Original title: How to do linear interpolation in PySpark without Pandas UDF (only using Spark API)?

I have a Spark DataFrame with the following structure:

shock_rule_id DATE value

A 2024-01-01 100

A 2024-01-02 null

A 2024-01-03 130

B 2024-01-01 50

B 2024-01-02 null

B 2024-01-03 null

B 2024-01-04 80

I want to perform linear interpolation of the value column within each shock_rule_id group. I don’t want to use Pandas UDF – I’d like to do this using only Spark API / SQL functions. My DataFrame contains only business-day dates (no weekends/holidays). I already wrote one approach using Spark SQL window functions (first, last, row numbers, etc.) like this (simplified):

Row numbers to simulate index positions

df_pos = ( result_df .withColumn(“row_num”, row_number().over(w)) .withColumn(“prev_value”, last(“value”, ignorenulls=True).over(w)) .withColumn(“prev_row”, last(“row_num”, ignorenulls=True).over(w)) .withColumn(“next_value”, first(“value”, ignorenulls=True).over(w.rowsBetween(0, Window.unboundedFollowing))) .withColumn(“next_row”, first(

Read the original question here

Row numbers to simulate index positions#

Row numbers to simulate index positions