Original title: How to do linear interpolation in PySpark without Pandas UDF (only using Spark API)?
I have a Spark DataFrame with the following structure:
shock_rule_id DATE value
A 2024-01-01 100
A 2024-01-02 null
A 2024-01-03 130
B 2024-01-01 50
B 2024-01-02 null
B 2024-01-03 null
B 2024-01-04 80
I want to perform linear interpolation of the value column within each shock_rule_id group. I don’t want to use Pandas UDF – I’d like to do this using only Spark API / SQL functions. My DataFrame contains only business-day dates (no weekends/holidays). I already wrote one approach using Spark SQL window functions (first, last, row numbers, etc.) like this (simplified):
Row numbers to simulate index positions
df_pos = ( result_df .withColumn(“row_num”, row_number().over(w)) .withColumn(“prev_value”, last(“value”, ignorenulls=True).over(w)) .withColumn(“prev_row”, last(“row_num”, ignorenulls=True).over(w)) .withColumn(“next_value”, first(“value”, ignorenulls=True).over(w.rowsBetween(0, Window.unboundedFollowing))) .withColumn(“next_row”, first(