Original title: Create unique index for each group PySpark
I am working with a relatively large dataframe (close to 1 billion rows) in PySpark. This dataframe is in “long” format, and I would like to have a unique index for each group defined by a groupBy over multiple columns. An example dataframe: +————–+——-+———+——+——+ |classification| id_1| id_2| t| y| +————–+——-+———+——+——+ | 1| person| Alice| 0.1| 0.247| | 1| person| Alice| 0.2| 0.249| | 1| person| Alice| 0.3| 0.255| | 0| animal| Jaguar| 0.1| 0.298| | 0| animal| Jaguar| 0.2| 0.305| | 0| animal| Jaguar| 0.3| 0.310| | 1| person| Chris| 0.1| 0.267| +————–+——-+———+——+——+
Here I would like to perform an operation such that I can index each group of [“classification”, “id_1”, “id_2”]. Example output is: +————–+——-+———+——+——+—-+ |classification| id_1| i