Pyspark Dynamic Column Computation
Below is my spark data frame a b c 1 3 4 2 0 0 4 1 0 2 2 0 My output should be as below a b c 1 3 4 2 0 2 4 1 -1 2 2 3 Formula is prev(c)-b+a i.e, 4-2+0=2 and 2-4+1=-1
Solution 1:
from pyspark.sql.functions import lag, udf
from pyspark.sql.types import IntegerType
from pyspark.sql.window import Window
numbers = [[1,2,3],[2,3,4],[3,4,5],[5,6,7]]
df = sc.parallelize(numbers).toDF(['a','b','c'])
df.show()
w = Window().partitionBy().orderBy('a')
calculate = udf(lambda a,b,c:a-b+c,IntegerType())
df = df.withColumn('result', lag("a").over(w)-df.b+df.c)
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
| 5| 6| 7|
+---+---+---+
+---+---+---+------+
| a| b| c|result|
+---+---+---+------+
| 1| 2| 3| null|
| 2| 3| 4| 2|
| 3| 4| 5| 3|
| 5| 6| 7| 4|
+---+---+---+------+
Solution 2:
import pyspark.sql.functions as f
from pyspark.sql.window import Window
df = sc.parallelize([
[1,3],
[2,0],
[4,1],
[2,2]
]).toDF(('a', 'b'))
df1 = df.withColumn("row_id", f.monotonically_increasing_id())
w = Window.partitionBy().orderBy(f.col("row_id"))
df1 = df1.withColumn("c_temp", f.when(f.col("row_id")==0, f.lit(4)).otherwise(- f.col("a") + f.col("b")))
df1 = df1.withColumn("c", f.sum(f.col("c_temp")).over(w)).drop("c_temp","row_id")
df1.show()
Output is:
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 3| 4|
| 2| 0| 2|
| 4| 1| -1|
| 2| 2| -1|
+---+---+---+
Post a Comment for "Pyspark Dynamic Column Computation"