Skip to content Skip to sidebar Skip to footer
Showing posts with the label Apache Spark Sql

Removing Duplicate Columns After A Df Join In Spark

When you join two DFs with similar column names: df = df1.join(df2, df1['id'] == df2['i… Read more Removing Duplicate Columns After A Df Join In Spark

Improve Parallelism In Spark Sql

I have the below code. I am using pyspark 1.2.1 with python 2.7 (cpython) for colname in shuffle_co… Read more Improve Parallelism In Spark Sql

Selecting Empty Array Values From A Spark Dataframe

Given a DataFrame with the following rows: rows = [ Row(col1='abc', col2=[8], col3=[18]… Read more Selecting Empty Array Values From A Spark Dataframe

Spark: How To Transpose And Explode Columns With Nested Arrays

I applied an algorithm from the question below(in NOTE) to transpose and explode nested spark dataf… Read more Spark: How To Transpose And Explode Columns With Nested Arrays

Pyspark - Append Previous And Next Row To Current Row

Let's say I have a PySpark data frame like so: 1 0 1 0 0 0 1 1 0 1 0 1 How can I append the la… Read more Pyspark - Append Previous And Next Row To Current Row

Implementing A Recursive Algorithm In Pyspark To Find Pairings Within A Dataframe

I have a spark dataframe (prof_student_df) that lists student/professor pair for a timestamp. There… Read more Implementing A Recursive Algorithm In Pyspark To Find Pairings Within A Dataframe

If I Cache A Spark Dataframe And Then Overwrite The Reference, Will The Original Data Frame Still Be Cached?

Suppose I had a function to generate a (py)spark data frame, caching the data frame into memory as … Read more If I Cache A Spark Dataframe And Then Overwrite The Reference, Will The Original Data Frame Still Be Cached?

How To Make An Integer Index Row?

I have a DataFrame: +-----+--------+---------+ | usn|log_type|item_code| +-----+--------+--------… Read more How To Make An Integer Index Row?