Pyspark Application Fail With Java.lang.outofmemoryerror: Java Heap Space

July 25, 2024 Post a Comment

I'm running spark via pycharm and respectively pyspark shell. I've stacked with this error: : java.lang.OutOfMemoryError: Java heap space at org.apache.spark.api.python.PythonR

Solution 1:

TL;DR Don't use parallelize outside tests and simple experiments. Because you use Python 2.7, range is not lazy, so you'll materialize a full range of values multiple types:

Python list after the call.
Serialized version which will be later written to disk.
Serialized copy loaded on JVM.

Using xrange would help, but you shouldn't use parallelize in the first place (or Python 2 in 2018).

If you want to create a series of values just use SparkContext.range

range(start, end=None, step=1, numSlices=None)
Create a new RDD of int containing elements from start to end (exclusive), increased by step every element. Can be called the same way as python’s built-in range() function. If called with a single argument, the argument is interpreted as end, and start is set to 0.

so in your case:

rdd = sc.range(1000000000, numSlices=100)

With DataFrame:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = spark.range(1000000000, numPartitions=100)

Free Interactive Python Tutorial

Pyspark Application Fail With Java.lang.outofmemoryerror: Java Heap Space

Solution 1:

Post a Comment for "Pyspark Application Fail With Java.lang.outofmemoryerror: Java Heap Space"