Skip to content Skip to sidebar Skip to footer

Pyspark Application Fail With Java.lang.outofmemoryerror: Java Heap Space

I'm running spark via pycharm and respectively pyspark shell. I've stacked with this error: : java.lang.OutOfMemoryError: Java heap space at org.apache.spark.api.python.PythonR

Solution 1:

TL;DR Don't use parallelize outside tests and simple experiments. Because you use Python 2.7, range is not lazy, so you'll materialize a full range of values multiple types:

  • Python list after the call.
  • Serialized version which will be later written to disk.
  • Serialized copy loaded on JVM.

Using xrange would help, but you shouldn't use parallelize in the first place (or Python 2 in 2018).

If you want to create a series of values just use SparkContext.range

range(start, end=None, step=1, numSlices=None)

Create a new RDD of int containing elements from start to end (exclusive), increased by step every element. Can be called the same way as python’s built-in range() function. If called with a single argument, the argument is interpreted as end, and start is set to 0.

so in your case:

rdd = sc.range(1000000000, numSlices=100)

With DataFrame:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = spark.range(1000000000, numPartitions=100)

Post a Comment for "Pyspark Application Fail With Java.lang.outofmemoryerror: Java Heap Space"