Pyspark Application Fail With Java.lang.outofmemoryerror: Java Heap Space
I'm running spark via pycharm and respectively pyspark shell. I've stacked with this error: : java.lang.OutOfMemoryError: Java heap space at org.apache.spark.api.python.PythonR
Solution 1:
TL;DR Don't use parallelize
outside tests and simple experiments. Because you use Python 2.7, range
is not lazy, so you'll materialize a full range of values multiple types:
- Python
list
after the call. - Serialized version which will be later written to disk.
- Serialized copy loaded on JVM.
Using xrange
would help, but you shouldn't use parallelize
in the first place (or Python 2 in 2018).
If you want to create a series of values just use SparkContext.range
range(start, end=None, step=1, numSlices=None)
Create a new RDD of int containing elements from start to end (exclusive), increased by step every element. Can be called the same way as python’s built-in range() function. If called with a single argument, the argument is interpreted as end, and start is set to 0.
so in your case:
rdd = sc.range(1000000000, numSlices=100)
With DataFrame
:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.range(1000000000, numPartitions=100)
Post a Comment for "Pyspark Application Fail With Java.lang.outofmemoryerror: Java Heap Space"