Dask Distributed Memory Error
Solution 1:
The most common cause of this error is trying to collect too much data, such as occurs in the following example using dask.dataframe:
df = dd.read_csv('s3://bucket/lots-of-data-*.csv')
df.compute()
This loads all of the data into RAM across the cluster (which is fine), and then tries to bring the entire result back to the local machine by way of the scheduler (which probably can't handle your 100's of GB of data all in one place.) Worker-to-client communications pass through the Scheduler, so it is the first single machine to receive all of the data and the first machine likely to fail.
If this is the case then you instead probably want to use the Executor.persist
method, to trigger computation but leave it on the cluster.
df = dd.read_csv('s3://bucket/lots-of-data-*.csv')
df = e.persist(df)
Generally we only use df.compute()
for small results that we want to view in our local session.
Post a Comment for "Dask Distributed Memory Error"