Skip to content Skip to sidebar Skip to footer

Downloading Files From Google Storage Using Spark (python) And Dataproc

I have an application that parallelizes the execution of Python objects that process data to be downloaded from Google Storage (my project bucket). The cluster is created using Goo

Solution 1:

The problem was clearly the Spark context. Replacing the call to "gsutil" by a call to "hadoop fs" solves it:

from subprocess import call
from os.path import join

defcopyDataFromBucket(filename,remoteFolder,localFolder):
  call(["hadoop","fs","-copyToLocal",join(remoteFolder,filename),localFolder]

I also did a test to send data to the bucket. One only needs to replace "-copyToLocal" by "-copyFromLocal"

Post a Comment for "Downloading Files From Google Storage Using Spark (python) And Dataproc"