Create New Csv File In Google Cloud Storage From Cloud Function

December 27, 2023 Post a Comment

First time working with Google Cloud Storage. Below I have a cloud function which is triggered whenever a csv file gets uploaded to my-folder inside my bucket. My goal is to create

Solution 1:

Probably I would say that I have a few questions about the code and the design of the solution:

As I understand - on one hand the cloud function is triggered by a finalise event Google Cloud Storage Triggers, not he other hand you would like to save a newly created file into the same bucket. Upon success, an appearance of a new object in that bucket is to trigger another instance of your cloud function. Is that the intended behaviour? You cloud function is ready for that?
Ontologically there is no such thing as folder. Thus in this code:

folder_name = 'my-folder'file_name = data['name']

the first line is a bit redundant, unless you would like to use that variable and value for something else... and the file_name gets the object name including all prefixes (you may consider them as "folders".

The example you refer - storage_compose_file.py - is about how a few objects in the GCS can be composed into one. I am not sure if that example is relevant for your case, unless you have some additional requirements.
Now, let's have a look at this snippet:

        destination = bucket.blob(bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:])
        destination.content_type = 'text/csv'
        sources = [bucket.get_blob(file_name)]
        destination.compose(sources)

a. bucket.blob - is a factory constructor - see API buckets description. I am not sure if you really would like to use a bucket_name as an element of its argument...

b. sources - becomes a list with only one element - a reference to the existing object in the GCS bucket.

c. destination.compose(sources) - is it an attempt to make a copy of the existing object? If successful - it may trigger another instance of your cloud function.

About type changes

blob = bucket.blob(file_name)
        blob = blob.download_as_string()

After the first line the blob variable has the type google.cloud.storage.blob.Blob. After the second - bytes. I think Python allows such things... but would you really like it? BTW, the download_as_string method is deprecated - see Blobs / Objects API

About the output:

output = bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:]
    
   with open(output, 'w') as output_csv:

Bear in mind - all of that happens inside the memory of the cloud function. Nothing to do with the GCS buckets of blobs. If you would like to use temporary files within cloud functions - you are to use them in the /tmp directory - Write temporary files from Google Cloud Function I would guess that you get the error because of this issue.

=> Coming to some suggestions.

You probably would like to download the object into the cloud function memory (into the /tmp directory). Then you would like to process the source file and save the result near the source. Then you would like to upload the result to another (not the source) bucket. If my assumptions are correct, I would suggest to implement those things step by step, and check that you get the desired result on each step.

Solution 2:

You can reach saving a csv in the Google Cloud Storage in two ways.

Either, you save it directly to GCS with gcsfs package in the "requirements.txt", or you use the container's /tmp folder and push it to the GCS bucket afterwards from there.

Use the power of the Python package "gcsfs"

gcsfs stands for "Google Cloud Storage File System". Add

gcsfs==2021.11.1

or another version to your "requirements.txt". You do not directly use this package name in the code, instead, its installation just allows you to save to the Google Cloud Storage directly, no interim /tmp and push to GCS bucket directory needed. You can also store the file in a sub-directory.

You can save a dataframe for example with:

df.to_csv('gs://MY_BUCKET_NAME/MY_OUTPUT.csv')

or:

df.to_csv('gs://MY_BUCKET_NAME/MY_DIR/MY_OUTPUT.csv')

or use an environment variable of the first menu step when creating the CF:

from os import environ

df.to_csv(environ["CSV_OUTPUT_FILE_PATH"], index=False)

Not sure whether this is needed, but I saw an example where the gcsfs package is installed together with

fsspec==2021.11.1

and it will not hurt adding it. My tests of saving a small df to csv on GCS did not need the package, though. Since I am not sure about this helping module, quote:

Purpose (of fsspec):
To produce a template or specification for a file-system interface, that specific implementations should follow, so that applications making use of them can rely on a common behaviour and not have to worry about the specific internal implementation decisions with any given backend. Many such implementations are included in this package, or in sister projects such as s3fs and gcsfs.
In addition, if this is well-designed, then additional functionality, such as a key-value store or FUSE mounting of the file-system implementation may be available for all implementations "for free".

First in container's "/tmp", then push to GCS

Here is an example of how to do what the other answer says about storing it at first in the container's /tmp (and only there, no other dir is possible) and then moving it to a bucket of your choice. You can also save it to the bucket that also stores the source code of the cloud function, against the last sentence of the other answer (tested, works):

# function `write_to_csv_file()` not used but might be helpful if no df at hand:#def write_to_csv_file(file_path, file_content, root):#    """ Creates a file on runtime. """#    file_path = path.join(root, file_path)##    # If file is a binary, we rather use 'wb' instead of 'w'#    with open(file_path, 'w') as file:#        file.write(file_content)    defpush_to_gcs(file, bucket):
    """ Writes to Google Cloud Storage. """
    file_name = file.split('/')[-1]
    print(f"Pushing {file_name} to GCS...")
    blob = bucket.blob(file_name)
    blob.upload_from_filename(file)
    print(f"File pushed to {blob.id} succesfully.")        

# Root path on CF will be /workspace, while on local Windows: C:\
root = path.dirname(path.abspath(__file__))
file_name = 'test_export.csv'# This is the main step: you *must* use `/tmp`:
file_path = '/tmp/' + file_name

d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df.to_csv(path.join(root, file_path), index = False)

# If you have a df anyway, `df.to_csv()` is easier. # The following file writer should rather be used if you have records instead (here: dfAsString). Since we do not use the function `write_to_csv_file()`, it is also commented out above, but can be useful if no df at hand.# dfAsString = df.to_string(header=True, index=False)   # write_to_csv_file(file_path, dfAsString, root)# Cloud Storage Client# Move csv file to Cloud Storage
storage_client = storage.Client()
bucket_name = MY_GOOGLE_STORAGE_BUCKET_NAME
bucket = storage_client.get_bucket(bucket_name)
push_to_gcs(path.join(root, file_path), bucket)

Free Interactive Python Tutorial

Create New Csv File In Google Cloud Storage From Cloud Function

Solution 1:

Solution 2:

Use the power of the Python package "gcsfs"

First in container's "/tmp", then push to GCS

Post a Comment for "Create New Csv File In Google Cloud Storage From Cloud Function"