Working with Google Cloud Storage

To make the most of Document AI API, it is worth familiarizing yourself with Google Cloud Storage, because this is what allows you to process documents asynchronously (more on this in the usage vignette). Google Cloud Storage – not to be confused with Google Drive – is a central feature of Google Cloud Services (GCS); it serves as a kind of mailbox where you deposit files for processing with any one of the GCS APIs and retrieve the output from the processing you requested. In the context of OCR processing with daiR, a typical asynchronous workflow consists of uploading documents to a Storage bucket, telling Document AI where to find them, and downloading the JSON output files afterwards. To interact with Google Storage we use the googleCloudStorageR package.¹

Setup

Storage is such a fundamental service in GCS that it is enabled by default. If you followed the configuration vignette where you created a GCS account and stored an environment variable GCS_AUTH_FILE in your .Renviron file, you are pretty much ready to go. All you need to do is load googleCloudStorageR and you will be auto-authenticated.

library(googleCloudStorageR)

Creating and inspecting buckets

Google Storage keeps your files in so-called “buckets”, which you can think of as folders. There is no root location in your Google Storage space, so you need at least one bucket to store files.

To view and create buckets, you need your project id, which we encountered in step 3 in the configuration vignette. If you did not store it when setting up GCS, you can look it up in the Google Cloud Console or use the daiR function get_project_id().

project_id <- daiR::get_project_id()

Now let’s see how many buckets we have:

gcs_list_buckets(project_id)

Answer: zero, because we haven’t created one yet. This we can do with gcs_create_bucket(). Note that it has to be globally unique (“my_bucket” won’t work because someone’s already taken it). For this example, let’s use “example-bucket-34869” (change the number so you get a unique one). Also add a location (“eu” or “us”).

gcs_create_bucket("example-bucket-34869", project_id, location = "eu")

Now we can see the bucket listed:

gcs_list_buckets(project_id)

You can create as many buckets as you want and organize them as you like. But you will need to supply a bucket name with every call to Google Storage (and Document AI), so you may want to store the name of a default bucket in the environment. Here you have two options:

Set it for the current session with gcs_global_bucket("<your bucket name>")
Store it permanently in you .Renviron file by calling usethis::edit_r_environ() and adding GCS_DEFAULT_BUCKET=<your bucket name> to the list of variables (just as you did with GCS_AUTH_FILE and DAI_PROCESSOR_ID in the configuration vignette). Note that adding a default bucket to .Renviron will not prevent you from supplying other bucket names in individual calls to Google Storage when necessary.

To get a bucket’s file inventory, we use gcs_list_objects(). Leaving the parentheses empty will get information about the default bucket if you have set it.

gcs_list_objects()

At this point it’s obviously empty, so let’s upload something.

Uploading files

This we do with gcs_upload(). If the file is in your working directory, just write the filename; otherwise provide the full file path. If you want, you can store the file under another name in Google Storage with the name parameter; otherwise, just leave the parameter out. For this example we create a simple CSV file and upload it.

setwd(tempdir())
write.csv(mtcars, "mtcars.csv")
gcs_upload("mtcars.csv")

Now let’s check the contents.

gcs_list_objects()

Note that you can use the parameter name to change the name that the file will be stored under in the bucket.

gcs_upload("mtcars.csv", name = "testfile.csv")

The Google Storage API handles only one file at a time, so for bulk uploads you need to use iteration. Let’s create another CSV file, create a vector with the two files, and map over it.

library(purrr)
write.csv(iris, "iris.csv")
my_files <- list.files(pattern = "*.csv")
map(my_files, gcs_upload)

Note that if your my_files vector contains full filepaths, not specifying the name parameter in gcs_upload() will produce long and awkward filenames in the Storage bucket. To avoid this, use basename() in the name parameter, like so:

map(my_files, ~ gcs_upload(.x, name = basename(.x)))

Let’s check the contents again:

gcs_list_objects()

Note that there’s a file size limit of 5Mb, but you can change it with gcs_upload_set_limit().

gcs_upload_set_limit(upload_limit = 20000000L)

Downloading files

Downloads are performed with gcs_get_object(). Note that you need to explicitly provide the name that your file will be saved under using the parameter saveToDisk.

gcs_get_object("iris.csv", saveToDisk = "iris.csv")

If you want the file somewhere other than your working directory, just provide a path. You can also change the file’s basename if you want.

gcs_get_object("mtcars.csv", saveToDisk = "/path/to/mtcars_downloaded.csv")

To download multiple files we again need to use iteration. Here it helps to know that the “bucket inventory” function, gcs_list_objects(), returns a dataframe with a column called name. If you store the output of this function as an object, e.g. contents, you can access the filenames with contents$name. To download all the files in the bucket to our working directory, we would do something like this:

contents <- gcs_list_objects()
map(contents$name, ~ gcs_get_object(.x, saveToDisk = .x))

If files with the same names exist in the destination drive, the process will fail (to protect your local files). Add overwrite = TRUE if you don’t mind overwriting them.

Note that files in a Google Storage bucket can have names that include forward slashes. The JSON files returned by Document AI, for example, can look like this: 17346859889898929078/0/document-0.json. If you try to save such a file under its full filename, your computer will think the slashes are folder separators, look for matching folder names, and give an error if those folders (in this case 17346859889898929078 and 0) don’t already exist on your drive. To avoid this and get file simply as document-0.json, use basename(.x) in the saveToDisk parameter of gcs_get_object().

map(contents$name, ~ gcs_get_object(.x, saveToDisk = basename(.x))

Deleting

We can delete files in the bucket with gcs_delete_object():

gcs_delete_object("mtcars.csv")

To delete several, we again need to loop or map. The following code deletes everything in the bucket.

contents <- gcs_list_objects()
map(contents$name, gcs_delete_object)

Convenience functions

You can always create custom functions for frequently used operations. For example, I like to start a new OCR project with an empty bucket, so I have the following function in my .Rprofile.

empty_bucket <- function() {
  contents <- googleCloudStorageR::gcs_list_objects()
  lapply(contents$name, googleCloudStorageR::gcs_delete_object)
}

Google Storage has many other functionalities, and I recommend exploring the documentation of googleCloudStorageR to find out more. But we have covered the essential ones, and you are now ready to make full use of daiR. Take a look at the vignette on basic usage to get started.

It is possible to manually upload and download files to Google Storage in the Google Cloud Console. For uploads this can sometimes be easier than doing it programmatically, but downloads and deletions will be cumbersome.↩︎