Often the simplest, but most useful operation when working with the Platform is to move data in and out. From the perspective of the R client, we call moving data from the Platform to the local machine reading. Likewise, moving data from the local machine to the Platform is called writing.
The civis
client handles data imports and exports in two
basic ways:
Data can be stored on Platform in two places:
Tables in Redshift are accessed and modified using SQL queries. Tables in Redshift can be easily shared and used in multiple workflows by multiple people. However, importing and exporting even small files on Redshift can be slow.
R objects and arbitrary files can be stored on Amazon S3, and are
accessed using a numeric file id. Data frames are uploaded as CSVs for
portability, and arbitrary R objects are serialized using
saveRDS
for speed and efficiency.
The main workhorse for getting data from Platform is
read_civis
. This function is designed to work similarly to
the built in function read.csv
, returning a dataframe from
a table in Platform. For more flexibility, read_civis
can
download files from Redshift using an SQL query, or download a file from
S3 (‘the files endpoint’) using a file id.
To read from a table in Platform, simply provide the name of the schema, table within the schema, and the database:
For convenience, a default database can be set in the package options, and not specified in further calls to any IO function. If there is only one database available, this database will automatically be used as the default. In the examples that follow, we assume that a default database has been set.
read_civis
accepts SQL queries when more flexibility is
needed. This is accomplished by wrapping sql(...)
around a
string containing the query. With read_civis
, queries are
always read only, and always return a data.frame
.
query <- "SELECT * FROM table JOIN other_table USING id WHERE var1 < 23"
df <- read_civis(sql(query))
Finally, read_civis
accepts a file id as the first
argument to read in files from S3 as data frames. IDs are obtained from
write_civis_file
.
For maximum flexibility, read_civis
accepts parameters
from read.csv
which can be used to define data types when
the defaults are not appropriate. For instance, when numbers should be
read in as characters or when strings shouldn’t be read in as
factors.
The complement to reading data into the R workspace is writing data
to the Platform. The function write_civis
uploads data
frames or csv files to an Amazon Redshift database. The function
write_civis_file
uploads R objects and arbitrary files to
Amazon S3 (the files endpoint).
When creating a new table, write_civis
relies on
Platform to determine data types. Distkeys and sortkeys can optionally
be set to improve query performance. Again, we set a default database in
these examples for convenience.
options(civis.default_db = "my_database")
df <- data.frame(x = rnorm(100), y = rnorm(100), z = rnorm(100))
write_civis(df, tablename = "schema.tablename",
distkey = "id", sortkey1 = "date", sortkey2 = "type")
By default, write_civis
will fail if the table passed in
tablename
already exists. Optionally,
write_civis
can append to an existing table. It may also
delete all rows and then append (truncate). If specific datatypes are
required, a table may first be created with a SQL
CREATE TABLE
command and then data can be inserted with
write_civis
.
write_civis(df, tablename = "schema.tablename", if_exists = "append")
write_civis(df, tablename = "schema.tablename", if_exists = "truncate")
If a csv file is saved to disk but not loaded in the R workspace,
write_civis
will upload the csv to Platform without needing
first load the csv into RAM. This can save time when a file is large.
Uploading a csv directly to Platform is done by simply passing the file
name and path to write_civis
as the first argument:
Finally, write_civis_file
uploads data frames, R objects
and files to Amazon S3, which is also referred to as the ‘files
endpoint.’ Data frames are uploaded as CSVs. R objects saved to the
files endpoint and are serialized using saveRDS
.
Data frames and R objects can be loaded back into memory by passing
the file id to read_civis
, and an appropriate
using
argument.
# Upload a data frame
data(iris)
id <- write_civis_file(iris)
iris2 <- read_civis(id)
# Upload an arbitrary R object
farm <- list(chickens = 1, ducks = 4, pigs = 2, cows = 1)
id <- write_civis_file(farm)
farm2 <- read_civis(id, using = readRDS)
When passed a file name and path, write_civis_file
will
upload the file to S3 as-is. To read the file back into memory, an
appropriate function to convert the file to a data frame must be
provided to the using
argument of read_civis
.
For example, a JSON file can be read back into R using
jsonlite::fromJSON
.
Occasionally, a table may be too large to store in memory.
download_civis
can be used in place of
read_civis
to download data straight to disk from
Platform.
Like read_civis
, download_civis
can
download files from Amazon Redshift by passing
schema.tablename
, or sql(...)
as the first
argument. Files can be downloaded from Amazon S3 by passing the file id
to download_civis
.
Arbitrary queries can be run on Redshift using
query_civis
, which returns the meta-data of the query.
Existing queries can be re-run by passing the query id to
query_civis
:
Often an improper API key will return an error like below:
Error in api_key() :
The environmental variable CIVIS_API_KEY is not set. Add this to your .Renviron or call Sys.setenv(CIVIS_API_KEY = '<api_key>')
However, there may be cases where the errors are less
straightforward. It is a good idea to test that API credentials are
properly set with a simple call such as
civis::users_list_me()
. See the README to set up API keys
correctly.
This may happen if a table is empty or when no rows match a
WHERE
statement. To fix, double check that the query is
correct or the table is not empty.
For both read_civis
and write_civis
, the
database must be set to the correct, case sensitive name (not hostname)
of the database.
Error in get_db(database) :
Argument database is NULL and options("civis.default_db") not set. Set this option using options(civis.default_db = "my_database")
To see a complete list of database names, run: