piggyback
?piggyback
grew out of the needs of students both in my
classroom and in my research group, who frequently need to work with
data files somewhat larger than one can conveniently manage by
committing directly to GitHub. As we frequently want to share and run
code that depends on >50MB data files on each of our own machines, on
continuous integration, and on larger computational servers, data
sharing quickly becomes a bottleneck.
GitHub allows repositories to attach files of up to 2 GB each to releases as a way to distribute large files associated with the project source code. There is no limit on the number of files or bandwidth to deliver them.
Install the latest release from CRAN using:
install.packages("piggyback")
You can install the development version from GitHub with:
# install.packages("devtools")
::install_github("ropensci/piggyback") devtools
No authentication is required to download data from public
GitHub repositories using piggyback
. Nevertheless,
piggyback
recommends setting a token when possible to avoid
rate limits. To upload data to any repository, or to download data from
private repositories, you will need to authenticate first.
To do so, add your GitHub
Token to an environmental variable, e.g. in a .Renviron
file in your home directory or project directory (any private place you
won’t upload), see usethis::edit_r_environ()
. For one-off
use you can also set your token from the R console using:
Sys.setenv(GITHUB_PAT="xxxxxx")
But try to avoid putting Sys.setenv()
in any R scripts –
remember, the goal here is to avoid writing your private token in any
file that might be shared, even privately.
For more information, please see the usethis guide to GitHub credentials
Download the latest version or a specific version of the data:
library(piggyback)
pb_download("iris2.tsv.gz",
repo = "cboettig/piggyback-tests",
tag = "v0.0.1",
dest = tempdir())
Note: Whenever you are working from a location
inside a git repository corresponding to your GitHub repo, you can
simply omit the repo
argument and it will be detected
automatically. Likewise, if you omit the release tag
, the
pb_download
will simply pull data from most recent release
(latest
). Third, you can omit tempdir()
if you
are using an RStudio Project (.Rproj
file) in your
repository, and then the download location will be relative to Project
root. tempdir()
is used throughout the examples only to
meet CRAN policies and is unlikely to be the choice you actually want
here.
Lastly, simply omit the file name to download all assets connected with a given release.
pb_download(repo = "cboettig/piggyback-tests",
tag = "v0.0.1",
dest = tempdir())
These defaults mean that in most cases, it is sufficient to simply
call pb_download()
without additional arguments to pull in
any data associated with a project on a GitHub repo that is too large to
commit to git directly.
pb_download()
will skip the download of any file that
already exists locally if the timestamp on the local copy is more recent
than the timestamp on the GitHub copy. pb_download()
also
includes arguments to control the timestamp behavior, progress bar,
whether existing files should be overwritten, or if any particular files
should not be downloaded. See function documentation for details.
Sometimes it is preferable to have a URL from which the data can be
read in directly, rather than downloading the data to a local file. For
example, such a URL can be embedded directly into another R script,
avoiding any dependence on piggyback
(provided the
repository is already public.) To get a list of URLs rather than
actually downloading the files, use pb_download_url()
:
pb_download_url("data/mtcars.tsv.gz",
repo = "cboettig/piggyback-tests",
tag = "v0.0.1")
If your GitHub repository doesn’t have any releases
yet, piggyback
will help you quickly create one. Create new
releases to manage multiple versions of a given data file. While you can
create releases as often as you like, making a new release is by no
means necessary each time you upload a file. If maintaining old versions
of the data is not useful, you can stick with a single release and
upload all of your data there.
pb_new_release("cboettig/piggyback-tests", "v0.0.2")
Once we have at least one release available, we are ready to upload.
By default, pb_upload
will attach data to the latest
release.
## We'll need some example data first.
## Pro tip: compress your tabular data to save space & speed upload/downloads
::write_tsv(mtcars, "mtcars.tsv.gz")
readr
pb_upload("mtcars.tsv.gz",
repo = "cboettig/piggyback-tests",
tag = "v0.0.1")
Like pb_download()
, pb_upload()
will
overwrite any file of the same name already attached to the release file
by default, unless the timestamp the previously uploaded version is more
recent. You can toggle these settings with overwrite=FALSE
and use_timestamps=FALSE
.
List all files currently piggybacking on a given release. Omit the
tag
to see files on all releases.
pb_list(repo = "cboettig/piggyback-tests",
tag = "v0.0.1")
Delete a file from a release:
pb_delete(file = "mtcars.tsv.gz",
repo = "cboettig/piggyback-tests",
tag = "v0.0.1")
Note that this is irreversible unless you have a copy of the data elsewhere.
You can pass in a vector of file paths with something like
list.files()
to the file
argument of
pb_upload()
in order to upload multiple files. Some common
patterns:
library(magrittr)
## upload a folder of data
list.files("data") %>%
pb_upload(repo = "cboettig/piggyback-tests", tag = "v0.0.1")
## upload certain file extensions
list.files(pattern = c("*.tsv.gz", "*.tif", "*.zip")) %>%
pb_upload(repo = "cboettig/piggyback-tests", tag = "v0.0.1")
Similarly, you can download all current data assets of the latest or
specified release by using pb_download()
with no
arguments.
To reduce API calls to GitHub, piggyback caches most calls with a
timeout of 1 second by default. This avoids repeating identical requests
to update it’s internal record of the repository data (releases, assets,
timestamps, etc) during programmatic use. You can increase or decrease
this delay by setting the environmental variable in seconds,
e.g. Sys.setenv("piggyback_cache_duration"=10)
for a longer
delay or Sys.setenv("piggyback_cache_duration"=0)
to
disable caching, and then restarting R.
GitHub assets attached to a release do not support file paths, and
will convert most special characters (#
, %
,
etc) to .
or throw an error (e.g. for file names containing
$
, @
, /
). piggyback will default
to using the base name of the file only (i.e. will only use
"mtcars.csv"
if provided a file path like
"data/mtcars.csv"
)
piggyback
is not intended as a data archiving solution.
Importantly, bear in mind that there is nothing special about multiple
“versions” in releases, as far as data assets uploaded by
piggyback
are concerned. The data files
piggyback
attaches to a Release can be deleted or modified
at any time – creating a new release to store data assets is the
functional equivalent of just creating new directories
v0.1
, v0.2
to store your data. (GitHub
Releases are always pinned to a particular git
tag, so the
code/git-managed contents associated with repo are more immutable, but
remember our data assets just piggyback on top of the repo).
Permanent, published data should always be archived in a proper data
repository with a DOI, such as zenodo.org. Zenodo can freely archive
public research data files up to 50 GB in size, and data is strictly
versioned (once released, a DOI always refers to the same version of the
data, new releases are given new DOIs). piggyback
is meant
only to lower the friction of working with data during the research
process. (e.g. provide data accessible to collaborators or continuous
integration systems during research process, including for private
repositories.)
GitHub documentation at the time of writing endorses the use of attachments to releases as a solution for distributing large files as part of your project:
Of course, it will be up to GitHub to decide if this use of release attachments is acceptable in the long term.