CRAN Task View: Reproducible Research
Maintainer: | John Blischak, Alison Hill, Ben Marwick, Daniel Sjoberg, Will Landau |
Contact: | jdblischak at gmail.com |
Version: | 2024-09-25 |
URL: | https://CRAN.R-project.org/view=ReproducibleResearch |
Source: | https://github.com/cran-task-views/ReproducibleResearch/ |
Contributions: | Suggestions and improvements for this task view are very welcome and can be made through issues or pull requests on GitHub or via e-mail to the maintainer address. For further details see the Contributing guide. |
Citation: | John Blischak, Alison Hill, Ben Marwick, Daniel Sjoberg, Will Landau (2024). CRAN Task View: Reproducible Research. Version 2024-09-25. URL https://CRAN.R-project.org/view=ReproducibleResearch. |
Installation: | The packages from this task view can be installed automatically using the ctv package. For example, ctv::install.views("ReproducibleResearch", coreOnly = TRUE) installs all the core packages or ctv::update.views("ReproducibleResearch") installs all packages that are not yet installed and up-to-date. See the CRAN Task View Initiative for more details. |
The goal of reproducible research is to tie specific instructions to data analysis and experimental data so that scholarship can be recreated, understood, and verified. Packages in R for this purpose can be roughly split into groups for: literate programming, pipeline toolkits, package reproducibility, project workflows, code/data formatting tools, format convertors, and object caching.
The current maintainers gratefully acknowledge Max Kuhn for originally creating and maintaining this task view.
Literate Programming
The primary way that R facilitates reproducible research is using a document that is a combination of content and data analysis code. The Sweave
function (in the base R utils package) and the knitr package can be used to blend the subject matter and R code so that a single document defines the content and the analysis. The brew and R.rsp packages contain alternative approaches to embedding R code into various markups.
The resources for literate programming are best organized by the document type/markup language:
LaTeX
Both Sweave
and knitr can process LaTeX files. lazyWeave can create LaTeX documents from scratch. RweaveExtra provides Sweave drivers with additional options to control processing and output.
The knitr and rmarkdown packages (along with pandoc ) can be used to create slides using the LaTeX beamer class.
Object Conversion Functions:
- summary tables/statistics: gtsummary, Hmisc, NMOF, papeR, quantreg, rapport, reporttools, sparktex, table1, tables, xtable, ztable, codebook
- tables/cross-tabulations: gtsummary, Hmisc, huxtable, lazyWeave, knitLatex, knitr, reporttools, table1, ztable
- graphics: animation, Hmisc,
grDevices:::pictex
, sparktex, tikzDevice
- statistical models/methods: gtsummary, memisc, quantreg, rms, stargazer, suRtex, texreg, xtable, ztable
- bibtex: bibtex and RefManageR
- others: latex2exp converts LaTeX equations to
plotmath
expressions.
Miscellaneous Tools
- Hmisc contains a function to correctly escape special characters. Standardized exams can be created using the exams package.
HTML
The knitr package can process HTML files directly. Sweave
can also work with HTML by way of the R2HTML package. lazyWeave can create HTML format documents from scratch.
For HTML slides, a combination of the knitr and rmarkdown packages (along with pandoc ) can be used to create slides using ioslides , reveal.js , Slidy , or remark.js (from the xaringan package).
The packages blogdown, bookdown, and distill can create entire websites.
Object Conversion Functions:
- summary tables/statistics: gtsummary, parameters, stargazer, table1, codebook
- tables/cross-tabulations: DT, flextable, formattable, gt, gtsummary, htmlTable, HTMLUtils, huxtable, hwriter, knitr, lazyWeave, table1, texreg, ztable
- statistical models/methods: gtsummary, rapport, stargazer, xtable
- others: knitcitations, RefManageR
Miscellaneous Tools: htmltools has various tools for working with HTML. tufterhandout can create Tufte-style handouts.
Markdown
The knitr package can process markdown files without assistance. The packages markdown and rmarkdown have general tools for working with documents in this format. lazyWeave can create markdown format documents from scratch. Also, the ascii package can write R objects to the AsciiDoc format.
Object Conversion Functions:
- summary tables/statistics: gtsummary, papeR
- tables/cross-tabulations: DT, formattable, gtsummary, htmlTable, knitr, lazyWeave, papeR, parameters
- statistical models/methods: gtsummary, pander, papeR, rapport, texreg
- others: RefManageR
Miscellaneous Tools: tufterhandout can create Tufte-style handouts. kfigr allows for figure indexing in markdown documents.
The officer (formerly ReporteRs and before that R2DOCX) package can create docx
and pptx
files. R2wd (windows only) can also create Word documents from scratch and R2PPT (also windows only) can create PowerPoint slides. The rtf package does the same for Rich Text Format documents. The openxlsx package creates xlsx
files. The readODS package can read and write Open Document Spreadsheets.
Object Conversion Functions:
Pipeline toolkits help maintain and verify reproducibility. They synchronize computational output with the underlying code and data, and they tell the user when everything is up to date. In other words, they provide concrete evidence that results are re-creatable from the starting materials, and the data analysis project does not need to rerun from scratch. The targets package is such a pipeline toolkit. It is similar to GNU Make , but it is R-focused.
- drake: A general-purpose computational engine for data analysis, drake rebuilds intermediate data objects when their dependencies change, and it skips work when the results are already up to date.
- flowr: This framework allows you to design and implement complex pipelines, and deploy them on your institution’s computing cluster.
- maestro: Framework for creating and orchestrating data pipelines. Organize, orchestrate, and monitor multiple pipelines in a single project. Use tags to decorate functions with scheduling parameters and configuration.
- makeit: Run R scripts if needed, based on last modified time. Implemented in base R with no additional software requirements, organizational overhead, or structural requirements.
- makepipe: A suite of tools for transforming an existing workflow into a self-documenting pipeline with very minimal upfront costs.
- repo: A data manager meant to avoid manual storage/retrieval of data to/from the file system.
- targets: As a pipeline toolkit for Statistics and data science in R, the ‘targets’ package brings together function-oriented programming and ‘Make’-like declarative workflows.
Package Reproducibility
R has various tools for ensuring that specific packages versions can be required for analyses. As an example, the renv package installs packages in project-specific directory, records “snapshots” of the current package versions in a “lockfile”, and restores the package setup on a different machine.
- checkpoint: Allows you to install packages as they existed on CRAN on a specific snapshot date as if you had a CRAN time machine.
- containerit (GitHub only): Package R sessions, scripts, workspace directories, and R Markdown documents together with all dependencies to execute them in Docker containers.
- dateback: Works like a virtual CRAN snapshot for source packages. It automatically downloads and installs ‘tar.gz’ files with dependencies, all of which were available on a specific day.
- groundhog: Make R scripts that rely on packages reproducible, by ensuring that every time a given script is run, the same version of the used packages are loaded.
- liftr: Persistent reproducible reporting by containerization of R Markdown documents.
- miniCRAN: Makes it possible to create an internally consistent repository consisting of selected packages from CRAN-like repositories.
- packrat: Manage the R packages your project depends on in an isolated, portable, and reproducible way.
- rang: Resolve the dependency graph of R packages at a specific time point in order to reconstruct the R computational environment.
- renv: Create and manage project-local R libraries, save the state of these libraries to a ‘lockfile’, and later restore your library as required.
- Require: A single key function, ‘Require’ that makes rerun-tolerant versions of ‘install.packages’ and ‘require’ for CRAN packages, packages no longer on CRAN (i.e., archived), specific versions of packages, and GitHub packages.
- rix: Simplifies the creation of reproducible development environments using the ‘Nix’ package manager.
- switchr: Provides an abstraction for managing, installing, and switching between sets of installed R packages.
Project Workflows
Successfully completing a data analysis project often requires much more than statistics and visualizations. Efficiently managing the code, data, and results as the project matures helps reduce stress and errors. The following “workflow” packages assist the R programmer by managing project infrastructure and/or facilitating a reproducible workflow.
Workflow utility packages provide single-use functions to implement project infrastructure or solve a specific problem. As a typical example, usethis::use_git()
initializes a Git repository, ignores common R files, and commits all project files.
- cabinets: Creates project specific directory and file templates that are written to a .Rprofile file.
- here: Constructs paths to your project’s files.
- prodigenr: Create a project directory structure, along with typical files for that project.
- RepoGenerator: Generates a project and repo for easy initialization of a GitHub repo for R workshops.
- rrtools (GitHub only): Instructions, templates, and functions for making a basic compendium suitable for doing reproducible research with R.
- starter: Get started with new projects by dropping a skeleton of a new project into a new or existing directory, initialise git repositories, and create reproducible environments with the ‘renv’ package
- starters (GitHub only): Setting up R project directories for teaching, presenting, analysis, package development can be a pain. starters shortcuts this by creating folder structures and setting good defaults for you.
- trackdown: Collaborative writing and editing of R Markdown (or Sweave) documents via Google Docs.
- usethis: Automate package and project setup tasks that are otherwise performed manually.
Workflow framework packages provide an organized directory structure and helper functions to assist during the development of the project. As a typical example, ProjectTemplate::create.project()
creates an organized setup with many subdirectories, and ProjectTemplate::run.project()
executes each R script that is saved in the src/
subdirectory.
- exreport: Analysis of experimental results and automatic report generation in both interactive HTML and LaTeX.
- madrat: Provides a framework which should improve reproducibility and transparency in data processing. It provides functionality such as automatic meta data creation and management, rudimentary quality management, data caching, work-flow management and data aggregation.
- makeProject: This package creates an empty framework of files and directories for the “Load, Clean, Func, Do” structure described by Josh Reich.
- orderly: Order, create and store reports from R.
- projects: Provides a project infrastructure with a focus on manuscript creation.
- ProjectTemplate: Provides functions to automatically build a directory structure for a new R project. Using this structure, ‘ProjectTemplate’ automates data loading, preprocessing, library importing and unit testing.
- rcompendium: Makes easier the creation of R package or research compendium (i.e. a predefined files/folders structure) so that users can focus on the code/analysis instead of wasting time organizing files.
- reportfactory: Provides an infrastructure for handling multiple R Markdown reports, including automated curation and time-stamping of outputs, parameterisation and provision of helper functions to manage dependencies.
- represtools: Reproducible research tools automates the creation of an analysis directory structure and work flow. There are R markdown skeletons which encapsulate typical analytic work flow steps. Functions will create appropriate modules which may pass data from one step to another.
- TAF: General framework to organize data, methods, and results used in reproducible scientific analyses. A TAF analysis consists of four scripts (data.R, model.R, output.R, report.R) that are run sequentially.
- tinyProject: Creates useful files and folders for data analysis projects and provides functions to manage data, scripts and output files.
- worcs: Create reproducible and transparent research projects in ‘R’. This package is based on the Workflow for Open Reproducible Code in Science (WORCS), a step-by-step procedure based on best practices for Open Science.
- workflowr: Provides a workflow for your analysis projects by combining literate programming (‘knitr’ and ‘rmarkdown’) and version control (‘Git’, via ‘git2r’) to generate a website containing time-stamped, versioned, and documented results.
formatR and styler can be used to format R code.
highlight and highr can be used to color R code.
Packages humanFormat, lubridate, prettyunits, and rprintf have functions to better format data.
pander can be used for rendering R objects into Pandoc’s markdown. knitr has the function pandoc
that can call an installed version of Pandoc to convert documents between formats such as Markdown, HTML, LaTeX, PDF and Word. tth facilitates TeX to HTML/MathML conversions.
Object Caching Packages
When using Sweave
and knitr it can be advantageous to cache the results of time consuming code chunks if the document will be re-processed (i.e. during debugging). knitr facilitates object caching and the Bioconductor package weaver can be used with Sweave
.
Non-literate programming packages to facilitate caching/archiving are archivist, R.cache, reproducible, and storr
CRAN packages
Core: | Hmisc, knitr, R2HTML, rms, xtable. |
Regular: | animation, archivist, ascii, bibtex, blogdown, bookdown, brew, cabinets, checkpoint, codebook, codebookr, dateback, distill, drake, DT, exams, exreport, flextable, flowr, formatR, formattable, groundhog, gt, gtsummary, here, highlight, highr, htmlTable, htmltools, HTMLUtils, humanFormat, huxtable, hwriter, kfigr, knitcitations, knitLatex, latex2exp, lazyWeave, liftr, lubridate, madrat, maestro, makeit, makepipe, makeProject, markdown, memisc, miniCRAN, mschart, NMOF, officer, openxlsx, orderly, packrat, pander, papeR, parameters, pharmaRTF, prettyunits, prodigenr, projects, ProjectTemplate, quantreg, R.cache, R.rsp, R2PPT, r2rtf, R2wd, rang, rapport, rcompendium, readODS, RefManageR, renv, repo, RepoGenerator, reportfactory, reporttools, represtools, reproducible, Require, rix, rmarkdown, rprintf, rtf, RweaveExtra, sparktex, stargazer, starter, storr, styler, suRtex, switchr, table1, tables, TAF, targets, texreg, tikzDevice, tinyProject, trackdown, tth, tufterhandout, unrtf, usethis, worcs, workflowr, xaringan, ztable. |
Related links
- Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis
- knitr: Elegant, flexible and fast dynamic report generation with R
- Wikipedia: Literate Programming
- Harrell: Reproducible Research (Biostatistics for Biomedical Research)
- Koenker, Zeileis: On Reproducible Econometric Research
- Peng: Reproducible Research and Biostatistics
- Rossini, Leisch: Literate Statistical Practice
- Baggerly, Coombes: Deriving Chemosensitivity from Cell Lines: Forensic Bioinformatics and Reproducible Research in High-Throughput Biology
- Leisch: Sweave, Part I: Mixing R and LaTeX
- Leisch: Sweave, Part II: Package Vignettes
- Betebenner: Using Control Structures with Sweave
- Garbade, Burgard: Using R/Sweave in Everyday Clinical Practice
- Gorjanc: Using Sweave with LyX
- Lecoutre: The R2HTML Package
- List of pipeline toolkits
- Computational Environments and Reproducibility
- Bryan: Project-oriented workflow
- rOpenSci: Reproducibility in Science
- Temple Lang, Gentleman: Statistical Analyses and Reproducible Research
- Marwick, Boettiger, Mullen: Packaging Data Analytical Work Reproducibly Using R (and Friends)
- Xie: Write An R Package Using Literate Programming Techniques
- Rolland: Reproducible Research in R (and friends)
- Schratz: Reproducibility of parallel tasks in R
Other resources