PubTator is an NCBI product that contains detailed annotations of abstracts found on PubMed. This makes it a very useful research tool. While PubTator does provide an API, the use of an API is inconvenient for high-throughput analyses and also requires a guaranteed internet connection. Querying a local PubTator database is better suited for high-throughput analyses. The package pubtatordb makes it easy to quickly start using a local copy of PubTator’s data.
You can install the released version of pubtatordb from CRAN with:
The version on GitHub can be downloaded using the devtools package with:
Load the package.
After loading the package, database setup and querying can be accomplished in four steps.
After the user manually creates a folder to store the data, the user can define the path to that folder and then download the data to that location:
# Download the data.
# Use the full path. Writing to the temp directory is not recommended.
download_dir <- tempdir()
download_pt(download_dir)
After defining the path to the download directory created above, the database can be created with:
# Define the data directory, a subdirectory of the above directory.
pubtator_path <- file.path(download_dir, "PubTator")
# Create the database.
pt_to_sql(
pubtator_path,
skip_behavior = FALSE,
remove_behavior = TRUE
)
If the .gz files from PubTator have already been extracted, their extraction can be skipped with the skip_behavior argument. After their insertion into the database, both the .gz and uncompressed files can be removed using the remove_behavior argument.
A connection can be created to the database using pt_connector. Note that this is a wrapper for the dbConnect function of the DBI package.
Querying the data is accomplished using the pt_select function. The first five rows of the gene table can be selected with:
# Query the data.
pt_select(
db_con,
"gene",
columns = NULL,
keys = NULL,
keytype = NULL,
limit = 5
)
The first five results for PMIDs in which the genes with ENTREZ IDs 7356 or 4199 were mentioned can be selected with:
PubTator has several datasets. The names of tables in the database can be obtained with:
The column names for a particular table can be accessed with:
The citation information for PubTator can be found on the PubTator website or with:
pubtator_citations()
#> Please cite PubTator in any publications:
#> 1. Wei CH et. al., PubTator: a Web-based text mining tool for assisting Biocuration, Nucleic acids research, 2013, 41 (W1): W518-W522. doi: 10.1093/nar/gkt44
#> 2. Wei CH et. al., Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts, Database (Oxford), bas041, 2012
#> 3. Wei CH et. al., PubTator: A PubMed-like interactive curation system for document triage and literature curation, in Proceedings of BioCreative 2012 workshop, Washington DC, USA, 145-150, 2012
The views expressed are those of the author(s) and do not reflect the official policy of the Department of the Army, the Department of Defense or the U.S. Government.