vcfR is a package intended to help visualize, manipulate and quality filter data in VCF files.
More documentation for vcfR can be found at the vcfR documentation website.
Input files frequently present challenges to analysis. A common problem I encounter is that chromosome names are not standardized among VCF, FASTA and GFF files. This presents work for the analyst. I suggest reading these files into R, syncronizing the names in R and then proceeding with downstream analyses. The other option I see is to create a set of files where the data is identical to the initial files, but the names have been syncronized. This later choice results in the creation of files which are largely redundant, something I feel is unnecessary.
Memory use is another important consideration when using vcfR. A strength of R is that it was typically intended to read in entire datasets into memory. This allows for visualization, manipulation and analyses to be performed on the entire dataset at once. The size of genomic datasets, particularly the VCF data, present a challenge in that they may be too large for practical use in R. This presents us with the challenge of reading enough data into memory so that we can explore a large amount of it, but not so much that we exceed our existing resources. It has been my experience that R does not perfomr well when memory use approaches 1 GB of RAM, so simply investing in a workstation with a large amount of memory may not be a solution. (This may change in the future as R is under continual development.) My solution to finding this balance is to work on single chomosomes. Actually, the ‘chromosomes’ in my projects are typically supercontigs, scaffolds or contigs. This is one situation where not having complete chromosomes actually can be an advantage.
The vcfR package is designed to work with data from VCF files. The use of a sequence file (FASTA format) and an annotation file (GFF format) can provide helpful context, but are not required. We’ll begin our example by locating the data files from the package ‘pinfsc50.’
pkg <- "pinfsc50"
vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz", package = pkg)
dna_file <- system.file("extdata", "pinf_sc50.fasta", package = pkg)
gff_file <- system.file("extdata", "pinf_sc50.gff", package = pkg)
Then read in the VCF file with vcfR.
The function read.vcfR()
takes the filename you specify
and reads it into R where it is stored as a vcfR
object. The vcfR object is an S4 class object with
three slots containing the metadata, the fixed data and the genotype
data. More information on VCF data can be found in the vignette ‘vcf
data.’ This object provides a known organization for the data so that
downstream functions can easily access it.
Genomic reference sequence files are typically in FASTA format files. These can be read in using the package ape.
Annotation files (we currently support GFF), files which contain coordinates for annotations such as start and end points of genes, are tabular and can be read in with typical R functions.
In my experience, GFF files typically to not surround text with
quotes. This can present a challenge to reading these files into R.
Disabling quotes in the call to read.table()
typically
helps handle this.
Once the data has been read into memory, modifications can be made to chromosome names or any other inconsistencies and one can proceed.
vcfR was designed to work on an individual chromosome, supercontig or contig, depending on the state of your genome. Reading an entire genome into memory may present a technical challenge when there is a lot of data. For example, when the genome is large or there are a lot of samples. Attempting to read in lage datasets to memory may exhaust all available memory and result in an unresponsive computer. Working on chromosomes appears to be a natural way to decompose this problem. Once you have read an object into R (e.g., an annotation or sequence file) you may need to subset it to data for a single chromosome, if necessary.
Once the data are in memory we can use it to create a
chromR object with the function
create.chromR()
. The create.chromR()
function
creates a new chromR object and populates it with data you provided
it.
## Names in vcf:
## Supercontig_1.50
## Names of sequences:
## Supercontig_1.50 of Phytophthora infestans T30-4
## Warning in create.chromR(name = "Supercontig", vcf = vcf, seq = dna, ann = gff):
## Names in variant data and sequence data do not match perfectly.
## If you choose to proceed, we'll do our best to match the data.
## But prepare yourself for unexpected results.
## Names in annotation:
## Supercontig_1.50
## Initializing var.info slot.
## var.info slot initialized.
Note that the names of our three data sources are not identical. This results in a warning. When we examine the output we see that the name in the VCF file is ‘Supercontig_1.50,’ while the name in the FASTA is ‘Supercontig_1.50 of Phytophthora infestans T30-4.’ We know that these are synonyms so we can ignore the warning and proceed.
The parameter ‘name’ is a name you can assign to your object. This
information is used when plotting the chromR object. The vcfR object
should be of class vcfR, which was most likely created with the function
read.vcfR()
. This object is inserted into the chrom object
with the function vcfR2chromR()
. The parameter ‘seq’ should
be a DNAbin object (see the R package ape) with one sequence in it. This
sequence will be inserted into the chromR object with the function
seq2chromR()
. If a sequence is not provided,
seq2chromR()
will infer the length of your chromosome from
the maximum position in the vcfR object. The parameter ‘ann’ should be a
GFF
and be of class data.frame. These can typically be read in with base R
functions such as read.table()
. This table will then be
inserted into the chromR object by ann2chromR()
. This
function will check to see if columns 4 and 5 (“start” and “end”) are
numeric. If not, it will try to recast them as so.
Once the chromR object has been created a few processing steps are needed. First, you may want to get a quick look at some of your data. This can be done with the plot function.
The distribution of read depth (DP) stands out. Presumably, there is
some base ploid level at which most of each genome is sequenced at. Here
we see a peak, which may represent that base ploid region, but we also
see a long tail which may represent copy number variants. Because
genotypers typically expect a constant level of ploidy, variant calls in
copy number variants may be suspect. We can see that mapping qualities
(MQ) are all rather peaked around a value of 60. Because of this, if we
would like to filter on this parameter we now know that we would have to
employ a narrow threshold. Interpretation of the qualities (QUAL)
appears less straightforward and may be clinal from a value of zero. You
may conclude that this is not an ideal parameter to filter on. No SNP
densities are found at this point because this data results from
windowing analyses performed by proc_chromR()
(see below).
Filtering on other parameters may reveal a more straight forward
path.
Note that VCF data created by different variant calling software may or may not have these fields or their ranges may be different. For example, here the mapping quality is peaked at 60. Other softwares may create files where mapping quality is peaked at 20 or some other value. This is why it is important to visualize the distribution of your data so you understand its properties.
We can use the masker()
function to try to filter out
data that we do not have high confidence in. The masker() function uses
quality, depth and mapping quality to try to select high quality
variants. See ?masker
for default values. When using
masker, variants deemed to be of low quality are not deleted from the
dataset. Instead, a logical vector is created to indicate which variants
have or have not been filtered. This maintains the geometry of the data
matrices throughout the analyisis and allows the user to easily undo any
changes.
chrom <- masker(chrom, min_QUAL = 1, min_DP = 300, max_DP = 700, min_MQ = 59.9, max_MQ = 60.1)
plot(chrom)
Once we’re satisfied with which variants we want to consider to be of
high quality we can process the chromR object with
proc.chromR()
. This function calls several helper functions
to process the variant, sequence and annotation data for
visualization.
The function regex.win()
defines rectangles for where
called sequence (A, C, G and T) occur as well as where ambiguous
nucleotides occur (N) which are used later for plotting. This function
also defines rectangles for annotated features, which are also for
plotting.
The function var.win()
performs windowing analyses on
the data. Currently it summarizes variant count per window as well as
G/C content per window.
## Nucleotide regions complete.
## elapsed time: 0.341
## N regions complete.
## elapsed time: 0.312
## Population summary complete.
## elapsed time: 0.238
## window_init complete.
## elapsed time: 0.001
## windowize_fasta complete.
## elapsed time: 0.09
## windowize_annotations complete.
## elapsed time: 0.014
## windowize_variants complete.
## elapsed time: 0.001
Now that we’ve processed our chromR object, we have variant counts per window. We’re also ready to move on to more complex plots.
At this point we’ve input three types of data (variant, sequence and annotation), inserted them into a chromR object, masked variants we feel were not of high quality and processed some summaries of these data. We can now move on to visualizing these data.
The function chromoqc()
uses the R function
layout()
to make composite plots of the data. These plots
can include barplots as well as scatterplots which have chromosomal
coordinates.
We can also zoom in on a feature of interest by using the
xlim
parameter.
One of the goals of the package vcfR is to help investigators understand and explore their data. Once they’ve gained an understanding of this data, they will likely want to act upon it. One way to act upon this understanding is to use their aquired comprehension of their data to filter it to what they feel is of adequate quality.
Within the framework of the package vcfR, the filtering and output of
variants determined to be of adequate quality can be accomplished with
the function write.vcf()
. This function takes a vcfR object
and optionally subsets it using the mask, created in previous steps, and
outputs it to a (gzipped) VCF file. This file should be usable by all
VCF compliant softwares for downstream analyses.
Conversion of vcfR and chromR to objects supported in other R packages is covered in the vignette ‘Converting data.’