R/simcross is an R package to simulate genotypes from an experimental cross. The aim is flexibility rather than speed.
Meiosis is simulated following the Stahl model (Copenhaver et al. 2002) with the interference parameter being an integer. In this model, chiasma locations are the superposition of two processes: a proportion p come from a process exhibiting no interference (that is, a Poisson process) and the remainder (proportion 1 – p) come from a process following the chi-square model (Foss et al. 1993; Zhao et al. 1995). Thus, with p=0, the model reduces to the chi-square model. The chi-square model has a single parameter, m, which is a non-negative integer and controls the strength of interference. m=0 corresponds to no interference. Broman et al. (2002) estimated m=10 for the level of interference in the mouse (derived assuming p=0). The chi-square model is a special case of the gamma model (McPeek and Speed 1995) which has a positive parameter ν; the chi-square model corresponds to the case that m = ν – 1 is an integer.
In many organisms, there is always at least one chiasma for each pair of homologous chromosomes at meiosis. Simulations with R/simcross may be performed with this assumption; the same model is used, but with rejection sampling to get results conditional on there being at least one chiasma. Note that this isn’t an assumption of an obligate crossover; if there is exactly one chiasma, then a random meiotic product will have 0 or 1 crossovers, with probability 1/2 each. Also note that, with the obligate chiasma assumption, chromosomes must be greater than 50 cM.
There are two basic functions for simulating a cross:
create_parent
and cross
.
create_parent
is for generating a parent object, either
an inbred individual or the F1 offspring of two inbred
individuals. It takes two arguments: the length of the chromosome in cM,
and the allele or pair of alleles.
Here’s how to generate two inbred individuals, one from each of strain 1 and strain 2, and an F1 individual, with a 100 cM chromosome.
library(simcross)
p1 <- create_parent(L=100, allele=1)
p2 <- create_parent(L=100, allele=2)
f1 <- create_parent(L=100, allele=1:2)
You don’t need to specify the arguments by name (e.g., in the last
line above, you could just write create_parent(100, 1:2)
);
I’m doing so here just to better document the names of the
arguments.
The cross
function is used to generate one offspring
from the cross between two individuals. The input is a pair of
individuals (e.g., as produced by create_parent
), the
interference parameter m
(default m=10
), the
parameter p
for the Stahl model (default p=0
),
xchr
to indicate whether the chromosome being simulated is
the X chromosome (FALSE
, the default, to simulate an
autosome), and male
to indicate whether the offspring is to
be male (which only matters if xchr=TRUE
). Further, one may
use obligate_chiasma=TRUE
(default FALSE
) to
require at least one chiasma on the 4-strand bundle).
Here’s an example, to generate an F2 individual. (The outer parentheses cause the result to be printed.)
## $mat
## $mat$alleles
## [1] 2
##
## $mat$locations
## [1] 100
##
##
## $pat
## $pat$alleles
## [1] 1 2 1
##
## $pat$locations
## [1] 19.41483 51.33810 100.00000
The output is a list with two components: the maternal and paternal chromosomes. Each chromosome is a list with the allele in a set of intervals, and the locations of the right endpoints of the intervals.
In the example above, the maternal chromosome is a non-recombinant
2
chromosome. The paternal chromosome has two crossovers,
with the the 1
allele up to 19.4 cM, the 2
allele in the interval 19.4 – 51.3, and the 1
allele for
the remainder of the chromosome.
By default, we use m=10
, p=0
, and
obligate_chiasma=FALSE
. The chromosome length is taken from
the input parent objects (f1
in this case). If we wanted to
do the simulation with no crossover interference, but with an obligate
chiasma, we’d use:
Behind the scenes, there are two additional functions,
sim_crossovers
, for simulating crossover locations on a
chromosome, and sim_meiosis
, for simulating a meiotic
product from an individual, but in general the user need not bother with
these. The cross
function calls sim_meiosis
twice (once for each parent) and then combines the results into a single
individual. sim_meiosis
calls sim_crossovers
to generate the meiotic product.
While one could simulate any experimental cross from a series of
calls to create_parent
and cross
, it is
generally more efficient to first develop a table that describes
pedigree for the cross, and then simulate from the pedigree with the
function sim_from_pedigree
(see the next section).
To define a pedigree, we use a numeric matrix (or data frame) with
four columns: individual ID, mom, dad, and sex (coded as 0=female,
1=male). The R/simcross
package includes a sample pedigree for advanced intercross lines,
AILped
, taken from the QTLRel package.
Here is the top of that dataset:
## id mom dad sex generation
## 1 1 0 0 1 0
## 2 2 0 0 0 0
## 3 3 2 1 1 1
## 4 4 2 1 0 1
## 5 15 4 3 0 2
## 6 16 4 3 0 2
## 7 17 4 3 0 2
## 8 14 4 3 1 2
## 9 11 4 3 1 2
## 10 12 4 3 1 2
The function check_pedigree
can be used to check that a
pedigree matrix conforms to R/simcross’s requirements: founders have
mom == dad == 0
, all other individuals have both parents
present in the pedigree, and parents always precede any of their
children. The check_pedigree
function returns
TRUE
if the pedigree matrix is okay; otherwise, it throws
an error.
## [1] TRUE
R/simcross includes a set of functions for generating pedigree
matrices for different cross designs: sim_ril_pedigree
,
sim_4way_pedigree
, sim_ail_pedigree
, and
sim_do_pedigree
.
sim_ril_pedigree
generates a pedigree matrix for a
single recombinant inbred line derived from 2k founder lines
for some k>0. (Examples include the Collaborative Cross (Threadgill
and Churchill 2012) and MAGIC lines (Kover et al. 2009).) The arguments
are ngen
(number of generations of inbreeding),
selfing
(default FALSE
, for sibling mating),
parents
(a vector of integers for the parents; the length
must be a power of 2 (i.e., 2, 4, 8, 16, etc.) and corresponds to the
number of founder lines), and firstind
, the ID number to
attach to the first individual following the parents (so that the
pedigree matrices for multiple RIL may be rbind
-ed
together). Note that there’s no real simulation here; the
result is entirely deterministic.)
## id mom dad sex gen
## 1 1 0 0 0 0
## 2 2 0 0 1 0
## 3 3 0 0 0 0
## 4 4 0 0 1 0
## 5 5 1 2 0 1
## 6 6 3 4 1 1
## 7 7 5 6 0 2
## 8 8 5 6 1 2
## 9 9 7 8 0 3
## 10 10 7 8 1 3
## 11 11 9 10 0 4
## 12 12 9 10 1 4
## 13 13 11 12 0 5
## 14 14 11 12 1 5
## 15 15 13 14 0 6
## 16 16 13 14 1 6
The gen
column in the output is the generation number,
with 0 corresponding to the founders. The generations are simply
sequential and so don’t correspond to the numbering scheme used for the
Collaborative Cross (see, for example, Broman 2012).
sim_4way_pedigree
generates a pedigree matrix for an
intercross among four inbred lines. The arguments are ngen
(which must be 1 or 2) and nsibs
. We start with four inbred
individuals, and cross them in two pairs to generate a pair of
heterozygous individuals. If ngen==1
, we then just generate
a set of sum(nsibs)
F1 offspring. If
ngen==2
, we generate length(nsibs)
pairs of
F1s and intercross them to generate a set of F2
sibships; in this case, the input vector nsibs
determines
the sizes of the sibships.
The following generates two F2 sibships with 3 offspring in each.
## id mom dad sex gen
## 1 1 0 0 0 0
## 2 2 0 0 1 0
## 3 3 0 0 0 0
## 4 4 0 0 1 0
## 5 5 1 2 0 1
## 6 6 3 4 1 1
## 7 7 5 6 0 2
## 8 8 5 6 1 2
## 9 9 5 6 0 2
## 10 10 5 6 1 2
## 11 11 7 8 0 3
## 12 12 7 8 1 3
## 13 13 7 8 0 3
## 14 14 9 10 1 3
## 15 15 9 10 0 3
## 16 16 9 10 1 3
Advanced intercross lines (AIL) are generated by crossing two inbred lines to form the F1 hybrid, intercrossing to form the F2 generation, and then performing repeated intercrosses with some large set of breeding pairs. At each generation, the mating pairs are chosen at random, often with an effort to avoid matings between sibling pairs. I would prefer the term “advanced intercross populations,” as it’s a set of heterozygous, genetically distinct individuals; they aren’t really lines.
sim_ail_pedigree
generates a pedigree matrix for 2-way
advanced intercross lines. Unlike sim_ril_pedigree
and
sim_4way_pedigree
, this is actually a simulation, as the
mating pairs at each generation are chosen at random. The arguments are
ngen
(number of generations), npairs
(number
of mating pairs at each generation), nkids_per
(for number
of kids per sibship in the last generation), and design
("nosib"
to avoid matings between siblings, or
"random"
to form the matings completely at random). At each
generation, each mating pair gives two offspring (one male and one
female) for the next generation. At the last generation, there are
nkids_per
offspring per mating pair, to give a total of
npairs*nkids_per
at the last generation.
Here’s an example of the use of sim_ail_pedigree
:
## [1] 2504
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 2 2 200 200 200 200 200 200 200 200 200 200 500
With 100 breeding pairs and 5 kids per pair in last generation, we have 200 individuals for most generations and 500 at the end.
The Diversity Outbred population (DO; Svenson et al. 2013) is like AIL, but starting with eight inbred lines rather than two. Actually, the mouse DO started with partially-inbred individuals from Collaborative Cross lines (the so-called preCC; intermediate generations in the development of eight-way RIL). Heterogeneous stock (HS; Mott and Flint 2002) can be viewed as a special case, but starting with the eight founder lines.
sim_do_pedigree
generates a pedigree matrix for a DO
population. The arguments are just like those of
sim_ail_pedigree
, but with one addition:
ccgen
, which is a vector with the numbers of generations of
inbreeding to form the initial preCC lines that are then used to
initiate the DO. (The default for ccgen
is taken from
Figure 1 of Svenson et al. 2013.) Use ccgen=0
to simulate a
pedigree for HS. The length ccgen
should be
npairs
, the number of breeding pairs; we take two
individuals (one female and one male) from each preCC line to begin the
outcrossing generations.
## [1] 6772
## gen
## do 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## FALSE 16 576 288 288 288 288 288 288 246 118 70 50 40 22 12 6
## TRUE 0 288 288 288 288 288 288 288 288 288 288 288 720 0 0 0
The output contains the usual id
, mom
,
dad
, and sex
, plus gen
(for
generation number) and do
(1 indicates part of the DO
population, 0 indicates part of the earlier generations). As you can see
from the table above, the generation (gen
) has a different
meaning, according to whether do
is 0 or 1. When
do==0
, it is the number of generations following the
initial founder lines; when do==1
, it is the generation
number of the outbreeding DO population.
Also note that we start with sixteen lines rather than 8: to properly handle the X chromosome, we need to consider a male and female from each of the eight founder lines. These are numbered 1–8 for the females, and 9‐16 for the males. Also note that the preCC lines are formed from a cross among the eight founders, with the order of the crosses chosen at random (four females and four males, one from each of the eight founder lines).
The point of the construction of a pedigree matrix for a cross
design, as in the previous section, was in order to simulate genotype
data for the cross design. The R/simcross function for this is
sim_from_pedigree
. Its arguments are pedigree
(the pedigree matrix), L
(length of chromosome),
xchr
(TRUE
or FALSE
, according to
whether to simulate the X chromosome or an autosome; default
FALSE
), and then the parameters governing the crossing over
process: m
, p
, and
obligate_chiasma
, with defaults m=10
,
p=0
, and obligate_chiasma=FALSE
.
(L
can also be a vector of chromosome lengths, for
simulating multiple chromosomes at once. In this case xchr
should be either a logical vector, of the same length as L
,
or a character string with the name of the chromosome in L
that correspond to the X chromosome. An example of simulating multiple
chromosomes is in the final section in this vignette.)
The sim_from_pedigree
function calls
create_parent
for all founding individuals and
cross
for any offspring; the output is a list with each
component corresponding to one individual and having the form output by
cross
: a list with mat
and pat
,
for the maternal and paternal chromosomes, respectively, each of which
has alleles
(allele present in each interval) and
locations
(location of right endpoint of each
interval).
Here’s an example, for simulating an AIL to generation 8.
ailped <- sim_ail_pedigree(ngen=8, npairs=30, nkids_per=5)
xodat <- sim_from_pedigree(ailped, L=100)
Here’s the result for the last individual:
## $mat
## $mat$alleles
## [1] 2 1 2
##
## $mat$locations
## [1] 19.20702 43.17502 100.00000
##
##
## $pat
## $pat$alleles
## [1] 1 2 1 2
##
## $pat$locations
## [1] 17.78088 35.51860 71.15012 100.00000
We can plot the average number of breakpoints across the individuals’ two chromosomes, by generation, as follows.
n_breakpoints <- sapply(xodat, function(a) sum(sapply(a, function(b) length(b$alleles)-1)))
ave_breakpoints <- tapply(n_breakpoints, ailped$gen, mean)
gen <- as.numeric(names(ave_breakpoints))
plot(gen, ave_breakpoints,
xlab="Generation", ylab="Average no. breakpoints", las=1,
pch=21, bg="Orchid", main="AIL with 30 breeding pairs")
The function where_het
will show the regions where an
individual is heterozygous.
## left right
## 1 0.00000 17.78088
## 2 19.20702 35.51860
## 3 43.17502 71.15012
We can plot the average proportion of the chromosome that is heterozygous, by generation, as follows.
prop_het <- sapply(lapply(xodat, where_het), function(a) sum(a[,2]-a[,1])/100)
ave_prop_het <- tapply(prop_het, ailped$gen, mean)
plot(gen, ave_prop_het,
xlab="Generation", ylab="Average proportion heterozygous", las=1,
pch=21, bg="Orchid", main="AIL with 30 breeding pairs")
abline(h=0.5, lty=2)
Note: if we’d greatly restricted the number of breeding pairs per generation, we’d see evidence of inbreeding, as a reduced proportion of heterozygosity. There’s considerably more noise, though, since we’ve got just 6 individuals per generation.
ailped2 <- sim_ail_pedigree(ngen=8, npairs=3, nkids_per=50)
xodat2 <- sim_from_pedigree(ailped2, L=100)
prop_het2 <- sapply(lapply(xodat2, where_het), function(a) sum(a[,2]-a[,1])/100)
ave_prop_het2 <- tapply(prop_het2, ailped2$gen, mean)
gen2 <- as.numeric(names(ave_prop_het2))
plot(gen2, ave_prop_het2,
xlab="Generation", ylab="Average proportion heterozygous", las=1,
pch=21, bg="Orchid", main="AIL with 3 breeding pairs")
abline(h=0.5, lty=2)
R/simcross simulates the locations of crossovers as continuous values in the interval (0,L). This is precise and compact, and it allows detailed study of the breakpoint process, but it can be cumbersome to work with and is often not what you want from simulations. In most cases, one is interested in individuals’ genotypes at a set of markers.
There are two functions for getting marker genotypes on the basis of
the the detailed crossover location data: get_geno
and
convert2geno
.
The function get_geno
will grab the genotype at a
specified location on the chromosome, returning a matrix with two
columns: the maternal and paternal alleles for each individual.
Continuing with the simulation in the previous section (an AIL with 30
breeding pairs), here’s how to grab the genotype at 30 cM:
## mat pat
## 509 1 2
## 510 1 1
## 511 1 1
## 512 1 2
## 513 1 2
## 514 1 2
The get_geno
function could be useful, for example, for
grabbing QTL genotypes for use in constructing a phenotype.
The convert2geno
function takes the detailed crossover
information plus a vector of marker locations and returns a matrix with
marker genotypes.
First, construct a vector with the marker locations.
## m0 m10 m20 m30 m40 m50 m60 m70 m80 m90 m100
## 0 10 20 30 40 50 60 70 80 90 100
Then, pass the crossover information and map to
convert2geno
. I print the data for the last five
individuals.
## m0 m10 m20 m30 m40 m50 m60 m70 m80 m90 m100
## 510 1 2 2 1 1 2 2 2 2 2 2
## 511 2 3 2 1 1 2 2 1 2 2 2
## 512 2 1 2 2 1 2 2 3 2 2 2
## 513 2 2 2 2 1 2 2 3 3 3 3
## 514 2 2 2 2 1 2 2 2 3 3 3
Here the genotypes get recoded as 1
/ 2
/
3
for 11
/ 12
/ 22
.
That is, genotypes 1
and 3
are the homozygotes
for the allele from founders 1 and 2, respectively, and genotype
2
is the heterozygote.
For crosses with more than two founders, the output of
convert2geno
is a three-dimensional array, individuals ×
markers × alleles (maternal and paternal). For example, here are
genotypes for the sixth generations of inbreeding of an eight-way
RIL.
rilped <- sim_ril_pedigree(ngen=6, parents=1:8)
dat <- sim_from_pedigree(rilped, L=100)
geno <- convert2geno(dat, map)
geno[nrow(geno)-(1:0),,]
## , , mat
##
## m0 m10 m20 m30 m40 m50 m60 m70 m80 m90 m100
## 27 7 5 5 5 5 3 3 3 3 3 7
## 28 7 5 5 5 5 3 3 5 5 6 7
##
## , , pat
##
## m0 m10 m20 m30 m40 m50 m60 m70 m80 m90 m100
## 27 7 5 5 5 5 3 3 3 3 3 7
## 28 7 5 5 5 5 3 3 3 3 3 7
More commonly, one may be interested in individuals’ SNP genotypes.
This can also be obtained with convert2geno
; one just needs
to provide a matrix of SNP alleles for the founder lines, with the
argument founder_geno
. This should be a matrix of
1
s and 2
s, of dimension
n_founders
× n_markers
.
Let’s simulate SNP alleles for eight founder lines.
And then here are the SNP genotypes for the sixth generation of inbreeding an eight-way RIL.
## m0 m10 m20 m30 m40 m50 m60 m70 m80 m90 m100
## 27 3 3 1 1 1 3 3 3 3 1 3
## 28 3 3 1 1 1 3 3 2 2 1 3
The genotypes are again coded as 1
/ 2
/
3
for 11
/ 12
/
22
.
One can use sim_from_pedigree
and
convert2geno
to simulate multiple chromosomes at once.
To simulate multiple chromosomes with sim_from_pedigree
,
provide a vector of chromosome lengths. In this case xchr
should be either a logical vector, of the same length as L
,
or a character string with the name of the chromosome in L
that correspond to the X chromosome.
ailped3 <- sim_ail_pedigree(ngen=12, npairs=30, nkids_per=3)
xodat3 <- sim_from_pedigree(ailped3, c("1"=100, "2"=75, "X"=100), "X")
xodat3alt <- sim_from_pedigree(ailped3, c("1"=100, "2"=75, "X"=100),
c(FALSE, FALSE, TRUE))
To indicate that all chromosomes are autosomes, you can use
xchr=""
, xchr=FALSE
, or
xchr=NULL
. For example:
Having simulated multiple chromosomes with
sim_from_pedigree
, you may wish to use
convert2geno
to convert those results to marker genotypes.
This is done by providing the output of sim_from_pedigree
as well as a genetic marker map that is a list of vectors of
marker locations.
Let’s first construct the marker map, assuming three chromosomes with lengths 100, 75, and 100.
map <- list("1"=seq(0, 100, by=5),
"2"=seq(0, 75, by=5),
"X"=seq(0, 100, by=5))
for(i in seq(along=map))
names(map[[i]]) <- paste0("m", names(map)[i], "_", map[[i]])
We then use convert2geno
, ensuring that the inputs are
both lists with the same length.
For crosses like the DO, in which founder genotypes are needed, the
input founder_geno
must be a list of matrices (one matrix
per chromosome).
Here are some simulated founder genotypes:
fg <- vector("list", length(map))
for(i in seq(along=map))
fg[[i]] <- matrix(sample(1:2, 8*length(map[[i]]), replace=TRUE), nrow=8)
And now here is the simulation of a small DO population for multiple
chromosomes. When simulating the pedigree, we need to have 16 founders
(the 8 female founders and then the 8 male founders). After simulating
the genotypes with sim_from_pedigree()
, we use
collapse_do_alleles()
to collapse the alleles 9-16 (for the
male founders) into 1-8.
Broman KW (2012) Genotype probabilities at intermediate generations in the construction of recombinant inbred lines. Genetics 190:403-412 doi: 10.1534/genetics.111.132647
Broman KW, Rowe LB, Churchill GA, Paigen K (2002) Crossover interference in the mouse. Genetics 160:1123-1131 doi: 10.1093/genetics/160.3.1123
Copenhaver GP, Housworth EA, Stahl FW (2002) Crossover interference in arabidopsis. Genetics 160:1631-1639 doi: 10.1093/genetics/160.4.1631
Foss E, Lande R, Stahl FW, Steinberg CM (1993) Chiasma interference as a function of genetic distance. Genetics 133:681-691 doi: 10.1093/genetics/92.2.573
Kover PX, Valdar W, Trakalo J, Scarcelli N, Ehrenreich IM, Purugganan MD, Durrant C, Mott R (2009) A Multiparent Advanced Generation Inter-Cross to fine-map quantitative traits in Arabidopsis thaliana. PLoS Genetics 5:e1000551 doi: 10.1371/journal.pgen.1000551
McPeek MS, Speed TP (1995) Modeling interference in genetic recombination. Genetics 139:1031-1044 doi: 10.1093/genetics/139.2.1031
Mott R, Flint J (2002) Simultaneous detection and fine mapping of quantitative trait loci in mice using heterogeneous stocks. Genetics 160:1609-1618 doi: 10.1093/genetics/160.4.1609
Svenson KL, Gatti DM, Valdar W, Welsh CE, Cheng R, Chesler EJ, Palmer AA, McMillan L, Churchill GA (2012) High-resolution genetic mapping using the mouse Diversity Outbred population.. Genetics 190:437-447 doi: 10.1534/genetics.111.132597
Threadgill DW, Churchill GA (2012) Ten years of the Collaborative Cross. Genetics 190:291-294 doi: 10.1534/genetics.111.138032
Zhao H, Speed TP, McPeek MS (1995) Statistical analysis of crossover interference using the chi-square model. Genetics 139:1045-1056 doi: 10.1534/genetics.111.138032