% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/individualQC.R
\name{check_relatedness}
\alias{check_relatedness}
\title{Identification of related individuals}
\usage{
check_relatedness(
  indir,
  name,
  qcdir = indir,
  highIBDTh = 0.1875,
  filter_high_ldregion = TRUE,
  high_ldregion_file = NULL,
  genomebuild = "hg19",
  imissTh = 0.03,
  run.check_relatedness = TRUE,
  interactive = FALSE,
  verbose = FALSE,
  mafThRelatedness = 0.1,
  path2plink = NULL,
  keep_individuals = NULL,
  remove_individuals = NULL,
  exclude_markers = NULL,
  extract_markers = NULL,
  legend_text_size = 5,
  legend_title_size = 7,
  axis_text_size = 5,
  axis_title_size = 7,
  title_size = 9,
  showPlinkOutput = TRUE
)
}
\arguments{
\item{indir}{[character] /path/to/directory containing the basic PLINK data
files name.bim, name.bed, name.fam files.}

\item{name}{[character] Prefix of PLINK files, i.e. name.bed, name.bim,
name.fam, name.genome and name.imiss.}

\item{qcdir}{[character] /path/to/directory to where name.genome as returned
by plink --genome will be saved.  Per default qcdir=indir. If
run.check_relatedness is FALSE, it is assumed that plink
--missing and plink --genome have been run and qcdir/name.imiss and
qcdir/name.genome exist. User needs writing permission to qcdir.}

\item{highIBDTh}{[double] Threshold for acceptable proportion of IBD between
pair of individuals.}

\item{filter_high_ldregion}{[logical] Should high LD regions be filtered
before IBD estimation; carried out per default with high LD regions for
hg19 provided as default via \code{genomebuild}. For alternative genome
builds not provided or non-human data, high LD regions files can be
provided via \code{high_ldregion_file}.}

\item{high_ldregion_file}{[character] Path to file with high LD regions used
for filtering before IBD estimation if \code{filter_high_ldregion} == TRUE,
otherwise ignored; for human genome data, high LD region files are provided
and can simply be chosen via \code{genomebuild}. Files have to be
space-delimited, no column names with the following columns: chromosome,
region-start, region-end, region number. Chromosomes are specified without
'chr' prefix. For instance:
1 48000000 52000000 1
2 86000000 100500000 2}

\item{genomebuild}{[character] Name of the genome build of the PLINK file
annotations, ie mappings in the name.bim file. Will be used to remove
high-LD regions based on the coordinates of the respective build. Options
are hg18, hg19 and hg38. See @details.}

\item{imissTh}{[double] Threshold for acceptable missing genotype rate in any
individual; has to be proportion between (0,1)}

\item{run.check_relatedness}{[logical] Should plink --genome be run to
determine pairwise IBD of individuals; if FALSE, it is assumed that
plink --genome and plink --missing have been run and qcdir/name.imiss and
qcdir/name.genome are present;
\code{\link{check_relatedness}} will fail with missing file error otherwise.}

\item{interactive}{[logical] Should plots be shown interactively? When
choosing this option, make sure you have X-forwarding/graphical interface
available for interactive plotting. Alternatively, set interactive=FALSE and
save the returned plot object (p_IBD() via ggplot2::ggsave(p=p_IBD,
other_arguments) or pdf(outfile) print(p_IBD) dev.off().}

\item{verbose}{[logical] If TRUE, progress info is printed to standard out.}

\item{mafThRelatedness}{[double] Threshold of minor allele frequency filter
for selecting variants for IBD estimation.}

\item{path2plink}{[character] Absolute path to PLINK executable
(\url{https://www.cog-genomics.org/plink/1.9/}) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by \code{\link[sys]{exec}}('plink').}

\item{keep_individuals}{[character] Path to file with individuals to be
retained in the analysis. The file has to be a space/tab-delimited text file
with family IDs in the first column and within-family IDs in the second
column. All samples not listed in this file will be removed from the current
analysis. See \url{https://www.cog-genomics.org/plink/1.9/filter#indiv}.
Default: NULL, i.e. no filtering on individuals.}

\item{remove_individuals}{[character] Path to file with individuals to be
removed from the analysis. The file has to be a space/tab-delimited text file
with family IDs in the first column and within-family IDs in the second
column. All samples listed in this file will be removed from the current
analysis. See \url{https://www.cog-genomics.org/plink/1.9/filter#indiv}.
Default: NULL, i.e. no filtering on individuals.}

\item{exclude_markers}{[character] Path to file with makers to be
removed from the analysis. The file has to be a text file with a list of
variant IDs (usually one per line, but it's okay for them to just be
separated by spaces). All listed variants will be removed from the current
analysis. See \url{https://www.cog-genomics.org/plink/1.9/filter#snp}.
Default: NULL, i.e. no filtering on markers.}

\item{extract_markers}{[character] Path to file with makers to be
included in the analysis. The file has to be a text file with a list of
variant IDs (usually one per line, but it's okay for them to just be
separated by spaces). All unlisted variants will be removed from the current
analysis. See \url{https://www.cog-genomics.org/plink/1.9/filter#snp}.
Default: NULL, i.e. no filtering on markers.}

\item{legend_text_size}{[integer] Size for legend text.}

\item{legend_title_size}{[integer] Size for legend title.}

\item{axis_text_size}{[integer] Size for axis text.}

\item{axis_title_size}{[integer] Size for axis title.}

\item{title_size}{[integer] Size for plot title.}

\item{showPlinkOutput}{[logical] If TRUE, plink log and error messages are
printed to standard out.}
}
\value{
Named [list] with i) fail_high_IBD containing a [data.frame] of
IIDs and FIDs of individuals who fail the IBDTh in columns
FID1 and IID1. In addition, the following columns are returned (as originally
obtained by plink --genome):
FID2 (Family ID for second sample), IID2 (Individual ID for second sample),
RT (Relationship type inferred from .fam/.ped file), EZ (IBD sharing expected
value, based on just .fam/.ped relationship), Z0 (P(IBD=0)), Z1 (P(IBD=1)),
Z2 (P(IBD=2)), PI_HAT (Proportion IBD, i.e. P(IBD=2) + 0.5*P(IBD=1)), PHE
(Pairwise phenotypic code (1, 0, -1 = AA, AU, and UU pairs, respectively)),
DST (IBS distance, i.e. (IBS2 + 0.5*IBS1) / (IBS0 + IBS1 + IBS2)), PPC (IBS
binomial test), RATIO (HETHET : IBS0 SNP ratio (expected value 2)).
and ii) failIDs containing a [data.frame] with individual IDs [IID] and
family IDs [FID] of individuals failing the highIBDTh iii) p_IBD, a
ggplot2-object 'containing' all pair-wise IBD-estimates as histograms
stratified by value of PI_HAT, which can be
shown by print(p_IBD).
}
\description{
Runs and evaluates results from plink --genome.
plink --genome calculates identity by state (IBS) for each pair of
individuals based on the average proportion of alleles shared at genotyped
SNPs. The degree of recent shared ancestry, i.e. the identity by descent
(IBD) can be estimated from the genome-wide IBS. The proportion of IBD
between two individuals is returned by plink --genome as PI_HAT.
check_relatedness finds pairs of samples whose proportion of IBD is larger
than the specified highIBDTh. Subsequently, for pairs of individuals that do
not have additional relatives in the dataset, the individual with the greater
genotype missingness rate is selected and returned as the individual failing
the relatedness check. For more complex family structures, the unrelated
individuals per family are selected (e.g. in a parents-offspring trio, the
offspring will be marked as fail, while the parents will be kept in the
analysis).
\code{check_relatedness} depicts all pair-wise IBD-estimates as histograms
stratified by value of PI_HAT.
}
\details{
\code{\link{check_relatedness}} wraps around
\code{\link{run_check_relatedness}} and
\code{\link{evaluate_check_relatedness}}. If run.check_relatedness is TRUE,
\code{\link{run_check_relatedness}} is executed ; otherwise it is assumed that
plink --genome has been run externally and qcdir/name.genome exists.
\code{\link{check_relatedness}}  will fail with missing file error otherwise.

For details on the output data.frame fail_high_IBD, check the original
description on the PLINK output format page:
\url{https://www.cog-genomics.org/plink/1.9/formats#genome}.
}
\examples{
\dontrun{
indir <- system.file("extdata", package="plinkQC")
name <- 'data'
path2plink <- "path/to/plink"

# whole dataset
relatednessQC <- check_relatedness(indir=indir, name=name, interactive=FALSE,
run.check_relatedness=FALSE, path2plink=path2plink)

# subset of dataset
remove_individuals_file <- system.file("extdata", "remove_individuals",
package="plinkQC")
fail_relatedness <- check_relatedness(indir=qcdir, name=name,
remove_individuals=remove_individuals_file, path2plink=path2plink)
}
}
