If an analysis starts with an input matrix, it has to be appropriately pre-proceed before it is used any functions of CB2
. CB2
allows two different types of input: a numeric matrix/data frame with row.names
and a data.frame contains columns of counts and columns of sgRNA IDs and target genes. Either of them will work. This document explains how the input should be formed and how to process the input using CB2
. In the entire document, [@evers2016crispr]'s CRISPR-RT112 screen data are used.
The following code imports required packages which are required to run below codes.
library(CB2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
The following code block shows an example of the first type of input which CB2
can handle. Each column of Evers_CRISPRn_RT112$count
contains counts of guide RNAs of a sample (that was initially extracted from NGS data). A count of the input shows that how many guide RNA barcodes were observed from a given NGS sample. Each row of the matrix has a row name (e.g., RPS19_sg10
), and the name is the ID of a guide RNA. For example, RPS19_sg10
, which is the first-row name in the example, indicates that the first row contains the counts of RPS_sg10
guide RNA. Every guide RNA ID must have exactly one _
character, and it is used to be a separator of two strings. The first string displays the name of a gene whose gene is targeted by the guide RNA, and the second string is used as an identifier among guide RNAs that targets the same gene. For example, RPS_sg10
indicates that the guide RNA is designed to target the RPS
gene, and sg10
is the unique identifier.
NOTE : If the input contains multiple _
characters, CB2
is not able to run. In particular, if Entrez gene IDs are used as the gene names, CB2
does not handle the input. One of the solutions for this case is changing the gene names to another identifier (e.g., HGNC symbol) or using another type of input, which will explain below.
data("Evers_CRISPRn_RT112")
head(Evers_CRISPRn_RT112$count)
## B1 B2 B3 A1 A2 A3
## RPS19_sg10 10180 9768 9406 1005 1186 911
## NUP93_sg4 9073 8598 8363 688 479 581
## PSMD11_sg2 9408 9573 9384 1014 1196 797
## RPS3A_sg3 8922 8965 8779 1266 1303 1176
## PSMB3_sg3 7434 6958 6871 450 628 398
## POLR2A_sg5 7672 7537 7259 647 786 683
In addition, CB2
requires experiment design information which is formed as a data.frame and contains sample names and groups of each sample. In Evers_CRISPRn_RT112
data, Evers_CRISPRn_RT112$design
is the data.frame.
Evers_CRISPRn_RT112$design
## # A tibble: 6 x 2
## group sample_name
## <chr> <chr>
## 1 before B1
## 2 before B2
## 3 before B3
## 4 after A1
## 5 after A2
## 6 after A3
With the two variables, CB2
can perform the hypothesis test with measure_sgrna_stats
and measure_gene_stats
functions.
sgrna_stats <- measure_sgrna_stats(Evers_CRISPRn_RT112$count, Evers_CRISPRn_RT112$design, "before", "after")
gene_stats <- measure_gene_stats(sgrna_stats)
head(gene_stats)
## # A tibble: 6 x 11
## gene n_sgrna cpm_a cpm_b logFC p_ts p_pa p_pb fdr_ts fdr_pa
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ABCG8 10 1101. 1540. 0.481 1.03e-18 1.00e+ 0 2.41e-21 1.56e-18 1.00e+ 0
## 2 ADH7 10 799. 1173. 0.522 4.88e-20 1.00e+ 0 1.10e-22 8.55e-20 1.00e+ 0
## 3 CABP5 10 1171. 1662. 0.493 2.10e-21 1.00e+ 0 4.55e-24 4.64e-21 1.00e+ 0
## 4 COPA 10 860. 413. -1.51 4.41e-33 5.48e-33 8.12e- 1 1.37e-31 1.70e-31
## 5 COPB1 10 1054. 498. -1.57 3.29e-25 3.04e-24 5.31e- 1 1.26e-24 1.29e-23
## 6 COPS2 10 1055. 842. -0.728 6.56e-19 4.85e-12 1.43e- 4 1.03e-18 1.13e-11
## # … with 1 more variable: fdr_pb <dbl>
Another input type is a data.frame that contains two additional columns, which contain the guide RNA information (target gene and guide RNA identifier). A CSV file which was used in the CB2
publication ([@jeong2019beta]). The CSV file contains the additional columns, the first is the gene
column, and the second is the sgRNA
column.
df <- read_csv("https://raw.githubusercontent.com/hyunhwaj/CB2-Experiments/master/01_gene-level-analysis/data/Evers/CRISPRn-RT112.csv")
## Parsed with column specification:
## cols(
## gene = col_character(),
## sgRNA = col_character(),
## B1 = col_double(),
## B2 = col_double(),
## B3 = col_double(),
## A1 = col_double(),
## A2 = col_double(),
## A3 = col_double()
## )
df
## # A tibble: 961 x 8
## gene sgRNA B1 B2 B3 A1 A2 A3
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 RPS19 RPS19-10 10180 9768 9406 1005 1186 911
## 2 NUP93 NUP93-4 9073 8598 8363 688 479 581
## 3 PSMD11 PSMD11-2 9408 9573 9384 1014 1196 797
## 4 RPS3A RPS3A-3 8922 8965 8779 1266 1303 1176
## 5 PSMB3 PSMB3-3 7434 6958 6871 450 628 398
## 6 POLR2A POLR2A-5 7672 7537 7259 647 786 683
## 7 RPS8 RPS8-8 6999 6966 6775 673 665 474
## 8 RPL36 RPL36-3 5988 5893 5498 266 276 201
## 9 RPS13 RPS13-4 5763 5546 5131 187 191 196
## 10 RPS11 RPS11-2 9444 9621 8835 1900 2213 1705
## # … with 951 more rows
Two additional parameters have to be set to the measure_sgrna_stats
function if an input matrix is this type. The first parameter is ge_id
, which specifies the column of genes, and the second parameter is sg_id
, which specifies the column of IDs. In the following code, gene_id
sets as gene
and sg_id
sets as sgRNA
.
head(measure_sgrna_stats(df, Evers_CRISPRn_RT112$design, "before", "after", ge_id = 'gene', sg_id = 'sgRNA'))
## gene sgRNA n_a n_b phat_a vhat_a phat_b vhat_b
## 1 RPS19 RPS19-10 3 3 0.002323257 5.369646e-10 0.0003257504 4.174644e-10
## 2 NUP93 NUP93-4 3 3 0.002060560 7.828460e-10 0.0001838979 3.300861e-10
## 3 PSMD11 PSMD11-2 3 3 0.002244954 1.772784e-10 0.0003147378 8.295566e-10
## 4 RPS3A RPS3A-3 3 3 0.002110486 1.666823e-10 0.0003935102 4.133227e-11
## 5 PSMB3 PSMB3-3 3 3 0.001683100 8.591075e-10 0.0001547065 3.831946e-10
## 6 POLR2A POLR2A-5 3 3 0.001778234 1.404884e-10 0.0002229180 2.109228e-10
## cpm_a cpm_b logFC t_value df p_ts p_pa
## 1 2323.275 325.7290 -2.830614 -64.65714 3.938262 4.148366e-07 2.074183e-07
## 2 2060.586 183.9145 -3.478824 -56.25377 3.432003 3.271576e-06 1.635788e-06
## 3 2246.609 314.6832 -2.831842 -60.83125 2.817477 1.761248e-05 8.806242e-06
## 4 2111.801 393.9205 -2.419522 -119.04667 2.934424 1.685127e-06 8.425636e-07
## 5 1683.145 154.6872 -3.435294 -43.36323 3.488096 6.833235e-06 3.416617e-06
## 6 1778.610 222.9897 -2.990057 -82.96805 3.845513 2.121616e-07 1.060808e-07
## p_pb fdr_ts fdr_pa fdr_pb
## 1 0.9999998 4.714939e-05 2.357470e-05 0.9999999
## 2 0.9999984 6.164676e-05 3.273760e-05 0.9999999
## 3 0.9999912 1.071240e-04 7.693454e-05 0.9999999
## 4 0.9999992 5.772537e-05 2.886268e-05 0.9999999
## 5 0.9999966 8.042437e-05 4.512440e-05 0.9999999
## 6 0.9999999 4.714939e-05 2.357470e-05 0.9999999