Introduction to canprot

The canprot package calculates chemical metrics of proteins from amino acid compositions. This vignette was compiled on 2024-03-28 with canprot version 2.0.0.

Reading FASTA files

KHAB17.fasta was obtained from Supplemental Information of Kacar et al. (2017) and is provided in the extdata/fasta directory of canprot. Use read_fasta() to read the file and return a data frame of amino acid composition.

fasta_file <- system.file("extdata/fasta/KHAB17.fasta", package = "canprot")
aa <- read_fasta(fasta_file)

## read_fasta: reading KHAB17.fasta ... 57 lines ... 6 sequences

The result is small enough that we can look at it here. The data frame has four columns for identifying information (protein, organism, ref, and abbrv); the first two are filled by read_fasta(). The chains column is the number of polypeptide chains.

aa

##        protein organism ref abbrv chains Ala Cys Asp Glu Phe Gly His Ile Lys
## 1 Anc_I/II/III   KHAB17  NA    NA      1  42   4  26  46  14  42  15  30  28
## 2    Anc_I/III   KHAB17  NA    NA      1  42   2  26  45  14  41  16  30  29
## 3   Anc_I/III'   KHAB17  NA    NA      1  42   1  25  44  15  42  14  32  30
## 4        Anc_I   KHAB17  NA    NA      1  52   1  28  32  18  40   9  18  28
## 5     Anc_IA/B   KHAB17  NA    NA      1  48   7  28  33  23  43  14  20  24
## 6       Anc_IB   KHAB17  NA    NA      1  46   7  27  31  23  44  15  20  24
##   Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr
## 1  42  10   8  21   4  22  13  19  29   5  13
## 2  39  13   9  22   6  22  15  17  29   5  13
## 3  41  11   8  23   7  22  18  17  29   4  14
## 4  42  12  13  24  13  31  16  29  37  10  17
## 5  43  12  14  21  11  30  17  32  30  10  15
## 6  44  12  15  22  13  29  16  31  28   9  16

None of the first five columns is necessary for the calculation of chemical metrics. They are provided for compatibility with CHNOSZ, and if you don’t plan to use CHNOSZ, you can put any information here that you want, including NA values, or remove the columns completely. The columns that do matter for the calculation of chemical metrics are the last 20 columns, which are named with the 3-letter abbreviations of the amino acids.

These particular sequences are ancestral sequences of Rubisco. A geochemical biology hypothesis is that proteins are oxidized when the environment is oxidizing. Is this what happens? Plot the carbon oxidation state (Z_C) to find out. Note the use of pre-formatted plot labels for chemical metrics available in cplab.

xlab <- "Ancestral sequences (older to younger)"
plot(Zc(aa), type = "b", xaxt = "n", xlab = xlab, ylab = cplab$Zc)
names <- gsub(".*_", "", aa$protein)
axis(1, at = 1:6, names)
abline(v = 3.5, lty = 2, col = 2)
axis(3, at = 3.5, "GOE (proposed)")

The vertical line denotes the proposed timing of the Great Oxidation Event (GOE) between Anc. I/III and Anc. I (Kacar et al., 2017). This analysis shows that reconstructed ancestral Rubiscos become more oxidized around the proposed timing of GOE.

Human proteins in canprot

canprot has a database of amino acid compositions of human proteins assembled from UniProt. Use human_aa() to get the amino acid composition. This example is for alanine aminotransferase, which has a UniProt ID of P24298:

(aa <- human_aa("P24298"))

##              protein organism  ref abbrv chains Ala Cys Asp Glu Phe Gly His Ile
## 5457 sp|P24298|ALAT1    HUMAN <NA>  <NA>      1  51  10  20  34  21  38   9  18
##      Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr
## 5457  16  55  12  12  33  30  37  25  18  41   1  15

Zc(aa)

##       5457 
## -0.1482091

Do you have a list of UniProt IDs for a differential expression dataset? Great! We can use those to calculate chemical metrics and make a boxplot. The IDs in this example come from Figure 5 of Doron et al. (2020), where they were identified as differentially regulated proteins in aggregate (3D) cell culture compared to monolayer (2D) culture.

up <-   c("Q92743", "P43490", "P52895", "P98160", "P23142", "P17301",
"U3KQK0", "Q15582", "Q9HCJ1", "P36222", "P27701", "Q08380", "P08572",
"P00734", "P22413", "O43657", "P35625", "O75348", "P02649", "P13861",
"P10620", "Q9H3N1", "A8K878", "P13611", "P07305", "E7ESP4", "Q9Y625",
"Q5ZPR3", "P62266", "Q96AQ6", "Q8N357", "Q13217", "Q9Y230", "Q9Y639",
"Q86W92", "C9JF17", "Q96PK6", "O95671", "P01033", "Q13501", "P69905",
"Q9Y5X1", "P50281", "Q9UBG0", "O60831", "P02751", "O43854", "P61803",
"J3KN66", "P42765", "P36543", "P15121", "Q16563", "Q12884", "P27695",
"P12110", "P07686", "Q92598", "Q02818", "Q07954", "O60493", "P40939",
"Q9Y3I0", "P51149", "P46776", "P46778", "P62805")

down <- c("J3KN67", "Q9Y490", "J3KNQ4", "E7EVA0", "Q01082", "J3KQ32",
"P54136", "Q9Y696", "Q01995", "Q15404", "P62714", "Q09666", "P07814",
"E7EQR4", "P46821", "O75369", "P02452", "P08123", "P54577", "P01023",
"Q6ZN40", "P42224", "B4DUT8", "Q13443", "Q9HCE1", "Q6DKJ4", "P50552",
"P35222", "P20908", "Q15417", "O75822", "P17812", "P05997", "P04080",
"O43294", "P08243", "P02458")

With those UniProt IDs for human proteins we can retrieve the amino acid compositions, then calculate a couple of chemical metrics and make some boxplots comparing the groups of differentially regulated proteins.

aa_down <- human_aa(down)
aa_up <- human_aa(up)
bp_names <- paste0(c("Down (", "Up ("), c(nrow(aa_down), nrow(aa_up)), c(")", ")"))

par(mfrow = c(1, 2))

Zclist <- list(Zc(aa_down), Zc(aa_up))
names(Zclist) <- bp_names
boxplot(Zclist, ylab = cplab$Zc, col = c(4, 2))
names(Zclist) <- c("x", "y")
p <- do.call(wilcox.test, Zclist)$p.value
legend("bottomleft", paste("p =", round(p, 3)), bty = "n")
title("Cabon oxidation state", font.main = 1)

nH2Olist <- list(nH2O(aa_down), nH2O(aa_up))
names(nH2Olist) <- bp_names
boxplot(nH2Olist, ylab = cplab$nH2O, col = c(4, 2))
names(nH2Olist) <- c("x", "y")
p <- do.call(wilcox.test, nH2Olist)$p.value
legend("bottomleft", paste("p =", round(p, 3)), bty = "n")
title("Stoichiometric hydration state", font.main = 1)

We find no significant difference of Z_C for the differentially regulated proteins. In contrast, n_H₂O is significantly lower for up-regulated than for down-regulated proteins.

A similar dehydration trend characterizes most datasets for 3D cell culture (Dick, 2021). The differential expression datasets analyzed in that paper, which were previously in canprot, have been moved to JMDplots.

Introduction to canprot

Reading FASTA files

Human proteins in canprot

References