The canprot package calculates chemical metrics of proteins from amino acid compositions. This vignette was compiled on 2024-03-28 with canprot version 2.0.0.
Previous vignette: Demos for canprot | Next vignette: More about metrics
KHAB17.fasta
was obtained from Supplemental Information of Kacar et al. (2017) and is provided in the extdata/fasta
directory of canprot. Use read_fasta()
to read the file and return a data frame of amino acid composition.
fasta_file <- system.file("extdata/fasta/KHAB17.fasta", package = "canprot")
aa <- read_fasta(fasta_file)
## read_fasta: reading KHAB17.fasta ... 57 lines ... 6 sequences
The result is small enough that we can look at it here. The data frame has four columns for identifying information (protein
, organism
, ref
, and abbrv
); the first two are filled by read_fasta()
. The chains
column is the number of polypeptide chains.
## protein organism ref abbrv chains Ala Cys Asp Glu Phe Gly His Ile Lys
## 1 Anc_I/II/III KHAB17 NA NA 1 42 4 26 46 14 42 15 30 28
## 2 Anc_I/III KHAB17 NA NA 1 42 2 26 45 14 41 16 30 29
## 3 Anc_I/III' KHAB17 NA NA 1 42 1 25 44 15 42 14 32 30
## 4 Anc_I KHAB17 NA NA 1 52 1 28 32 18 40 9 18 28
## 5 Anc_IA/B KHAB17 NA NA 1 48 7 28 33 23 43 14 20 24
## 6 Anc_IB KHAB17 NA NA 1 46 7 27 31 23 44 15 20 24
## Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr
## 1 42 10 8 21 4 22 13 19 29 5 13
## 2 39 13 9 22 6 22 15 17 29 5 13
## 3 41 11 8 23 7 22 18 17 29 4 14
## 4 42 12 13 24 13 31 16 29 37 10 17
## 5 43 12 14 21 11 30 17 32 30 10 15
## 6 44 12 15 22 13 29 16 31 28 9 16
None of the first five columns is necessary for the calculation of chemical metrics. They are provided for compatibility with CHNOSZ, and if you don’t plan to use CHNOSZ, you can put any information here that you want, including NA values, or remove the columns completely. The columns that do matter for the calculation of chemical metrics are the last 20 columns, which are named with the 3-letter abbreviations of the amino acids.
These particular sequences are ancestral sequences of Rubisco. A geochemical biology hypothesis is that proteins are oxidized when the environment is oxidizing. Is this what happens? Plot the carbon oxidation state (ZC) to find out. Note the use of pre-formatted plot labels for chemical metrics available in cplab
.
xlab <- "Ancestral sequences (older to younger)"
plot(Zc(aa), type = "b", xaxt = "n", xlab = xlab, ylab = cplab$Zc)
names <- gsub(".*_", "", aa$protein)
axis(1, at = 1:6, names)
abline(v = 3.5, lty = 2, col = 2)
axis(3, at = 3.5, "GOE (proposed)")
The vertical line denotes the proposed timing of the Great Oxidation Event (GOE) between Anc. I/III and Anc. I (Kacar et al., 2017). This analysis shows that reconstructed ancestral Rubiscos become more oxidized around the proposed timing of GOE.
canprot has a database of amino acid compositions of human proteins assembled from UniProt. Use human_aa()
to get the amino acid composition. This example is for alanine aminotransferase, which has a UniProt ID of P24298:
## protein organism ref abbrv chains Ala Cys Asp Glu Phe Gly His Ile
## 5457 sp|P24298|ALAT1 HUMAN <NA> <NA> 1 51 10 20 34 21 38 9 18
## Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr
## 5457 16 55 12 12 33 30 37 25 18 41 1 15
## 5457
## -0.1482091
Do you have a list of UniProt IDs for a differential expression dataset? Great! We can use those to calculate chemical metrics and make a boxplot. The IDs in this example come from Figure 5 of Doron et al. (2020), where they were identified as differentially regulated proteins in aggregate (3D) cell culture compared to monolayer (2D) culture.
up <- c("Q92743", "P43490", "P52895", "P98160", "P23142", "P17301",
"U3KQK0", "Q15582", "Q9HCJ1", "P36222", "P27701", "Q08380", "P08572",
"P00734", "P22413", "O43657", "P35625", "O75348", "P02649", "P13861",
"P10620", "Q9H3N1", "A8K878", "P13611", "P07305", "E7ESP4", "Q9Y625",
"Q5ZPR3", "P62266", "Q96AQ6", "Q8N357", "Q13217", "Q9Y230", "Q9Y639",
"Q86W92", "C9JF17", "Q96PK6", "O95671", "P01033", "Q13501", "P69905",
"Q9Y5X1", "P50281", "Q9UBG0", "O60831", "P02751", "O43854", "P61803",
"J3KN66", "P42765", "P36543", "P15121", "Q16563", "Q12884", "P27695",
"P12110", "P07686", "Q92598", "Q02818", "Q07954", "O60493", "P40939",
"Q9Y3I0", "P51149", "P46776", "P46778", "P62805")
down <- c("J3KN67", "Q9Y490", "J3KNQ4", "E7EVA0", "Q01082", "J3KQ32",
"P54136", "Q9Y696", "Q01995", "Q15404", "P62714", "Q09666", "P07814",
"E7EQR4", "P46821", "O75369", "P02452", "P08123", "P54577", "P01023",
"Q6ZN40", "P42224", "B4DUT8", "Q13443", "Q9HCE1", "Q6DKJ4", "P50552",
"P35222", "P20908", "Q15417", "O75822", "P17812", "P05997", "P04080",
"O43294", "P08243", "P02458")
With those UniProt IDs for human proteins we can retrieve the amino acid compositions, then calculate a couple of chemical metrics and make some boxplots comparing the groups of differentially regulated proteins.
aa_down <- human_aa(down)
aa_up <- human_aa(up)
bp_names <- paste0(c("Down (", "Up ("), c(nrow(aa_down), nrow(aa_up)), c(")", ")"))
par(mfrow = c(1, 2))
Zclist <- list(Zc(aa_down), Zc(aa_up))
names(Zclist) <- bp_names
boxplot(Zclist, ylab = cplab$Zc, col = c(4, 2))
names(Zclist) <- c("x", "y")
p <- do.call(wilcox.test, Zclist)$p.value
legend("bottomleft", paste("p =", round(p, 3)), bty = "n")
title("Cabon oxidation state", font.main = 1)
nH2Olist <- list(nH2O(aa_down), nH2O(aa_up))
names(nH2Olist) <- bp_names
boxplot(nH2Olist, ylab = cplab$nH2O, col = c(4, 2))
names(nH2Olist) <- c("x", "y")
p <- do.call(wilcox.test, nH2Olist)$p.value
legend("bottomleft", paste("p =", round(p, 3)), bty = "n")
title("Stoichiometric hydration state", font.main = 1)
We find no significant difference of ZC for the differentially regulated proteins. In contrast, nH2O is significantly lower for up-regulated than for down-regulated proteins.
A similar dehydration trend characterizes most datasets for 3D cell culture (Dick, 2021). The differential expression datasets analyzed in that paper, which were previously in canprot, have been moved to JMDplots.
Dick JM. 2021. Water as a reactant in the differential expression of proteins in cancer. Computational and Systems Oncology 1(1): e1007. doi: 10.1002/cso2.1007
Doron G, Klontzas ME, Mantalaris A, Guldberg RE, Temenoff JS. 2020. Multiomics characterization of mesenchymal stromal cells cultured in monolayer and as aggregates. Biotechnology and Bioengineering 117(6): 1761–1778. doi: 10.1002/bit.27317
Kacar B, Hanson-Smith V, Adam ZR, Boekelheide N. 2017. Constraining the timing of the Great Oxidation Event within the Rubisco phylogenetic tree. Geobiology 15(5): 628–640. doi: 10.1111/gbi.12243