The ampir (short for antimicrobial peptide prediction in r ) package was designed to be a fast and user-friendly method to predict antimicrobial peptides (AMPs) from any given size protein dataset. ampir uses a supervised statistical machine learning approach to predict AMPs. It incorporates two support vector machine classification models, “precursor” and “mature” that have been trained on publicly available antimicrobial peptide data. The default model, “precursor” is best suited for full length proteins and the “mature” model is best suited for small mature proteins (<60 amino acids). ampir also accepts custom (user trained) models based on the caret package. Please see the ampir “How to train your model” vignette for details.
ampir’s associated paper is published in the Bioinformatics journal as btaa653. Please cite this paper if you use ampir in your research.
ampir is also available via a Shiny based GUI at https://ampir.marine-omics.net/ where users can submit protein sequences in FASTA file format to be classified by either the “precursor” or “mature” model. The prediction results can then be downloaded as a csv file.
You can install the released version of ampir from CRAN with:
install.packages("ampir")
And the development version from GitHub with:
# install.packages("devtools")
::install_github("Legana/ampir") devtools
library(ampir)
Standard input to ampir is a data.frame
with sequence names in the first column and protein sequences in the
second column.
Read in a FASTA formatted file as a data.frame
with
read_faa()
<- read_faa(system.file("extdata/little_test.fasta", package = "ampir")) my_protein_df
seq_name | seq_aa |
---|---|
G1P6H5_MYOLU | MALTVRIQAACLLLLLLASLTSYSLLLSQTTQLADLQTQDTAGAT… |
L5L3D0_PTEAL | MKPLLIVFVFLIFWDPALAGLNPISSEMYKKCYGNGICRLECYTS… |
A0A183U1F1_TOXCA | LLRLYSPLVMFATRRVLLCLLVIYLLAQPIHSSWLKKTYKKLENS… |
Q5F4I1_DROPS | MNFYKIFIFVALILAISVGQSEAGWLKKLGKRLERVGQHTRDATI… |
A7S075_NEMVE | MFLKVVVVLLAVELSVAQSARQRVRPLDRKAGRKRFAPIFPRQCS… |
F1DFM9_9CNID | MKVLVILFGAMLVLMEFQKASAATLLEDFDDDDDLLDDGGDFDLE… |
Q5XV93_ARATH | MSKREYERQLANEEDEQLRNFQAAVAARSAILHEPKEAALPPPAP… |
Q2XXN9_POGBA | MRFLYLLFAVAFLFSVQAEDAELEQEQQGDPWEGLDEFQDQPPDD… |
Calculate the probability that each protein is an antimicrobial
peptide with predict_amps()
. Since these proteins are all
full length precursors rather than mature peptides we use
ampir
’s built-in precursor model.
Note that amino acid sequences that are shorter than 10 amino
acids long and/or contain anything other than the standard 20 amino
acids are not evaluated and will contain an NA
as their
prob_AMP
value.
<- predict_amps(my_protein_df, model = "precursor") my_prediction
seq_name | seq_aa | prob_AMP |
---|---|---|
G1P6H5_MYOLU | MALTVRIQAACLLLLLLASLTSYSLLLSQTTQLADLQTQDTAGAT… | 0.612 |
L5L3D0_PTEAL | MKPLLIVFVFLIFWDPALAGLNPISSEMYKKCYGNGICRLECYTS… | 0.945 |
A0A183U1F1_TOXCA | LLRLYSPLVMFATRRVLLCLLVIYLLAQPIHSSWLKKTYKKLENS… | 0.088 |
Q5F4I1_DROPS | MNFYKIFIFVALILAISVGQSEAGWLKKLGKRLERVGQHTRDATI… | 0.998 |
A7S075_NEMVE | MFLKVVVVLLAVELSVAQSARQRVRPLDRKAGRKRFAPIFPRQCS… | 0.032 |
F1DFM9_9CNID | MKVLVILFGAMLVLMEFQKASAATLLEDFDDDDDLLDDGGDFDLE… | 0.223 |
Q5XV93_ARATH | MSKREYERQLANEEDEQLRNFQAAVAARSAILHEPKEAALPPPAP… | 0.009 |
Q2XXN9_POGBA | MRFLYLLFAVAFLFSVQAEDAELEQEQQGDPWEGLDEFQDQPPDD… | 0.733 |
Predicted proteins with a specified predicted probability value could then be extracted and written to a FASTA file:
<- my_protein_df[which(my_prediction$prob_AMP >= 0.8),] my_predicted_amps
seq_name | seq_aa | |
---|---|---|
2 | L5L3D0_PTEAL | MKPLLIVFVFLIFWDPALAGLNPISSEMYKKCYGNGICRLECYTS… |
4 | Q5F4I1_DROPS | MNFYKIFIFVALILAISVGQSEAGWLKKLGKRLERVGQHTRDATI… |
Write the data.frame
with sequence names in the first
column and protein sequences in the second column to a FASTA formatted
file with df_to_faa()
df_to_faa(my_predicted_amps, "my_predicted_amps.fasta")