The mulea
R package (Turek et al. 2024) is a
comprehensive tool for functional enrichment analysis. It provides two
different approaches:
For unranked sets of elements, such as significantly up- or
down-regulated genes, mulea
employs the set-based
overrepresentation analysis (ORA).
Alternatively, if the data consists of ranked elements, for
instance, genes ordered by p-value or log fold-change
calculated by the differential expression analysis, mulea
offers the gene set enrichment (GSEA)
approach.
For the overrepresentation analysis, mulea
employs a
progressive empirical false discovery rate (eFDR)
method, specifically designed for interconnected biological data, to
accurately identify significant terms within diverse ontologies.
mulea
expands beyond traditional tools by incorporating
a wide range of ontologies, encompassing Gene Ontology,
pathways, regulatory elements, genomic locations, and protein domains
for 27 model organisms, covering 22 ontology types from 16 databases and
various identifiers resulting in 879 files available at the ELTEbioinformatics/GMT_files_for_mulea
GitHub repository and through the muleaData
ExperimentData Bioconductor package.
Install the dependency fgsea
BioConductor package:
# Installing the BiocManager package if needed
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# Installing the fgsea package with the BiocManager
BiocManager::install("fgsea")
To install mulea
from CRAN:
To install the development version of mulea
from
GitHub:
First, load the mulea
and
dplyr
libraries. The
dplyr
library is not essential but is used
here to facilitate data manipulation and inspection.
This section demonstrates how to import the desired ontology, such as transcription factors and their target genes downloaded from the database, into a data frame suitable for enrichment analysis. We present multiple methods for importing the ontology. Ensure that the identifier type (e.g., UniProt protein ID, Entrez ID, Gene Symbol, Ensembl gene ID) matches between the ontology and the elements you wish to investigate.
The GMT
(Gene Matrix Transposed) format contains collections of genes or
proteins associated with specific ontology terms in a tab-delimited text
file. The GMT file can be read into R as a data frame using the
read_gmt
function from the mulea
package. Each
term is represented by a single row in both the GMT file and the data
frame. Each row includes three types of elements:
Ontology identifier (“ontology_id”): This uniquely identifies each term within the file or data frame.
Ontology name or description (“ontology_name”): This provides a user-friendly label or textual description for each term.
Associated gene or protein identifiers: These are listed in the “list_of_values” column, with identifiers separated by spaces, and belong to each term.1
mulea
GMT FileAlongside with the mulea
package we provide ontologies
collected from 16 publicly available databases, in a standardised GMT
format for 27 model organisms, from Bacteria to human. These files are
available at the ELTEbioinformatics/GMT_files_for_mulea
GitHub repository.
To read a downloaded GMT file locally:
# Reading the mulea GMT file locally
tf_ontology <- read_gmt("Transcription_factor_RegulonDB_Escherichia_coli_GeneSymbol.gmt")
Alternatively, one can read it directly from the GitHub repository:
mulea
is compatible with GMT files provided
with the Enricher
R package (Kuleshov et al. 2016).
Download and read such a GMT file (e. g.
TRRUST_Transcription_Factors_2019.txt) locally. Note that
this ontology is not suitable for analyzing the Escherichia coli
differential expression data described in the section The Differential
Expression Dataset to Analyse.
mulea
is compatible with the MsigDB (Subramanian et
al. 2005) GMT
files. Download and read such a GMT file (e. g.
c3.tft.v2023.2.Hs.symbols.gmt) locally. Note that this
ontology is not suitable for analyzing the Escherichia coli differential
expression data described in the section The Differential
Expression Dataset to Analyse.
muleaData
PackageAlternatively, you can retrieve the ontology using the muleaData
ExperimentData Bioconductor package:
# Installing the ExperimentHub package from Bioconductor
BiocManager::install("ExperimentHub")
# Calling the ExperimentHub library.
library(ExperimentHub)
# Downloading the metadata from ExperimentHub.
eh <- ExperimentHub()
# Creating the muleaData variable.
muleaData <- query(eh, "muleaData")
# Looking for the ExperimentalHub ID of the ontology.
EHID <- mcols(muleaData) %>%
as.data.frame() %>%
dplyr::filter(title == "Transcription_factor_RegulonDB_Escherichia_coli_GeneSymbol.rds") %>%
rownames()
# Reading the ontology from the muleaData package.
tf_ontology <- muleaData[[EHID]]
# Change the header
tf_ontology <- tf_ontology %>%
rename(ontology_id = "ontologyId",
ontology_name = "ontologyName",
list_of_values = "listOfValues")
Enrichment analysis results can sometimes be skewed by overly
specific or broad entries. mulea
allows you to customise
the size of ontology entries – the number of genes or proteins belonging
to a term – ensuring your analysis aligns with your desired scope.
Let’s exclude ontology entries with less than 3 or more than 400 gene symbols.
You can save the ontology as a GMT file using the
write_gmt
function.
The mulea
package provides the list_to_gmt
function to convert a list of gene sets into an ontology data frame. The
following example demonstrates how to use this function:
For further steps we will analyse a dataset from a microarray experiment (GSE55662) in the NCBI Gene Expression Omnibus . The study by Méhi et al. (2014) investigated antibiotic resistance evolution in Escherichia coli. Gene expression changes were compared between ciprofloxacin antibiotic-treated Escherichia coli bacteria and non-treated controls.
The expression levels of these groups were compared with the GEO2R tool:
To see how the dataset were prepared go to the Formatting the Results of a Differential Expression Analysis section.
The mulea
package implements a set-based enrichment
analysis approach using the hypergeometric test, which is
analogous to the one-tailed Fisher’s exact test. This method identifies
statistically significant overrepresentation of elements from a target
set (e.g., significantly up- or downregulated genes) within a
background set (e.g., all genes that were investigated in the
experiment). Therefore, a predefined threshold value, such as 0.05 for
the corrected p-value or 2-fold change, should be used in the
preceding analysis.
The overrepresentation analysis is implemented in the
ora
function which requires three inputs:
Ontology data frame: Fits the investigated taxa and the applied gene or protein identifier type, such as GO, pathway, transcription factor regulation, microRNA regulation, gene expression data, genomic location data, or protein domain content.
Target set: A vector of elements to investigate, containing genes or proteins of interest, such as significantly overexpressed genes in the experiment.
Background set: A vector of background elements representing the broader context, often including all genes investigated in the study.
Let’s read the text files containing the identifiers (gene symbols) of the target and the background gene set directly from the GitHub website. To see how these files were prepared, refer to the section on Formatting the Results of a Differential Expression Analysis.
To perform the analysis, we will first establish a model using the
ora
function. This model defines the parameters for the
enrichment analysis. We then execute the test itself using the
run_test
function. It is important to note
that for this example, we will employ 10,000 permutations for the
empirical false discovery rate (eFDR), which is the
recommended minimum, to ensure robust correction for multiple
testing.
# Creating the ORA model using the GMT variable
ora_model <- ora(gmt = tf_ontology_filtered,
# Test set variable
element_names = target_set,
# Background set variable
background_element_names = background_set,
# p-value adjustment method
p_value_adjustment_method = "eFDR",
# Number of permutations
number_of_permutations = 10000,
# Number of processor threads to use
nthreads = 2,
# Setting a random seed for reproducibility
random_seed = 1)
# Running the ORA
ora_results <- run_test(ora_model)
The ora_results
data frame summarises the enrichment
analysis, listing enriched ontology entries – in our case transcription
factors – alongside their associated p-values and eFDR
values.
We can now determine the number of transcription factors classified as “enriched” based on these statistical measures (eFDR < 0.05).
## [1] 10
Inspect the significant results:
ora_results %>%
# Arrange the rows by the eFDR values
arrange(eFDR) %>%
# Rows where the eFDR < 0.05
filter(eFDR < 0.05)
ontology_id | ontology_name | nr_common_with_tested_elements | nr_common_with_background_elements | p_value | eFDR |
---|---|---|---|---|---|
FNR | FNR | 26 | 259 | 0.0000003 | 0.0000000 |
LexA | LexA | 14 | 53 | 0.0000000 | 0.0000000 |
SoxS | SoxS | 7 | 37 | 0.0001615 | 0.0036667 |
Rob | Rob | 5 | 21 | 0.0004717 | 0.0051200 |
DnaA | DnaA | 4 | 13 | 0.0006281 | 0.0052000 |
FadR | FadR | 5 | 20 | 0.0003692 | 0.0056000 |
NsrR | NsrR | 8 | 64 | 0.0010478 | 0.0073714 |
ArcA | ArcA | 12 | 148 | 0.0032001 | 0.0197500 |
IHF | IHF | 14 | 205 | 0.0070758 | 0.0458600 |
MarA | MarA | 5 | 37 | 0.0066068 | 0.0483111 |
To gain a comprehensive understanding of the enriched transcription
factors, mulea
offers diverse
visualisation tools, including lollipop charts, bar plots, networks, and
heatmaps. These visualisations effectively reveal patterns and
relationships among the enriched factors.
Initialising the visualisation with the reshape_results
function:
# Reshapeing the ORA results for visualisation
ora_reshaped_results <- reshape_results(model = ora_model,
model_results = ora_results,
# Choosing which column to use for the
# indication of significance
p_value_type_colname = "eFDR")
Visualising the Spread of eFDR Values: Lollipop Plot
Lollipop charts provide a graphical representation of the distribution of enriched transcription factors. The y-axis displays the transcription factors, while the x-axis represents their corresponding eFDR values. The dots are coloured based on their eFDR values. This visualisation helps us examine the spread of eFDRs and identify factors exceeding the commonly used significance threshold of 0.05.
plot_lollipop(reshaped_results = ora_reshaped_results,
# Column containing the names we wish to plot
ontology_id_colname = "ontology_id",
# Upper threshold for the value indicating the significance
p_value_max_threshold = 0.05,
# Column that indicates the significance values
p_value_type_colname = "eFDR")
Visualising the Spread of eFDR Values: Bar Plot
Bar charts offer a graphical representation similar to lollipop plots. The y-axis displays the enriched ontology categories (e.g., transcription factors), while the x-axis represents their corresponding eFDR values. The bars are coloured based on their eFDR values, aiding in examining the spread of eFDRs and identifying factors exceeding the significance threshold of 0.05.
plot_barplot(reshaped_results = ora_reshaped_results,
# Column containing the names we wish to plot
ontology_id_colname = "ontology_id",
# Upper threshold for the value indicating the significance
p_value_max_threshold = 0.05,
# Column that indicates the significance values
p_value_type_colname = "eFDR")
Visualising the Associations: Graph Plot
This function generates a network visualisation of the enriched ontology categories (e.g., transcription factors). Each node represents an eriched ontology category, coloured based on its eFDR value. An edge is drawn between two nodes if they share at least one common gene belonging to the target set, indicating co-regulation. The thickness of the edge reflects the number of shared genes.
plot_graph(reshaped_results = ora_reshaped_results,
# Column containing the names we wish to plot
ontology_id_colname = "ontology_id",
# Upper threshold for the value indicating the significance
p_value_max_threshold = 0.05,
# Column that indicates the significance values
p_value_type_colname = "eFDR")
Visualising the Associations: Heatmap
The heatmap displays the genes associated with the enriched ontology categories (e.g., transcription factors). Each row represents a category, coloured based on its eFDR value. Each column represents a gene from the target set belonging to the enriched ontology category, indicating potential regulation by one or more enriched transcription factors.
The ora
function allows you to choose between different
methods for calculating the FDR and adjusting the
p-values: eFDR, and all method
options
from the stats::p.adjust
documentation (holm, hochberg,
hommel, bonferroni, BH, BY, and fdr). The following code snippet
demonstrates how to perform the analysis using the
Benjamini-Hochberg and Bonferroni corrections:
# Creating the ORA model using the Benjamini-Hochberg p-value correction method
BH_ora_model <- ora(gmt = tf_ontology_filtered,
# Test set variable
element_names = target_set,
# Background set variable
background_element_names = background_set,
# p-value adjustment method
p_value_adjustment_method = "BH",
# Number of processor threads to use
nthreads = 2)
# Running the ORA
BH_results <- run_test(BH_ora_model)
# Creating the ORA model using the Bonferroni p-value correction method
Bonferroni_ora_model <- ora(gmt = tf_ontology_filtered,
# Test set variable
element_names = target_set,
# Background set variable
background_element_names = background_set,
# p-value adjustment method
p_value_adjustment_method = "bonferroni",
# Number of processor threads to use
nthreads = 2)
# Running the ORA
Bonferroni_results <- run_test(Bonferroni_ora_model)
To compare the significant results (using the conventional < 0.05 threshold) of the eFDR, Benjamini-Hochberg, and Bonferroni corrections, we can merge and filter the result tables:
# Merging the Benjamini-Hochberg and eFDR results
merged_results <- BH_results %>%
# Renaming the column
rename(BH_adjusted_p_value = adjusted_p_value) %>%
# Selecting the necessary columns
select(ontology_id, BH_adjusted_p_value) %>%
# Joining with the eFDR results
left_join(ora_results, ., by = "ontology_id") %>%
# Converting the data.frame to a tibble
tibble()
# Merging the Bonferroni results with the merged results
merged_results <- Bonferroni_results %>%
# Renaming the column
rename(Bonferroni_adjusted_p_value = adjusted_p_value) %>%
# Selecting the necessary columns
select(ontology_id, Bonferroni_adjusted_p_value) %>%
# Joining with the eFDR results
left_join(merged_results, ., by = "ontology_id") %>%
# Arranging by the p-value
arrange(p_value)
# filter the p-value < 0.05 results
merged_results_filtered <- merged_results %>%
filter(p_value < 0.05) %>%
# remove the unnecessary columns
select(-ontology_id, -nr_common_with_tested_elements,
-nr_common_with_background_elements)
ontology_name | p_value | eFDR | BH_adjusted_p_value | Bonferroni_adjusted_p_value |
---|---|---|---|---|
LexA | 0.0000000 | 0.0000000 | 0.0000001 | 0.0000001 |
FNR | 0.0000003 | 0.0000000 | 0.0000208 | 0.0000416 |
SoxS | 0.0001615 | 0.0036667 | 0.0082880 | 0.0248641 |
FadR | 0.0003692 | 0.0056000 | 0.0142127 | 0.0568507 |
Rob | 0.0004717 | 0.0051200 | 0.0145296 | 0.0726479 |
DnaA | 0.0006281 | 0.0052000 | 0.0161218 | 0.0967306 |
NsrR | 0.0010478 | 0.0073714 | 0.0230517 | 0.1613622 |
ArcA | 0.0032001 | 0.0197500 | 0.0616014 | 0.4928114 |
MarA | 0.0066068 | 0.0483111 | 0.1089670 | 1.0000000 |
IHF | 0.0070758 | 0.0458600 | 0.1089670 | 1.0000000 |
NarL | 0.0096065 | 0.0534000 | 0.1276532 | 1.0000000 |
NikR | 0.0099470 | 0.0615833 | 0.1276532 | 1.0000000 |
OxyR | 0.0174505 | 0.0786923 | 0.2067212 | 1.0000000 |
ExuR | 0.0261046 | 0.1051867 | 0.2680073 | 1.0000000 |
UxuR | 0.0261046 | 0.1051867 | 0.2680073 | 1.0000000 |
NrdR | 0.0328500 | 0.1232750 | 0.3161817 | 1.0000000 |
IscR | 0.0376038 | 0.1249412 | 0.3406459 | 1.0000000 |
Nac | 0.0419701 | 0.1487556 | 0.3590774 | 1.0000000 |
Fis | 0.0457307 | 0.1433053 | 0.3706596 | 1.0000000 |
A comparison of the significant results revealed that conventional p-value corrections (Benjamini-Hochberg and Bonferroni) tend to be overly conservative, leading to a reduction in the number of significant transcription factors compared to the eFDR. As illustrated in the below figure, by applying the eFDR we were able to identify 10 significant transcription factors, while with the Benjamini-Hochberg and Bonferroni corrections only 7 and 3, respectively. This suggests that the eFDR may be a more suitable approach for controlling false positives in this context.
To perform enrichment analysis using ranked lists, you need to provide an ordered list of elements, such as genes or proteins. This ranking is typically based on the results of your prior analysis, using metrics like p-values, z-scores, fold-changes, or others. Crucially, the ranked list should include all elements involved in your analysis. For example, in a differential expression study, it should encompass all genes that were measured.
mulea
utilises the Kolmogorov-Smirnov approach with a
permutation test (developed by Subramanian et al. (2005)) to calculate
gene set enrichment analyses. This functionality is implemented through
the integration of the fgsea
Bioconductor package (created by Korotkevich et al. (2021)).
GSEA requires input data about the genes analysed in our experiment. This data can be formatted in two ways:
Data frame: This format should include all genes investigated and their respective log fold change values (or other values for ordering the genes) obtained from the differential expression analysis.
Two vectors: Alternatively, you can provide two separate vectors. One vector should contain the gene symbols (or IDs), and the other should hold the corresponding log fold change values (or other values for ordering the genes) for each gene.
Let’s read the TSV file containing the identifiers (gene symbols) and the log fold change values of the investigated set directly from the GitHub website. For details on how this file was prepared, please refer to the Formatting the Results of a Differential Expression Analysis section.
# Reading the tsv containing the ordered set
ordered_set <- read_tsv("https://raw.githubusercontent.com/ELTEbioinformatics/mulea/master/inst/extdata/ordered_set.tsv")
## Rows: 7381 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): Gene.symbol
## dbl (1): logFC
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
To perform the analysis, we will first establish a model using the
gsea
function. This model defines the parameters for the
enrichment analysis. Subsequently, we will execute the test itself using
the run_test
function. We will employ 10,000 permutations
for the false discovery rate, to ensure robust correction for multiple
testing.
# Creating the GSEA model using the GMT variable
gsea_model <- gsea(gmt = tf_ontology_filtered,
# Names of elements to test
element_names = ordered_set$Gene.symbol,
# LogFC-s of elements to test
element_scores = ordered_set$logFC,
# Consider elements having positive logFC values only
element_score_type = "pos",
# Number of permutations
number_of_permutations = 10000)
# Running the GSEA
gsea_results <- run_test(gsea_model)
The gsea_results
data frame summarises the enrichment
analysis, listing enriched ontology entries – in our case transcription
factors – alongside their associated p-values and adjusted
p-value values.
We can now determine the number of transcription factors classified as “enriched” based on these statistical measures (adjusted p-value < 0.05).
gsea_results %>%
# rows where the adjusted_p_value < 0.05
filter(adjusted_p_value < 0.05) %>%
# the number of such rows
nrow()
## [1] 7
Inspect the significant results:
gsea_results %>%
# arrange the rows by the adjusted_p_value values
arrange(adjusted_p_value) %>%
# rows where the adjusted_p_value < 0.05
filter(adjusted_p_value < 0.05)
ontology_id | ontology_name | nr_common_with_tested_elements | p_value | adjusted_p_value |
---|---|---|---|---|
LexA | LexA | 53 | 0.0000000 | 0.0000020 |
FNR | FNR | 259 | 0.0000323 | 0.0024689 |
GlaR | GlaR | 3 | 0.0001485 | 0.0075738 |
ModE | ModE | 45 | 0.0003258 | 0.0124627 |
ArcA | ArcA | 148 | 0.0004900 | 0.0138964 |
SoxS | SoxS | 37 | 0.0005450 | 0.0138964 |
DnaA | DnaA | 13 | 0.0007190 | 0.0157147 |
Initializing the visualisation with the reshape_results
function:
# Reshaping the GSEA results for visualisation
gsea_reshaped_results <- reshape_results(model = gsea_model,
model_results = gsea_results,
# choosing which column to use for the
# indication of significance
p_value_type_colname = "adjusted_p_value")
Visualising Relationships: Graph Plot
This function generates a network visualisation of the enriched ontology categories (e.g., transcription factors). Each node represents a category and is coloured based on its significance level. A connection (edge) is drawn between two nodes if they share at least one common gene belonging to the ranked list, meaning that both transcription factors regulate the expression of the same target gene. The thickness of the edge reflects the number of shared genes.
plot_graph(reshaped_results = gsea_reshaped_results,
# the column containing the names we wish to plot
ontology_id_colname = "ontology_id",
# upper threshold for the value indicating the significance
p_value_max_threshold = 0.05,
# column that indicates the significance values
p_value_type_colname = "adjusted_p_value")
Other plot types such as lollipop plots, bar plots, and heatmaps can also be used to investigate the GSEA results.
This section aims to elucidate the structure and essential components
of the provided DE results table. It offers guidance to users on
interpreting the data effectively for subsequent analysis with
mulea
.
Let’s read the differential expression result file named GSE55662.table_wt_non_vs_cipro.tsv located in the inst/extdata/ folder directly from the GitHub website.
# Importing necessary libraries and reading the DE results table
geo2r_result_tab <- read_tsv("https://raw.githubusercontent.com/ELTEbioinformatics/mulea/master/inst/extdata/GSE55662.table_wt_non_vs_cipro.tsv")
Let’s delve into the geo2r_result_tab
data frame by
examining its initial rows:
ID | adj.P.Val | P.Value | t | B | logFC | Gene.symbol | Gene.title |
---|---|---|---|---|---|---|---|
1765336_s_at | 0.0186 | 2.4e-06 | 21.5 | 4.95769 | 3.70 | gnsB | Qin prophage; multicopy suppressor of secG(Cs) and fabA6(Ts) |
1760422_s_at | 0.0186 | 3.8e-06 | 19.6 | 4.68510 | 3.14 | NA | NA |
1764904_s_at | 0.0186 | 5.7e-06 | 18.2 | 4.43751 | 2.54 | sulA///sulA///sulA///ECs1042 | SOS cell division inhibitor///SOS cell division inhibitor///SOS cell division inhibitor///SOS cell division inhibitor |
Preparing the data frame appropriately for enrichment analysis is crucial. This involves specific steps tailored to the microarray experiment type. Here, we undertake the following transformations:
Gene Symbol Extraction: We isolate the primary
gene symbol from the Gene.symbol
column, eliminating any
extraneous information.
Handling Missing Values: Rows with missing gene
symbols (NA
) are excluded.
Sorting by Fold Change: The data frame is sorted
by log-fold change (logFC
) in descending order,
prioritizing genes with the most significant expression
alterations.
# Formatting the data frame
geo2r_result_tab <- geo2r_result_tab %>%
# Extracting the primary gene symbol and removing extraneous information
mutate(Gene.symbol = str_remove(string = Gene.symbol,
pattern = "\\/.*")) %>%
# Filtering out rows with NA gene symbols
filter(!is.na(Gene.symbol)) %>%
# Sorting by logFC
arrange(desc(logFC))
Before proceeding with enrichment analysis, let’s examine the initial
rows of the formatted geo2r_result_tab
data frame:
ID | adj.P.Val | P.Value | t | B | logFC | Gene.symbol | Gene.title |
---|---|---|---|---|---|---|---|
1765336_s_at | 0.0186 | 2.40e-06 | 21.5 | 4.95769 | 3.70 | gnsB | Qin prophage; multicopy suppressor of secG(Cs) and fabA6(Ts) |
1764904_s_at | 0.0186 | 5.70e-06 | 18.2 | 4.43751 | 2.54 | sulA | SOS cell division inhibitor///SOS cell division inhibitor///SOS cell division inhibitor///SOS cell division inhibitor |
1761763_s_at | 0.0186 | 1.54e-05 | 15.0 | 3.73568 | 2.16 | recN | recombination and repair protein///recombination and repair protein///recombination and repair protein///recombination and repair protein |
Following these formatting steps, the data frame is primed for further analysis.
A vector containing the gene symbols of significantly overexpressed (adjusted p-value < 0.05) genes with greater than 2 fold-change (logFC > 1).
target_set <- geo2r_result_tab %>%
# Filtering for adjusted p-value < 0.05 and logFC > 1
filter(adj.P.Val < 0.05 & logFC > 1) %>%
# Selecting the Gene.symbol column
select(Gene.symbol) %>%
# Converting the tibble to a vector
pull() %>%
# Removing duplicates
unique()
The first 10 elements of the target set:
## [1] "gnsB" "sulA" "recN" "c4435" "dinI" "c2757" "c1431"
## [8] "gabP" "recA" "ECs5456"
The number of genes in the target set:
## [1] 241
A vector containing the gene symbols of all genes were included in the differential expression analysis.
background_set <- geo2r_result_tab %>%
# Selecting the Gene.symbol column
select(Gene.symbol) %>%
# Converting the tibble to a vector
pull() %>%
# Removing duplicates
unique()
The number of genes in the background set:
## [1] 7381
Save the target and the background set vectors to text file:
# If there are duplicated Gene.symbols keep the first one only
ordered_set <- geo2r_result_tab %>%
# Grouping by Gene.symbol to be able to filter
group_by(Gene.symbol) %>%
# Keeping the first row for each Gene.symbol from rows with the same
# Gene.symbol
filter(row_number()==1) %>%
# Ungrouping
ungroup() %>%
# Arranging by logFC in descending order
arrange(desc(logFC)) %>%
select(Gene.symbol, logFC)
The number of gene symbols in the ordered_set
vector:
## [1] 7381
Save the ordered set data frame to tab delimited file:
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Debian GNU/Linux 12 (bookworm)
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.21.so; LAPACK version 3.11.0
##
## locale:
## [1] LC_CTYPE=en_IE.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_IE.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_IE.UTF-8 LC_MESSAGES=en_IE.UTF-8
## [7] LC_PAPER=en_IE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Budapest
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
## [5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
## [9] ggplot2_3.5.1 tidyverse_2.0.0 mulea_1.1.0
##
## loaded via a namespace (and not attached):
## [1] fastmatch_1.1-4 gtable_0.3.5 xfun_0.46
## [4] bslib_0.7.0 ggrepel_0.9.5 lattice_0.22-6
## [7] tzdb_0.4.0 vctrs_0.6.5 tools_4.4.1
## [10] generics_0.1.3 curl_5.2.1 parallel_4.4.1
## [13] fansi_1.0.6 highr_0.11 pkgconfig_2.0.3
## [16] Matrix_1.7-0 data.table_1.15.4 lifecycle_1.0.4
## [19] compiler_4.4.1 farver_2.1.2 munsell_0.5.1
## [22] ggforce_0.4.2 fgsea_1.30.0 graphlayouts_1.1.1
## [25] codetools_0.2-20 htmltools_0.5.8.1 sass_0.4.9
## [28] yaml_2.3.9 crayon_1.5.3 pillar_1.9.0
## [31] jquerylib_0.1.4 MASS_7.3-61 BiocParallel_1.38.0
## [34] cachem_1.1.0 viridis_0.6.5 tidyselect_1.2.1
## [37] digest_0.6.36 stringi_1.8.4 labeling_0.4.3
## [40] cowplot_1.1.3 polyclip_1.10-6 fastmap_1.2.0
## [43] grid_4.4.1 colorspace_2.1-0 cli_3.6.3
## [46] magrittr_2.0.3 ggraph_2.2.1 tidygraph_1.3.1
## [49] utf8_1.2.4 withr_3.0.0 scales_1.3.0
## [52] bit64_4.0.5 timechange_0.3.0 rmarkdown_2.27
## [55] bit_4.0.5 igraph_2.0.3 gridExtra_2.3
## [58] hms_1.1.3 memoise_2.0.1 evaluate_0.24.0
## [61] knitr_1.48 viridisLite_0.4.2 rlang_1.1.4
## [64] Rcpp_1.0.13 glue_1.7.0 tweenr_2.0.3
## [67] vroom_1.6.5 rstudioapi_0.16.0 jsonlite_1.8.8
## [70] R6_2.5.1 plyr_1.8.9
mulea
Package?To cite package mulea
in publications use:
Turek, Cezary, Márton Ölbei, Tamás Stirling, Gergely Fekete, Ervin Tasnádi, Leila Gul, Balázs Bohár, Balázs Papp, Wiktor Jurkowski, and Eszter Ari. 2024. “mulea - an R Package for Enrichment Analysis Using Multiple Ontologies and Empirical FDR Correction.” bioRxiv, March. https://doi.org/10.1101/2024.02.28.582444.
Korotkevich, Gennady, Vladimir Sukhov, Nikolay Budin, Boris Shpak, Maxim N. Artyomov, and Alexey Sergushichev. 2021. “Fast Gene Set Enrichment Analysis.” bioRxiv, February. https://doi.org/10.1101/060012.
Kuleshov, Maxim V., Matthew R. Jones, Andrew D. Rouillard, Nicolas F. Fernandez, Qiaonan Duan, Zichen Wang, Simon Koplev, et al. 2016. “Enrichr: A Comprehensive Gene Set Enrichment Analysis Web Server 2016 Update.” Nucleic Acids Research 44 (W1): W90–97. https://doi.org/10.1093/nar/gkw377.
Méhi, Orsolya, Balázs Bogos, Bálint Csörgő, Ferenc Pál, Ákos Nyerges, Balázs Papp, and Csaba Pál. 2014. “Perturbation of Iron Homeostasis Promotes the Evolution of Antibiotic Resistance.” Molecular Biology and Evolution 31 (10): 2793–2804. https://doi.org/10.1093/molbev/msu223.
Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.
Turek, Cezary, Márton Ölbei, Tamás Stirling, Gergely Fekete, Ervin Tasnádi, Leila Gul, Balázs Bohár, Balázs Papp, Wiktor Jurkowski, and Eszter Ari. 2024. “Mulea - an R Package for Enrichment Analysis Using Multiple Ontologies and Empirical FDR Correction.” bioRxiv, March. https://doi.org/10.1101/2024.02.28.582444.
The format of the actually used ontology slightly
deviates from standard GMT files. In tf_ontology
, both the
ontology_id
and ontology_name
columns contain
gene symbols of the transcription factors, unlike other
ontologies such as GO, where these columns hold specific identifiers and
corresponding names.↩︎