This R package edlibR
provides bindings to the C/C++
library edlib, which computes the exact pairwise sequence alignment
using the edit
distance (Levenshtein
distance). The functions within edlibR
are modeled
after the API of the Python
package edlib on PyPI
There are three functions within edlibR
:
The first function provided by edlibR
is
align()
. The function align()
computes the
pairwise alignment of the input query
against the input
target
:
align(query, target, [mode], [task], [k], [cigarFormat], [additionalEqualities])
A list is returned with the following fields:
query
and target
.list(c(start, end))
. Note: if the start or end
positions are NULL, this is encoded as NA
to work correctly
with R vectors.cigarFormat
in the function
align()
which is returned here for the function
getNiceAlignment()
. (Note: the function
getNiceAlignment()
only accepts
cigarFormat="extended"
.)## $editDistance
## [1] 1
##
## $alphabetLength
## [1] 5
##
## $locations
## $locations[[1]]
## [1] 1 3
##
## $locations[[2]]
## [1] 1 4
##
##
## $cigar
## [1] "3=1I"
##
## $cigarFormat
## [1] "extended"
## $editDistance
## [1] 3
##
## $alphabetLength
## [1] 8
##
## $locations
## $locations[[1]]
## [1] NA 8
##
##
## $cigar
## NULL
##
## $cigarFormat
## [1] "extended"
## $editDistance
## [1] 1
##
## $alphabetLength
## [1] 5
##
## $locations
## $locations[[1]]
## [1] 1 3
##
## $locations[[2]]
## [1] 1 4
##
##
## $cigar
## [1] "3=1I"
##
## $cigarFormat
## [1] "extended"
## the previous example with additionalEqualities
algn4 = align("ACTG", "CACTRT", mode="HW", task="path", additionalEqualities=list(c("R", "A"), c("R", "G")))
print(algn4)
## $editDistance
## [1] 0
##
## $alphabetLength
## [1] 5
##
## $locations
## $locations[[1]]
## [1] 1 4
##
##
## $cigar
## [1] "4="
##
## $cigarFormat
## [1] "extended"
edlibR
:
AACT
and target as AACTGGC
, the edit
distance would be 0, because removing GGC
from the end of
the second sequence is “free” and does not count into the total edit
distance. This method is appropriate when you want to find out how
well the first sequence fits at the beginning of the second
sequence.ACT
and
CGACTGAC
, the edit distance would be 0, because removing
CG
from the start and GAC
from the end of the
second sequence is “free” and does not count into the total edit
distance. This method is appropriate when you want to find out how
well the first sequence fits at any part of the second sequence.
For example, if your second sequence was a long text and your first
sequence was a sentence from that text, but slightly scrambled, you
could use this method to discover how scrambled it is and where it fits
in that text. In bioinformatics, this method is appropriate for
aligning a read to a sequence.cigarFormat="extended"
):
The function getNiceAlignment() takes the output of align(), and represents this in a visually informative format for human inspection (“NICE format”). This will be an informative string showing the matches, mismatches, insertions, and deletions.
getNiceAlignment(alignResult, query, target, [gapSymbol])
Note: Users must use the argument task="path"
within
align()
to output a CIGAR for
getNiceAlignment()
; otherwise, there will be no CIGAR for
getNiceAlignment()
to reconstruct the alignment in “NICE”
format. Also, users must use the argument
cigarFormat="extended"
within align(); otherwise, the CIGAR
will be too ambiguous for getNiceAlignment() to correctly reconstruct
the alignment() in “NICE” format.
library(edlibR)
query = "elephant"
target = "telephone"
result = align(query, target, task = "path")
nice_algn = getNiceAlignment(result, query, target)
print(nice_algn)
## $query_aligned
## [1] "-elephant"
##
## $matched_aligned
## [1] "-|||||.|."
##
## $target_aligned
## [1] "telephone"
align()
. As mentioned above, align()
must use
the arguments task="path"
and
cigarFormat="extended"
in order for the CIGAR to be
informative enough for getNiceAlignment()
to work
properly.alignResult
alignResult
query
and
target
(default="-"
). This must be a single
character, i.e. a string of length 1 (i.e. nchar(gapSymbol)
must equal 1).The function nice_print()
simply prints the output of
getNiceAlignment()
to the console for quickly inspecting
the alignment. Users can think of this function as a “pretty-print”
function for visualization.
library(edlibR)
## example above from getNiceAlignment()
query = "elephant"
target = "telephone"
result = align(query, target, task = "path")
nice_algn = getNiceAlignment(result, query, target)
nice_print(nice_algn)
## [1] "query: -elephant"
## [1] "matched: -|||||.|."
## [1] "target: telephone"
For more information regarding edlib, please see the publication in Bioinformatics.