Difference between revisions of "Alignment"
imported>Eunjin RYU (Created page with "<p><span style="font-size:24px"><u><strong>Sequence Alignment</strong></u></span></p> <p> </p> <p> </p> <p><span style="font-size:20px"><strong>What is sequence alig...") |
imported>Na kyung Jung |
||
(2 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
− | < | + | <h1>Sequence alignment</h1> |
− | <p> | + | <p>From Wikipedia, the free encyclopedia</p> |
− | <p> | + | <p><a href="https://en.wikipedia.org/wiki/Sequence_alignment#mw-head">Jump to navigation</a><a href="https://en.wikipedia.org/wiki/Sequence_alignment#p-search">Jump to search</a></p> |
− | <p>< | + | <table> |
+ | <tbody> | ||
+ | <tr> | ||
+ | <td> | ||
+ | <p><a href="https://en.wikipedia.org/wiki/File:Question_book-new.svg"><img alt="" src="https://upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/50px-Question_book-new.svg.png" style="height:39px; width:50px" /></a></p> | ||
+ | </td> | ||
+ | <td> | ||
+ | <p>This article <strong>needs additional citations for <a href="https://en.wikipedia.org/wiki/Wikipedia:Verifiability" title="Wikipedia:Verifiability">verification</a></strong>. Please help <a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit">improve this article</a> by <a href="https://en.wikipedia.org/wiki/Help:Introduction_to_referencing_with_Wiki_Markup/1" title="Help:Introduction to referencing with Wiki Markup/1">adding citations to reliable sources</a>. Unsourced material may be challenged and removed. <small><em>(March 2009)</em></small><small><em> (<a href="https://en.wikipedia.org/wiki/Help:Maintenance_template_removal" title="Help:Maintenance template removal">Learn how and when to remove this template message</a>)</em></small></p> | ||
+ | </td> | ||
+ | </tr> | ||
+ | </tbody> | ||
+ | </table> | ||
− | <p> </p> | + | <p>In <a href="https://en.wikipedia.org/wiki/Bioinformatics" title="Bioinformatics">bioinformatics</a>, a <strong>sequence alignment</strong> is a way of arranging the sequences of <a href="https://en.wikipedia.org/wiki/DNA" title="DNA">DNA</a>, <a href="https://en.wikipedia.org/wiki/RNA" title="RNA">RNA</a>, or protein to identify regions of similarity that may be a consequence of functional, <a href="https://en.wikipedia.org/wiki/Structural_biology" title="Structural biology">structural</a>, or <a href="https://en.wikipedia.org/wiki/Evolution" title="Evolution">evolutionary</a> relationships between the sequences.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-mount-1">[1]</a></sup> Aligned sequences of <a href="https://en.wikipedia.org/wiki/Nucleotide" title="Nucleotide">nucleotide</a> or <a href="https://en.wikipedia.org/wiki/Amino_acid" title="Amino acid">amino acid</a> residues are typically represented as rows within a <a href="https://en.wikipedia.org/wiki/Matrix_(mathematics)" title="Matrix (mathematics)">matrix</a>. Gaps are inserted between the <a href="https://en.wikipedia.org/wiki/Residue_(chemistry)" title="Residue (chemistry)">residues</a> so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the <a href="https://en.wikipedia.org/wiki/Edit_distance" title="Edit distance">edit distance cost</a> between strings in a <a href="https://en.wikipedia.org/wiki/Natural_language" title="Natural language">natural language</a> or in financial data.</p> |
− | <p>< | + | <p><a href="https://en.wikipedia.org/wiki/File:Histone_Alignment.png"><img alt="" src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Histone_Alignment.png/595px-Histone_Alignment.png" style="height:205px; width:595px" /></a></p> |
− | <p> </p> | + | <p>A sequence alignment, produced by <a href="https://en.wikipedia.org/wiki/ClustalO" title="ClustalO">ClustalO</a>, of mammalian <a href="https://en.wikipedia.org/wiki/Histone" title="Histone">histone</a> proteins. <br /> |
+ | Sequences are the <a href="https://en.wikipedia.org/wiki/Amino_acid#Table_of_standard_amino_acid_abbreviations_and_side_chain_properties" title="Amino acid">amino acids</a> for residues 120-180 of the proteins. Residues that are conserved across all sequences are highlighted in grey. Below the protein sequences is a key denoting <a href="https://en.wikipedia.org/wiki/Conserved_sequence" title="Conserved sequence">conserved sequence</a> (*), <a href="https://en.wikipedia.org/wiki/Conservative_mutation" title="Conservative mutation">conservative mutations</a> (:), semi-conservative mutations (.), and <a href="https://en.wikipedia.org/wiki/Segregating_site" title="Segregating site">non-conservative mutations</a> ( ).<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-2">[2]</a></sup></p> | ||
− | <p>< | + | <p><input type="checkbox" /></p> |
− | < | + | <h2>Contents</h2> |
− | < | + | <ul> |
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Interpretation">1Interpretation</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Alignment_methods">2Alignment methods</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Representations">3Representations</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Global_and_local_alignments">4Global and local alignments</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Pairwise_alignment">5Pairwise alignment</a> | ||
+ | <ul> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Dot-matrix_methods">5.1Dot-matrix methods</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Dynamic_programming">5.2Dynamic programming</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Word_methods">5.3Word methods</a></li> | ||
+ | </ul> | ||
+ | </li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Multiple_sequence_alignment">6Multiple sequence alignment</a> | ||
+ | <ul> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Dynamic_programming_2">6.1Dynamic programming</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Progressive_methods">6.2Progressive methods</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Iterative_methods">6.3Iterative methods</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Motif_finding">6.4Motif finding</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Techniques_inspired_by_computer_science">6.5Techniques inspired by computer science</a></li> | ||
+ | </ul> | ||
+ | </li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Structural_alignment">7Structural alignment</a> | ||
+ | <ul> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#DALI">7.1DALI</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#SSAP">7.2SSAP</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Combinatorial_extension">7.3Combinatorial extension</a></li> | ||
+ | </ul> | ||
+ | </li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Phylogenetic_analysis">8Phylogenetic analysis</a> | ||
+ | <ul> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Assessment_of_significance">8.1Assessment of significance</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Assessment_of_credibility">8.2Assessment of credibility</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Scoring_functions">8.3Scoring functions</a></li> | ||
+ | </ul> | ||
+ | </li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Other_biological_uses">9Other biological uses</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Non-biological_uses">10Non-biological uses</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#Software">11Software</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#See_also">12See also</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#References">13References</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_alignment#External_links">14External links</a></li> | ||
+ | </ul> | ||
− | < | + | <h2>Interpretation[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=1" title="Edit section: Interpretation">edit</a>]</h2> |
− | <p>< | + | <p>If two sequences in an alignment share a common ancestor, mismatches can be interpreted as <a href="https://en.wikipedia.org/wiki/Point_mutation" title="Point mutation">point mutations</a> and gaps as <a href="https://en.wikipedia.org/wiki/Indel" title="Indel">indels</a> (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. In sequence alignments of proteins, the degree of similarity between <a href="https://en.wikipedia.org/wiki/Amino_acid" title="Amino acid">amino acids</a> occupying a particular position in the sequence can be interpreted as a rough measure of how <a href="https://en.wikipedia.org/wiki/Conservation_(genetics)" title="Conservation (genetics)">conserved</a> a particular region or <a href="https://en.wikipedia.org/wiki/Sequence_motif" title="Sequence motif">sequence motif</a> is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose <a href="https://en.wikipedia.org/wiki/Side_chain" title="Side chain">side chains</a> have similar biochemical properties) in a particular region of the sequence, suggest <sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-predict-3">[3]</a></sup> that this region has structural or functional importance. Although DNA and RNA <a href="https://en.wikipedia.org/wiki/Nucleotide" title="Nucleotide">nucleotide</a> bases are more similar to each other than are amino acids, the conservation of base pairs can indicate a similar functional or structural role.</p> |
− | < | + | <h2>Alignment methods[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=2" title="Edit section: Alignment methods">edit</a>]</h2> |
− | <p>< | + | <p>Very short or very similar sequences can be aligned by hand. However, most interesting problems require the alignment of lengthy, highly variable or extremely numerous sequences that cannot be aligned solely by human effort. Instead, human knowledge is applied in constructing algorithms to produce high-quality sequence alignments, and occasionally in adjusting the final results to reflect patterns that are difficult to represent algorithmically (especially in the case of nucleotide sequences). Computational approaches to sequence alignment generally fall into two categories: <em>global alignments</em> and <em>local alignments</em>. Calculating a global alignment is a form of <a href="https://en.wikipedia.org/wiki/Global_optimization" title="Global optimization">global optimization</a> that "forces" the alignment to span the entire length of all query sequences. By contrast, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Local alignments are often preferable, but can be more difficult to calculate because of the additional challenge of identifying the regions of similarity.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-Polyanovsky2011-4">[4]</a></sup> A variety of computational algorithms have been applied to the sequence alignment problem. These include slow but formally correct methods like <a href="https://en.wikipedia.org/wiki/Dynamic_programming" title="Dynamic programming">dynamic programming</a>. These also include efficient, <a href="https://en.wikipedia.org/wiki/Heuristic_algorithm" title="Heuristic algorithm">heuristic algorithms</a> or <a href="https://en.wikipedia.org/wiki/Probability" title="Probability">probabilistic</a> methods designed for large-scale database search, that do not guarantee to find best matches.</p> |
− | < | + | <h2>Representations[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=3" title="Edit section: Representations">edit</a>]</h2> |
− | <p>< | + | <p>Alignments are commonly represented both graphically and in text format. In almost all sequence alignment representations, sequences are written in rows arranged so that aligned residues appear in successive columns. In text formats, aligned columns containing identical or similar characters are indicated with a system of conservation symbols. As in the image above, an asterisk or pipe symbol is used to show identity between two columns; other less common symbols include a colon for conservative substitutions and a period for semiconservative substitutions. Many sequence visualization programs also use color to display information about the properties of the individual sequence elements; in DNA and RNA sequences, this equates to assigning each nucleotide its own color. In protein alignments, such as the one in the image above, color is often used to indicate amino acid properties to aid in judging the <a href="https://en.wikipedia.org/wiki/Conservation_(genetics)" title="Conservation (genetics)">conservation</a> of a given amino acid substitution. For multiple sequences the last row in each column is often the <a href="https://en.wikipedia.org/wiki/Consensus_sequence" title="Consensus sequence">consensus sequence</a> determined by the alignment; the consensus sequence is also often represented in graphical format with a <a href="https://en.wikipedia.org/wiki/Sequence_logo" title="Sequence logo">sequence logo</a> in which the size of each nucleotide or amino acid letter corresponds to its degree of conservation.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-Schneider-5">[5]</a></sup></p> |
+ | |||
+ | <p>Sequence alignments can be stored in a wide variety of text-based file formats, many of which were originally developed in conjunction with a specific alignment program or implementation. Most web-based tools allow a limited number of input and output formats, such as <a href="https://en.wikipedia.org/wiki/FASTA_format" title="FASTA format">FASTA format</a> and <a href="https://en.wikipedia.org/wiki/GenBank" title="GenBank">GenBank</a> format and the output is not easily editable. Several conversion programs that provide graphical and/or command line interfaces are available<sup>[<em><a href="https://en.wikipedia.org/wiki/Wikipedia:Link_rot" title="Wikipedia:Link rot">dead link</a></em>]</sup>, such as <a href="https://web.archive.org/web/20071024223546/http://bioweb.pasteur.fr/seqanal/interfaces/readseq.html" rel="nofollow">READSEQ</a> and <a href="https://en.wikipedia.org/wiki/EMBOSS" title="EMBOSS">EMBOSS</a>. There are also several programming packages which provide this conversion functionality, such as <a href="https://en.wikipedia.org/wiki/BioPython" title="BioPython">BioPython</a>, <a href="https://en.wikipedia.org/wiki/BioRuby" title="BioRuby">BioRuby</a> and <a href="https://en.wikipedia.org/wiki/BioPerl" title="BioPerl">BioPerl</a>. The <a href="https://en.wikipedia.org/wiki/SAM_(file_format)" title="SAM (file format)">SAM/BAM files</a> use the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string format to represent an alignment of a sequence to a reference by encoding a sequence of events (e.g. match/mismatch, insertions, deletions).<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-6">[6]</a></sup></p> | ||
+ | |||
+ | <h2>Global and local alignments[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=4" title="Edit section: Global and local alignments">edit</a>]</h2> | ||
+ | |||
+ | <p>Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot start and/or end in gaps.) A general global alignment technique is the <a href="https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm" title="Needleman–Wunsch algorithm">Needleman–Wunsch algorithm</a>, which is based on dynamic programming. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The <a href="https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm" title="Smith–Waterman algorithm">Smith–Waterman algorithm</a> is a general local alignment method based on the same dynamic programming scheme but with additional choices to start and end at any place.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-Polyanovsky2011-4">[4]</a></sup></p> | ||
+ | |||
+ | <p>Hybrid methods, known as semi-global or "glocal" (short for <strong>glo</strong>bal-lo<strong>cal</strong>) methods, search for the best possible partial alignment of the two sequences (a subset of one or both starts and one or both ends has to be chosen before aligning). This can be especially useful when the downstream part of one sequence overlaps with the upstream part of the other sequence. In this case, neither global nor local alignment is entirely appropriate: a global alignment would attempt to force the alignment to extend beyond the region of overlap, while a local alignment might not fully cover the region of overlap.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-brudno-7">[7]</a></sup> Another case where semi-global alignment is useful is when one sequence is short (for example a gene sequence) and the other is very long (for example a chromosome sequence). In that case, the short sequence should be globally (fully) aligned but only a local (partial) alignment is desired for the long sequence.</p> | ||
+ | |||
+ | <h2>Pairwise alignment[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=5" title="Edit section: Pairwise alignment">edit</a>]</h2> | ||
+ | |||
+ | <p>Pairwise sequence alignment methods are used to find the best-matching piecewise (local or global) alignments of two query sequences. Pairwise alignments can only be used between two sequences at a time, but they are efficient to calculate and are often used for methods that do not require extreme precision (such as searching a database for sequences with high similarity to a query). The three primary methods of producing pairwise alignments are dot-matrix methods, dynamic programming, and word methods;<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-mount-1">[1]</a></sup> however, multiple sequence alignment techniques can also align pairs of sequences. Although each method has its individual strengths and weaknesses, all three pairwise methods have difficulty with highly repetitive sequences of low <a href="https://en.wikipedia.org/wiki/Information_content" title="Information content">information content</a> - especially where the number of repetitions differ in the two sequences to be aligned. One way of quantifying the utility of a given pairwise alignment is the 'maximum unique match' (MUM), or the longest subsequence that occurs in both query sequences. Longer MUM sequences typically reflect closer relatedness.</p> | ||
+ | |||
+ | <h3>Dot-matrix methods[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=6" title="Edit section: Dot-matrix methods">edit</a>]</h3> | ||
+ | |||
+ | <table> | ||
+ | <tbody> | ||
+ | <tr> | ||
+ | <td><a href="https://en.wikipedia.org/wiki/File:Mup_locus_showing_DNA_repeats.jpg"><img alt="" src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/72/Mup_locus_showing_DNA_repeats.jpg/200px-Mup_locus_showing_DNA_repeats.jpg" style="height:236px; width:200px" /></a> | ||
+ | |||
+ | <p>Self comparison of a part of a mouse strain genome. The dot-plot shows a patchwork of lines, demonstrating duplicated segments of DNA.</p> | ||
+ | </td> | ||
+ | </tr> | ||
+ | </tbody> | ||
+ | </table> | ||
+ | |||
+ | <p>See main article on <a href="https://en.wikipedia.org/wiki/Dot_plot_(bioinformatics)" title="Dot plot (bioinformatics)">dot plots (bioinformatics)</a>.</p> | ||
+ | |||
+ | <table> | ||
+ | <tbody> | ||
+ | <tr> | ||
+ | <td><a href="https://en.wikipedia.org/wiki/File:Zinc-finger-dot-plot.png"><img alt="" src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Zinc-finger-dot-plot.png/200px-Zinc-finger-dot-plot.png" style="height:200px; width:200px" /></a> | ||
+ | |||
+ | <p>A DNA <a href="https://en.wikipedia.org/wiki/Dot_plot_(bioinformatics)" title="Dot plot (bioinformatics)">dot plot</a> of a <a href="https://en.wikipedia.org/wiki/Human" title="Human">human</a><a href="https://en.wikipedia.org/wiki/Zinc_finger" title="Zinc finger">zinc finger</a> <a href="https://en.wikipedia.org/wiki/Transcription_factor" title="Transcription factor">transcription factor</a>(GenBank ID NM_002383), showing regional <a href="https://en.wikipedia.org/wiki/Self-similarity" title="Self-similarity">self-similarity</a>. The main diagonal represents the sequence's alignment with itself; lines off the main diagonal represent similar or repetitive patterns within the sequence. This is a typical example of a <a href="https://en.wikipedia.org/wiki/Recurrence_plot" title="Recurrence plot">recurrence plot</a>.</p> | ||
+ | </td> | ||
+ | </tr> | ||
+ | </tbody> | ||
+ | </table> | ||
+ | |||
+ | <p>The dot-matrix approach, which implicitly produces a family of alignments for individual sequence regions, is qualitative and conceptually simple, though time-consuming to analyze on a large scale. In the absence of noise, it can be easy to visually identify certain sequence features—such as insertions, deletions, repeats, or <a href="https://en.wikipedia.org/wiki/Inverted_repeat" title="Inverted repeat">inverted repeats</a>—from a dot-matrix plot. To construct a <a href="https://en.wikipedia.org/wiki/Dot_plot_(bioinformatics)" title="Dot plot (bioinformatics)">dot-matrix plot</a>, the two sequences are written along the top row and leftmost column of a two-dimensional <a href="https://en.wikipedia.org/wiki/Matrix_(mathematics)" title="Matrix (mathematics)">matrix</a>and a dot is placed at any point where the characters in the appropriate columns match—this is a typical <a href="https://en.wikipedia.org/wiki/Recurrence_plot" title="Recurrence plot">recurrence plot</a>. Some implementations vary the size or intensity of the dot depending on the degree of similarity of the two characters, to accommodate conservative substitutions. The dot plots of very closely related sequences will appear as a single line along the matrix's <a href="https://en.wikipedia.org/wiki/Main_diagonal" title="Main diagonal">main diagonal</a>.</p> | ||
+ | |||
+ | <p>Problems with dot plots as an information display technique include: noise, lack of clarity, non-intuitiveness, difficulty extracting match summary statistics and match positions on the two sequences. There is also much wasted space where the match data is inherently duplicated across the diagonal and most of the actual area of the plot is taken up by either empty space or noise, and, finally, dot-plots are limited to two sequences. None of these limitations apply to Miropeats alignment diagrams but they have their own particular flaws.</p> | ||
+ | |||
+ | <p>Dot plots can also be used to assess repetitiveness in a single sequence. A sequence can be plotted against itself and regions that share significant similarities will appear as lines off the main diagonal. This effect can occur when a protein consists of multiple similar <a href="https://en.wikipedia.org/wiki/Structural_domain" title="Structural domain">structural domains</a>.</p> | ||
+ | |||
+ | <h3>Dynamic programming[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=7" title="Edit section: Dynamic programming">edit</a>]</h3> | ||
+ | |||
+ | <p>The technique of <a href="https://en.wikipedia.org/wiki/Dynamic_programming" title="Dynamic programming">dynamic programming</a> can be applied to produce global alignments via the <a href="https://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm" title="Needleman-Wunsch algorithm">Needleman-Wunsch algorithm</a>, and local alignments via the <a href="https://en.wikipedia.org/wiki/Smith-Waterman_algorithm" title="Smith-Waterman algorithm">Smith-Waterman algorithm</a>. In typical usage, protein alignments use a <a href="https://en.wikipedia.org/wiki/Substitution_matrix" title="Substitution matrix">substitution matrix</a> to assign scores to amino-acid matches or mismatches, and a <a href="https://en.wikipedia.org/wiki/Gap_penalty" title="Gap penalty">gap penalty</a> for matching an amino acid in one sequence to a gap in the other. DNA and RNA alignments may use a scoring matrix, but in practice often simply assign a positive match score, a negative mismatch score, and a negative gap penalty. (In standard dynamic programming, the score of each amino acid position is independent of the identity of its neighbors, and therefore <a href="https://en.wikipedia.org/wiki/Base_stacking" title="Base stacking">base stacking</a> effects are not taken into account. However, it is possible to account for such effects by modifying the algorithm.) A common extension to standard linear gap costs, is the usage of two different gap penalties for opening a gap and for extending a gap. Typically the former is much larger than the latter, e.g. -10 for gap open and -2 for gap extension. Thus, the number of gaps in an alignment is usually reduced and residues and gaps are kept together, which typically makes more biological sense. The Gotoh algorithm implements affine gap costs by using three matrices.</p> | ||
+ | |||
+ | <p>Dynamic programming can be useful in aligning nucleotide to protein sequences, a task complicated by the need to take into account <a href="https://en.wikipedia.org/wiki/Frameshift" title="Frameshift">frameshift</a> mutations (usually insertions or deletions). The framesearch method produces a series of global or local pairwise alignments between a query nucleotide sequence and a search set of protein sequences, or vice versa. Its ability to evaluate frameshifts offset by an arbitrary number of nucleotides makes the method useful for sequences containing large numbers of indels, which can be very difficult to align with more efficient heuristic methods. In practice, the method requires large amounts of computing power or a system whose architecture is specialized for dynamic programming. The <a href="https://en.wikipedia.org/wiki/BLAST" title="BLAST">BLAST</a> and <a href="https://en.wikipedia.org/wiki/EMBOSS" title="EMBOSS">EMBOSS</a> suites provide basic tools for creating translated alignments (though some of these approaches take advantage of side-effects of sequence searching capabilities of the tools). More general methods are available from both commercial sources, such as <em>FrameSearch</em>, distributed as part of the <a href="https://en.wikipedia.org/wiki/Accelrys" title="Accelrys">Accelrys</a> <a href="https://en.wikipedia.org/wiki/GCG_(software)" title="GCG (software)">GCG package</a>, and <a href="https://en.wikipedia.org/wiki/Open_Source" title="Open Source">Open Source</a> software such as <a href="http://www.ebi.ac.uk/Tools/psa/genewise/" rel="nofollow">Genewise</a>.</p> | ||
+ | |||
+ | <p>The dynamic programming method is guaranteed to find an optimal alignment given a particular scoring function; however, identifying a good scoring function is often an empirical rather than a theoretical matter. Although dynamic programming is extensible to more than two sequences, it is prohibitively slow for large numbers of sequences or extremely long sequences.</p> | ||
+ | |||
+ | <h3>Word methods[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=8" title="Edit section: Word methods">edit</a>]</h3> | ||
+ | |||
+ | <p>Word methods, also known as <em>k</em>-tuple methods, are <a href="https://en.wikipedia.org/wiki/Heuristic" title="Heuristic">heuristic</a> methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools <a href="https://en.wikipedia.org/wiki/FASTA" title="FASTA">FASTA</a> and the <a href="https://en.wikipedia.org/wiki/BLAST" title="BLAST">BLAST</a> family.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-mount-1">[1]</a></sup> Word methods identify a series of short, nonoverlapping subsequences ("words") in the query sequence that are then matched to candidate database sequences. The relative positions of the word in the two sequences being compared are subtracted to obtain an offset; this will indicate a region of alignment if multiple distinct words produce the same offset. Only if this region is detected do these methods apply more sensitive alignment criteria; thus, many unnecessary comparisons with sequences of no appreciable similarity are eliminated.</p> | ||
+ | |||
+ | <p>In the FASTA method, the user defines a value <em>k</em> to use as the word length with which to search the database. The method is slower but more sensitive at lower values of <em>k</em>, which are also preferred for searches involving a very short query sequence. The BLAST family of search methods provides a number of algorithms optimized for particular types of queries, such as searching for distantly related sequence matches. BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy; like FASTA, BLAST uses a word search of length <em>k</em>, but evaluates only the most significant word matches, rather than every word match as does FASTA. Most BLAST implementations use a fixed default word length that is optimized for the query and database type, and that is changed only under special circumstances, such as when searching with repetitive or very short query sequences. Implementations can be found via a number of web portals, such as <a href="http://www.ebi.ac.uk/fasta33/" rel="nofollow">EMBL FASTA</a>and <a href="https://www.ncbi.nlm.nih.gov/BLAST/" rel="nofollow">NCBI BLAST</a>.</p> | ||
+ | |||
+ | <h2>Multiple sequence alignment[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=9" title="Edit section: Multiple sequence alignment">edit</a>]</h2> | ||
+ | |||
+ | <p>Main article: <a href="https://en.wikipedia.org/wiki/Multiple_sequence_alignment" title="Multiple sequence alignment">Multiple sequence alignment</a></p> | ||
+ | |||
+ | <p><a href="https://en.wikipedia.org/wiki/File:Hemagglutinin-alignments.png"><img alt="" src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Hemagglutinin-alignments.png/300px-Hemagglutinin-alignments.png" style="height:322px; width:300px" /></a></p> | ||
+ | |||
+ | <p>Alignment of 27 <a href="https://en.wikipedia.org/wiki/Avian_influenza" title="Avian influenza">avian influenza</a> <a href="https://en.wikipedia.org/wiki/Hemagglutinin" title="Hemagglutinin">hemagglutinin</a>protein sequences colored by residue conservation (top) and residue properties (bottom)</p> | ||
+ | |||
+ | <p><a href="https://en.wikipedia.org/wiki/Multiple_sequence_alignment" title="Multiple sequence alignment">Multiple sequence alignment</a> is an extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignment methods try to align all of the sequences in a given query set. Multiple alignments are often used in identifying <a href="https://en.wikipedia.org/wiki/Conservation_(genetics)" title="Conservation (genetics)">conserved</a> sequence regions across a group of sequences hypothesized to be evolutionarily related. Such conserved sequence motifs can be used in conjunction with structural and <a href="https://en.wikipedia.org/wiki/Reaction_mechanism" title="Reaction mechanism">mechanistic</a>information to locate the catalytic <a href="https://en.wikipedia.org/wiki/Active_site" title="Active site">active sites</a> of <a href="https://en.wikipedia.org/wiki/Enzyme" title="Enzyme">enzymes</a>. Alignments are also used to aid in establishing evolutionary relationships by constructing <a href="https://en.wikipedia.org/wiki/Phylogenetic_tree" title="Phylogenetic tree">phylogenetic trees</a>. Multiple sequence alignments are computationally difficult to produce and most formulations of the problem lead to <a href="https://en.wikipedia.org/wiki/NP-complete" title="NP-complete">NP-complete</a> combinatorial optimization problems.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-wang-8">[8]</a></sup><sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-elias-9">[9]</a></sup> Nevertheless, the utility of these alignments in bioinformatics has led to the development of a variety of methods suitable for aligning three or more sequences.</p> | ||
+ | |||
+ | <h3>Dynamic programming[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=10" title="Edit section: Dynamic programming">edit</a>]</h3> | ||
+ | |||
+ | <p>The technique of dynamic programming is theoretically applicable to any number of sequences; however, because it is computationally expensive in both time and <a href="https://en.wikipedia.org/wiki/Computer_memory" title="Computer memory">memory</a>, it is rarely used for more than three or four sequences in its most basic form. This method requires constructing the <em>n</em>-dimensional equivalent of the sequence matrix formed from two sequences, where <em>n</em> is the number of sequences in the query. Standard dynamic programming is first used on all pairs of query sequences and then the "alignment space" is filled in by considering possible matches or gaps at intermediate positions, eventually constructing an alignment essentially between each two-sequence alignment. Although this technique is computationally expensive, its guarantee of a global optimum solution is useful in cases where only a few sequences need to be aligned accurately. One method for reducing the computational demands of dynamic programming, which relies on the "sum of pairs" <a href="https://en.wikipedia.org/wiki/Objective_function" title="Objective function">objective function</a>, has been implemented in the <a href="https://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html" rel="nofollow">MSA</a> software package.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-lipman-10">[10]</a></sup></p> | ||
+ | |||
+ | <h3>Progressive methods[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=11" title="Edit section: Progressive methods">edit</a>]</h3> | ||
+ | |||
+ | <p>Progressive, hierarchical, or tree methods generate a multiple sequence alignment by first aligning the most similar sequences and then adding successively less related sequences or groups to the alignment until the entire query set has been incorporated into the solution. The initial tree describing the sequence relatedness is based on pairwise comparisons that may include heuristic pairwise alignment methods similar to <a href="https://en.wikipedia.org/wiki/FASTA" title="FASTA">FASTA</a>. Progressive alignment results are dependent on the choice of "most related" sequences and thus can be sensitive to inaccuracies in the initial pairwise alignments. Most progressive multiple sequence alignment methods additionally weight the sequences in the query set according to their relatedness, which reduces the likelihood of making a poor choice of initial sequences and thus improves alignment accuracy.</p> | ||
+ | |||
+ | <p>Many variations of the <a href="https://en.wikipedia.org/wiki/Clustal" title="Clustal">Clustal</a> progressive implementation<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-higgins-11">[11]</a></sup><sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-thompson-12">[12]</a></sup><sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-chenna-13">[13]</a></sup> are used for multiple sequence alignment, phylogenetic tree construction, and as input for <a href="https://en.wikipedia.org/wiki/Protein_structure_prediction" title="Protein structure prediction">protein structure prediction</a>. A slower but more accurate variant of the progressive method is known as <a href="https://en.wikipedia.org/wiki/T-Coffee" title="T-Coffee">T-Coffee</a>.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-notredame-14">[14]</a></sup></p> | ||
+ | |||
+ | <h3>Iterative methods[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=12" title="Edit section: Iterative methods">edit</a>]</h3> | ||
+ | |||
+ | <p>Iterative methods attempt to improve on the heavy dependence on the accuracy of the initial pairwise alignments, which is the weak point of the progressive methods. Iterative methods optimize an <a href="https://en.wikipedia.org/wiki/Objective_function" title="Objective function">objective function</a> based on a selected alignment scoring method by assigning an initial global alignment and then realigning sequence subsets. The realigned subsets are then themselves aligned to produce the next iteration's multiple sequence alignment. Various ways of selecting the sequence subgroups and objective function are reviewed in.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-hirosawa-15">[15]</a></sup></p> | ||
+ | |||
+ | <h3>Motif finding[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=13" title="Edit section: Motif finding">edit</a>]</h3> | ||
+ | |||
+ | <p>Motif finding, also known as profile analysis, constructs global multiple sequence alignments that attempt to align short conserved <a href="https://en.wikipedia.org/wiki/Sequence_motif" title="Sequence motif">sequence motifs</a> among the sequences in the query set. This is usually done by first constructing a general global multiple sequence alignment, after which the highly <a href="https://en.wikipedia.org/wiki/Conservation_(genetics)" title="Conservation (genetics)">conserved</a> regions are isolated and used to construct a set of profile matrices. The profile matrix for each conserved region is arranged like a scoring matrix but its frequency counts for each amino acid or nucleotide at each position are derived from the conserved region's character distribution rather than from a more general empirical distribution. The profile matrices are then used to search other sequences for occurrences of the motif they characterize. In cases where the original <a href="https://en.wikipedia.org/wiki/Data_set" title="Data set">data set</a> contained a small number of sequences, or only highly related sequences, <a href="https://en.wikipedia.org/wiki/Pseudocount" title="Pseudocount">pseudocounts</a> are added to normalize the character distributions represented in the motif.</p> | ||
+ | |||
+ | <h3>Techniques inspired by computer science[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=14" title="Edit section: Techniques inspired by computer science">edit</a>]</h3> | ||
+ | |||
+ | <p>A variety of general <a href="https://en.wikipedia.org/wiki/Optimization_(mathematics)" title="Optimization (mathematics)">optimization</a> algorithms commonly used in computer science have also been applied to the multiple sequence alignment problem. <a href="https://en.wikipedia.org/wiki/Hidden_Markov_model" title="Hidden Markov model">Hidden Markov models</a> have been used to produce probability scores for a family of possible multiple sequence alignments for a given query set; although early HMM-based methods produced underwhelming performance, later applications have found them especially effective in detecting remotely related sequences because they are less susceptible to noise created by conservative or semiconservative substitutions.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-karplus-16">[16]</a></sup> <a href="https://en.wikipedia.org/wiki/Genetic_algorithm" title="Genetic algorithm">Genetic algorithms</a>and <a href="https://en.wikipedia.org/wiki/Simulated_annealing" title="Simulated annealing">simulated annealing</a> have also been used in optimizing multiple sequence alignment scores as judged by a scoring function like the sum-of-pairs method. More complete details and software packages can be found in the main article <a href="https://en.wikipedia.org/wiki/Multiple_sequence_alignment" title="Multiple sequence alignment">multiple sequence alignment</a>.</p> | ||
+ | |||
+ | <p>The <a href="https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform" title="Burrows–Wheeler transform">Burrows–Wheeler transform</a> has been successfully applied to fast short read alignment in popular tools such as <a href="https://en.wikipedia.org/wiki/Bowtie_(sequence_analysis)" title="Bowtie (sequence analysis)">Bowtie</a> and BWA. See <a href="https://en.wikipedia.org/wiki/FM-index" title="FM-index">FM-index</a>.</p> | ||
+ | |||
+ | <h2>Structural alignment[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=15" title="Edit section: Structural alignment">edit</a>]</h2> | ||
+ | |||
+ | <p>Main article: <a href="https://en.wikipedia.org/wiki/Structural_alignment" title="Structural alignment">Structural alignment</a></p> | ||
+ | |||
+ | <p>Structural alignments, which are usually specific to protein and sometimes RNA sequences, use information about the <a href="https://en.wikipedia.org/wiki/Secondary_structure" title="Secondary structure">secondary</a> and <a href="https://en.wikipedia.org/wiki/Tertiary_structure" title="Tertiary structure">tertiary structure</a> of the protein or RNA molecule to aid in aligning the sequences. These methods can be used for two or more sequences and typically produce local alignments; however, because they depend on the availability of structural information, they can only be used for sequences whose corresponding structures are known (usually through <a href="https://en.wikipedia.org/wiki/X-ray_crystallography" title="X-ray crystallography">X-ray crystallography</a> or <a href="https://en.wikipedia.org/wiki/NMR_spectroscopy" title="NMR spectroscopy">NMR spectroscopy</a>). Because both protein and RNA structure is more evolutionarily conserved than sequence,<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-chothia-17">[17]</a></sup> structural alignments can be more reliable between sequences that are very distantly related and that have diverged so extensively that sequence comparison cannot reliably detect their similarity.</p> | ||
+ | |||
+ | <p>Structural alignments are used as the "gold standard" in evaluating alignments for homology-based <a href="https://en.wikipedia.org/wiki/Protein_structure_prediction" title="Protein structure prediction">protein structure prediction</a><sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-skolnick-18">[18]</a></sup> because they explicitly align regions of the protein sequence that are structurally similar rather than relying exclusively on sequence information. However, clearly structural alignments cannot be used in structure prediction because at least one sequence in the query set is the target to be modeled, for which the structure is not known. It has been shown that, given the structural alignment between a target and a template sequence, highly accurate models of the target protein sequence can be produced; a major stumbling block in homology-based structure prediction is the production of structurally accurate alignments given only sequence information.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-skolnick-18">[18]</a></sup></p> | ||
+ | |||
+ | <h3>DALI[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=16" title="Edit section: DALI">edit</a>]</h3> | ||
+ | |||
+ | <p>The DALI method, or <a href="https://en.wikipedia.org/wiki/Distance_matrix" title="Distance matrix">distance matrix</a> alignment, is a fragment-based method for constructing structural alignments based on contact similarity patterns between successive hexapeptides in the query sequences.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-holm-19">[19]</a></sup> It can generate pairwise or multiple alignments and identify a query sequence's structural neighbors in the <a href="https://en.wikipedia.org/wiki/Protein_Data_Bank" title="Protein Data Bank">Protein Data Bank</a> (PDB). It has been used to construct the <a href="https://en.wikipedia.org/wiki/Families_of_structurally_similar_proteins" title="Families of structurally similar proteins">FSSP</a> structural alignment database (Fold classification based on Structure-Structure alignment of Proteins, or Families of Structurally Similar Proteins). A DALI webserver can be accessed at <a href="https://web.archive.org/web/20090301064750/http://ekhidna.biocenter.helsinki.fi/dali_server/start" rel="nofollow">DALI</a> and the FSSP is located at <a href="https://web.archive.org/web/20051125045348/http://ekhidna.biocenter.helsinki.fi/dali/start" rel="nofollow">The Dali Database</a>.</p> | ||
+ | |||
+ | <h3>SSAP[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=17" title="Edit section: SSAP">edit</a>]</h3> | ||
+ | |||
+ | <p>SSAP (sequential structure alignment program) is a dynamic programming-based method of structural alignment that uses atom-to-atom vectors in structure space as comparison points. It has been extended since its original description to include multiple as well as pairwise alignments,<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-taylor-20">[20]</a></sup> and has been used in the construction of the <a href="https://en.wikipedia.org/wiki/CATH" title="CATH">CATH</a> (Class, Architecture, Topology, Homology) hierarchical database classification of protein folds.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-orengo-21">[21]</a></sup> The CATH database can be accessed at <a href="http://www.cathdb.info/" rel="nofollow">CATH Protein Structure Classification</a>.</p> | ||
+ | |||
+ | <h3>Combinatorial extension[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=18" title="Edit section: Combinatorial extension">edit</a>]</h3> | ||
+ | |||
+ | <p>The combinatorial extension method of structural alignment generates a pairwise structural alignment by using local geometry to align short fragments of the two proteins being analyzed and then assembles these fragments into a larger alignment.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-shindyalov-22">[22]</a></sup> Based on measures such as rigid-body <a href="https://en.wikipedia.org/wiki/Root_mean_square_deviation_(bioinformatics)" title="Root mean square deviation (bioinformatics)">root mean square distance</a>, residue distances, local secondary structure, and surrounding environmental features such as residue neighbor <a href="https://en.wikipedia.org/wiki/Hydrophobic" title="Hydrophobic">hydrophobicity</a>, local alignments called "aligned fragment pairs" are generated and used to build a similarity matrix representing all possible structural alignments within predefined cutoff criteria. A path from one protein structure state to the other is then traced through the matrix by extending the growing alignment one fragment at a time. The optimal such path defines the combinatorial-extension alignment. A web-based server implementing the method and providing a database of pairwise alignments of structures in the Protein Data Bank is located at the <a href="https://web.archive.org/web/19981203071023/http://cl.sdsc.edu/" rel="nofollow">Combinatorial Extension</a> website.</p> | ||
+ | |||
+ | <h2>Phylogenetic analysis[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=19" title="Edit section: Phylogenetic analysis">edit</a>]</h2> | ||
+ | |||
+ | <p>Main article: <a href="https://en.wikipedia.org/wiki/Computational_phylogenetics" title="Computational phylogenetics">Computational phylogenetics</a></p> | ||
+ | |||
+ | <p>Phylogenetics and sequence alignment are closely related fields due to the shared necessity of evaluating sequence relatedness.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-ortet-23">[23]</a></sup> The field of <a href="https://en.wikipedia.org/wiki/Phylogenetics" title="Phylogenetics">phylogenetics</a> makes extensive use of sequence alignments in the construction and interpretation of <a href="https://en.wikipedia.org/wiki/Phylogenetic_tree" title="Phylogenetic tree">phylogenetic trees</a>, which are used to classify the evolutionary relationships between homologous <a href="https://en.wikipedia.org/wiki/Gene" title="Gene">genes</a> represented in the <a href="https://en.wikipedia.org/wiki/Genome" title="Genome">genomes</a> of divergent species. The degree to which sequences in a query set differ is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence identity suggests that the sequences in question have a comparatively young <a href="https://en.wikipedia.org/wiki/Most_recent_common_ancestor" title="Most recent common ancestor">most recent common ancestor</a>, while low identity suggests that the divergence is more ancient. This approximation, which reflects the "<a href="https://en.wikipedia.org/wiki/Molecular_clock" title="Molecular clock">molecular clock</a>" hypothesis that a roughly constant rate of evolutionary change can be used to extrapolate the elapsed time since two genes first diverged (that is, the <a href="https://en.wikipedia.org/wiki/Coalescence_(genetics)" title="Coalescence (genetics)">coalescence</a> time), assumes that the effects of mutation and <a href="https://en.wikipedia.org/wiki/Natural_selection" title="Natural selection">selection</a> are constant across sequence lineages. Therefore, it does not account for possible difference among organisms or species in the rates of <a href="https://en.wikipedia.org/wiki/DNA_repair" title="DNA repair">DNA repair</a> or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences, the molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between <a href="https://en.wikipedia.org/wiki/Silent_mutation" title="Silent mutation">silent mutations</a> that do not alter the meaning of a given <a href="https://en.wikipedia.org/wiki/Codon" title="Codon">codon</a> and other mutations that result in a different <a href="https://en.wikipedia.org/wiki/Amino_acid" title="Amino acid">amino acid</a> being incorporated into the protein). More statistically accurate methods allow the evolutionary rate on each branch of the phylogenetic tree to vary, thus producing better estimates of coalescence times for genes.</p> | ||
+ | |||
+ | <p>Progressive multiple alignment techniques produce a phylogenetic tree by necessity because they incorporate sequences into the growing alignment in order of relatedness. Other techniques that assemble multiple sequence alignments and phylogenetic trees score and sort trees first and calculate a multiple sequence alignment from the highest-scoring tree. Commonly used methods of phylogenetic tree construction are mainly <a href="https://en.wikipedia.org/wiki/Heuristic" title="Heuristic">heuristic</a> because the problem of selecting the optimal tree, like the problem of selecting the optimal multiple sequence alignment, is <a href="https://en.wikipedia.org/wiki/NP-hard" title="NP-hard">NP-hard</a>.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-felsenstein-24">[24]</a></sup></p> | ||
+ | |||
+ | <h3>Assessment of significance[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=20" title="Edit section: Assessment of significance">edit</a>]</h3> | ||
+ | |||
+ | <p>Sequence alignments are useful in bioinformatics for identifying sequence similarity, producing phylogenetic trees, and developing homology models of protein structures. However, the biological relevance of sequence alignments is not always clear. Alignments are often assumed to reflect a degree of evolutionary change between sequences descended from a common ancestor; however, it is formally possible that <a href="https://en.wikipedia.org/wiki/Convergent_evolution" title="Convergent evolution">convergent evolution</a> can occur to produce apparent similarity between proteins that are evolutionarily unrelated but perform similar functions and have similar structures.</p> | ||
+ | |||
+ | <p>In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance; BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts.</p> | ||
+ | |||
+ | <p>Methods of statistical significance estimation for gapped sequence alignments are available in the literature.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-ortet-23">[23]</a></sup><sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-altschul-25">[25]</a></sup><sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-hartmann-26">[26]</a></sup><sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-newberg-27">[27]</a></sup><sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-eddy-28">[28]</a></sup><sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-bastien-29">[29]</a></sup><sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-agrawal11-30">[30]</a></sup><sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-agrawal08-31">[31]</a></sup></p> | ||
+ | |||
+ | <h3>Assessment of credibility[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=21" title="Edit section: Assessment of credibility">edit</a>]</h3> | ||
+ | |||
+ | <p>Statistical significance indicates the probability that an alignment of a given quality could arise by chance, but does not indicate how much superior a given alignment is to alternative alignments of the same sequences. Measures of alignment credibility indicate the extent to which the best scoring alignments for a given pair of sequences are substantially similar. Methods of alignment credibility estimation for gapped sequence alignments are available in the literature.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-NewbergLawrence2009-32">[32]</a></sup></p> | ||
+ | |||
+ | <h3>Scoring functions[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=22" title="Edit section: Scoring functions">edit</a>]</h3> | ||
+ | |||
+ | <p>The choice of a scoring function that reflects biological or statistical observations about known sequences is important to producing good alignments. Protein sequences are frequently aligned using <a href="https://en.wikipedia.org/wiki/Substitution_matrix" title="Substitution matrix">substitution matrices</a> that reflect the probabilities of given character-to-character substitutions. A series of matrices called <a href="https://en.wikipedia.org/wiki/Point_accepted_mutation" title="Point accepted mutation">PAM matrices</a> (Point Accepted Mutation matrices, originally defined by <a href="https://en.wikipedia.org/wiki/Margaret_Dayhoff" title="Margaret Dayhoff">Margaret Dayhoff</a> and sometimes referred to as "Dayhoff matrices") explicitly encode evolutionary approximations regarding the rates and probabilities of particular amino acid mutations. Another common series of scoring matrices, known as <a href="https://en.wikipedia.org/wiki/BLOSUM" title="BLOSUM">BLOSUM</a> (Blocks Substitution Matrix), encodes empirically derived substitution probabilities. Variants of both types of matrices are used to detect sequences with differing levels of divergence, thus allowing users of BLAST or FASTA to restrict searches to more closely related matches or expand to detect more divergent sequences. <a href="https://en.wikipedia.org/wiki/Gap_penalty" title="Gap penalty">Gap penalties</a>account for the introduction of a gap - on the evolutionary model, an insertion or deletion mutation - in both nucleotide and protein sequences, and therefore the penalty values should be proportional to the expected rate of such mutations. The quality of the alignments produced therefore depends on the quality of the scoring function.</p> | ||
+ | |||
+ | <p>It can be very useful and instructive to try the same alignment several times with different choices for scoring matrix and/or gap penalty values and compare the results. Regions where the solution is weak or non-unique can often be identified by observing which regions of the alignment are robust to variations in alignment parameters.</p> | ||
+ | |||
+ | <h2>Other biological uses[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=23" title="Edit section: Other biological uses">edit</a>]</h2> | ||
+ | |||
+ | <p>Sequenced RNA, such as <a href="https://en.wikipedia.org/wiki/Expressed_sequence_tags" title="Expressed sequence tags">expressed sequence tags</a> and full-length mRNAs, can be aligned to a sequenced genome to find where there are genes and get information about <a href="https://en.wikipedia.org/wiki/Alternative_splicing" title="Alternative splicing">alternative splicing</a><sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-33">[33]</a></sup> and <a href="https://en.wikipedia.org/wiki/RNA_editing" title="RNA editing">RNA editing</a>.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-34">[34]</a></sup> Sequence alignment is also a part of <a href="https://en.wikipedia.org/wiki/Genome_assembly" title="Genome assembly">genome assembly</a>, where sequences are aligned to find overlap so that <em><a href="https://en.wikipedia.org/wiki/Contig" title="Contig">contigs</a></em> (long stretches of sequence) can be formed.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-35">[35]</a></sup> Another use is <a href="https://en.wikipedia.org/wiki/Single_nucleotide_polymorphism" title="Single nucleotide polymorphism">SNP</a>analysis, where sequences from different individuals are aligned to find single basepairs that are often different in a population.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-36">[36]</a></sup></p> | ||
+ | |||
+ | <h2>Non-biological uses[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=24" title="Edit section: Non-biological uses">edit</a>]</h2> | ||
+ | |||
+ | <p>The methods used for biological sequence alignment have also found applications in other fields, most notably in <a href="https://en.wikipedia.org/wiki/Natural_language_processing" title="Natural language processing">natural language processing</a> and in social sciences, where the <a href="https://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm" title="Needleman-Wunsch algorithm">Needleman-Wunsch algorithm</a> is usually referred to as <a href="https://en.wikipedia.org/wiki/Optimal_matching" title="Optimal matching">Optimal matching</a>.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-37">[37]</a></sup> Techniques that generate the set of elements from which words will be selected in natural-language generation algorithms have borrowed multiple sequence alignment techniques from bioinformatics to produce linguistic versions of computer-generated mathematical proofs.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-Barzilay-38">[38]</a></sup> In the field of historical and comparative <a href="https://en.wikipedia.org/wiki/Linguistics" title="Linguistics">linguistics</a>, sequence alignment has been used to partially automate the <a href="https://en.wikipedia.org/wiki/Comparative_method_(linguistics)" title="Comparative method (linguistics)">comparative method</a> by which linguists traditionally reconstruct languages.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-39">[39]</a></sup> Business and marketing research has also applied multiple sequence alignment techniques in analyzing series of purchases over time.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-prinzie-40">[40]</a></sup></p> | ||
+ | |||
+ | <h2>Software[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=25" title="Edit section: Software">edit</a>]</h2> | ||
+ | |||
+ | <p>Main article: <a href="https://en.wikipedia.org/wiki/Sequence_alignment_software" title="Sequence alignment software">Sequence alignment software</a></p> | ||
+ | |||
+ | <p>A more complete list of available software categorized by algorithm and alignment type is available at <a href="https://en.wikipedia.org/wiki/Sequence_alignment_software" title="Sequence alignment software">sequence alignment software</a>, but common software tools used for general sequence alignment tasks include ClustalW2<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-41">[41]</a></sup> and T-coffee<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-42">[42]</a></sup> for alignment, and BLAST<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-43">[43]</a></sup> and FASTA3x<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-44">[44]</a></sup> for database searching. Commercial tools such as <a href="https://en.wikipedia.org/wiki/DNASTAR" title="DNASTAR">DNASTAR Lasergene</a>, <a href="https://en.wikipedia.org/w/index.php?title=Geneious&action=edit&redlink=1" title="Geneious (page does not exist)">Geneious</a>, and <a href="https://en.wikipedia.org/wiki/PatternHunter" title="PatternHunter">PatternHunter</a> are also available. Tools annotated as performing <a href="http://edamontology.org/operation_0292" rel="nofollow">sequence alignment</a> are listed in the <a href="https://bio.tools/?page=1&function=%22Sequence%20alignment%22&sort=score" rel="nofollow">bio.tools</a> registry.</p> | ||
+ | |||
+ | <p>Alignment algorithms and software can be directly compared to one another using a standardized set of <a href="https://en.wikipedia.org/wiki/Benchmark_(computing)" title="Benchmark (computing)">benchmark</a> reference multiple sequence alignments known as BAliBASE.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-thompson2-45">[45]</a></sup> The data set consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at BAliBASE.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-46">[46]</a></sup><sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-thompson3-47">[47]</a></sup> A comprehensive list of BAliBASE scores for many (currently 12) different alignment tools can be computed within the protein workbench STRAP.<sup><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_note-48">[48]</a></sup></p> | ||
+ | |||
+ | <h2>See also[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=26" title="Edit section: See also">edit</a>]</h2> | ||
+ | |||
+ | <ul> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_homology" title="Sequence homology">Sequence homology</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Sequence_mining" title="Sequence mining">Sequence mining</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/BLAST" title="BLAST">BLAST</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/String_searching_algorithm" title="String searching algorithm">String searching algorithm</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Alignment-free_sequence_analysis" title="Alignment-free sequence analysis">Alignment-free sequence analysis</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/UGENE" title="UGENE">UGENE</a></li> | ||
+ | <li><a href="https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm" title="Needleman–Wunsch algorithm">Needleman–Wunsch algorithm</a></li> | ||
+ | </ul> | ||
+ | |||
+ | <h2>References[<a href="https://en.wikipedia.org/w/index.php?title=Sequence_alignment&action=edit&section=27" title="Edit section: References">edit</a>]</h2> | ||
<ol> | <ol> | ||
− | <li>< | + | <li>^ <a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-mount_1-0">Jump up to:<sup><em><strong>a</strong></em></sup></a> <a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-mount_1-1"><sup><em><strong>b</strong></em></sup></a> <a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-mount_1-2"><sup><em><strong>c</strong></em></sup></a> <cite>Mount DM. (2004). <em>Bioinformatics: Sequence and Genome Analysis</em> (2nd ed.). Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY. <a href="https://en.wikipedia.org/wiki/International_Standard_Book_Number" title="International Standard Book Number">ISBN</a> <a href="https://en.wikipedia.org/wiki/Special:BookSources/978-0-87969-608-5" title="Special:BookSources/978-0-87969-608-5">978-0-87969-608-5</a>.</cite></li> |
− | <li>< | + | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-2" title="Jump up">^</a></strong> <cite><a href="http://www.ebi.ac.uk/Tools/msa/clustalw2/help/faq.html#23" rel="nofollow">"Clustal FAQ #Symbols"</a>. <em>Clustal</em>. Retrieved 8 December 2014.</cite></li> |
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-predict_3-0" title="Jump up">^</a></strong> <cite>Ng PC; Henikoff S (May 2001). <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC311071" rel="nofollow">"Predicting deleterious amino acid substitutions"</a>. <em>Genome Res</em>. <strong>11</strong> (5): 863–74. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1101%2Fgr.176601" rel="nofollow">10.1101/gr.176601</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Central" title="PubMed Central">PMC</a> <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC311071" rel="nofollow">311071</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/11337480" rel="nofollow">11337480</a>.</cite></li> | ||
+ | <li>^ <a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-Polyanovsky2011_4-0">Jump up to:<sup><em><strong>a</strong></em></sup></a> <a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-Polyanovsky2011_4-1"><sup><em><strong>b</strong></em></sup></a> <cite>Polyanovsky, V. O.; Roytberg, M. A.; Tumanyan, V. G. (2011). <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3223492" rel="nofollow">"Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences"</a>. <em>Algorithms for Molecular Biology</em>. <strong>6</strong> (1): 25. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1186%2F1748-7188-6-25" rel="nofollow">10.1186/1748-7188-6-25</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Central" title="PubMed Central">PMC</a> <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3223492" rel="nofollow">3223492</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/22032267" rel="nofollow">22032267</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-Schneider_5-0" title="Jump up">^</a></strong> <cite>Schneider TD; Stephens RM (1990). <a href="http://nar.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=2172928" rel="nofollow">"Sequence logos: a new way to display consensus sequences"</a>. <em>Nucleic Acids Res</em>. <strong>18</strong>(20): 6097–6100. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1093%2Fnar%2F18.20.6097" rel="nofollow">10.1093/nar/18.20.6097</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Central" title="PubMed Central">PMC</a> <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC332411" rel="nofollow">332411</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/2172928" rel="nofollow">2172928</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-6" title="Jump up">^</a></strong> <cite><a href="https://samtools.github.io/hts-specs/SAMv1.pdf" rel="nofollow">"Sequence Alignment/Map Format Specification"</a> (PDF).</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-brudno_7-0" title="Jump up">^</a></strong> <cite>Brudno M; Malde S; Poliakov A; Do CB; Couronne O; Dubchak I; Batzoglou S (2003). <a href="http://bioinformatics.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=12855437" rel="nofollow">"Glocal alignment: finding rearrangements during alignment"</a>. <em>Bioinformatics</em>. 19. Suppl 1 (90001): i54–62. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1093%2Fbioinformatics%2Fbtg1005" rel="nofollow">10.1093/bioinformatics/btg1005</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/12855437" rel="nofollow">12855437</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-wang_8-0" title="Jump up">^</a></strong> <cite>Wang L; Jiang T. (1994). "On the complexity of multiple sequence alignment". <em>J Comput Biol</em>. <strong>1</strong> (4): 337–48. <a href="https://en.wikipedia.org/wiki/CiteSeerX" title="CiteSeerX">CiteSeerX</a> <a href="https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.408.894" rel="nofollow">10.1.1.408.894</a>. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1089%2Fcmb.1994.1.337" rel="nofollow">10.1089/cmb.1994.1.337</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/8790475" rel="nofollow">8790475</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-elias_9-0" title="Jump up">^</a></strong> <cite>Elias, Isaac (2006). "Settling the intractability of multiple alignment". <em>J Comput Biol</em>. <strong>13</strong> (7): 1323–1339. <a href="https://en.wikipedia.org/wiki/CiteSeerX" title="CiteSeerX">CiteSeerX</a> <a href="https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.256" rel="nofollow">10.1.1.6.256</a>. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1089%2Fcmb.2006.13.1323" rel="nofollow">10.1089/cmb.2006.13.1323</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/17037961" rel="nofollow">17037961</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-lipman_10-0" title="Jump up">^</a></strong> <cite>Lipman DJ; Altschul SF; Kececioglu JD (1989). <a href="http://www.pnas.org/cgi/pmidlookup?view=long&pmid=2734293" rel="nofollow">"A tool for multiple sequence alignment"</a>. <em>Proc Natl Acad Sci USA</em>. <strong>86</strong> (12): 4412–5. <a href="https://en.wikipedia.org/wiki/Bibcode" title="Bibcode">Bibcode</a>:<a href="http://adsabs.harvard.edu/abs/1989PNAS...86.4412L" rel="nofollow">1989PNAS...86.4412L</a>. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1073%2Fpnas.86.12.4412" rel="nofollow">10.1073/pnas.86.12.4412</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Central" title="PubMed Central">PMC</a> <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC287279" rel="nofollow">287279</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/2734293" rel="nofollow">2734293</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-higgins_11-0" title="Jump up">^</a></strong> <cite><a href="https://en.wikipedia.org/wiki/Desmond_G._Higgins" title="Desmond G. Higgins">Higgins DG</a>, Sharp PM (1988). <a href="http://linkinghub.elsevier.com/retrieve/pii/0378-1119(88)90330-7" rel="nofollow">"CLUSTAL: a package for performing multiple sequence alignment on a microcomputer"</a>. <em>Gene</em>. <strong>73</strong> (1): 237–44. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1016%2F0378-1119%2888%2990330-7" rel="nofollow">10.1016/0378-1119(88)90330-7</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/3243435" rel="nofollow">3243435</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-thompson_12-0" title="Jump up">^</a></strong> <cite>Thompson JD; <a href="https://en.wikipedia.org/wiki/Desmond_G._Higgins" title="Desmond G. Higgins">Higgins DG</a>; Gibson TJ. (1994). <a href="http://nar.oxfordjournals.org/content/22/22/4673" rel="nofollow">"CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice"</a>. <em>Nucleic Acids Res</em>. <strong>22</strong> (22): 4673–80. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1093%2Fnar%2F22.22.4673" rel="nofollow">10.1093/nar/22.22.4673</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Central" title="PubMed Central">PMC</a> <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC308517" rel="nofollow">308517</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/7984417" rel="nofollow">7984417</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-chenna_13-0" title="Jump up">^</a></strong> <cite>Chenna R; Sugawara H; Koike T; Lopez R; Gibson TJ; Higgins DG; Thompson JD. (2003). <a href="http://nar.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=12824352" rel="nofollow">"Multiple sequence alignment with the Clustal series of programs"</a>. <em>Nucleic Acids Res</em>. <strong>31</strong> (13): 3497–500. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1093%2Fnar%2Fgkg500" rel="nofollow">10.1093/nar/gkg500</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Central" title="PubMed Central">PMC</a> <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC168907" rel="nofollow">168907</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/12824352" rel="nofollow">12824352</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-notredame_14-0" title="Jump up">^</a></strong> <cite>Notredame C; <a href="https://en.wikipedia.org/wiki/Desmond_G._Higgins" title="Desmond G. Higgins">Higgins DG</a>; Heringa J. (2000). <a href="http://linkinghub.elsevier.com/retrieve/pii/S0022-2836(00)94042-7" rel="nofollow">"T-Coffee: A novel method for fast and accurate multiple sequence alignment"</a>. <em>J Mol Biol</em>. <strong>302</strong> (1): 205–17. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1006%2Fjmbi.2000.4042" rel="nofollow">10.1006/jmbi.2000.4042</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/10964570" rel="nofollow">10964570</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-hirosawa_15-0" title="Jump up">^</a></strong> <cite>Hirosawa M; Totoki Y; Hoshida M; Ishikawa M. (1995). <a href="http://bioinformatics.oxfordjournals.org/cgi/content/abstract/11/1/13" rel="nofollow">"Comprehensive study on iterative algorithms of multiple sequence alignment"</a>. <em>Comput Appl Biosci</em>. <strong>11</strong> (1): 13–8. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1093%2Fbioinformatics%2F11.1.13" rel="nofollow">10.1093/bioinformatics/11.1.13</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/7796270" rel="nofollow">7796270</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-karplus_16-0" title="Jump up">^</a></strong> <cite>Karplus K; Barrett C; Hughey R. (1998). <a href="http://bioinformatics.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=9927713" rel="nofollow">"Hidden Markov models for detecting remote protein homologies"</a>. <em>Bioinformatics</em>. <strong>14</strong>(10): 846–856. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1093%2Fbioinformatics%2F14.10.846" rel="nofollow">10.1093/bioinformatics/14.10.846</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/9927713" rel="nofollow">9927713</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-chothia_17-0" title="Jump up">^</a></strong> <cite>Chothia C; Lesk AM. (April 1986). <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1166865" rel="nofollow">"The relation between the divergence of sequence and structure in proteins"</a>. <em>EMBO J</em>. <strong>5</strong>(4): 823–6. <a href="https://en.wikipedia.org/wiki/PubMed_Central" title="PubMed Central">PMC</a> <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1166865" rel="nofollow">1166865</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/3709526" rel="nofollow">3709526</a>.</cite></li> | ||
+ | <li>^ <a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-skolnick_18-0">Jump up to:<sup><em><strong>a</strong></em></sup></a> <a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-skolnick_18-1"><sup><em><strong>b</strong></em></sup></a> <cite>Zhang Y; Skolnick J. (2005). <a href="http://www.pnas.org/cgi/pmidlookup?view=long&pmid=15653774" rel="nofollow">"The protein structure prediction problem could be solved using the current PDB library"</a>. <em>Proc Natl Acad Sci USA</em>. <strong>102</strong> (4): 1029–34. <a href="https://en.wikipedia.org/wiki/Bibcode" title="Bibcode">Bibcode</a>:<a href="http://adsabs.harvard.edu/abs/2005PNAS..102.1029Z" rel="nofollow">2005PNAS..102.1029Z</a>. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1073%2Fpnas.0407152101" rel="nofollow">10.1073/pnas.0407152101</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Central" title="PubMed Central">PMC</a> <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC545829" rel="nofollow">545829</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/15653774" rel="nofollow">15653774</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-holm_19-0" title="Jump up">^</a></strong> <cite>Holm L; Sander C (1996). <a href="http://www.sciencemag.org/cgi/pmidlookup?view=long&pmid=8662544" rel="nofollow">"Mapping the protein universe"</a>. <em>Science</em>. <strong>273</strong> (5275): 595–603. <a href="https://en.wikipedia.org/wiki/Bibcode" title="Bibcode">Bibcode</a>:<a href="http://adsabs.harvard.edu/abs/1996Sci...273..595H" rel="nofollow">1996Sci...273..595H</a>. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1126%2Fscience.273.5275.595" rel="nofollow">10.1126/science.273.5275.595</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/8662544" rel="nofollow">8662544</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-taylor_20-0" title="Jump up">^</a></strong> <cite>Taylor WR; Flores TP; Orengo CA. (1994). <a href="http://www.proteinscience.org/cgi/pmidlookup?view=long&pmid=7849601" rel="nofollow">"Multiple protein structure alignment"</a>. <em>Protein Sci</em>. <strong>3</strong> (10): 1858–70. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1002%2Fpro.5560031025" rel="nofollow">10.1002/pro.5560031025</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Central" title="PubMed Central">PMC</a> <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2142613" rel="nofollow">2142613</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/7849601" rel="nofollow">7849601</a>.</cite><sup>[<em><a href="https://en.wikipedia.org/wiki/Wikipedia:Link_rot" title="Wikipedia:Link rot">permanent dead link</a></em>]</sup></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-orengo_21-0" title="Jump up">^</a></strong> <cite>Orengo CA; Michie AD; Jones S; Jones DT; Swindells MB; Thornton JM (1997). "CATH--a hierarchic classification of protein domain structures". <em>Structure</em>. <strong>5</strong> (8): 1093–108. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1016%2FS0969-2126%2897%2900260-8" rel="nofollow">10.1016/S0969-2126(97)00260-8</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/9309224" rel="nofollow">9309224</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-shindyalov_22-0" title="Jump up">^</a></strong> <cite>Shindyalov IN; Bourne PE. (1998). <a href="http://peds.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=9796821" rel="nofollow">"Protein structure alignment by incremental combinatorial extension (CE) of the optimal path"</a>. <em>Protein Eng</em>. <strong>11</strong> (9): 739–47. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1093%2Fprotein%2F11.9.739" rel="nofollow">10.1093/protein/11.9.739</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/9796821" rel="nofollow">9796821</a>.</cite></li> | ||
+ | <li>^ <a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-ortet_23-0">Jump up to:<sup><em><strong>a</strong></em></sup></a> <a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-ortet_23-1"><sup><em><strong>b</strong></em></sup></a> <cite>Ortet P; Bastien O (2010). <a href="http://www.la-press.com/where-does-the-alignment-score-distribution-shape-come-from-article-a2393" rel="nofollow">"Where Does the Alignment Score Distribution Shape Come from?"</a>. <em>Evolutionary Bioinformatics</em>. <strong>6</strong>: 159–187. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.4137%2FEBO.S5875" rel="nofollow">10.4137/EBO.S5875</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Central" title="PubMed Central">PMC</a> <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3023300" rel="nofollow">3023300</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/21258650" rel="nofollow">21258650</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-felsenstein_24-0" title="Jump up">^</a></strong> <cite>Felsenstein J. (2004). <em>Inferring Phylogenies</em>. Sinauer Associates: Sunderland, MA. <a href="https://en.wikipedia.org/wiki/International_Standard_Book_Number" title="International Standard Book Number">ISBN</a> <a href="https://en.wikipedia.org/wiki/Special:BookSources/978-0-87893-177-4" title="Special:BookSources/978-0-87893-177-4">978-0-87893-177-4</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-altschul_25-0" title="Jump up">^</a></strong> <cite>Altschul SF; Gish W (1996). <em>Local Alignment Statistics</em>. <em>Meth.Enz</em>. Methods in Enzymology. <strong>266</strong>. pp. 460–480. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1016%2FS0076-6879%2896%2966029-7" rel="nofollow">10.1016/S0076-6879(96)66029-7</a>. <a href="https://en.wikipedia.org/wiki/International_Standard_Book_Number" title="International Standard Book Number">ISBN</a> <a href="https://en.wikipedia.org/wiki/Special:BookSources/9780121821678" title="Special:BookSources/9780121821678">9780121821678</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-hartmann_26-0" title="Jump up">^</a></strong> <cite>Hartmann AK (2002). "Sampling rare events: statistics of local sequence alignments". <em>Phys. Rev. E</em>. <strong>65</strong> (5): 056102. <a href="https://en.wikipedia.org/wiki/ArXiv" title="ArXiv">arXiv</a>:<a href="https://arxiv.org/abs/cond-mat/0108201" rel="nofollow">cond-mat/0108201</a>. <a href="https://en.wikipedia.org/wiki/Bibcode" title="Bibcode">Bibcode</a>:<a href="http://adsabs.harvard.edu/abs/2002PhRvE..65e6102H" rel="nofollow">2002PhRvE..65e6102H</a>. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1103%2FPhysRevE.65.056102" rel="nofollow">10.1103/PhysRevE.65.056102</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/12059642" rel="nofollow">12059642</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-newberg_27-0" title="Jump up">^</a></strong> <cite>Newberg LA (2008). <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2737730" rel="nofollow">"Significance of gapped sequence alignments"</a>. <em>J Comput Biolo</em>. <strong>15</strong> (9): 1187–1194. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1089%2Fcmb.2008.0125" rel="nofollow">10.1089/cmb.2008.0125</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Central" title="PubMed Central">PMC</a> <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2737730" rel="nofollow">2737730</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/18973434" rel="nofollow">18973434</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-eddy_28-0" title="Jump up">^</a></strong> <cite>Eddy SR; Rost, Burkhard (2008). Rost, Burkhard, ed. <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2396288" rel="nofollow">"A probabilistic model of local sequence alignment that simplifies statistical significance estimation"</a>. <em>PLoS Comput Biol</em>. <strong>4</strong> (5): e1000069. <a href="https://en.wikipedia.org/wiki/Bibcode" title="Bibcode">Bibcode</a>:<a href="http://adsabs.harvard.edu/abs/2008PLSCB...4E0069E" rel="nofollow">2008PLSCB...4E0069E</a>. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1371%2Fjournal.pcbi.1000069" rel="nofollow">10.1371/journal.pcbi.1000069</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Central" title="PubMed Central">PMC</a> <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2396288" rel="nofollow">2396288</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/18516236" rel="nofollow">18516236</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-bastien_29-0" title="Jump up">^</a></strong> <cite>Bastien O; Aude JC; Roy S; Marechal E (2004). <a href="http://bioinformatics.oxfordjournals.org/content/20/4/534.long" rel="nofollow">"Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Z-value statistics"</a>. <em>Bioinformatics</em>. <strong>20</strong>(4): 534–537. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1093%2Fbioinformatics%2Fbtg440" rel="nofollow">10.1093/bioinformatics/btg440</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/14990449" rel="nofollow">14990449</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-agrawal11_30-0" title="Jump up">^</a></strong> <cite>Agrawal A; Huang X (2011). <a href="https://archive.is/20130415004914/http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5276793" rel="nofollow">"Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices"</a>. <em>IEEE/ACM Transactions on Computational Biology and Bioinformatics</em>. <strong>8</strong> (1): 194–205. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1109%2FTCBB.2009.69" rel="nofollow">10.1109/TCBB.2009.69</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/21071807" rel="nofollow">21071807</a>. Archived from <a href="http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5276793" rel="nofollow">the original</a> on 2013-04-15.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-agrawal08_31-0" title="Jump up">^</a></strong> <cite>Agrawal A; Brendel VP; Huang X (2008). <a href="https://archive.is/20130128163812/http://inderscience.metapress.com/content/1558538106522500/" rel="nofollow">"Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment"</a>. <em>International Journal of Computational Biology and Drug Design</em>. <strong>1</strong> (4): 347–367. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1504%2FIJCBDD.2008.022207" rel="nofollow">10.1504/IJCBDD.2008.022207</a>. Archived from <a href="http://inderscience.metapress.com/content/1558538106522500/" rel="nofollow">the original</a> on 28 January 2013.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-NewbergLawrence2009_32-0" title="Jump up">^</a></strong> <cite>Newberg LA; Lawrence CE (2009). <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2858568" rel="nofollow">"Exact Calculation of Distributions on Integers, with Application to Sequence Alignment"</a>. <em>J Comput Biolo</em>. <strong>16</strong> (1): 1–18. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1089%2Fcmb.2008.0137" rel="nofollow">10.1089/cmb.2008.0137</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Central" title="PubMed Central">PMC</a> <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2858568" rel="nofollow">2858568</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/19119992" rel="nofollow">19119992</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-33" title="Jump up">^</a></strong> <cite>Kim N; Lee C (2008). <em>Bioinformatics detection of alternative splicing</em>. <em>Methods Mol. Biol</em>. Methods in Molecular Biology™. <strong>452</strong>. pp. 179–97. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1007%2F978-1-60327-159-2_9" rel="nofollow">10.1007/978-1-60327-159-2_9</a>. <a href="https://en.wikipedia.org/wiki/International_Standard_Book_Number" title="International Standard Book Number">ISBN</a> <a href="https://en.wikipedia.org/wiki/Special:BookSources/978-1-58829-707-5" title="Special:BookSources/978-1-58829-707-5">978-1-58829-707-5</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/18566765" rel="nofollow">18566765</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-34" title="Jump up">^</a></strong> <cite>Li JB, Levanon EY, Yoon JK, et al. (May 2009). "Genome-wide identification of human RNA editing sites by parallel DNA capturing and sequencing". <em>Science</em>. <strong>324</strong> (5931): 1210–3. <a href="https://en.wikipedia.org/wiki/Bibcode" title="Bibcode">Bibcode</a>:<a href="http://adsabs.harvard.edu/abs/2009Sci...324.1210L" rel="nofollow">2009Sci...324.1210L</a>. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1126%2Fscience.1170995" rel="nofollow">10.1126/science.1170995</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/19478186" rel="nofollow">19478186</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-35" title="Jump up">^</a></strong> <cite>Blazewicz J, Bryja M, Figlerowicz M, et al. (June 2009). "Whole genome assembly from 454 sequencing output via modified DNA graph concept". <em>Comput Biol Chem</em>. <strong>33</strong> (3): 224–30. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1016%2Fj.compbiolchem.2009.04.005" rel="nofollow">10.1016/j.compbiolchem.2009.04.005</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/19477687" rel="nofollow">19477687</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-36" title="Jump up">^</a></strong> <cite>Duran C; Appleby N; Vardy M; Imelfort M; Edwards D; Batley J (May 2009). "Single nucleotide polymorphism discovery in barley using autoSNPdb". <em>Plant Biotechnol. J</em>. <strong>7</strong> (4): 326–33. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1111%2Fj.1467-7652.2009.00407.x" rel="nofollow">10.1111/j.1467-7652.2009.00407.x</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/19386041" rel="nofollow">19386041</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-37" title="Jump up">^</a></strong> <cite>Abbott A.; Tsay A. (2000). "Sequence Analysis and Optimal Matching Methods in Sociology, Review and Prospect". <em>Sociological Methods and Research</em>. <strong>29</strong> (1): 3–33. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1177%2F0049124100029001001" rel="nofollow">10.1177/0049124100029001001</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-Barzilay_38-0" title="Jump up">^</a></strong> <cite>Barzilay R; Lee L. (2002). <a href="http://www.cs.cornell.edu/home/llee/papers/gen-msa.pdf" rel="nofollow">"Bootstrapping Lexical Choice via Multiple-Sequence Alignment"</a> (PDF). <em>Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)</em>. <strong>10</strong>: 164–171. <a href="https://en.wikipedia.org/wiki/ArXiv" title="ArXiv">arXiv</a>:<a href="https://arxiv.org/abs/cs/0205065" rel="nofollow">cs/0205065</a>. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.3115%2F1118693.1118715" rel="nofollow">10.3115/1118693.1118715</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-39" title="Jump up">^</a></strong> <cite>Kondrak, Grzegorz (2002). <a href="http://www.cs.ualberta.ca/~kondrak/papers/thesis.pdf" rel="nofollow">"Algorithms for Language Reconstruction"</a> (PDF). University of Toronto, Ontario. Retrieved 2007-01-21.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-prinzie_40-0" title="Jump up">^</a></strong> <cite>Prinzie A.; D. Van den Poel (2006). <a href="http://econpapers.repec.org/paper/rugrugwps/05_2F292.htm" rel="nofollow">"Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM"</a>. <em>Decision Support Systems</em>. <strong>42</strong>(2): 508–526. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1016%2Fj.dss.2005.02.004" rel="nofollow">10.1016/j.dss.2005.02.004</a>.</cite> See also Prinzie and Van den Poel's paper <cite>Prinzie, A; Vandenpoel, D (2007). <a href="http://econpapers.repec.org/paper/rugrugwps/07_2F442.htm" rel="nofollow">"Predicting home-appliance acquisition sequences: Markov/Markov for Discrimination and survival analysis for modeling sequential information in NPTB models"</a>. <em>Decision Support Systems</em>. <strong>44</strong> (1): 28–45. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1016%2Fj.dss.2007.02.008" rel="nofollow">10.1016/j.dss.2007.02.008</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-41" title="Jump up">^</a></strong> <cite>EMBL-EBI. <a href="http://www.ebi.ac.uk/Tools/msa/clustalw2/" rel="nofollow">"ClustalW2 < Multiple Sequence Alignment < EMBL-EBI"</a>. <em>www.EBI.ac.uk</em>. Retrieved 12 June 2017.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-42" title="Jump up">^</a></strong> <a href="https://web.archive.org/web/20080918022531/http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi" rel="nofollow">T-coffee</a></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-43" title="Jump up">^</a></strong> <cite><a href="http://blast.ncbi.nlm.nih.gov/Blast.cgi" rel="nofollow">"BLAST: Basic Local Alignment Search Tool"</a>. <em>blast.ncbi.nlm.NIH.gov</em>. Retrieved 12 June 2017.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-44" title="Jump up">^</a></strong> <cite><a href="http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml" rel="nofollow">"UVA FASTA Server"</a>. <em>fasta.bioch.Virginia.edu</em>. Retrieved 12 June 2017.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-thompson2_45-0" title="Jump up">^</a></strong> <cite>Thompson JD; Plewniak F; Poch O (1999). <a href="http://bioinformatics.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=10068696" rel="nofollow">"BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs"</a>. <em>Bioinformatics</em>. <strong>15</strong> (1): 87–8. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1093%2Fbioinformatics%2F15.1.87" rel="nofollow">10.1093/bioinformatics/15.1.87</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/10068696" rel="nofollow">10068696</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-46" title="Jump up">^</a></strong> <a href="https://web.archive.org/web/20121130084356/http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/prog_scores.html" rel="nofollow">BAliBASE</a></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-thompson3_47-0" title="Jump up">^</a></strong> <cite>Thompson JD; Plewniak F; Poch O. (1999). <a href="http://nar.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=10373585" rel="nofollow">"A comprehensive comparison of multiple sequence alignment programs"</a>. <em>Nucleic Acids Res</em>. <strong>27</strong> (13): 2682–90. <a href="https://en.wikipedia.org/wiki/Digital_object_identifier" title="Digital object identifier">doi</a>:<a href="https://doi.org/10.1093%2Fnar%2F27.13.2682" rel="nofollow">10.1093/nar/27.13.2682</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Central" title="PubMed Central">PMC</a> <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC148477" rel="nofollow">148477</a>. <a href="https://en.wikipedia.org/wiki/PubMed_Identifier" title="PubMed Identifier">PMID</a> <a href="https://www.ncbi.nlm.nih.gov/pubmed/10373585" rel="nofollow">10373585</a>.</cite></li> | ||
+ | <li><strong><a href="https://en.wikipedia.org/wiki/Sequence_alignment#cite_ref-48" title="Jump up">^</a></strong> <cite><a href="http://3d-alignment.eu/" rel="nofollow">"Multiple sequence alignment: Strap"</a>. <em>3d-alignment.eu</em>. Retrieved 12 June 2017.</cite></li> | ||
</ol> | </ol> |
Latest revision as of 03:04, 2 December 2018
Contents
- 1 Sequence alignment
- 1.1 Contents
- 1.2 Interpretation[edit]
- 1.3 Alignment methods[edit]
- 1.4 Representations[edit]
- 1.5 Global and local alignments[edit]
- 1.6 Pairwise alignment[edit]
- 1.7 Multiple sequence alignment[edit]
- 1.8 Structural alignment[edit]
- 1.9 Phylogenetic analysis[edit]
- 1.10 Other biological uses[edit]
- 1.11 Non-biological uses[edit]
- 1.12 Software[edit]
- 1.13 See also[edit]
- 1.14 References[edit]
Sequence alignment
From Wikipedia, the free encyclopedia
Jump to navigationJump to search
This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (March 2009) (Learn how and when to remove this template message) |
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.[1] Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the edit distance cost between strings in a natural language or in financial data.
A sequence alignment, produced by ClustalO, of mammalian histone proteins.
Sequences are the amino acids for residues 120-180 of the proteins. Residues that are conserved across all sequences are highlighted in grey. Below the protein sequences is a key denoting conserved sequence (*), conservative mutations (:), semi-conservative mutations (.), and non-conservative mutations ( ).[2]
<input type="checkbox" />
Contents
- 1Interpretation
- 2Alignment methods
- 3Representations
- 4Global and local alignments
- 5Pairwise alignment
- 6Multiple sequence alignment
- 7Structural alignment
- 8Phylogenetic analysis
- 9Other biological uses
- 10Non-biological uses
- 11Software
- 12See also
- 13References
- 14External links
Interpretation[edit]
If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. In sequence alignments of proteins, the degree of similarity between amino acids occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or sequence motif is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose side chains have similar biochemical properties) in a particular region of the sequence, suggest [3] that this region has structural or functional importance. Although DNA and RNA nucleotide bases are more similar to each other than are amino acids, the conservation of base pairs can indicate a similar functional or structural role.
Alignment methods[edit]
Very short or very similar sequences can be aligned by hand. However, most interesting problems require the alignment of lengthy, highly variable or extremely numerous sequences that cannot be aligned solely by human effort. Instead, human knowledge is applied in constructing algorithms to produce high-quality sequence alignments, and occasionally in adjusting the final results to reflect patterns that are difficult to represent algorithmically (especially in the case of nucleotide sequences). Computational approaches to sequence alignment generally fall into two categories: global alignments and local alignments. Calculating a global alignment is a form of global optimization that "forces" the alignment to span the entire length of all query sequences. By contrast, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Local alignments are often preferable, but can be more difficult to calculate because of the additional challenge of identifying the regions of similarity.[4] A variety of computational algorithms have been applied to the sequence alignment problem. These include slow but formally correct methods like dynamic programming. These also include efficient, heuristic algorithms or probabilistic methods designed for large-scale database search, that do not guarantee to find best matches.
Representations[edit]
Alignments are commonly represented both graphically and in text format. In almost all sequence alignment representations, sequences are written in rows arranged so that aligned residues appear in successive columns. In text formats, aligned columns containing identical or similar characters are indicated with a system of conservation symbols. As in the image above, an asterisk or pipe symbol is used to show identity between two columns; other less common symbols include a colon for conservative substitutions and a period for semiconservative substitutions. Many sequence visualization programs also use color to display information about the properties of the individual sequence elements; in DNA and RNA sequences, this equates to assigning each nucleotide its own color. In protein alignments, such as the one in the image above, color is often used to indicate amino acid properties to aid in judging the conservation of a given amino acid substitution. For multiple sequences the last row in each column is often the consensus sequence determined by the alignment; the consensus sequence is also often represented in graphical format with a sequence logo in which the size of each nucleotide or amino acid letter corresponds to its degree of conservation.[5]
Sequence alignments can be stored in a wide variety of text-based file formats, many of which were originally developed in conjunction with a specific alignment program or implementation. Most web-based tools allow a limited number of input and output formats, such as FASTA format and GenBank format and the output is not easily editable. Several conversion programs that provide graphical and/or command line interfaces are available[dead link], such as READSEQ and EMBOSS. There are also several programming packages which provide this conversion functionality, such as BioPython, BioRuby and BioPerl. The SAM/BAM files use the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string format to represent an alignment of a sequence to a reference by encoding a sequence of events (e.g. match/mismatch, insertions, deletions).[6]
Global and local alignments[edit]
Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot start and/or end in gaps.) A general global alignment technique is the Needleman–Wunsch algorithm, which is based on dynamic programming. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The Smith–Waterman algorithm is a general local alignment method based on the same dynamic programming scheme but with additional choices to start and end at any place.[4]
Hybrid methods, known as semi-global or "glocal" (short for global-local) methods, search for the best possible partial alignment of the two sequences (a subset of one or both starts and one or both ends has to be chosen before aligning). This can be especially useful when the downstream part of one sequence overlaps with the upstream part of the other sequence. In this case, neither global nor local alignment is entirely appropriate: a global alignment would attempt to force the alignment to extend beyond the region of overlap, while a local alignment might not fully cover the region of overlap.[7] Another case where semi-global alignment is useful is when one sequence is short (for example a gene sequence) and the other is very long (for example a chromosome sequence). In that case, the short sequence should be globally (fully) aligned but only a local (partial) alignment is desired for the long sequence.
Pairwise alignment[edit]
Pairwise sequence alignment methods are used to find the best-matching piecewise (local or global) alignments of two query sequences. Pairwise alignments can only be used between two sequences at a time, but they are efficient to calculate and are often used for methods that do not require extreme precision (such as searching a database for sequences with high similarity to a query). The three primary methods of producing pairwise alignments are dot-matrix methods, dynamic programming, and word methods;[1] however, multiple sequence alignment techniques can also align pairs of sequences. Although each method has its individual strengths and weaknesses, all three pairwise methods have difficulty with highly repetitive sequences of low information content - especially where the number of repetitions differ in the two sequences to be aligned. One way of quantifying the utility of a given pairwise alignment is the 'maximum unique match' (MUM), or the longest subsequence that occurs in both query sequences. Longer MUM sequences typically reflect closer relatedness.
Dot-matrix methods[edit]
Self comparison of a part of a mouse strain genome. The dot-plot shows a patchwork of lines, demonstrating duplicated segments of DNA. |
See main article on dot plots (bioinformatics).
A DNA dot plot of a humanzinc finger transcription factor(GenBank ID NM_002383), showing regional self-similarity. The main diagonal represents the sequence's alignment with itself; lines off the main diagonal represent similar or repetitive patterns within the sequence. This is a typical example of a recurrence plot. |
The dot-matrix approach, which implicitly produces a family of alignments for individual sequence regions, is qualitative and conceptually simple, though time-consuming to analyze on a large scale. In the absence of noise, it can be easy to visually identify certain sequence features—such as insertions, deletions, repeats, or inverted repeats—from a dot-matrix plot. To construct a dot-matrix plot, the two sequences are written along the top row and leftmost column of a two-dimensional matrixand a dot is placed at any point where the characters in the appropriate columns match—this is a typical recurrence plot. Some implementations vary the size or intensity of the dot depending on the degree of similarity of the two characters, to accommodate conservative substitutions. The dot plots of very closely related sequences will appear as a single line along the matrix's main diagonal.
Problems with dot plots as an information display technique include: noise, lack of clarity, non-intuitiveness, difficulty extracting match summary statistics and match positions on the two sequences. There is also much wasted space where the match data is inherently duplicated across the diagonal and most of the actual area of the plot is taken up by either empty space or noise, and, finally, dot-plots are limited to two sequences. None of these limitations apply to Miropeats alignment diagrams but they have their own particular flaws.
Dot plots can also be used to assess repetitiveness in a single sequence. A sequence can be plotted against itself and regions that share significant similarities will appear as lines off the main diagonal. This effect can occur when a protein consists of multiple similar structural domains.
Dynamic programming[edit]
The technique of dynamic programming can be applied to produce global alignments via the Needleman-Wunsch algorithm, and local alignments via the Smith-Waterman algorithm. In typical usage, protein alignments use a substitution matrix to assign scores to amino-acid matches or mismatches, and a gap penalty for matching an amino acid in one sequence to a gap in the other. DNA and RNA alignments may use a scoring matrix, but in practice often simply assign a positive match score, a negative mismatch score, and a negative gap penalty. (In standard dynamic programming, the score of each amino acid position is independent of the identity of its neighbors, and therefore base stacking effects are not taken into account. However, it is possible to account for such effects by modifying the algorithm.) A common extension to standard linear gap costs, is the usage of two different gap penalties for opening a gap and for extending a gap. Typically the former is much larger than the latter, e.g. -10 for gap open and -2 for gap extension. Thus, the number of gaps in an alignment is usually reduced and residues and gaps are kept together, which typically makes more biological sense. The Gotoh algorithm implements affine gap costs by using three matrices.
Dynamic programming can be useful in aligning nucleotide to protein sequences, a task complicated by the need to take into account frameshift mutations (usually insertions or deletions). The framesearch method produces a series of global or local pairwise alignments between a query nucleotide sequence and a search set of protein sequences, or vice versa. Its ability to evaluate frameshifts offset by an arbitrary number of nucleotides makes the method useful for sequences containing large numbers of indels, which can be very difficult to align with more efficient heuristic methods. In practice, the method requires large amounts of computing power or a system whose architecture is specialized for dynamic programming. The BLAST and EMBOSS suites provide basic tools for creating translated alignments (though some of these approaches take advantage of side-effects of sequence searching capabilities of the tools). More general methods are available from both commercial sources, such as FrameSearch, distributed as part of the Accelrys GCG package, and Open Source software such as Genewise.
The dynamic programming method is guaranteed to find an optimal alignment given a particular scoring function; however, identifying a good scoring function is often an empirical rather than a theoretical matter. Although dynamic programming is extensible to more than two sequences, it is prohibitively slow for large numbers of sequences or extremely long sequences.
Word methods[edit]
Word methods, also known as k-tuple methods, are heuristic methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools FASTA and the BLAST family.[1] Word methods identify a series of short, nonoverlapping subsequences ("words") in the query sequence that are then matched to candidate database sequences. The relative positions of the word in the two sequences being compared are subtracted to obtain an offset; this will indicate a region of alignment if multiple distinct words produce the same offset. Only if this region is detected do these methods apply more sensitive alignment criteria; thus, many unnecessary comparisons with sequences of no appreciable similarity are eliminated.
In the FASTA method, the user defines a value k to use as the word length with which to search the database. The method is slower but more sensitive at lower values of k, which are also preferred for searches involving a very short query sequence. The BLAST family of search methods provides a number of algorithms optimized for particular types of queries, such as searching for distantly related sequence matches. BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy; like FASTA, BLAST uses a word search of length k, but evaluates only the most significant word matches, rather than every word match as does FASTA. Most BLAST implementations use a fixed default word length that is optimized for the query and database type, and that is changed only under special circumstances, such as when searching with repetitive or very short query sequences. Implementations can be found via a number of web portals, such as EMBL FASTAand NCBI BLAST.
Multiple sequence alignment[edit]
Main article: Multiple sequence alignment
Alignment of 27 avian influenza hemagglutininprotein sequences colored by residue conservation (top) and residue properties (bottom)
Multiple sequence alignment is an extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignment methods try to align all of the sequences in a given query set. Multiple alignments are often used in identifying conserved sequence regions across a group of sequences hypothesized to be evolutionarily related. Such conserved sequence motifs can be used in conjunction with structural and mechanisticinformation to locate the catalytic active sites of enzymes. Alignments are also used to aid in establishing evolutionary relationships by constructing phylogenetic trees. Multiple sequence alignments are computationally difficult to produce and most formulations of the problem lead to NP-complete combinatorial optimization problems.[8][9] Nevertheless, the utility of these alignments in bioinformatics has led to the development of a variety of methods suitable for aligning three or more sequences.
Dynamic programming[edit]
The technique of dynamic programming is theoretically applicable to any number of sequences; however, because it is computationally expensive in both time and memory, it is rarely used for more than three or four sequences in its most basic form. This method requires constructing the n-dimensional equivalent of the sequence matrix formed from two sequences, where n is the number of sequences in the query. Standard dynamic programming is first used on all pairs of query sequences and then the "alignment space" is filled in by considering possible matches or gaps at intermediate positions, eventually constructing an alignment essentially between each two-sequence alignment. Although this technique is computationally expensive, its guarantee of a global optimum solution is useful in cases where only a few sequences need to be aligned accurately. One method for reducing the computational demands of dynamic programming, which relies on the "sum of pairs" objective function, has been implemented in the MSA software package.[10]
Progressive methods[edit]
Progressive, hierarchical, or tree methods generate a multiple sequence alignment by first aligning the most similar sequences and then adding successively less related sequences or groups to the alignment until the entire query set has been incorporated into the solution. The initial tree describing the sequence relatedness is based on pairwise comparisons that may include heuristic pairwise alignment methods similar to FASTA. Progressive alignment results are dependent on the choice of "most related" sequences and thus can be sensitive to inaccuracies in the initial pairwise alignments. Most progressive multiple sequence alignment methods additionally weight the sequences in the query set according to their relatedness, which reduces the likelihood of making a poor choice of initial sequences and thus improves alignment accuracy.
Many variations of the Clustal progressive implementation[11][12][13] are used for multiple sequence alignment, phylogenetic tree construction, and as input for protein structure prediction. A slower but more accurate variant of the progressive method is known as T-Coffee.[14]
Iterative methods[edit]
Iterative methods attempt to improve on the heavy dependence on the accuracy of the initial pairwise alignments, which is the weak point of the progressive methods. Iterative methods optimize an objective function based on a selected alignment scoring method by assigning an initial global alignment and then realigning sequence subsets. The realigned subsets are then themselves aligned to produce the next iteration's multiple sequence alignment. Various ways of selecting the sequence subgroups and objective function are reviewed in.[15]
Motif finding[edit]
Motif finding, also known as profile analysis, constructs global multiple sequence alignments that attempt to align short conserved sequence motifs among the sequences in the query set. This is usually done by first constructing a general global multiple sequence alignment, after which the highly conserved regions are isolated and used to construct a set of profile matrices. The profile matrix for each conserved region is arranged like a scoring matrix but its frequency counts for each amino acid or nucleotide at each position are derived from the conserved region's character distribution rather than from a more general empirical distribution. The profile matrices are then used to search other sequences for occurrences of the motif they characterize. In cases where the original data set contained a small number of sequences, or only highly related sequences, pseudocounts are added to normalize the character distributions represented in the motif.
Techniques inspired by computer science[edit]
A variety of general optimization algorithms commonly used in computer science have also been applied to the multiple sequence alignment problem. Hidden Markov models have been used to produce probability scores for a family of possible multiple sequence alignments for a given query set; although early HMM-based methods produced underwhelming performance, later applications have found them especially effective in detecting remotely related sequences because they are less susceptible to noise created by conservative or semiconservative substitutions.[16] Genetic algorithmsand simulated annealing have also been used in optimizing multiple sequence alignment scores as judged by a scoring function like the sum-of-pairs method. More complete details and software packages can be found in the main article multiple sequence alignment.
The Burrows–Wheeler transform has been successfully applied to fast short read alignment in popular tools such as Bowtie and BWA. See FM-index.
Structural alignment[edit]
Main article: Structural alignment
Structural alignments, which are usually specific to protein and sometimes RNA sequences, use information about the secondary and tertiary structure of the protein or RNA molecule to aid in aligning the sequences. These methods can be used for two or more sequences and typically produce local alignments; however, because they depend on the availability of structural information, they can only be used for sequences whose corresponding structures are known (usually through X-ray crystallography or NMR spectroscopy). Because both protein and RNA structure is more evolutionarily conserved than sequence,[17] structural alignments can be more reliable between sequences that are very distantly related and that have diverged so extensively that sequence comparison cannot reliably detect their similarity.
Structural alignments are used as the "gold standard" in evaluating alignments for homology-based protein structure prediction[18] because they explicitly align regions of the protein sequence that are structurally similar rather than relying exclusively on sequence information. However, clearly structural alignments cannot be used in structure prediction because at least one sequence in the query set is the target to be modeled, for which the structure is not known. It has been shown that, given the structural alignment between a target and a template sequence, highly accurate models of the target protein sequence can be produced; a major stumbling block in homology-based structure prediction is the production of structurally accurate alignments given only sequence information.[18]
DALI[edit]
The DALI method, or distance matrix alignment, is a fragment-based method for constructing structural alignments based on contact similarity patterns between successive hexapeptides in the query sequences.[19] It can generate pairwise or multiple alignments and identify a query sequence's structural neighbors in the Protein Data Bank (PDB). It has been used to construct the FSSP structural alignment database (Fold classification based on Structure-Structure alignment of Proteins, or Families of Structurally Similar Proteins). A DALI webserver can be accessed at DALI and the FSSP is located at The Dali Database.
SSAP[edit]
SSAP (sequential structure alignment program) is a dynamic programming-based method of structural alignment that uses atom-to-atom vectors in structure space as comparison points. It has been extended since its original description to include multiple as well as pairwise alignments,[20] and has been used in the construction of the CATH (Class, Architecture, Topology, Homology) hierarchical database classification of protein folds.[21] The CATH database can be accessed at CATH Protein Structure Classification.
Combinatorial extension[edit]
The combinatorial extension method of structural alignment generates a pairwise structural alignment by using local geometry to align short fragments of the two proteins being analyzed and then assembles these fragments into a larger alignment.[22] Based on measures such as rigid-body root mean square distance, residue distances, local secondary structure, and surrounding environmental features such as residue neighbor hydrophobicity, local alignments called "aligned fragment pairs" are generated and used to build a similarity matrix representing all possible structural alignments within predefined cutoff criteria. A path from one protein structure state to the other is then traced through the matrix by extending the growing alignment one fragment at a time. The optimal such path defines the combinatorial-extension alignment. A web-based server implementing the method and providing a database of pairwise alignments of structures in the Protein Data Bank is located at the Combinatorial Extension website.
Phylogenetic analysis[edit]
Main article: Computational phylogenetics
Phylogenetics and sequence alignment are closely related fields due to the shared necessity of evaluating sequence relatedness.[23] The field of phylogenetics makes extensive use of sequence alignments in the construction and interpretation of phylogenetic trees, which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species. The degree to which sequences in a query set differ is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence identity suggests that the sequences in question have a comparatively young most recent common ancestor, while low identity suggests that the divergence is more ancient. This approximation, which reflects the "molecular clock" hypothesis that a roughly constant rate of evolutionary change can be used to extrapolate the elapsed time since two genes first diverged (that is, the coalescence time), assumes that the effects of mutation and selection are constant across sequence lineages. Therefore, it does not account for possible difference among organisms or species in the rates of DNA repair or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences, the molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between silent mutations that do not alter the meaning of a given codon and other mutations that result in a different amino acid being incorporated into the protein). More statistically accurate methods allow the evolutionary rate on each branch of the phylogenetic tree to vary, thus producing better estimates of coalescence times for genes.
Progressive multiple alignment techniques produce a phylogenetic tree by necessity because they incorporate sequences into the growing alignment in order of relatedness. Other techniques that assemble multiple sequence alignments and phylogenetic trees score and sort trees first and calculate a multiple sequence alignment from the highest-scoring tree. Commonly used methods of phylogenetic tree construction are mainly heuristic because the problem of selecting the optimal tree, like the problem of selecting the optimal multiple sequence alignment, is NP-hard.[24]
Assessment of significance[edit]
Sequence alignments are useful in bioinformatics for identifying sequence similarity, producing phylogenetic trees, and developing homology models of protein structures. However, the biological relevance of sequence alignments is not always clear. Alignments are often assumed to reflect a degree of evolutionary change between sequences descended from a common ancestor; however, it is formally possible that convergent evolution can occur to produce apparent similarity between proteins that are evolutionarily unrelated but perform similar functions and have similar structures.
In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance; BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts.
Methods of statistical significance estimation for gapped sequence alignments are available in the literature.[23][25][26][27][28][29][30][31]
Assessment of credibility[edit]
Statistical significance indicates the probability that an alignment of a given quality could arise by chance, but does not indicate how much superior a given alignment is to alternative alignments of the same sequences. Measures of alignment credibility indicate the extent to which the best scoring alignments for a given pair of sequences are substantially similar. Methods of alignment credibility estimation for gapped sequence alignments are available in the literature.[32]
Scoring functions[edit]
The choice of a scoring function that reflects biological or statistical observations about known sequences is important to producing good alignments. Protein sequences are frequently aligned using substitution matrices that reflect the probabilities of given character-to-character substitutions. A series of matrices called PAM matrices (Point Accepted Mutation matrices, originally defined by Margaret Dayhoff and sometimes referred to as "Dayhoff matrices") explicitly encode evolutionary approximations regarding the rates and probabilities of particular amino acid mutations. Another common series of scoring matrices, known as BLOSUM (Blocks Substitution Matrix), encodes empirically derived substitution probabilities. Variants of both types of matrices are used to detect sequences with differing levels of divergence, thus allowing users of BLAST or FASTA to restrict searches to more closely related matches or expand to detect more divergent sequences. Gap penaltiesaccount for the introduction of a gap - on the evolutionary model, an insertion or deletion mutation - in both nucleotide and protein sequences, and therefore the penalty values should be proportional to the expected rate of such mutations. The quality of the alignments produced therefore depends on the quality of the scoring function.
It can be very useful and instructive to try the same alignment several times with different choices for scoring matrix and/or gap penalty values and compare the results. Regions where the solution is weak or non-unique can often be identified by observing which regions of the alignment are robust to variations in alignment parameters.
Other biological uses[edit]
Sequenced RNA, such as expressed sequence tags and full-length mRNAs, can be aligned to a sequenced genome to find where there are genes and get information about alternative splicing[33] and RNA editing.[34] Sequence alignment is also a part of genome assembly, where sequences are aligned to find overlap so that contigs (long stretches of sequence) can be formed.[35] Another use is SNPanalysis, where sequences from different individuals are aligned to find single basepairs that are often different in a population.[36]
Non-biological uses[edit]
The methods used for biological sequence alignment have also found applications in other fields, most notably in natural language processing and in social sciences, where the Needleman-Wunsch algorithm is usually referred to as Optimal matching.[37] Techniques that generate the set of elements from which words will be selected in natural-language generation algorithms have borrowed multiple sequence alignment techniques from bioinformatics to produce linguistic versions of computer-generated mathematical proofs.[38] In the field of historical and comparative linguistics, sequence alignment has been used to partially automate the comparative method by which linguists traditionally reconstruct languages.[39] Business and marketing research has also applied multiple sequence alignment techniques in analyzing series of purchases over time.[40]
Software[edit]
Main article: Sequence alignment software
A more complete list of available software categorized by algorithm and alignment type is available at sequence alignment software, but common software tools used for general sequence alignment tasks include ClustalW2[41] and T-coffee[42] for alignment, and BLAST[43] and FASTA3x[44] for database searching. Commercial tools such as DNASTAR Lasergene, Geneious, and PatternHunter are also available. Tools annotated as performing sequence alignment are listed in the bio.tools registry.
Alignment algorithms and software can be directly compared to one another using a standardized set of benchmark reference multiple sequence alignments known as BAliBASE.[45] The data set consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at BAliBASE.[46][47] A comprehensive list of BAliBASE scores for many (currently 12) different alignment tools can be computed within the protein workbench STRAP.[48]
See also[edit]
- Sequence homology
- Sequence mining
- BLAST
- String searching algorithm
- Alignment-free sequence analysis
- UGENE
- Needleman–Wunsch algorithm
References[edit]
- ^ Jump up to:a b c Mount DM. (2004). Bioinformatics: Sequence and Genome Analysis (2nd ed.). Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY. ISBN 978-0-87969-608-5.
- ^ "Clustal FAQ #Symbols". Clustal. Retrieved 8 December 2014.
- ^ Ng PC; Henikoff S (May 2001). "Predicting deleterious amino acid substitutions". Genome Res. 11 (5): 863–74. doi:10.1101/gr.176601. PMC 311071. PMID 11337480.
- ^ Jump up to:a b Polyanovsky, V. O.; Roytberg, M. A.; Tumanyan, V. G. (2011). "Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences". Algorithms for Molecular Biology. 6 (1): 25. doi:10.1186/1748-7188-6-25. PMC 3223492. PMID 22032267.
- ^ Schneider TD; Stephens RM (1990). "Sequence logos: a new way to display consensus sequences". Nucleic Acids Res. 18(20): 6097–6100. doi:10.1093/nar/18.20.6097. PMC 332411. PMID 2172928.
- ^ "Sequence Alignment/Map Format Specification" (PDF).
- ^ Brudno M; Malde S; Poliakov A; Do CB; Couronne O; Dubchak I; Batzoglou S (2003). "Glocal alignment: finding rearrangements during alignment". Bioinformatics. 19. Suppl 1 (90001): i54–62. doi:10.1093/bioinformatics/btg1005. PMID 12855437.
- ^ Wang L; Jiang T. (1994). "On the complexity of multiple sequence alignment". J Comput Biol. 1 (4): 337–48. CiteSeerX 10.1.1.408.894. doi:10.1089/cmb.1994.1.337. PMID 8790475.
- ^ Elias, Isaac (2006). "Settling the intractability of multiple alignment". J Comput Biol. 13 (7): 1323–1339. CiteSeerX 10.1.1.6.256. doi:10.1089/cmb.2006.13.1323. PMID 17037961.
- ^ Lipman DJ; Altschul SF; Kececioglu JD (1989). "A tool for multiple sequence alignment". Proc Natl Acad Sci USA. 86 (12): 4412–5. Bibcode:1989PNAS...86.4412L. doi:10.1073/pnas.86.12.4412. PMC 287279. PMID 2734293.
- ^ Higgins DG, Sharp PM (1988). "CLUSTAL: a package for performing multiple sequence alignment on a microcomputer". Gene. 73 (1): 237–44. doi:10.1016/0378-1119(88)90330-7. PMID 3243435.
- ^ Thompson JD; Higgins DG; Gibson TJ. (1994). "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice". Nucleic Acids Res. 22 (22): 4673–80. doi:10.1093/nar/22.22.4673. PMC 308517. PMID 7984417.
- ^ Chenna R; Sugawara H; Koike T; Lopez R; Gibson TJ; Higgins DG; Thompson JD. (2003). "Multiple sequence alignment with the Clustal series of programs". Nucleic Acids Res. 31 (13): 3497–500. doi:10.1093/nar/gkg500. PMC 168907. PMID 12824352.
- ^ Notredame C; Higgins DG; Heringa J. (2000). "T-Coffee: A novel method for fast and accurate multiple sequence alignment". J Mol Biol. 302 (1): 205–17. doi:10.1006/jmbi.2000.4042. PMID 10964570.
- ^ Hirosawa M; Totoki Y; Hoshida M; Ishikawa M. (1995). "Comprehensive study on iterative algorithms of multiple sequence alignment". Comput Appl Biosci. 11 (1): 13–8. doi:10.1093/bioinformatics/11.1.13. PMID 7796270.
- ^ Karplus K; Barrett C; Hughey R. (1998). "Hidden Markov models for detecting remote protein homologies". Bioinformatics. 14(10): 846–856. doi:10.1093/bioinformatics/14.10.846. PMID 9927713.
- ^ Chothia C; Lesk AM. (April 1986). "The relation between the divergence of sequence and structure in proteins". EMBO J. 5(4): 823–6. PMC 1166865. PMID 3709526.
- ^ Jump up to:a b Zhang Y; Skolnick J. (2005). "The protein structure prediction problem could be solved using the current PDB library". Proc Natl Acad Sci USA. 102 (4): 1029–34. Bibcode:2005PNAS..102.1029Z. doi:10.1073/pnas.0407152101. PMC 545829. PMID 15653774.
- ^ Holm L; Sander C (1996). "Mapping the protein universe". Science. 273 (5275): 595–603. Bibcode:1996Sci...273..595H. doi:10.1126/science.273.5275.595. PMID 8662544.
- ^ Taylor WR; Flores TP; Orengo CA. (1994). "Multiple protein structure alignment". Protein Sci. 3 (10): 1858–70. doi:10.1002/pro.5560031025. PMC 2142613. PMID 7849601.[permanent dead link]
- ^ Orengo CA; Michie AD; Jones S; Jones DT; Swindells MB; Thornton JM (1997). "CATH--a hierarchic classification of protein domain structures". Structure. 5 (8): 1093–108. doi:10.1016/S0969-2126(97)00260-8. PMID 9309224.
- ^ Shindyalov IN; Bourne PE. (1998). "Protein structure alignment by incremental combinatorial extension (CE) of the optimal path". Protein Eng. 11 (9): 739–47. doi:10.1093/protein/11.9.739. PMID 9796821.
- ^ Jump up to:a b Ortet P; Bastien O (2010). "Where Does the Alignment Score Distribution Shape Come from?". Evolutionary Bioinformatics. 6: 159–187. doi:10.4137/EBO.S5875. PMC 3023300. PMID 21258650.
- ^ Felsenstein J. (2004). Inferring Phylogenies. Sinauer Associates: Sunderland, MA. ISBN 978-0-87893-177-4.
- ^ Altschul SF; Gish W (1996). Local Alignment Statistics. Meth.Enz. Methods in Enzymology. 266. pp. 460–480. doi:10.1016/S0076-6879(96)66029-7. ISBN 9780121821678.
- ^ Hartmann AK (2002). "Sampling rare events: statistics of local sequence alignments". Phys. Rev. E. 65 (5): 056102. arXiv:cond-mat/0108201. Bibcode:2002PhRvE..65e6102H. doi:10.1103/PhysRevE.65.056102. PMID 12059642.
- ^ Newberg LA (2008). "Significance of gapped sequence alignments". J Comput Biolo. 15 (9): 1187–1194. doi:10.1089/cmb.2008.0125. PMC 2737730. PMID 18973434.
- ^ Eddy SR; Rost, Burkhard (2008). Rost, Burkhard, ed. "A probabilistic model of local sequence alignment that simplifies statistical significance estimation". PLoS Comput Biol. 4 (5): e1000069. Bibcode:2008PLSCB...4E0069E. doi:10.1371/journal.pcbi.1000069. PMC 2396288. PMID 18516236.
- ^ Bastien O; Aude JC; Roy S; Marechal E (2004). "Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Z-value statistics". Bioinformatics. 20(4): 534–537. doi:10.1093/bioinformatics/btg440. PMID 14990449.
- ^ Agrawal A; Huang X (2011). "Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices". IEEE/ACM Transactions on Computational Biology and Bioinformatics. 8 (1): 194–205. doi:10.1109/TCBB.2009.69. PMID 21071807. Archived from the original on 2013-04-15.
- ^ Agrawal A; Brendel VP; Huang X (2008). "Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment". International Journal of Computational Biology and Drug Design. 1 (4): 347–367. doi:10.1504/IJCBDD.2008.022207. Archived from the original on 28 January 2013.
- ^ Newberg LA; Lawrence CE (2009). "Exact Calculation of Distributions on Integers, with Application to Sequence Alignment". J Comput Biolo. 16 (1): 1–18. doi:10.1089/cmb.2008.0137. PMC 2858568. PMID 19119992.
- ^ Kim N; Lee C (2008). Bioinformatics detection of alternative splicing. Methods Mol. Biol. Methods in Molecular Biology™. 452. pp. 179–97. doi:10.1007/978-1-60327-159-2_9. ISBN 978-1-58829-707-5. PMID 18566765.
- ^ Li JB, Levanon EY, Yoon JK, et al. (May 2009). "Genome-wide identification of human RNA editing sites by parallel DNA capturing and sequencing". Science. 324 (5931): 1210–3. Bibcode:2009Sci...324.1210L. doi:10.1126/science.1170995. PMID 19478186.
- ^ Blazewicz J, Bryja M, Figlerowicz M, et al. (June 2009). "Whole genome assembly from 454 sequencing output via modified DNA graph concept". Comput Biol Chem. 33 (3): 224–30. doi:10.1016/j.compbiolchem.2009.04.005. PMID 19477687.
- ^ Duran C; Appleby N; Vardy M; Imelfort M; Edwards D; Batley J (May 2009). "Single nucleotide polymorphism discovery in barley using autoSNPdb". Plant Biotechnol. J. 7 (4): 326–33. doi:10.1111/j.1467-7652.2009.00407.x. PMID 19386041.
- ^ Abbott A.; Tsay A. (2000). "Sequence Analysis and Optimal Matching Methods in Sociology, Review and Prospect". Sociological Methods and Research. 29 (1): 3–33. doi:10.1177/0049124100029001001.
- ^ Barzilay R; Lee L. (2002). "Bootstrapping Lexical Choice via Multiple-Sequence Alignment" (PDF). Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 10: 164–171. arXiv:cs/0205065. doi:10.3115/1118693.1118715.
- ^ Kondrak, Grzegorz (2002). "Algorithms for Language Reconstruction" (PDF). University of Toronto, Ontario. Retrieved 2007-01-21.
- ^ Prinzie A.; D. Van den Poel (2006). "Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM". Decision Support Systems. 42(2): 508–526. doi:10.1016/j.dss.2005.02.004. See also Prinzie and Van den Poel's paper Prinzie, A; Vandenpoel, D (2007). "Predicting home-appliance acquisition sequences: Markov/Markov for Discrimination and survival analysis for modeling sequential information in NPTB models". Decision Support Systems. 44 (1): 28–45. doi:10.1016/j.dss.2007.02.008.
- ^ EMBL-EBI. "ClustalW2 < Multiple Sequence Alignment < EMBL-EBI". www.EBI.ac.uk. Retrieved 12 June 2017.
- ^ T-coffee
- ^ "BLAST: Basic Local Alignment Search Tool". blast.ncbi.nlm.NIH.gov. Retrieved 12 June 2017.
- ^ "UVA FASTA Server". fasta.bioch.Virginia.edu. Retrieved 12 June 2017.
- ^ Thompson JD; Plewniak F; Poch O (1999). "BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs". Bioinformatics. 15 (1): 87–8. doi:10.1093/bioinformatics/15.1.87. PMID 10068696.
- ^ BAliBASE
- ^ Thompson JD; Plewniak F; Poch O. (1999). "A comprehensive comparison of multiple sequence alignment programs". Nucleic Acids Res. 27 (13): 2682–90. doi:10.1093/nar/27.13.2682. PMC 148477. PMID 10373585.
- ^ "Multiple sequence alignment: Strap". 3d-alignment.eu. Retrieved 12 June 2017.