MUSCLE
Robert C. Edgar
Dept. of Plant and Microbial Biology, University of California at Berkeley
Automated multiple alignment of protein sequences is a fundamental tool in computational biology, but
current software is sometimes unable to process the large amounts of data derived from current sequencing
projects. The most popular program, CLUSTALW, can handle no more than a few hundred sequences, and
sometimes produces low-quality alignments. T-Coffee is more accurate, but is limited to around 100 sequences.
We describe a new algorithm, MUSCLE, which achieves the highest accuracy scores reported to date on four
different benchmarks, with average accuracy from 1.5 to 5% better than T-Coffee. MUSCLE is comparable in
speed to CLUSTALW on typical input data. With settings for optimal speed, MUSCLE has average accuracy that
is statistically indistinguishable from T-Coffee and is the fastest reported algorithm for large numbers of
sequences, able to align 5,000 sequences of length 350 in less than 10 minutes on a desktop computer. The
MUSCLE program and source code is freely available at http://www.drive5.com/muscle.
A New Hardware Architecture for Genomic Sequence Alignment
Greg Knowles and Paul Gardner-Stephen
School of Informatics and Engineering, Flinders University
In this paper we describe a novel hardware architecture for Genomic search and sequence alignment which
achieves a speed-up of between 100–200 times NCBI-Blast. Heuristic algorithms such as NCBI-Blast are
indispensable for searching today's large Genomic databases. A significant contributor to search time for
such algorithms is the dynamic programming evaluations, NCBI-Blast spends around 76% of its time budget in
this area. In a companion paper we introduce our search algorithm, DASH, which outperforms NCBI-Blast by
an order of magnitude in software, and has equivalent sensitivity. DASH is designed around the principle
f considering Genomic sequence alignments to typically consist of regions of high homology (ie the ¬diagonals¬)
interspersed with regions of low homology. In DASH, the optimal solution consists of such diagonals joined by
regions of exact dynamic programming. This is affordable due to the small area of these inter-connecting
regions. Extensive testing has shown that DASH spends over 90% of its time in finding the diagonals, and
less than 10% in dynamic programming. Accordingly, we have designed a chip which finds the diagonals directly
in hardware. On a Xilinx Virtex II FPGA (Field Programmable Gate Array) it performs 5,000 base comparisons
in parallel, for a throughput of 5.5E11 base comparisons/second. When combined with the inter-region dynamic
programming in software, we achieve a system speed-up of between 100-200 times Blast running on a 1.4GHz AMD
Opteron processor.
Return to Poster Abstract Index
Return to Top
DASHfaster: Accurate Sequence Alignment
Paul Gardner-Stephen and Greg Knowles
School of Informatics and Engineering, Flinders University
In this paper we present our genomic and protenomic sequence alignment algorithm, DASH, which results in
order of magnitude speed improvement when compared to NCBI-BLAST 2.2.6, with equivalent or superior sensitivity.
Dynamic programming is the predominant contributor to search time for algorithms such as BLAST and FastA.
Improving the efficiency of dynamic programming provides an opportunity to increase sensitivity, or
significantly reduce search times and help offset the effects of the continuing exponential growth in
atabase sizes in excess of Moore's Law. Specifically, for protein searching we have demonstrated an order
of magnitude speed up with equivalent sensitivity, or alternatively ~30% speed up with significantly
improved sensitivity, depending on the parameters selected. Smith-Waterman complete dynamic programming is
used as the sensitivity benchmark. Similar results are presented for nucleotide alignment. DASH is designed
around the principle of considering sequence alignments to typically consist of regions of high homology
(the ¬ diagonals ¬ ) interspersed with regions of low homology. Our technique localizes dynamic programming
to those regions of low homology and the extremities. Combined with an indexed approach to rapidly finding
the high homology diagonals, we achieve significant speed improvements. Given the efficiency of this approach,
the time budget available can be leveraged to provide superior sensitivity compared to NCBI-BLAST. Since our
algorithm is highly parallel, we have developed dedicated hardware which we will present in a companion paper,
and a distributed version of our software with approximately linear scalability, which we have employed to
provide an NCBI style web search service to our collaborators.
Return to Poster Abstract Index
Return to Top
SUPERCONTIGS: A Contig Scaffolding Tool
Daniela Puiu
Virginia Commonwealth University
SUPERCONTIGS is a genome-finishing program that orders, orients and groups the contigs generated by a
shotgun sequence assembly program based on clone pair information and an alignment with a related genome.
The program can be used in the genome finishing and gap closing process. The main idea developed in
SUPERCONTIGS is the combination of assembly and alignment results towards scaffold generation. This is
done by locating clone pairs assembled in different contigs and based on their position in the contigs and
the contigs alignment(s) to the related genome, compute their absolute alignment coordinates. If the clones
are located within a maximum clone library insert size and have opposite orientations, it is most likely
that the contigs they belong to are adjacent and can be clustered together. The contigs are hierarchically
clustered into scaffolds based on transitivity relation. The order and size of the scaffolds is computed,
followed by an estimation of the gap types (physical/sequence) and sizes. The results are formatted and saved
for analysis and visualization. The tool was initially developed for the Cryptosporidium hominis genome
assembly. C. hominis is an 8 chromosome intestinal parasite whose closest relative genome, Cryptosporidium
parvum was almost complete and used for alignment. SUPERCONTIGS took the 1,573 Phrap-generated C. hominis
contigs and the nucmer alignment to the C. parvum genome and grouped them into 305 scaffolds, separated by
226 physical gaps. These numbers were much smaller than the ones obtained using other scaffolding tools
(ex: Bambus), indicating a better performance.
Return to Poster Abstract Index
Return to Top
Multiple Alignment of Rearranged Genomes
Aaron Darling, Bob Mau, Mark Craven, and Nicole T. Perna
University of Wisconsin-Madison
As organisms evolve, their genomes undergo both small and large scale changes that together can have a
profound impact on the organism's phenotype. Through genome comparison, we hope to glean information
that reveals the genotypic differences responsible for novel phenotypes. Numerous multiple genome alignment
and comparison methods have been developed in response to the ready availability of genome sequence data.
Of these methods, none perform global alignment of an arbitrary set of genomes containing rearrangements.
We have developed an integrated method to rapidly align several genome sequences in the presence of large-scale
rearrangement events. In order to rapidly align genomes, the algorithm uses alignment anchors to reduce the
number of possible alignments. However, unlike previous methods which enforced strict collinearity among
alignment anchors, our algorithm breaks anchors into groups of local collinearity called Locally Collinear
Blocks (LCBs). To avoid spurious matches arising from random sequence similarity, each LCB must meet a
minimum sequence identity criteria. The implementation, Mauve, identifies large-scale changes such as
horizontal transfer, inversion, and rearrangement in addition to local changes such as nucleotide substitutions
and indels. In order to evaluate the quality of genome alignments produced by Mauve and by other software,
we designed a model of bacterial genome evolution that allows us to simulate evolution. In addition to the
evolved genome sequences, the simulation produces a 'correct' alignment that provides a basis for evaluating
the quality of calculated genome alignments. The evaluations of alignment quality suggest directions for
further refinement of genome alignment models.
Return to Poster Abstract Index
Return to Top
A Novel Genome Assembler: Using K-Mers to Indirectly PerformN-Squared Comparisons in O(N)
Ho Joon Lee, Stephanie Rogers, Maulik Shah, and Jeffrey Touchman
Computational Biosciences, Arizona State University
We have designed and implemented a novel, exhaustive approach to genome assembly. We aimed eliminate
traditional heuristics and indirectly compare each sequence fragment to all the others in less than
O(N2 ) running time. The first step in the algorithm involves building a k-mer library; accomplished
by scanning through each sequence fragment using a sliding window of size k and cataloging each of
these k-mers and the sequence fragment in which it occurred. We assume that neighboring fragments from
the genome will share k-mers. Therefore, from this k-mer table, we build an adjacency table; cataloging
each sequence fragment and its neighbors. This adjacency table is generated in O(N) and represents all
N2 comparisons. Finally, with the information in the adjacency table, we performed multiple breadth-first
searches to collect and separate the connected fragments. We assume that there exist disjoint graphs in
the adjacency table and that each such graph represents an entire contig. We performed this process on
two datasets, a simulated set and a real set: both have >40,000 sequence fragments of ~1,000 kb and 9-fold
coverage. In both instances, the majority of the fragments were assigned to one contig.
Return to Poster Abstract Index
Return to Top
|