CSB2004

POSTER ABSTRACTS:
SEQUENCE ALIGNMENT

MUSCLE
Robert C. Edgar
Dept. of Plant and Microbial Biology, University of California at Berkeley

Automated multiple alignment of protein sequences is a fundamental tool in computational biology, but current software is sometimes unable to process the large amounts of data derived from current sequencing projects. The most popular program, CLUSTALW, can handle no more than a few hundred sequences, and sometimes produces low-quality alignments. T-Coffee is more accurate, but is limited to around 100 sequences. We describe a new algorithm, MUSCLE, which achieves the highest accuracy scores reported to date on four different benchmarks, with average accuracy from 1.5 to 5% better than T-Coffee. MUSCLE is comparable in speed to CLUSTALW on typical input data. With settings for optimal speed, MUSCLE has average accuracy that is statistically indistinguishable from T-Coffee and is the fastest reported algorithm for large numbers of sequences, able to align 5,000 sequences of length 350 in less than 10 minutes on a desktop computer. The MUSCLE program and source code is freely available at http://www.drive5.com/muscle.

A New Hardware Architecture for Genomic Sequence Alignment
Greg Knowles and Paul Gardner-Stephen
School of Informatics and Engineering, Flinders University

In this paper we describe a novel hardware architecture for Genomic search and sequence alignment which achieves a speed-up of between 100–200 times NCBI-Blast. Heuristic algorithms such as NCBI-Blast are indispensable for searching today's large Genomic databases. A significant contributor to search time for such algorithms is the dynamic programming evaluations, NCBI-Blast spends around 76% of its time budget in this area. In a companion paper we introduce our search algorithm, DASH, which outperforms NCBI-Blast by an order of magnitude in software, and has equivalent sensitivity. DASH is designed around the principle f considering Genomic sequence alignments to typically consist of regions of high homology (ie the ¬diagonals¬) interspersed with regions of low homology. In DASH, the optimal solution consists of such diagonals joined by regions of exact dynamic programming. This is affordable due to the small area of these inter-connecting regions. Extensive testing has shown that DASH spends over 90% of its time in finding the diagonals, and less than 10% in dynamic programming. Accordingly, we have designed a chip which finds the diagonals directly in hardware. On a Xilinx Virtex II FPGA (Field Programmable Gate Array) it performs 5,000 base comparisons in parallel, for a throughput of 5.5E11 base comparisons/second. When combined with the inter-region dynamic programming in software, we achieve a system speed-up of between 100-200 times Blast running on a 1.4GHz AMD Opteron processor.

Return to Poster Abstract Index
Return to Top

DASHfaster: Accurate Sequence Alignment
Paul Gardner-Stephen and Greg Knowles
School of Informatics and Engineering, Flinders University

In this paper we present our genomic and protenomic sequence alignment algorithm, DASH, which results in order of magnitude speed improvement when compared to NCBI-BLAST 2.2.6, with equivalent or superior sensitivity. Dynamic programming is the predominant contributor to search time for algorithms such as BLAST and FastA. Improving the efficiency of dynamic programming provides an opportunity to increase sensitivity, or significantly reduce search times and help offset the effects of the continuing exponential growth in atabase sizes in excess of Moore's Law. Specifically, for protein searching we have demonstrated an order of magnitude speed up with equivalent sensitivity, or alternatively ~30% speed up with significantly improved sensitivity, depending on the parameters selected. Smith-Waterman complete dynamic programming is used as the sensitivity benchmark. Similar results are presented for nucleotide alignment. DASH is designed around the principle of considering sequence alignments to typically consist of regions of high homology (the ¬ diagonals ¬ ) interspersed with regions of low homology. Our technique localizes dynamic programming to those regions of low homology and the extremities. Combined with an indexed approach to rapidly finding the high homology diagonals, we achieve significant speed improvements. Given the efficiency of this approach, the time budget available can be leveraged to provide superior sensitivity compared to NCBI-BLAST. Since our algorithm is highly parallel, we have developed dedicated hardware which we will present in a companion paper, and a distributed version of our software with approximately linear scalability, which we have employed to provide an NCBI style web search service to our collaborators.

Return to Poster Abstract Index
Return to Top

SUPERCONTIGS: A Contig Scaffolding Tool
Daniela Puiu
Virginia Commonwealth University

SUPERCONTIGS is a genome-finishing program that orders, orients and groups the contigs generated by a shotgun sequence assembly program based on clone pair information and an alignment with a related genome. The program can be used in the genome finishing and gap closing process. The main idea developed in SUPERCONTIGS is the combination of assembly and alignment results towards scaffold generation. This is done by locating clone pairs assembled in different contigs and based on their position in the contigs and the contigs alignment(s) to the related genome, compute their absolute alignment coordinates. If the clones are located within a maximum clone library insert size and have opposite orientations, it is most likely that the contigs they belong to are adjacent and can be clustered together. The contigs are hierarchically clustered into scaffolds based on transitivity relation. The order and size of the scaffolds is computed, followed by an estimation of the gap types (physical/sequence) and sizes. The results are formatted and saved for analysis and visualization. The tool was initially developed for the Cryptosporidium hominis genome assembly. C. hominis is an 8 chromosome intestinal parasite whose closest relative genome, Cryptosporidium parvum was almost complete and used for alignment. SUPERCONTIGS took the 1,573 Phrap-generated C. hominis contigs and the nucmer alignment to the C. parvum genome and grouped them into 305 scaffolds, separated by 226 physical gaps. These numbers were much smaller than the ones obtained using other scaffolding tools (ex: Bambus), indicating a better performance.

Return to Poster Abstract Index
Return to Top

Multiple Alignment of Rearranged Genomes
Aaron Darling, Bob Mau, Mark Craven, and Nicole T. Perna
University of Wisconsin-Madison

As organisms evolve, their genomes undergo both small and large scale changes that together can have a profound impact on the organism's phenotype. Through genome comparison, we hope to glean information that reveals the genotypic differences responsible for novel phenotypes. Numerous multiple genome alignment and comparison methods have been developed in response to the ready availability of genome sequence data. Of these methods, none perform global alignment of an arbitrary set of genomes containing rearrangements. We have developed an integrated method to rapidly align several genome sequences in the presence of large-scale rearrangement events. In order to rapidly align genomes, the algorithm uses alignment anchors to reduce the number of possible alignments. However, unlike previous methods which enforced strict collinearity among alignment anchors, our algorithm breaks anchors into groups of local collinearity called Locally Collinear Blocks (LCBs). To avoid spurious matches arising from random sequence similarity, each LCB must meet a minimum sequence identity criteria. The implementation, Mauve, identifies large-scale changes such as horizontal transfer, inversion, and rearrangement in addition to local changes such as nucleotide substitutions and indels. In order to evaluate the quality of genome alignments produced by Mauve and by other software, we designed a model of bacterial genome evolution that allows us to simulate evolution. In addition to the evolved genome sequences, the simulation produces a 'correct' alignment that provides a basis for evaluating the quality of calculated genome alignments. The evaluations of alignment quality suggest directions for further refinement of genome alignment models.

Return to Poster Abstract Index
Return to Top

A Novel Genome Assembler: Using K-Mers to Indirectly PerformN-Squared Comparisons in O(N)
Ho Joon Lee, Stephanie Rogers, Maulik Shah, and Jeffrey Touchman
Computational Biosciences, Arizona State University

We have designed and implemented a novel, exhaustive approach to genome assembly. We aimed eliminate traditional heuristics and indirectly compare each sequence fragment to all the others in less than O(N2 ) running time. The first step in the algorithm involves building a k-mer library; accomplished by scanning through each sequence fragment using a sliding window of size k and cataloging each of these k-mers and the sequence fragment in which it occurred. We assume that neighboring fragments from the genome will share k-mers. Therefore, from this k-mer table, we build an adjacency table; cataloging each sequence fragment and its neighbors. This adjacency table is generated in O(N) and represents all N2 comparisons. Finally, with the information in the adjacency table, we performed multiple breadth-first searches to collect and separate the connected fragments. We assume that there exist disjoint graphs in the adjacency table and that each such graph represents an entire contig. We performed this process on two datasets, a simulated set and a real set: both have >40,000 sequence fragments of ~1,000 kb and 9-fold coverage. In both instances, the majority of the fragments were assigned to one contig.

Return to Poster Abstract Index
Return to Top

HOME • REGISTRATION • PAPERS • POSTERS • TUTORIALS • PROGRAM • KEYNOTE SPEAKERS • INVITED SPEAKERS
SPECIAL EVENTS • COMMITTEES • SPONSORS • NEWS ROOM • CONTACT US • PREVIOUS CONFERENCES