IEEE Computer Society Bioinformatics Conference

Podium Presentations

Pathways, Networks and System Biology
Protein/RNA Structure Prediction and Modeling
Pattern Recognition
Sequence Alignment
Comparative Genomics and Phylogenetic Analysis
Microarray Data Analysis
Data Mining
Biomedical Research
Molecular Structure

Pathways, Networks and System Biology (Session IV)

Can We Identify Cellular Pathways Implicated in Cancer Using Gene Expression Data?

Nigam Shah
Pennsylvania State University, University Park, PA
Jorge Lepre, Yuhai Tu and Gustavo Stolovitzky
IBM Computational Biology Center, T.J. Watson Research Center, Yorktown Heights, NY

Abstract
The cancer state of a cell is characterized by alterations of important cellular processes such as cell proliferation, apoptosis, DNA-damage repair, etc. Some of these alterations involve modifications of the expression of genes that participate in the pathways responsible for these processes. From this simple observation it follows that the expression of genes associated with cancer related pathways should exhibit differences between the normal and the cancerous states. We explore various means to find those differences. Interestingly, these differences can only be identified when groups of genes, as opposed to isolated genes, are considered. Instead of using all the individual genes on a DNA array in comparing cohorts of cancer patients with control subjects, our analysis searches for signals within small subsets of genes that are associated with specific pathways, thereby substantially increasing the sensitivity of the analysis, while preserving the biological information contained in the pathways. We analyze 6 different pathways (p53, Ras, Brca, DNA damage repair, NFkb and ?-catenin) and 4 different types of cancer: colon, pancreas, prostate and kidney. Our results are found to be mostly consistent with existing knowledge of the involvement of these pathways in different cancers. Our analysis constitutes proof of principle that it may be possible to predict the involvement of a particular pathway in cancer or other diseases by using gene expression data. Such method would be particularly useful for the types of diseases where biology is poorly understood.

Return to Program or Index

Statistical Inference for Well-ordered Structures in Nucleotide Sequences

Shu-Yun Le
Laboratory of Experimental and Computational Biology
NCI Center for Cancer Research, National Cancer Institute, NIH
Jih-H. Chen and Jacob V. Maizel, Jr.
Advanced Biomedical Computing Center

Abstract
Distinct, local structures are frequently correlated with functional RNA elements involved in post-transcriptional regulation of gene expression. Discovery of microRNAs (miRNAs) suggests that there are a large class of small non-coding RNAs in eukaryotic genomes. These miRNAs have the potential to form distinct fold-back stem-loop structures. The prediction of those well-ordered folding sequences (WFS) in genomic sequences is very helpful for our understanding of RNA-based gene regulation and determination of local RNA elements with structure-dependent functions. In this study, we describe a novel method for discovering the local WFS in a nucleotide sequence by Monte Carlo simulation and RNA folding. In the approach the quality of local WFS is assessed by the energy difference (Ediff) between the optimal structure folded in the local segment and its corresponding optimal, restrained structure where all the previous base pairings formed are prohibited. Using Monte Carlo simulations we can evaluate the random probabilities of Ediff in a random sample. The distinct WFS can be discovered by scanning successive segments along a sequence for evaluating the difference between Ediff of the natural sequence and those computed from randomly shuffled sequences. Our results indicate that the statistically significant WFS detected in the genomic sequences of Caenorhabditis elegans (C.elegans) F49E12, T07C5, T07D1, T10H9, Y56A3A and Y71G12B are coincident with known fold-back stem-loops found in miRNA precursors. The potential and implications of our method in searching for miRNAs in genomes is discussed.

Return to Program or Index

Combining Microarrays and Biological Knowledge for Estimating Gene Networks via Bayesian Networks

Seiya Imoto
Human Genome Center, Institute of Medical Science, University of Tokyo, Tokyo, Japan
Tomoyuki Higuchi
The Institute of Statistical Mathematics, 4-6-7, Tokyo, Japan
Takao Goto
Human Genome Center, Institute of Medical Science, University of Tokyo, Tokyo, Japan
Kousuke Tashiro
Graduate School of Genetic Resources Technology, Kyushu University, Japan
Satoru Kuhara
Graduate School of Genetic Resources Technology, Kyushu University, Japan
Satoru Miyano
Human Genome Center, Institute of Medical Science, Univ. of Tokyo, Tokyo, Japan

Abstract
We propose a statistical method for estimating a gene network based on Bayesian networks from microarray gene expression data together with biological knowledge including protein-protein interactions, protein-DNA interactions, binding site information, existing literature and so on. Unfortunately, microarray data do not contain enough information for constructing gene networks accurately in many cases. Our method adds biological knowledge to the estimation method of gene networks under a Bayesian statistical framework, and also controls the trade-off between microarray information and biological knowledge automatically. We conduct Monte Carlo simulations to show the effectiveness of the proposed method. We analyze Saccharomyces cerevisiae gene expression data as an application.

Return to Program or Index

Protein/RNA Structure Prediction and Modeling (Sessions VI & IX)

Towards Index-based Similarity Search for Protein Structure Databases

Orhan Camoglu, Tamer Kahveci and Ambuj K. Singh
Department of Computer Science, University of California at Santa Barbara, Santa Barbara, CA

Abstract
We propose two methods for finding similarities in protein structure databases. Our techniques extract feature vectors on triplets of SSEs (Secondary Structure Elements) of proteins. These feature vectors are then indexed using a multidimensional index structure. Our first technique considers the problem of finding proteins similar to a given query protein in a protein dataset. This technique quickly finds promising proteins using the index structure. These proteins are then aligned to the query protein using a popular pairwise alignment tool such as VAST. We also develop a novel statistical model to estimate the goodness of a match using the SSEs. Our second technique considers the problem of joining two protein datasets to find an all-to-all similarity. Experimental results show that our techniques improve the pruning time of VAST 3 to 3.5 times while keeping the sensitivity similar.

Return to Program or Index

Local Similarity in RNA Secondary Structures

Matthias Höchsmann
International Graduate School in Bioinformatics and Genome Research, Universität Bielefeld, Bielefeld, Germany
Thomas Töller
Graduiertenkolleg Bioinformatik, Universität, Bielefeld, Germany
Robert Giegerich and Stefan Kurtz
Technische Fakult, Universität Bielefeld, Germany

Abstract
We present a systematic treatment of alignment distance and local similarity algorithms on trees and forests. We build upon the tree alignment algorithm for ordered trees given by Jiang et. al (1995) and extend it to calculate local forest alignments, which is essential for finding local similar regions in RNA secondary structures. The time complexity of our algorithm is

O( | F1| . |F2| . deg(F1) . deg(F2) . (deg(F1) +deg(F2)))

where |Fi| is the number of nodes in forest Fi and deg(Fi) is the degree of Fi. We provide carefully engineered dynamic programming implementations using dense, two-dimensional tables which considerably reduces the space requirement. We suggest a new representation of RNA secondary structures as forests that allow reasonable scoring of edit operations on RNA secondary structures. The comparison of RNA secondary structures is facilitated by a new visualization technique for RNA secondary structure alignments. Finally, we show how potential regulatory motifs can be discovered solely by their structural preservation, and independent of their sequence conservation and position.

Return to Program or Index

CTSS: A Robust and Efficient Method for Protein Structure Alignment Based on Local Geometrical and Biological Features

Tolga Can and Yuan-Fang Wang
Computer Science, UC Santa Barbara, Santa Barbara, CA

Abstract
We present a new method for conducting protein structure similarity searches, which improves on the accuracy, robustness, and efficiency of some existing techniques. Our method is grounded in the theory of differential geometry on 3D space curve matching. We generate shape signatures for proteins that are invariant (i.e., they are not affected by the translation and rotation of a protein structure in space), localized (i.e., the signature at each residue location is completely determined by the local structure around that particular residue), robust (i.e., small perturbations of atomic coordinates induce small changes in the associated signatures), compact (i.e., the size of the signatures is O(n), n: number of residues), and biologically meaningful (i.e., we incorporate secondary structure assignment into the signature). To improve matching accuracy, we smooth the noisy raw atomic coordinate data with spline fitting. To improve matching efficiency, we adopt a hierarchical coarse-to-fine strategy. We first use an efficient hashing-based technique to screen out unlikely candidates. We then perform pairwise alignments only for a small number of candidates that survive the screening process. Contrary to other hashing based techniques, our technique employs domain specific information (not just geometric information) in constructing the hash key, and hence, is more tuned to the domain of biology. Furthermore, the invariancy, localization, and compactness of the shape signatures allow us to utilize a well-known local sequence alignment algorithm for aligning two protein structures. One measure of the efficacy of the proposed technique is that we were able to discover new, meaningful motifs that were not reported by other structure alignment methods. We have implemented our method as an interactive tool that allows visual inspection of the alignment results and iterative discovery of possible suboptimal alignments that may have biological importance.

Return to Program or Index

Statistical and Visual Morph Movie Analysis of Crystallographic Mutant Selection Bias in Protein Mutation Resource Data

Werner G. Krebs
Integrative Biosciences San Diego Supercomputer Center National, Partnership for Advanced Computational Infrastructure
Phil Bourne
Department of Pharmacology, UC San Diego, CA

Abstract
Structural studies of the effects of non-silent mutations on protein conformational change are an important key in deciphering the language that relates protein amino acid primary structure to tertiary structure. Elsewhere, we presented the Protein Mutant Resource (PMR) database, a set of online tools that systematically identified groups of related mutant structures in the Protein DataBank (PDB), accurately inferred mutant classifications in the Gene Ontology using an innovative, statistically rigorous data-mining algorithm with more general applicability, and illustrated the relationship of these mutant structures via an intuitive user interface. Here, we perform a comprehensive statistical analysis of the effect of PMR mutations on protein tertiary structure. We find that, although the PMR does contain spectacular examples of conformational change, in general there is a counter-intuitive inverse relationship between conformational change (measured as C-_ displacement or RMS of the core structure) and the number of mutations in a structure. That is, post-translational modifications by structural biologists present in the PDB contrast naturally evolved mutations. We compare the frequency of mutations in the PMR/PDB datasets against the accepted PAM250 natural amino acid mutation frequency to confirm these observations. We generated morph movies from PMR structure pairs using technology previously developed for the Macromolecular Motions Database, http://molmovdb.org, allowing bioinformaticists, geneticists, protein engineers, and rational drug designers to analyze visually the mechanisms of protein conformational change and distinguish between conformational change due to motions (e.g., ligand binding) and mutations. The PMR morph movies and statistics can be freely viewed from the PMR website, http://pmr.sdsc.edu.

Return to Program or Index

A New Similarity Measure Among Protein Sequences

Kuen-Pin Wu, Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu
Institute of Information Science, Academia Sinica, Taipei, Taiwan

Abstract
Protein sequence analysis is an important tool to decode the logic of life. One of the most important similarity measures in this area is the edit distance between amino acids of two sequences. We believe this criterion should be reconsidered because protein features are probably associated more with small peptide fragments than with individual amino acids.

In this paper, we design small patterns that are associated with highly conversed regions among a set of protein sequences. These patterns are used analogous to the index terms in information retrieval. Therefore, we do not consider gaps within patterns. This new similarity measure has been applied to phylogenetic tree construction, protein clustering and protein secondary structure prediction and has produced promising results.

Return to Program or Index

Pattern Recognition (Session VIII)

cWINNOWER Algorithm for Finding Fuzzy DNA Motifs

Shoudan Liang
NASA Advanced Supercomputing Division, NASA Ames Research Center, Moffett Field, CA

Abstract
The cWINNOWER algorithm detects fuzzy motifs in DNA sequences rich in protein-binding signals. A signal is defined as any short nucleotide pattern having up to d mutations differing from a motif of length l. The algorithm finds such motifs if multiple mutated copies of the motif (i.e., the signals) are present in the DNA sequence in sufficient abundance. The cWINNOWER algorithm substantially improves the sensitivity of the Winnower method of Pevzner and Sze by imposing a consensus constraint, enabling it to detect much weaker signals. We studied the minimum number of detectable motifs qc as a function of sequence length N for random sequences. We found that qc increases linearly with N for a fast version of the algorithm based on counting three-member sub-cliques. Imposing consensus constraints reduces qc by a factor of three in this case, which makes the algorithm dramatically more sensitive. Our most sensitive algorithm, which counts four-member sub-cliques, needs a minimum of only 13 signals to detect motifs in a sequence of length N = 12000 for (l, d) = (15, 4).

Return to Program or Index

LOGOS: A Modular Bayesian Model for de novo Motif Detection

Eric P. Xing
Computer Science Division, University of California at Berkeley, Berkeley, CA
Wei Wu
Life Science Division, Lawrence Berkeley National Lab, Berkeley, CA
Michael I. Jordan
Computer Science and Statistics, UC Berkeley, CA
Richard M. Karp
Computer Science Division, UC Berkeley, CA

Abstract
The complexity of the global organization and internal structure of motifs in higher eukaryotic organisms raises significant challenges for motif detection techniques. To achieve successful de novo motif detection it is necessary to model the complex dependencies within and among motifs and incorporate biological prior knowledge. In this paper, we present LOGOS, an integrated Local and GlObal motif Sequence model for biopolymer sequences, which provides a principled frame-work for developing, modularizing, extending and computing expressive motif models for complex biopolymer sequence analysis. LOGOS consists of two interacting submodels: HMDM, a local align-ment model capturing biological prior knowledge and positional dependence within the motif local structure; and HMM, a global motif distribution model modeling frequencies and dependencies of motif occurrences. Model parameters can be fit using training motifs within an empirical Bayesian framework. A variational EM algorithm is developed for de novo motif detection. LOGOS improves over existing models that ignore biological priors and dependencies in motif structures and motif occurrences, and demonstrates superior performance on both semi-realistic test data and cis-regulatory sequences from yeast data with regard to sensitivity, specificity, flexibility and extensibility.

Return to Program or Index

A Block Coding Method That Leads to Significantly Lower Entropy Values for the Proteins of Haemophilus influenza and its Coding Sections

G. Sampath
Department of Computer Science, The College of New Jersey, Ewing, NJ

Abstract
A simple statistical block code in combination with the LZW-based compression utilities gzip and compress has been found to increase by a significant amount the level of compression possible for the proteins encoded in Haemophilus influenzae (hi), the first fully sequenced genome. The method yields an entropy of 3.665 bits per symbol (bps), which is 0.657 bps below the maximum of 4.322 bps. This is an improvement of 0.452 bps over the best known to date of 4.118 bps using the lza-CTW algorithm of Matsumoto, Sadakane, and Imai, the next best being 4.143 bps using Nevill-Manning and Witten's cp algorithm. Using an efficient inverse map from the 20 amino acids to the 61 triplets that code for them, the genome too is found to be compressible, although the gain is not as high. Calculated estimates based on this latter method yield an entropy of 1.757 bps for the coding portions of the genome, with a possibly lower actual entropy. Both of these results, which flow from the sequential use of statistics-based encoding techniques followed by substitution-based text compression algorithms (gzip, compress), hint at the existence of hitherto unexplored redundancies at both the global and the local level in hi and the proteins coded by it. Further work may help determine whether such decreases in entropy are possible with other proteins and genomes or hi is a rarity (perhaps even unique) in this respect.

Return to Program or Index

Sequence Alignment (Session X)

Computing Highly Specific and Mismatch Tolerant Oligomers Efficiently

Tomoyuki Yamada and Shinichi Morishita
Computer Science, University of Tokyo, Tokyo, Japan

Abstract
The sequencing of the genomes of a variety of species and the growing databases containing expressed sequence tags (ESTs) and complementary DNAs (cDNAs) facilitate the design of highly specific oligomers for use as genomic markers, PCR primers, or DNA oligo microarrays. The first step in evaluating the specificity of short oligomers of about twenty units in length is to determine the frequencies at which the oligomers occur. However, for oligomers longer than about fifty units this is not efficient, as they usually have a frequency of only 1. A more suitable procedure is to consider the mismatch tolerance of an oligomer, that is, the minimum number of mismatches that allows a given oligomer to match a sub-sequence other than the target sequence anywhere in the genome or the EST database. However, calculating the exact value of mismatch tolerance is computationally costly and impractical. Therefore, we studied the problem of checking whether an oligomer meets the constraint that its mismatch tolerance is no less than a given threshold. Here, we present an efficient dynamic programming algorithm solution that utilizes suffix and height arrays. We demonstrated the effectiveness of this algorithm by efficiently computing a dense list of oligo-markers applicable to the human genome. Experimental results show that the algorithm runs faster than well-known AbrahamsonÕs algorithm by orders of magnitude and is able to enumerate 63% ~79% of qualified oligomers.

Return to Program or Index

ANTICLUSTAL: Multiple Sequence Alignment by Antipole Clustering and Linear Approximate 1-Median Selection

C. Di Pietro, A. Ferro, T. Maugeri, G. Pigola, A. Pulvirenti
Dipartimento di Matematica e Informatica, Università di Catania
D. Shasha
Courant Institute of Mathematical Sciences, New York University, NY
V. Di Pietro, G. Emmanuele, E. Modica, M. Purrello, M. Ragusa, M. Scalia, S. Travali, V. Zimmitti
Dipartimento di Scienze Biomediche, Università di Catania

Abstract
In this paper we present a new Multiple Sequence Alignment (MSA) algorithm called AntiClustAl. The method makes use of the commonly used idea of aligning homologous sequences belonging to classes generated by some clustering algorithm, and then continues the alignment process in a bottom-up way along a suitable tree structure. The final result is then read at the root of the tree. Multiple sequence alignment in each cluster makes use of the progressive alignment with the 1-median (center) of the cluster. The 1-median of the set S of sequences is the element of S which minimizes the average distance from any other sequence in S. Its exact computation requires quadratic time. The basic idea of our proposed algorithm is to make use of a simple and natural algorithmic technique based on randomized tournaments which has been successfully applied to large size search problems in general metric spaces. In particular a clustering algorithm called Antipole tree Clustal W, a widely used tool to SMA, shows a better running time results with fully comparable alignment quality. A successful biological application showing high amino acid conservation during evolution of Xenopus laevis ASOD2 is also cited.

Return to Program or Index

Efficient Constrained Multiple Sequence Alignment with Performance Guarantee

Francis Y.L. Chin, N.L. Ho, and T.W. Lam
Department of Computer Science and Information Systems, University of Hong Kong, Hong Kong

Abstract
The Constrained Multiple Sequence Alignment problem is to align a set of sequences subject to a given constrained sequence, which arises from some knowledge of the structure of the sequences. This paper presents new algorithms for this problem, which are more efficient in terms of time and space (memory) than the previous algorithms [14], and with a worst-case guarantee on the quality of the alignment. Saving the space requirement by a quadratic factor is particularly significant as the previous O(n 4) space algorithm has limited application due to its huge memory requirement. Experiments on real data sets confirm that our new algorithms show improvements in both alignment quality and resource requirements.

Return to Program or Index

Comparative Genomics and Phylogenetic Analysis (Session XI)

Efficient Reconstruction of Phylogenetic Networks (of SNPs) with Constrained Recombination

Dan Gusfield and Satish Eddhu
Department of Computer Science, University of California at Davis, Davis, CA
Charles Langley
Department of Evolution and Ecology, UC Davis, CA

Abstract
A phylogenetic network is a generalization of a phylogenetic tree, incorporating more complex molecular phenomena, such as recombination, than is incorporated into a pure phylogenetic tree. Genomic sequences often do not fit a pure tree model, and a phylogenetic network is required to explain the evolution of the sequences. Deducing that history is important for the study of molecular evolution, but has other applications such as deducing the location of genes associated with a disease or a valuable genetic trait. Understanding the history of recombinations is particularly important in these applications. For greatest biological fidelity, the number of recombinations in the network should be small, and/or the network should have particular biologically-signifiant structural properties.

In a seminal paper, Wang et al. [10] showed that the problem of finding a phylogenetic network that derives an input set of sequences and minimizes the number of recombination events is NP-hard. However, they gave a polynomial time algorithm that was intended to determine whether the sequences could be derived on a phylogenetic network, where the recombination cycles are node disjoint, a structural feature that is realistic in particular evolutionary contexts. We call such a network a "galled-tree". Unfortunately, the algorithm in [10] is only a sufficient, but not a necessary, test for the existence of a galled-tree.

Here, we develop combinatorial constraints that galled-trees must obey, to obtain a faster algorithm that is guaranteed to be both a necessary and sufficient test for the existence of a galled-tree for the data. The algorithm finds if there is galled-tree for the data, and it characterizes the set of all the galled-trees for the sequences, and gives a count of them. The combinatorial constraints we develop apply (for the most part) to node-disjoint cycles in any phylogenetic network (not just galled-trees), and can be used for example to prove that a given site cannot be on a node-disjoint cycle in any phylogenetic network.

Perhaps more important than the specific results about galled-trees, we introduce an approach that can be used to study recombination in phylogenetic networks that go beyond galled-trees.

Return to Program or Index

Prokaryote Phylogeny without Sequence Alignment: From Avoidance Signature to Composition Distance

Bailin Hao
T-Life Research Center, Fudan University, Shanghai 200433, China
Ji Qi and Bin Wang
Institute of Theoretical Physics, Academia Sinica, Beijing, China

Abstract
A new and essentially simple method to infer prokaryotic phylogenetic relations from their complete genome data without using sequence alignment is proposed. It is based on the appearance frequency of oligopeptides of a fixed length (up to K=6) in their proteomes. This is a method without fine adjustment and choice of genes. It can incorporate the effect of lateral gene transfer to some extent and leads to results comparable with the bacteriologistsÕ systematics as reflected in the latest 2001 edition of the BergeyÕs Manual of Systematic Bacteriology[1, 2]. A key point in our approach is subtraction of a random background by using a Markovian model of order K-1 from the composition vectors to highlight the shaping role of natural selection.

Return to Program or Index

Microarray Data Analysis (Sessions I & II)

Clustering Binary Fingerprint Vectors with Missing Values for DNA Array Data Analysis

Andres Figueroa
Department of Computer Science, UC Riverside, Riverside, CA
James Borneman
Department of Plant Pathology, UC Riverside, Riverside, CA
Tao Jiang
Department of Computer Science, UC Riverside, Riverside, CA

Abstract
Oligonucleotide fingerprinting is a powerful DNA array based method to characterize cDNA and ribosomal RNA gene (rDNA) libraries and has many applications including gene expression profiling and DNA clone classification. We are especially interested in the latter application. A key step in the method is the cluster analysis of fingerprint data obtained from DNA array hybridization experiments. Most of the existing approaches to clustering use (normalized) real intensity values and thus do not treat positive and negative hybridization signals equally (positive signals are much more emphasized). In this paper, we consider a discrete approach. Fingerprint data are first normalized and binarized using control DNA clones. Because there may exist unresolved (or missing) values in this binarization process, we formulate the clustering of (binary) oligonucleotide fingerprints as a combinatorial optimization problem that attempts to identify clusters and resolve the missing values in the fingerprints simultaneously. We study the computational complexity of this clustering problem and a natural parameterized version, and present an efficient greedy algorithm based on MINIMUM CLIQUE PARTITION on graphs. The algorithm takes advantage of some unique properties of the graphs considered here, which allow us to efficiently find the maximum cliques as well as some special maximal cliques. Our experimental results on simulated and real data demonstrate that the algorithm runs faster and performs better than some popular hierarchical and graph-based clustering methods. The results on real data from DNA clone classification also suggest that this discrete approach is more accurate than clustering methods based on real intensity values, in terms of separating clones that have different characteristics with respect to the given oligonucleotide probes.

Return to Program or Index

Clustering Time-varying Gene Expression Profiles Using Scale-space Signals

Tanveer Syeda-Mahmood
IBM Almaden Research Center, San Jose, CA

Abstract
The functional state of an organism is determined largely by the pattern of expression of its genes. The analysis of gene expression data from gene chips has primarily revolved around clustering and classification of the data using machine learning techniques based on the intensity of expression alone with the time-varying pattern mostly ignored. In this paper, we present a pattern recognition-based approach to capturing similarity by finding salient changes in the time-varying expression patterns of genes. Such changes can give clues about important events, such as gene regulation by cell-cycle phases, or even signal the onset of a disease. Specifically, we observe that dissimilarity between time series is revealed by the sharp twists and bends produced in a higher-dimensional curve formed from the constituent signals. Scale-space analysis is used to detect the sharp twists and turns and their relative strength with respect to the component signals is estimated to form a shape similarity measure between time profiles. A clustering algorithm is presented to cluster gene profiles using the scale-space distance as a similarity metric. Multi-dimensional curves formed from time series within clusters are used as cluster prototypes or indexes to the gene expression database, and are used to retrieve the functionally similar genes to a query gene profile. Extensive comparison of clustering using scale-space distance in comparison to traditional Euclidean distance is presented on the yeast genome database.

Return to Program or Index

Fast and Sensitive Probe Selection for DNA Chips Using Jumps in Matching Statistics

Sven Rahmann
Department of Computational Molecular Biology, Max-Planck- Institute for Molecular Genetics, and Department of Mathematics and Computer Science, Freie Universität, Berlin, Germany

Abstract
The design of large-scale DNA microarrays is a challenging problem. So far, probe selection algorithms must trade the ability to cope with large-scale problems for a loss of accuracy in the estimation of probe quality. We present an approach based on jumps in matching statistics that combines the best of both worlds.

This article consists of two parts. The first part is theoretical. We introduce the notion of jumps in matching statistics between two strings and derive their properties. We estimate the frequency of jumps for random strings in a non-uniform Bernoulli model and present a new heuristic argument to find the center of the length distribution of the longest substring that two random strings have in common. The results are generalized to near perfect matches with a small number of mismatches.

In the second part, we use the concept of jumps to improve the accuracy of the longest common factor approach for probe selection by moving from a string based to an energy based specificity measure, while only slightly more than doubling the selection time.

Return to Program or Index

Fast and Accurate Probe Selection Algorithm for Large Genomes

Wing-Kin Sung and Wah-Heng Lee
Computer Science, National University of Singapore, Singapore

Abstract
The oligo microarray (DNA chip) technology in recent years has a significant impact on genomic study. Many fields such as gene discovery, drug discovery, toxicological research and disease diagnosis, will certainly benefit from its use. A microarray is an orderly arrangement of thousands of DNA fragments where each DNA fragment is a probe (or a fingerprint) of a gene/cDNA. It is important that each probe must uniquely associate with a particular gene/cDNA. Otherwise, the performance of the microarray will be affected. Existing algorithms usually select probes using the criteria of homogeneity, sensitivity, and specificity. However, most of these algorithms are too slow to be practical. To improve efficiency, some of these algorithms resorted to using some heuristics which reduces accuracy. Instead, we make use of some smart filtering techniques to avoid redundant computation while maintaining the accuracy. Based on the new algorithm, optimal short (20 bases) or long (50 or 70 bases) probes can be computed efficiently for large genomes.
Return to Program or Index

Degenerate Primer Design via Clustering

Xintao Wei and Giri Narasimhan
Computer Science, Florida International University, University Park, Miami, FL
David N. Kuhn
Biological Sciences, Florida International University, University Park, Miami, and USDA-ARS, Subtropical Horticulture Research Station, Miami, FL

Abstract
This paper describes a new strategy for designing degenerate primers for a given multiple alignment of amino acid sequences. Degenerate primers are useful for designing degenerate primers for homologous genes/proteins. However, when a large collection of sequences is considered, no consensus region may exist in the multiple alignment, making it impossible to design a single pair of primers for the collection. In such cases, manual methods are resorted to find smaller groups from the input collection so that primers can be designed for individual groups. Our strategy proposes an automatic grouping of the input sequences from the multiple alignment by using clustering techniques. The groups are then dealt with individually. Conserved regions are detected for each group. Conserved regions are scored using a BlockSimilarity score, a novel alignment scoring scheme that is appropriate for this application. Degenerate primers are then designed by reverse translating the conserved amino acid sequences to the corresponding nucleotide sequences. Our program, DePiCt, was written in BioPerl and was tested on the Toll-Interleukin Receptor (TIR) and the non-TIR family of plant resistance genes. Existing programs for degenerate primer design were unable to find primers for these data sets.

Return to Program or Index

Data Mining (Sessions VII and XII)

Discovering Compact and Highly Discriminative Features or Feature Combinations of Drug Activities Using Support Vector Machines

Hwanjo Yu and Jiong Yang
Computer Science, University of Illinois, Urbana-Champaign, IL
Wei Wang
Computer Science Department, University of North Carolina,
Chapel Hill, NC
Jiawei Han
Computer Science, University of Illinois, Urbana-Champaign, IL

Abstract
Nowadays, high throughput techniques such as microarray experiments make if feasible to examine and collect massive data at the molecular level. These data, typically mapped to a very high dimensional feature space, carry rich information about functionalities of certain biological entities and can be used to infer valuable knowledge for the purposes of classification and prediction. Typically, a small number of features or feature combinations play a deterministic roles in functional discrimination. The identification of such features or feature combinations is of great importance. In this paper, we study the problem of discovering compact and highly discriminative features or feature combinations from a rich feature collection. We employ the support of vector machines as the classification means and aim at finding compact feature combinations. Comparing to previous methods on feature selection, which identify features solely based on their individual roles in the classification, our method is able to identify minimal feature combinations that ultimately have deterministic roles in as systematic fashion. Experimental study significant individually but are most significant combinatively.

Return to Program or Index

SMASHing Regulatory Sites in DNA by Human-mouse Sequence Comparisons

Mihaela Zavolan and Terry Gaasterland
Laboratory for Computational Genomics, and Laboratory for Molecular Genetics, The Rockefeller University, New York, NY
Nikolaus Rajewsky
Department of Biology, New York University, New York, NY
Nicholas D. Socci
Pathology, Seaver Foundation Program in Bioinformatics and Laboratory for Molecular Genetics, The Rockefeller University, New York, NY

Abstract
Regulatory sequence elements provide important clues to understanding and predicting gene expression. Although the binding sites for hundreds of transcription factors are known, there has been no systematic attempt to incorporate this information in the annotation of the human genome. Cross species sequence comparisons are critical to a meaningful annotation of regulatory elements since they generally reside in conserved non-coding regions. To take advantage of the recently completed drafts of the mouse and human genomes for annotating transcription factor binding sites, we developed SMASH, a computational pipeline that identifies thousands of orthologous human/mouse proteins, maps them to genomic sequences, extracts and compares upstream regions and annotates putative regulatory elements in conserved, non-coding, upstream regions. Our current dataset consists of approximately 2500 human/mouse gene pairs. Transcription start sites were estimated by mapping quasi-full length cDNA sequences. SMASH uses a novel probabilistic method to identify putative conserved binding sites that takes into account the competition between transcription factors for binding DNA. SMASH presents the results via a genome browser web interface which displays the predicted regulatory information together with the current annotations for the human genome. Our results are validated by comparison to previously published experimental data. SMASH results compare favorably to other existing computational approaches.

Return to Program or Index

Initial Large-scale Exploration of Protein-protein Interactions in the Human Brain

Jake Y. Chen, Andrey Y. Sivachenko, Russell Bell, Connie Kurschner, Irene Ota and Sudhir Sahasrabudhe
Myriad Proteomics, Inc., Salt Lake City, UT

Abstract
Study of protein interaction networks is crucial to post-genomic systems biology. Aided by high-throughput screening technologies, biologists are rapidly accumulating protein-protein interaction data. Using a random yeast two-hybrid (R2H) process, we have performed large-scale yeast two-hybrid searches with approximately fifty thousand random human brain cDNA bait fragments against a human brain cDNA prey fragment library. From these searches, we have identified 13,656 unique protein-protein interaction pairs involving 4,473 distinct known human loci. In this paper, we have performed our initial characterization of the protein interaction network in human brain tissue. We have classified and characterized all identified interactions based on Gene Ontology (GO) annotation of interacting loci. We have also described the "scale-free" topological structure of the network.

Return to Program or Index

An SVM-based Algorithm for Identification of Photosynthesis-specific Genome Features

Gong-Xin Yu, Al Geist, George Ostrouchov, and Nagiza F. Samatova
Computational Biology Group, Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN

Abstract
This paper presents a novel algorithm for identification and functional characterization of "key" genome features responsible for a particular biochemical process of interest. The central idea behind our algorithm is that individual genome features (or their combinations) are identified as significant "key" features if the discrimination accuracy between two classes of genomes with respect to a given biochemical process is sufficiently affected by the inclusion or exclusion of these features. In this paper, genome features are defined by high-resolution gene functions. The discrimination procedure utilizes the Support Vector Machine (SVM) classification technique. Changes in classification accuracy in response to addition or deletion of genome features measure the significance of these features. The application of this SVM-based feature identification algorithm to the oxygenic photosynthetic process resulted in a total of 126 highly confident candidate genome features. They cover not only dominant genome features (that always occur in oxygenic photosynthetic genomes but not in the other genomes) but also weak yet complementary genome features (their combinations make unique dominant genome features). While many of these features are well-known components in the oxygenic photosynthetic process, others are completely unknown, even including some hypothetical proteins. It is obvious that our SVM-based feature identification algorithm has the capability to discover novel genome features related to a targeted biochemical process.

Return to Program or Index

Biomedical Research (Session V)

Stochastic Stage-structured Modeling of the Adaptive Immune System

Dennis L. Chao and Stephanie Forrest
Computer Science, University of New Mexico, Albuquerque, NM
Miles P. Davenport
Pathology, Faculty of Medicine, University of New South Wales, Kensington, Australia
Alan S. Perelson
Theoretical Biology and Biophysics, Los Alamos National Laboratory, Los Alamos, NM

Abstract
We have constructed a computer model of the cytotoxic T lymphocyte (CTL) response to antigen and the maintenance of immunological memory. Because immune responses often begin with small numbers of cells and there is great variation among individual immune systems, we have chosen to implement a stochastic model that captures the life cycle of T cells more faithfully than deterministic models. Past models of the immune response have been differential equation based, which do not capture stochastic effects, or agent-based, which are computationally expensive. We use a stochastic stage-structured approach that has many of the advantages of agent-based modeling but is more efficient. Our model can provide insights into the effect infections have on the CTL repertoire and the response to subsequent infections.

Return to Program or Index

A Personalized and Automated dbSNP Surveillance System

Shuo Liu, Steve Lin, Mark Woon, Teri E. Klein, and Russ B. Altman
Department of Genetics, Stanford Medical Informatics, Stanford, CA

Abstract
The development of high throughput techniques and large-scale studies in the biological sciences has given rise to an explosive growth in both the volume and types of data available to researchers. A surveillance system that monitors data repositories and reports changes helps manage the data overload. We developed a dbSNP surveillance system (URL: http://www.pharmgkb.org/do/serve?id=tools.surveillance.dbsnp) that performs surveillance on the dbSNP database and alerts users to new information. The system is notable because it is personalized and fully automated. Each registered user has a list of genes to follow and receives notification of new entries concerning these genes. The system integrates data from dbSNP, LocusLink, PharmGKB, and Genbank to position SNPs on reference sequences and classify SNPs into categories such as synonymous and non-synonymous SNPs. The system uses data warehousing, object model-based data integration, object-oriented programming, and a platform- neutral data access mechanism.

Return to Program or Index

Molecular Structure (Session III)

Experimental Studies of the Universal Chemical Key (UCK) Algorithm on the NCI Database of Chemical Compounds

Robert Grossman, Donald Hamelberg and Pavan Kasturi
Laboratory for Advanced Computing, University of Illinois, Chicago, IL
Bing Liu
Department of Computer Science, University of Illinois, Chicago, IL

Abstract
We have developed an algorithm called the Universal Chemical Key (UCK) algorithm that constructs a unique key for a molecular structure. The molecular structures are represented as undirected labeled graphs with the atoms representing the vertices of the graph and the bonds representing the edges. The algorithm was tested on 236,917 compounds obtained from the National Cancer Institute (NCI) database of chemical compounds. In this paper we present the algorithm, some examples and the experimental results on the NCI database. On the NCI database, the UCK algorithm provided distinct unique keys for chemicals with different molecular structures.

Return to Program or Index

RETURN TO
PROGRAM

Return to Top

HOME | REGISTRATION | STANFORD SITE | PAPERS | POSTERS | TUTORIALS | PROGRAM | KEYNOTE SPEAKERS
INVITED SPEAKERS | SPECIAL EVENTS | COMMITTEES | SPONSORS | NEWS | CONTACT | 2002 CONFERENCE