IEEE Computer Society Bioinformatics Conference

Full Papers Presented as Posters

Pathways, Networks and System Biology
Protein/RNA Structure Prediction and Modeling
Pattern Recognition
Sequence Alignment
Microarray Data Analysis
Data Mining
Biomedical Research

Pathways, Networks and System Biology (Session IV)

Two Operative Concepts for the Post-genomic Era: The "mémoire vive" of the Cell and a Molecular Algebra

Simone Bentolila
IGM, University Marne la Vallée - Cité Descartes, France

Abstract
The first successes in cloning experiments and stem cell "reprogramming" have already demonstrated the primordial role of cellular working-space memory and regulatory mechanisms, which use the knowledge stored in the DNA database in read mode. We present an analogy between living systems and informatics systems by considering: 1) the cell cytoplasm as a memory device accessible as read/write; 2) the mechanisms of regulation as a programming language defined by a grammar, a molecular algebra; 3) biological processes as volatile programs which are executed without being written; 4) DNA as a database in read only mode. We also present applications to two biological algorithms: the immune response and glycogen metabolism.

Return to Program or Index

Protein/RNA Structure Prediction and Modeling (Session IX)

Automated Protein NMR Resonance Assignments

Xiang Wan
Protein Engineering Network Centers of Excellence, Computing Science, University of Alberta, Edmonton, Alberta, Canada
Dong Xu
Supported by the Office of Biological and Environmental Research, U.S. Department of Energy, under Contract DE-AC05-00OR22725, managed by UT-Battelle, LLC. Protein Informatics Group, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN
Carolyn M. Slupsky
Protein Engineering Network Centers of Excellence, Medical Research Center, University of Alberta, Edmonton, Alberta, Canada
Guohui Lin
Protein Engineering Network Centers of Excellence, supported in part by NSERC, PENCE, and a Startup REE Grant from the University of Alberta, Computing Science, University of Alberta, Edmonton, Alberta, Canada

Abstract
NMR resonance peak assignment is one of the key steps in solving an NMR protein structure. The assignment process links resonance peaks to individual residues of the target protein sequence, providing the prerequisite for establishing intra- and inter-residue spatial relationships between atoms. The assignment process is tedious and time-consuming, which could take many weeks. Though there exist a number of computer programs to assist the assignment process, many NMR labs are still doing the assignments manually to ensure quality. This paper presents (1) a new scoring system for mapping spin systems to residues, (2) an automated adjacency information extraction procedure from NMR spectra, and (3) a very fast assignment algorithm based on our previous proposed greedy filtering method and a maximum matching algorithm to automate the assignment process. The computational tests on 70 instances of (pseudo) experimental NMR data of 14 proteins demonstrate that the new score scheme has much better discerning power with the aid of adjacency information between spin systems simulated across various NMR spectra. Typically, with automated extraction of adjacency information, our method achieves nearly complete assignments for most of the proteins. The experiment shows very promising perspective that the fast automated assignment algorithm together with the new score scheme and automated adjacency extraction may be ready for practical use.

Return to Program or Index

3D Structural Homology Detection via Unassigned Residual Dipolar Couplings

Christopher James Langmead and Bruce Randall Donald
Computer Science, Chemistry, Biological Sciences, Dartmouth, Hanover, NH

Abstract
Recognition of a protein's fold provides valuable information about its function. While many sequence-based homology prediction methods exist, an important challenge remains: two highly dissimilar sequences can have similar folds—how can we detect this rapidly, in the context of structural genomics? High-throughput NMR experiments, coupled with novel algorithms for data analysis, can address this challenge. We report an automated procedure for detecting 3D structural homologies from sparse, unassigned protein NMR data. Our method identifies the 3D structural models in a protein structural database whose geometries best fit the unassigned experimental NMR data. It does not use sequence information and is thus not limited by sequence homology. The method can also be used to confirm or refute structural predictions made by other techniques such as protein threading or sequence homology. The algorithm runs in O(pnk³) time, where p is the number of proteins in the database, n is the number of residues in the target protein, and k is the resolution of a rotation search. The method requires only uniform ¹⁵N-labelling of the protein and processes unassigned H^N - ¹⁵N residual dipolar couplings, which can be acquired in a couple of hours. Our experiments on NMR data from 5 different proteins demonstrate that the method identifies closely related protein folds, despite low-sequence homology between the target protein and the computed model.

Abbreviations used: NMR, nuclear magnetic resonance; RDC, residual dipolar coupling; DOF, degrees of freedom; 3D, three-dimensional; HSQC, heteronuclear single-quantum coherence; H N , amide proton; SAR, structure activity relation; SO(3), special orthogonal (rotation) group in 3D.

Return to Program or Index

Pattern Recognition (Session VIII)

Codon Optimization for DNA Vaccines and Gene Therapy Using Pattern Matching

Ravi Vijaya Satya and Amar Mukherjee
School of Engineering and Computer Science, University of Central Florida, Orlando, FL
Udaykumar Ranga
Jawaharlal Nehru Center for Advanced Scientific Research, Jakkur, Bangalore, India

Abstract
Codon optimization enhances the effectiveness of DNA expression vectors used in DNA vaccination and gene therapy by increasing protein expression. Additionally certain nucleotide motifs have experimentally been shown to be immuno-stimulatory while certain others have been shown to be immuno-suppressive. In this paper, we present algorithms to locate all the possible occurrences of a given set of immuno-modulatory motifs in the DNA expression vectors corresponding to a given amino acid sequence and maximize or minimize the number and the context of the immuno-modulatory motifs in the DNA expression vectors. The main contribution is to use multiple pattern matching algorithms to synthesize a DNA sequence for a given amino acid sequence and a graph theoretic approach for finding the longest weighted path in a directed graph that will maximize or minimize certain motifs as well as guarantee certain fitness factors of codon frequency usage for a particular species. This is achieved using O(n²) time and storage resources compared to the brute force algorithm that might take exponential amount of resources, where n is the length of the amino acid sequence. Based on this, we develop a software tool that could help the researcher to codon optimize the DNA vector for a given species for higher protein expression in a heterologous system. Additionally, this software could also enable the researcher to analyze the content of CpG motifs in a given amino acid sequence and engineer CpG motifs for immuno-modulation.

Return to Program or Index

Haplotype Motifs: An Algorithmic Approach to Locating Evolutionary Conserved Patterns in Haploid Sequences

Russell Schwartz
Biological Sciences, Carnegie Melon University, Pittsburgh, PA

Abstract
The promise of plentiful data on common human genetic variations has given hope that we will be able to uncover genetic factors behind common diseases that have proven difficult to locate by prior methods. Much recent interest in this problem has focused on using haplotypes (contiguous regions of correlated genetic variations), instead of the isolated variations, in order to reduce the size of the statistical analysis problem. In order to most effectively use such variation data, we will need a better understanding of haplotype structure, including both the general principles underlying haplotype structure in the human population and the specific structures found in particular genetic regions or sub-populations. This paper presents a probabilistic model for analyzing haplotype structure in a population using conservative motifs found in statistically significant sub-populations. It describes the model and computational methods for deriving the predicted motif set and haplotype structure for a population. It further presents results on simulated data, in order to validate the method, and on two real datasets from the literature, in order to illustrate its practical application.

Return to Program or Index

Sequence Alignment (Session X)

A New Approach for Gene Annotation Using Unambiguous Sequence Joint

Alexandre Tchourbanov, Daniel Quest, Hesham Ali, Mark Pauley and Robert Norgren
Computer Science, College of Information Science and Technology, University of Nebraska, Omaha, NE and Genetics, Cell Biology and Anatomy, University of Nebraska, Medical Center, Omaha, NE

Abstract
The problem addressed by this paper is accurate and automatic gene finding following precise identification/annotation of exon and intron boundaries for biologically verified nucleotide sequences, using the alignment of human genomic DNA to the curated mRNA transcript. We present a detailed description of a new cDNA/DNA homology gene annotation algorithm combining the results of BLASTN search with spliced alignment. Unambiguous junction of several Genomic DNA sequences is the key feature increasing our annotation quality, comparing to other programs. We also address gene annotation with both non-canonic splice sites and short exons. The approach has been tested on Genie learning subset as well as full-scale human RefSeq, demonstrating performance as high as 97%.

Return to Program or Index

Microarray Data Analysis (Session II)

Group Testing With DNA Chips: Generating Designs and Decoding Experiments

Alexander Schliep
Computational Molecular Biology, Max-Planck-Institute for Molecular Genetics, Berlin, Germany
David C. Torney
Theoretical Biology and Biophysics, Los Alamos National Laboratory, Los Alamos, NM
Sven Rahmann
Mathematics and Computer Science, Freie Universität, Berlin, Germany and Computational Molecular Biology, Max-Planck-Institute for Molecular Genetics, Berlin, Germany

Abstract
DNA microarrays are valuable tool for massively parallel DNA-DNA hybridization experiments. Currently, most applications rely on the existence of sequence-specific oligonucleotide probes. In large families of closely related target sequences, such as different virus subtypes, the high degree of similarity often makes it impossible to find a unique probe for every target. On the other hand, a robust presence or absence call may be more important than a quantitative analysis in these cases. We propose a microarray design methodology based on a group testing approach. While probes might not bind to unique targets, a properly chosen probe set can still unambiguously distinguish the presence of one target set from the presence of a different target set. Our method is the first one that explicitly takes cross-hybridization and experiment errors into account.

The complete approach consists of three steps: (1) Pre-selection of probe candidates, (2) Generation of a suitable group testing design, and (3) Decoding of hybridization results to infer presence or absence of individual targets. Our results show that this approach is very promising, even for difficult data sets and experimental error rates of up to 5%.

Return to Program or Index

Data Mining (SessionXII)

A Probabilistic Model for Identifying Protein Names and their Name Boundaries

Kazuhiro Seki and Javed Mostafa
Laboratory of Applied Informatics Research, Indiana University, Bloomington, Indiana

Abstract
This paper proposes a method for identifying protein names in biomedical texts with an emphasis on detecting protein name boundaries. We use a probabilistic model which exploits several surface clues characterizing protein names and incorporates word classes for generalization. In contrast to previously proposed methods, our approach does not rely on natural language processing tools such as part-of-speech taggers and syntactic parsers, so as to reduce processing overhead and probabilistic parameters to be estimated. A notion of certainty is also proposed to improve precision for identification. We implemented a protein name identification system based on our proposed method, and evaluated the system on real-world biomedical texts in conjunction with the previous work. The results showed that overall our system performs comparably to the state-of-the-art protein name identification system and that higher performance is achieved for compound names. In addition, it is shown that our system can further improve precision by restricting the system output to those with high certainties.

Return to Program or Index

Biomedical Research (Session V)

Fourier Harmonic Approach for Visualizing Temporal Patterns of Gene Expression Data

Li Zhang and Aidong Zhang
Computer Science and Engineering, State University of New York, Buffalo, NY
Murali Ramanathan
Pharmaceutical Sciences, State University of New York, Buffalo, NY

Abstract
DNA microarray technology provides a broad snapshot of the state of the cell by measuring the expression levels of thousands of genes simultaneously. Visualization techniques can enable the exploration and detection of patterns and relationships in a complex dataset by presenting the data in a graphical format in which the key characteristics become more apparent. The purpose of this study is to present an interactive visualization technique conveying the temporal patterns of gene expression data in a form intuitive for non-specialized end-users. The first Fourier harmonic projection (FFHP) was introduced to translate the multi-dimensional time series data into a two dimensional scatter plot. The spatial relationship of the points reflect the structure of the original dataset and relationships among clusters become two dimensional. The proposed method was tested using two published, array-derived gene expression datasets. Our results demonstrate the effectiveness of the approach.

Return to Program or Index

RETURN TO
PROGRAM

Return to Top

HOME | REGISTRATION | STANFORD SITE | PAPERS | POSTERS | TUTORIALS | PROGRAM | KEYNOTE SPEAKERS
INVITED SPEAKERS | SPECIAL EVENTS | COMMITTEES | SPONSORS | NEWS | CONTACT | 2002 CONFERENCE