Full Papers Presented as Posters
Pathways, Networks and System Biology (Session IV)
Two Operative Concepts for the Post-genomic Era:
The "mémoire vive" of the Cell and a Molecular Algebra
Simone Bentolila IGM, University Marne la Vallée - Cité Descartes, France
Abstract
The first successes in cloning experiments and stem cell "reprogramming" have already
demonstrated the primordial role of cellular working-space memory and regulatory
mechanisms, which use the knowledge stored in the DNA database in read mode.
We present an analogy between living systems and informatics systems by considering: 1) the cell
cytoplasm as a memory device accessible as read/write; 2) the mechanisms of
regulation as a programming language defined by a grammar, a molecular algebra;
3) biological processes as volatile programs which are executed without being written;
4) DNA as a database in read only mode. We also present applications to two biological algorithms:
the immune response and glycogen metabolism.
Return to Program or Index
Protein/RNA
Structure Prediction and Modeling (Session IX)
Automated Protein NMR Resonance Assignments
Xiang Wan Protein Engineering Network Centers of Excellence, Computing Science,
University of Alberta, Edmonton, Alberta, Canada
Dong Xu Supported by the Office of Biological and Environmental Research, U.S.
Department of Energy, under Contract DE-AC05-00OR22725, managed by UT-Battelle,
LLC. Protein Informatics Group, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN
Carolyn M. Slupsky Protein Engineering Network Centers of Excellence, Medical Research
Center, University of Alberta, Edmonton, Alberta, Canada
Guohui Lin Protein Engineering Network Centers of Excellence, supported in part by
NSERC, PENCE, and a Startup REE Grant from the University of Alberta,
Computing Science, University of Alberta, Edmonton, Alberta, Canada
Abstract
NMR resonance peak assignment is one of the key steps in solving an NMR protein
structure. The assignment process links resonance peaks to individual residues of the
target protein sequence, providing the prerequisite for establishing intra- and
inter-residue spatial relationships between atoms. The assignment process is tedious
and time-consuming, which could take many weeks. Though there exist a number of
computer programs to assist the assignment process, many NMR labs are still doing
the assignments manually to ensure quality. This paper presents (1) a new scoring
system for mapping spin systems to residues, (2) an automated adjacency information
extraction procedure from NMR spectra, and (3) a very fast assignment algorithm
based on our previous proposed greedy filtering method and a maximum matching
algorithm to automate the assignment process. The computational tests on 70 instances
of (pseudo) experimental NMR data of 14 proteins demonstrate that the new score
scheme has much better discerning power with the aid of adjacency information between
spin systems simulated across various NMR spectra. Typically, with automated extraction
of adjacency information, our method achieves nearly complete assignments
for most of the proteins. The experiment shows very promising perspective that the fast
automated assignment algorithm together with the new score scheme and automated
adjacency extraction may be ready for practical use.
Return to Program or Index
3D Structural Homology Detection
via Unassigned Residual Dipolar Couplings
Christopher James Langmead and Bruce Randall Donald Computer Science, Chemistry,
Biological Sciences, Dartmouth, Hanover, NH
Abstract
Recognition of a protein's fold provides valuable information about its function. While many
sequence-based homology prediction methods exist, an important challenge remains: two highly dissimilar
sequences can have similar foldshow can we detect this rapidly, in the context of structural genomics?
High-throughput NMR experiments, coupled with novel algorithms for data analysis, can address this
challenge. We report an automated procedure for detecting 3D structural homologies from sparse, unassigned
protein NMR data. Our method identifies the 3D structural models in a protein structural database whose
geometries best fit the unassigned experimental NMR data. It does not use sequence information and is
thus not limited by sequence homology. The method can also be used to confirm or refute structural predictions
made by other techniques such as protein threading or sequence homology. The algorithm runs in O(pnk3) time,
where p is the number of proteins in the database, n is the number of residues in the target protein, and k
is the resolution of a rotation search. The method requires only uniform
15N-labelling of the protein and processes unassigned HN - 15N residual
dipolar couplings, which can be
acquired in a couple of hours. Our experiments on NMR data from 5 different proteins demonstrate that the
method identifies closely related protein folds, despite low-sequence homology between the target protein
and the computed model.
Abbreviations used: NMR, nuclear magnetic resonance; RDC, residual dipolar coupling; DOF, degrees of freedom;
3D, three-dimensional; HSQC, heteronuclear single-quantum coherence; H N , amide proton; SAR, structure
activity relation; SO(3), special orthogonal (rotation) group in 3D.
Return to Program or Index
Pattern Recognition (Session VIII)
Codon Optimization for DNA
Vaccines and Gene Therapy Using Pattern Matching
Ravi Vijaya Satya and Amar Mukherjee
School of Engineering and Computer Science, University of Central Florida, Orlando, FL
Udaykumar Ranga
Jawaharlal Nehru Center for Advanced Scientific Research, Jakkur,
Bangalore, India
Abstract
Codon optimization enhances the effectiveness of DNA expression vectors used in
DNA vaccination and gene therapy by increasing protein expression. Additionally
certain nucleotide motifs have experimentally been shown to be immuno-stimulatory while certain others
have been shown to be immuno-suppressive. In this paper, we present algorithms to locate all the possible
occurrences of a given set of immuno-modulatory motifs in the DNA expression vectors corresponding to a
given amino acid sequence and maximize or minimize the number and the context of the immuno-modulatory
motifs in the DNA expression vectors. The main contribution is to use multiple pattern matching algorithms
to synthesize a DNA sequence for a given amino acid sequence and a graph theoretic approach for finding the longest weighted
path in a directed graph that will maximize or minimize certain motifs as well as guarantee certain
fitness factors of codon frequency usage for a particular species. This is achieved using O(n2) time
and storage resources compared to the brute force algorithm that might take exponential amount of
resources, where n is the length of the amino acid sequence. Based on this, we develop a software tool
that could help the researcher to codon optimize the DNA vector for a given species for higher protein expression in a
heterologous system. Additionally, this software could also enable the researcher to analyze the content
of CpG motifs in a given amino acid sequence and engineer CpG motifs for immuno-modulation.
Return to Program or Index
Haplotype Motifs: An Algorithmic
Approach to Locating Evolutionary Conserved Patterns in Haploid Sequences
Russell Schwartz Biological Sciences, Carnegie Melon University, Pittsburgh, PA
Abstract
The promise of plentiful data on common human genetic variations has given hope that we will be
able to uncover genetic factors behind common diseases that have proven difficult to locate by prior methods.
Much recent interest in this problem has focused on using haplotypes (contiguous regions of correlated
genetic variations), instead of the isolated variations, in order to reduce the size of the statistical
analysis problem. In order to most effectively use such variation data, we will need a better understanding
of haplotype structure, including both the general principles underlying haplotype structure in the human
population and the specific structures found in particular genetic regions or sub-populations. This paper
presents a probabilistic model for analyzing haplotype structure in a population using conservative motifs
found in statistically significant sub-populations. It describes the model and computational methods for
deriving the predicted motif set and haplotype structure for a population. It further presents results on
simulated data, in order to validate the method, and on two real datasets from the literature, in order to
illustrate its practical application.
Return to Program or Index
Sequence Alignment (Session X)
A New Approach for Gene Annotation Using Unambiguous
Sequence Joint
Alexandre Tchourbanov, Daniel Quest, Hesham Ali, Mark Pauley and Robert Norgren
Computer Science, College of Information Science and Technology,
University of Nebraska, Omaha, NE and Genetics, Cell Biology and Anatomy,
University of Nebraska, Medical Center, Omaha, NE
Abstract
The problem addressed by this paper is accurate and automatic gene finding following
precise identification/annotation of exon and intron boundaries for biologically verified
nucleotide sequences, using the alignment of human genomic DNA to the curated mRNA
transcript. We present a detailed description of a new cDNA/DNA homology gene
annotation algorithm combining the results of BLASTN search with spliced alignment. Unambiguous
junction of several Genomic DNA sequences is the key feature increasing our annotation quality,
comparing to other programs. We also address gene annotation with both non-canonic splice sites
and short exons. The approach has been tested on Genie learning subset as well as full-scale
human RefSeq, demonstrating performance as high as 97%.
Return to Program or Index
Microarray Data Analysis (Session II)
Group Testing With DNA Chips: Generating Designs and Decoding Experiments
Alexander Schliep Computational Molecular Biology, Max-Planck-Institute for Molecular Genetics, Berlin, Germany
David C. Torney Theoretical Biology and Biophysics, Los Alamos National Laboratory, Los Alamos, NM
Sven Rahmann Mathematics and Computer Science, Freie Universität, Berlin, Germany and
Computational Molecular Biology, Max-Planck-Institute for Molecular Genetics, Berlin, Germany
Abstract
DNA microarrays are valuable tool for massively parallel DNA-DNA hybridization experiments.
Currently, most applications rely on the existence of sequence-specific oligonucleotide probes. In large
families of closely related target sequences, such as different virus subtypes, the high degree of similarity
often makes it impossible to find a unique probe for every target. On the other hand, a robust presence or
absence call may be more important than a quantitative analysis in these cases. We propose a microarray design
methodology based on a group testing approach. While probes might not bind to unique targets, a properly
chosen probe set can still unambiguously distinguish the presence of one target set from the presence of a
different target set. Our method is the first one that explicitly takes cross-hybridization and experiment errors into account.
The complete approach consists of three steps: (1) Pre-selection of probe candidates, (2) Generation of a
suitable group testing design, and (3) Decoding of hybridization results to infer presence or absence of
individual targets. Our results show that this approach is very promising, even for difficult data sets
and experimental error rates of up to 5%.
Return to Program or Index
Data Mining (SessionXII)
A Probabilistic Model
for Identifying Protein Names and their Name Boundaries
Kazuhiro Seki and Javed Mostafa Laboratory of Applied Informatics Research, Indiana University,
Bloomington, Indiana
Abstract
This paper proposes a method for identifying protein names in biomedical texts with an emphasis on
detecting protein name boundaries. We use a probabilistic model which exploits several surface clues characterizing
protein names and incorporates word classes for generalization. In contrast to previously proposed methods, our
approach does not rely on natural language processing tools such as part-of-speech taggers and syntactic parsers,
so as to reduce processing overhead and probabilistic parameters to be estimated. A notion of certainty is also
proposed to improve precision for identification. We implemented a protein name identification system based on
our proposed method, and evaluated the system on real-world biomedical texts in conjunction with the previous
work. The results showed that overall our system performs comparably to the state-of-the-art protein name
identification system and that higher performance is achieved for compound names. In addition, it is shown
that our system can further improve precision by restricting the system output to those with high certainties.
Return to Program or Index
Biomedical Research (Session V)
Fourier Harmonic Approach for Visualizing Temporal Patterns of Gene Expression Data
Li Zhang and Aidong Zhang Computer Science and Engineering, State University of New York, Buffalo, NY
Murali Ramanathan Pharmaceutical Sciences, State University of New York, Buffalo, NY
Abstract
DNA microarray technology provides a broad snapshot of the state of the cell by measuring the
expression levels of thousands of genes simultaneously. Visualization techniques can enable the exploration
and detection of patterns and relationships in a complex dataset by presenting the data in a graphical
format in which the key characteristics become more apparent. The purpose of this study is to present an
interactive visualization technique conveying the temporal patterns of gene expression data in a form intuitive
for non-specialized end-users. The first Fourier harmonic projection (FFHP) was introduced to translate the
multi-dimensional time series data into a two dimensional scatter plot. The spatial relationship of the
points reflect the structure of the original dataset and relationships among clusters become two dimensional.
The proposed method was tested using two published, array-derived gene expression datasets.
Our results demonstrate the effectiveness of the approach.
Return to Program or Index
|