Podium Presentations
Pathways, Networks and System Biology (Session IV)
Can We Identify Cellular Pathways Implicated
in Cancer Using Gene Expression Data?
Nigam Shah Pennsylvania State University, University Park, PA
Jorge Lepre, Yuhai Tu and Gustavo Stolovitzky
IBM Computational Biology Center, T.J. Watson Research Center, Yorktown Heights, NY
Abstract
The cancer state of a cell is characterized by alterations of important
cellular processes such as cell proliferation, apoptosis, DNA-damage
repair, etc. Some of these alterations involve modifications of the
expression of genes that participate in the pathways responsible for
these processes. From this simple observation it follows that the
expression of genes associated with cancer related pathways should
exhibit differences between the normal and the cancerous states. We
explore various means to find those differences. Interestingly, these
differences can only be identified when groups of genes, as opposed
to isolated genes, are considered. Instead of using all the individual
genes on a DNA array in comparing cohorts of cancer patients with
control subjects, our analysis searches for signals within small subsets
of genes that are associated with specific pathways, thereby substantially
increasing the sensitivity of the analysis, while preserving the biological
information contained in the pathways. We analyze 6 different pathways
(p53, Ras, Brca, DNA damage repair, NFkb and ?-catenin) and 4 different
types of cancer: colon, pancreas, prostate and kidney. Our results
are found to be mostly consistent with existing knowledge of the involvement
of these pathways in different cancers. Our analysis constitutes proof
of principle that it may be possible to predict the involvement of
a particular pathway in cancer or other diseases by using gene expression
data. Such method would be particularly useful for the types of diseases
where biology is poorly understood.
Return to Program or Index
Statistical Inference for Well-ordered
Structures in Nucleotide Sequences
Shu-Yun Le Laboratory of Experimental and Computational Biology NCI Center for Cancer Research,
National Cancer Institute, NIH
Jih-H. Chen and Jacob V. Maizel, Jr.
Advanced Biomedical Computing Center
Abstract
Distinct, local structures are frequently correlated with functional
RNA elements involved in post-transcriptional regulation of gene expression.
Discovery of microRNAs (miRNAs) suggests that there are a large class
of small non-coding RNAs in eukaryotic genomes. These miRNAs have
the potential to form distinct fold-back stem-loop structures. The
prediction of those well-ordered folding sequences (WFS) in genomic
sequences is very helpful for our understanding of RNA-based gene
regulation and determination of local RNA elements with structure-dependent
functions. In this study, we describe a novel method for discovering
the local WFS in a nucleotide sequence by Monte Carlo simulation and
RNA folding. In the approach the quality of local WFS is assessed
by the energy difference (Ediff) between the optimal structure folded
in the local segment and its corresponding optimal, restrained structure
where all the previous base pairings formed are prohibited. Using
Monte Carlo simulations we can evaluate the random probabilities of
Ediff in a random sample. The distinct WFS can be discovered by scanning
successive segments along a sequence for evaluating the difference
between Ediff of the natural sequence and those computed from randomly
shuffled sequences. Our results indicate that the statistically significant
WFS detected in the genomic sequences of Caenorhabditis elegans (C.elegans)
F49E12, T07C5, T07D1, T10H9, Y56A3A and Y71G12B are coincident with
known fold-back stem-loops found in miRNA precursors. The potential
and implications of our method in searching for miRNAs in genomes
is discussed.
Return to Program or Index
Combining Microarrays and Biological
Knowledge for Estimating Gene Networks via Bayesian Networks
Seiya Imoto Human Genome Center, Institute of Medical Science, University of Tokyo, Tokyo, Japan
Tomoyuki Higuchi The Institute of Statistical Mathematics, 4-6-7, Tokyo, Japan
Takao Goto Human Genome Center, Institute of Medical Science, University of Tokyo, Tokyo, Japan
Kousuke Tashiro Graduate School of Genetic Resources Technology, Kyushu University, Japan
Satoru Kuhara Graduate School of Genetic Resources Technology, Kyushu University, Japan
Satoru Miyano Human Genome Center, Institute of Medical Science, Univ. of Tokyo, Tokyo, Japan
Abstract
We propose a statistical method for estimating a gene network based on Bayesian networks
from microarray gene expression data together with biological knowledge
including protein-protein interactions, protein-DNA interactions, binding site
information, existing literature and so on. Unfortunately, microarray data do not contain
enough information for constructing gene networks accurately in many cases. Our method
adds biological knowledge to the estimation method of gene networks under a Bayesian statistical
framework, and also controls the trade-off between microarray information and biological
knowledge automatically. We conduct Monte Carlo simulations to show the effectiveness of
the proposed method. We analyze Saccharomyces cerevisiae gene expression data as an application.
Return to Program or Index
Protein/RNA Structure Prediction and Modeling (Sessions VI & IX)
Towards
Index-based Similarity Search for Protein Structure Databases
Orhan Camoglu, Tamer Kahveci and Ambuj K. Singh
Department of Computer Science, University of California at Santa Barbara, Santa Barbara, CA
Abstract
We propose two methods for finding similarities in protein structure databases. Our techniques
extract feature vectors on triplets of SSEs (Secondary Structure Elements) of proteins. These
feature vectors are then indexed using a multidimensional index structure. Our first technique
considers the problem of finding proteins similar to a given query protein in a protein dataset.
This technique quickly finds promising proteins using the index structure. These proteins are
then aligned to the query protein using a popular pairwise alignment tool such as VAST. We also
develop a novel statistical model to estimate the goodness of a match using the SSEs. Our second
technique considers the problem of joining two protein datasets to find an all-to-all similarity. Experimental
results show that our techniques improve the pruning time of VAST 3 to 3.5 times while keeping the sensitivity similar.
Return to Program or Index
Local Similarity in RNA Secondary Structures
Matthias Höchsmann
International Graduate School in Bioinformatics and Genome Research, Universität Bielefeld, Bielefeld, Germany
Thomas Töller Graduiertenkolleg Bioinformatik, Universität, Bielefeld, Germany
Robert Giegerich and Stefan Kurtz Technische Fakult, Universität Bielefeld, Germany
Abstract
We present a systematic treatment of alignment distance and local similarity algorithms on
trees and forests. We build upon the tree alignment algorithm for ordered trees given by
Jiang et. al (1995) and extend it to calculate local forest alignments, which is essential for
finding local similar regions in RNA secondary structures. The time complexity of our algorithm
is
O( | F1| . |F2| . deg(F1) . deg(F2) . (deg(F1) +deg(F2)))
where |Fi| is the number of nodes
in forest Fi and deg(Fi) is the degree of Fi. We provide carefully engineered dynamic programming
implementations using dense, two-dimensional tables which considerably reduces the space requirement.
We suggest a new representation of RNA secondary structures as forests that allow reasonable scoring
of edit operations on RNA secondary structures. The comparison of RNA secondary structures is
facilitated by a new visualization technique for RNA secondary structure alignments. Finally, we show
how potential regulatory motifs can be discovered solely by their structural preservation, and
independent of their sequence conservation and position.
Return to Program or Index
CTSS: A Robust and Efficient Method for
Protein Structure Alignment Based on Local Geometrical and Biological Features
Tolga Can and Yuan-Fang Wang
Computer Science, UC Santa Barbara, Santa Barbara, CA
Abstract
We present a new method for conducting protein structure similarity searches, which improves on the
accuracy, robustness, and efficiency of some existing techniques. Our method is grounded in the theory
of differential geometry on 3D space curve matching. We generate shape signatures for proteins that
are invariant (i.e., they are not affected by the translation and rotation of a protein structure in
space), localized (i.e., the signature at each residue location is completely determined by the local
structure around that particular residue), robust (i.e., small perturbations of atomic coordinates induce small
changes in the associated signatures), compact (i.e., the size of the signatures is O(n), n: number
of residues), and biologically meaningful (i.e., we incorporate secondary structure assignment into
the signature). To improve matching accuracy, we smooth the noisy raw atomic coordinate data with
spline fitting. To improve matching efficiency, we adopt a hierarchical coarse-to-fine strategy. We
first use an efficient hashing-based technique to screen out unlikely candidates. We then perform
pairwise alignments only for a small number of candidates that survive the screening process. Contrary to other
hashing based techniques, our technique employs domain specific information (not just geometric
information) in constructing the hash key, and hence, is more tuned to the domain of biology.
Furthermore, the invariancy, localization, and compactness of the shape signatures allow us to utilize
a well-known local sequence alignment algorithm for aligning two protein structures. One measure of
the efficacy of the proposed technique is that we were able to discover new, meaningful motifs that
were not reported by other structure alignment methods. We have implemented our method as an interactive
tool that allows visual inspection of the alignment results and iterative discovery of possible suboptimal
alignments that may have biological importance.
Return to Program or Index
Statistical
and Visual Morph Movie Analysis of Crystallographic Mutant Selection
Bias in Protein Mutation Resource Data
Werner G. Krebs Integrative Biosciences San Diego
Supercomputer Center National, Partnership for Advanced Computational
Infrastructure
Phil Bourne Department of Pharmacology, UC San Diego, CA
Abstract
Structural studies of the effects of non-silent mutations on protein
conformational change are an important key in deciphering the language
that relates protein amino acid primary structure to tertiary structure.
Elsewhere, we presented the Protein Mutant Resource (PMR) database,
a set of online tools that systematically identified groups of related
mutant structures in the Protein DataBank (PDB), accurately inferred
mutant classifications in the Gene Ontology using an innovative, statistically
rigorous data-mining algorithm with more general applicability, and
illustrated the relationship of these mutant structures via an intuitive
user interface. Here, we perform a comprehensive statistical analysis
of the effect of PMR mutations on protein tertiary structure. We find
that, although the PMR does contain spectacular examples of conformational
change, in general there is a counter-intuitive inverse relationship
between conformational change (measured as C-_ displacement or RMS
of the core structure) and the number of mutations in a structure.
That is, post-translational modifications by structural biologists
present in the PDB contrast naturally evolved mutations. We compare
the frequency of mutations in the PMR/PDB datasets against the accepted
PAM250 natural amino acid mutation frequency to confirm these observations.
We generated morph movies from PMR structure pairs using technology
previously developed for the Macromolecular Motions Database, http://molmovdb.org,
allowing bioinformaticists, geneticists, protein engineers, and rational
drug designers to analyze visually the mechanisms of protein conformational
change and distinguish between conformational change due to motions
(e.g., ligand binding) and mutations. The PMR morph movies and statistics
can be freely viewed from the PMR website, http://pmr.sdsc.edu.
Return to Program or Index
A New Similarity Measure Among Protein Sequences
Kuen-Pin Wu, Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu
Institute of Information Science, Academia Sinica, Taipei, Taiwan
Abstract
Protein sequence analysis is an important tool to decode the logic of life. One of the most
important similarity measures in this area is the edit distance between amino acids of two
sequences. We believe this criterion should be reconsidered because protein features are
probably associated more with small peptide fragments than with individual amino acids.
In this paper, we design small patterns that are associated with highly conversed regions
among a set of protein sequences. These patterns are used analogous to the index terms in
information retrieval. Therefore, we do not consider gaps within patterns. This new similarity
measure has been applied to phylogenetic tree construction, protein clustering and protein
secondary structure prediction and has produced promising results.
Return to Program or Index
Pattern Recognition (Session VIII)
cWINNOWER Algorithm for Finding Fuzzy DNA Motifs
Shoudan Liang
NASA Advanced Supercomputing Division, NASA Ames Research Center, Moffett Field, CA
Abstract
The cWINNOWER algorithm detects fuzzy motifs in DNA sequences rich in protein-binding
signals. A signal is defined as any short nucleotide pattern having up to d mutations
differing from a motif of length l. The algorithm finds such motifs if multiple mutated
copies of the motif (i.e., the signals) are present in the DNA sequence in sufficient
abundance. The cWINNOWER algorithm substantially improves the sensitivity of the Winnower
method of Pevzner and Sze by imposing a consensus constraint, enabling it to detect much
weaker signals. We studied the minimum number of detectable motifs qc as a function of
sequence length N for random sequences. We found that qc increases linearly with N for
a fast version of the algorithm based on counting three-member sub-cliques. Imposing
consensus constraints reduces qc by a factor of three in this case, which makes the
algorithm dramatically more sensitive. Our most sensitive algorithm, which counts
four-member sub-cliques, needs a minimum of only 13 signals to detect motifs in a
sequence of length N = 12000 for (l, d) = (15, 4).
Return to Program or Index
LOGOS: A
Modular Bayesian Model for de novo Motif Detection
Eric P. Xing
Computer Science Division, University of California at Berkeley, Berkeley, CA
Wei Wu Life Science Division, Lawrence Berkeley National Lab, Berkeley, CA
Michael I. Jordan Computer Science and Statistics, UC Berkeley, CA
Richard M. Karp Computer Science Division, UC Berkeley, CA
Abstract
The complexity of the global organization and internal structure of motifs in higher
eukaryotic organisms raises significant challenges for motif detection techniques. To
achieve successful de novo motif detection it is necessary to model the complex dependencies
within and among motifs and incorporate biological prior knowledge. In this paper, we present
LOGOS, an integrated Local and GlObal motif Sequence model for biopolymer sequences, which
provides a principled frame-work for developing, modularizing, extending and computing
expressive motif models for complex biopolymer sequence analysis. LOGOS consists of two
interacting submodels: HMDM, a local align-ment model capturing biological prior knowledge and
positional dependence within the motif local structure; and HMM, a global motif distribution model
modeling frequencies and dependencies of motif occurrences. Model parameters can be fit using
training motifs within an empirical Bayesian framework. A variational EM algorithm is developed
for de novo motif detection. LOGOS improves over existing models that ignore biological priors
and dependencies in motif structures and motif occurrences, and demonstrates superior performance
on both semi-realistic test data and cis-regulatory sequences from yeast data with regard to
sensitivity, specificity, flexibility and extensibility.
Return to Program or Index
A Block Coding Method That Leads to Significantly
Lower Entropy Values for the Proteins of Haemophilus influenza and its Coding Sections
G. Sampath
Department of Computer Science, The College of New Jersey, Ewing, NJ
Abstract
A simple statistical block code in combination with the LZW-based compression utilities
gzip and compress has been found to increase by a significant amount the level of compression possible
for the proteins encoded in Haemophilus influenzae (hi),
the first fully sequenced genome. The method yields an entropy of 3.665 bits per symbol
(bps), which is 0.657 bps below the maximum of 4.322 bps. This is an improvement of
0.452 bps over the best known to date of 4.118 bps using the lza-CTW algorithm of
Matsumoto, Sadakane, and Imai, the next best being 4.143 bps using Nevill-Manning and
Witten's cp algorithm. Using an efficient inverse map from the 20 amino acids to the 61
triplets that code for them, the genome too is found to be compressible, although the gain
is not as high. Calculated estimates based on this latter method yield an entropy of 1.757
bps for the coding portions of the genome, with a possibly lower actual entropy. Both of
these results, which flow from the sequential use of statistics-based encoding techniques
followed by substitution-based text compression algorithms (gzip, compress), hint at the
existence of hitherto unexplored redundancies at both the global and the local level in hi
and the proteins coded by it. Further work may help determine whether such decreases in
entropy are possible with other proteins and genomes or hi is a rarity (perhaps even
unique) in this respect.
Return to Program or Index
Sequence Alignment (Session X)
Computing Highly Specific and
Mismatch Tolerant Oligomers Efficiently
Tomoyuki Yamada and Shinichi Morishita
Computer Science, University of Tokyo, Tokyo, Japan
Abstract
The sequencing of the genomes of a variety of species and the growing databases containing
expressed sequence tags (ESTs) and complementary DNAs (cDNAs) facilitate the design of highly
specific oligomers for use as genomic markers, PCR primers, or DNA oligo microarrays. The
first step in evaluating the specificity of short oligomers of about twenty units in length
is to determine the frequencies at which the oligomers occur. However, for oligomers longer
than about fifty units this is not efficient, as they usually have a frequency of only 1. A
more suitable procedure is to consider the mismatch tolerance of an oligomer, that is, the
minimum number of mismatches that allows a given oligomer to match a sub-sequence other than
the target sequence anywhere in the genome or the EST database. However, calculating the
exact value of mismatch tolerance is computationally costly and impractical. Therefore, we
studied the problem of checking whether an oligomer meets the constraint that its mismatch
tolerance is no less than a given threshold. Here, we present an efficient dynamic programming
algorithm solution that utilizes suffix and height arrays. We demonstrated the effectiveness
of this algorithm by efficiently computing a dense list of oligo-markers applicable to the
human genome. Experimental results show that the algorithm runs faster than well-known
AbrahamsonÕs algorithm by orders of magnitude and is able to enumerate 63% ~79% of qualified oligomers.
Return to Program or Index
ANTICLUSTAL: Multiple Sequence Alignment
by Antipole Clustering and Linear Approximate 1-Median Selection
C. Di Pietro, A. Ferro, T. Maugeri, G. Pigola, A. Pulvirenti Dipartimento di Matematica e
Informatica, Università di Catania
D. Shasha Courant Institute of Mathematical Sciences, New York University, NY
V. Di Pietro, G. Emmanuele, E. Modica, M. Purrello, M. Ragusa, M. Scalia, S. Travali, V. Zimmitti
Dipartimento di Scienze Biomediche, Università di Catania
Abstract
In this paper we present a new Multiple Sequence Alignment (MSA) algorithm called AntiClustAl.
The method makes use of the commonly used idea of aligning homologous sequences belonging to
classes generated by some clustering algorithm, and then continues the alignment process in
a bottom-up way along a suitable tree structure. The final result is then read at the root
of the tree. Multiple sequence alignment in each cluster makes use of the progressive alignment
with the 1-median (center) of the cluster. The 1-median of the set S of sequences is the element
of S which minimizes the average distance from any other sequence in S. Its exact computation
requires quadratic time. The basic idea of our proposed algorithm is to make use of a simple
and natural algorithmic technique based on randomized tournaments which has been successfully
applied to large size search problems in general metric spaces. In particular a clustering
algorithm called Antipole tree Clustal W, a widely used tool to SMA, shows a better running
time results with fully comparable alignment quality. A successful biological application
showing high amino acid conservation during evolution of Xenopus laevis ASOD2 is also cited.
Return to Program or Index
Efficient Constrained Multiple
Sequence Alignment with Performance Guarantee
Francis Y.L. Chin, N.L. Ho, and T.W. Lam
Department of Computer Science and Information Systems, University of Hong Kong, Hong Kong
Abstract
The Constrained Multiple Sequence Alignment problem is to align a set of sequences subject
to a given constrained sequence, which arises from some knowledge of the structure of the
sequences. This paper presents new algorithms for this problem, which are more efficient in
terms of time and space (memory) than the previous algorithms [14], and with a worst-case guarantee
on the quality of the alignment. Saving the space requirement by a quadratic factor is particularly
significant as the previous O(n 4) space algorithm has limited application due to its huge memory
requirement. Experiments on real data sets confirm that our new algorithms show improvements in both
alignment quality and resource requirements.
Return to Program or Index
Comparative Genomics and Phylogenetic Analysis (Session XI)
Efficient
Reconstruction of Phylogenetic Networks (of SNPs) with Constrained
Recombination
Dan Gusfield and Satish Eddhu
Department of Computer Science, University of California at Davis,
Davis, CA
Charles Langley
Department of Evolution and Ecology, UC Davis, CA
Abstract
A phylogenetic network is a generalization of a phylogenetic tree,
incorporating more complex molecular phenomena, such as recombination,
than is incorporated into a pure phylogenetic tree. Genomic sequences
often do not fit a pure tree model, and a phylogenetic network is
required to explain the evolution of the sequences. Deducing that
history is important for the study of molecular evolution, but has
other applications such as deducing the location of genes associated
with a disease or a valuable genetic trait. Understanding the history
of recombinations is particularly important in these applications.
For greatest biological fidelity, the number of recombinations in
the network should be small, and/or the network should have particular
biologically-signifiant structural properties.
In a seminal paper, Wang et al. [10] showed that the problem of finding a phylogenetic
network that derives an input set of sequences and minimizes the number of
recombination events is NP-hard. However, they gave a polynomial time algorithm that
was intended to determine whether the sequences could be derived on a phylogenetic
network, where the recombination cycles are node disjoint, a structural feature that is
realistic in particular evolutionary contexts. We call such a network a "galled-tree".
Unfortunately, the algorithm in [10] is only a sufficient, but not a necessary, test for the
existence of a galled-tree.
Here, we develop combinatorial constraints that galled-trees must obey, to obtain
a faster algorithm that is guaranteed to be both a necessary and sufficient test for the
existence of a galled-tree for the data. The algorithm finds if there is galled-tree for
the data, and it characterizes the set of all the galled-trees for the sequences, and gives
a count of them. The combinatorial constraints we develop apply (for the most part)
to node-disjoint cycles in any phylogenetic network (not just galled-trees), and can be
used for example to prove that a given site cannot be on a node-disjoint cycle in any
phylogenetic network.
Perhaps more important than the specific results about galled-trees, we introduce
an approach that can be used to study recombination in phylogenetic networks that go
beyond galled-trees.
Return to Program or Index
Prokaryote Phylogeny without Sequence Alignment:
From Avoidance Signature to Composition Distance
Bailin Hao T-Life Research Center, Fudan University, Shanghai
200433, China Ji Qi and Bin Wang Institute of Theoretical Physics,
Academia Sinica, Beijing, China
Abstract
A new and essentially simple method to infer prokaryotic phylogenetic relations from their
complete genome data without using sequence alignment is proposed. It is based on the appearance
frequency of oligopeptides of a fixed length (up to K=6) in their proteomes. This is a method without
fine adjustment and choice of genes. It can incorporate the effect of lateral gene transfer to some
extent and leads to results comparable with the bacteriologistsÕ systematics as reflected in the
latest 2001 edition of the BergeyÕs Manual of Systematic Bacteriology[1, 2]. A key point in our
approach is subtraction of a random background by using a Markovian model of order K-1 from the
composition vectors to highlight the shaping role of natural selection.
Return to Program or Index
Microarray Data Analysis (Sessions I & II)
Clustering Binary Fingerprint Vectors with Missing Values for DNA
Array Data Analysis
Andres Figueroa Department of Computer Science, UC Riverside, Riverside, CA
James Borneman Department of Plant Pathology, UC Riverside, Riverside, CA
Tao Jiang Department of Computer Science, UC Riverside, Riverside, CA
Abstract
Oligonucleotide fingerprinting is a powerful DNA array based method to characterize
cDNA and ribosomal RNA gene (rDNA) libraries and has many applications including gene expression
profiling and DNA clone classification. We are especially interested in the latter application.
A key step in the method is the cluster analysis of fingerprint data obtained from DNA array
hybridization experiments. Most of the existing approaches to clustering use (normalized) real
intensity values and thus do not treat positive and negative hybridization signals equally
(positive signals are much more emphasized). In this paper, we consider a discrete approach. Fingerprint
data are first normalized and binarized using control DNA clones. Because there may exist unresolved
(or missing) values in this binarization process, we formulate the clustering of (binary) oligonucleotide
fingerprints as a combinatorial optimization problem that attempts to identify clusters and resolve the
missing values in the fingerprints simultaneously. We study the computational complexity of this clustering
problem and a natural parameterized version, and present an efficient greedy algorithm based on MINIMUM
CLIQUE PARTITION on graphs. The algorithm takes advantage of some unique properties of the graphs
considered here, which allow us to efficiently find the maximum cliques as well as some special maximal
cliques. Our experimental results on simulated and real data demonstrate that the algorithm runs faster
and performs better than some popular hierarchical and graph-based clustering methods. The results on
real data from DNA clone classification also suggest that this discrete approach is more accurate than
clustering methods based on real intensity values, in terms of separating clones that have different
characteristics with respect to the given oligonucleotide probes.
Return to Program or Index
Clustering Time-varying Gene
Expression Profiles Using Scale-space Signals
Tanveer Syeda-Mahmood
IBM Almaden Research Center, San Jose, CA
Abstract
The functional state of an organism is determined largely by the pattern of expression of its genes.
The analysis of gene expression data from gene chips has primarily revolved around clustering and
classification of the data using machine learning techniques based on the intensity of expression
alone with the time-varying pattern mostly ignored. In this paper, we present a pattern recognition-based
approach to capturing similarity by finding salient changes in the time-varying expression patterns of
genes. Such changes can give clues about important events, such as gene regulation by cell-cycle phases,
or even signal the onset of a disease. Specifically, we observe that dissimilarity between time series
is revealed by the sharp twists and bends produced in a higher-dimensional curve formed from the
constituent signals. Scale-space analysis is used to detect the sharp twists and turns and their relative
strength with respect to the component signals is estimated to form a shape similarity measure between
time profiles. A clustering algorithm is presented to cluster gene profiles using the scale-space distance
as a similarity metric. Multi-dimensional curves formed from time series within clusters are used as
cluster prototypes or indexes to the gene expression database, and are used to retrieve the functionally
similar genes to a query gene profile. Extensive comparison of clustering using scale-space distance in
comparison to traditional Euclidean distance is presented on the yeast genome database.
Return to Program or Index
Fast
and Sensitive Probe Selection for DNA Chips Using Jumps in Matching
Statistics
Sven Rahmann Department of Computational Molecular Biology, Max-Planck-
Institute for Molecular Genetics, and Department of Mathematics and Computer Science,
Freie Universität, Berlin, Germany
Abstract
The design of large-scale DNA microarrays is a challenging problem. So far, probe selection
algorithms must trade the ability to cope with large-scale problems for a loss of accuracy
in the estimation of probe quality. We present an approach based on jumps in matching
statistics that combines the best of both worlds.
This article consists of two parts. The first part is theoretical. We introduce the notion
of jumps in matching statistics between two strings and derive their properties. We estimate
the frequency of jumps for random strings in a non-uniform Bernoulli model and present a
new heuristic argument to find the center of the length distribution of the longest substring
that two random strings have in common. The results are generalized to near perfect matches
with a small number of mismatches.
In the second part, we use the concept of jumps to improve the accuracy of the longest
common factor approach for probe selection by moving from a string based to an energy based
specificity measure, while only slightly more than doubling the selection time.
Return to Program or Index
Fast and Accurate Probe
Selection Algorithm for Large Genomes
Wing-Kin Sung and Wah-Heng Lee
Computer Science, National University of Singapore, Singapore
Abstract
The oligo microarray (DNA chip) technology in recent years has a significant impact
on genomic study. Many fields such as gene discovery, drug discovery, toxicological research and
disease diagnosis, will certainly benefit from its use. A microarray is an orderly arrangement of
thousands of DNA fragments where each DNA fragment is a probe (or a fingerprint) of a gene/cDNA.
It is important that each probe must uniquely associate with a particular gene/cDNA. Otherwise,
the performance of the microarray will be affected. Existing algorithms usually select probes using
the criteria of homogeneity, sensitivity, and specificity. However, most of these algorithms are
too slow to be practical. To improve efficiency, some of these algorithms resorted to using some
heuristics which reduces accuracy. Instead, we make use of some smart filtering techniques
to avoid redundant computation while maintaining the accuracy. Based
on the new algorithm, optimal short (20 bases) or long (50 or 70 bases) probes
can be computed efficiently for large genomes.
Return to Program or Index
Degenerate
Primer Design via Clustering
Xintao Wei and Giri Narasimhan
Computer Science, Florida International University, University Park, Miami, FL
David N. Kuhn Biological Sciences, Florida International University, University Park, Miami,
and USDA-ARS, Subtropical Horticulture Research Station, Miami, FL
Abstract
This paper describes a new strategy for designing degenerate primers for a given
multiple alignment of amino acid sequences. Degenerate primers are useful for designing degenerate
primers for homologous genes/proteins. However, when a large collection of sequences is considered,
no consensus region may exist in the multiple alignment, making it impossible to design a single pair
of primers for the collection. In such cases, manual methods are resorted to find smaller groups
from the input collection so that primers can be designed for individual groups. Our strategy proposes
an automatic grouping of the input sequences from the multiple alignment by using clustering techniques.
The groups are then dealt with individually. Conserved regions are detected for each group. Conserved
regions are scored using a BlockSimilarity score, a novel alignment scoring scheme that is appropriate
for this application. Degenerate primers are then designed by reverse translating the conserved amino
acid sequences to the corresponding nucleotide sequences. Our program, DePiCt, was written in BioPerl
and was tested on the Toll-Interleukin Receptor (TIR) and the non-TIR family of plant
resistance genes. Existing programs for degenerate primer design were unable to find primers for these data sets.
Return to Program or Index
Data Mining (Sessions VII and XII)
Discovering
Compact and Highly Discriminative Features or Feature Combinations
of Drug Activities Using Support Vector Machines
Hwanjo Yu and Jiong Yang Computer Science, University of Illinois, Urbana-Champaign, IL
Wei Wang
Computer Science Department, University of North Carolina, Chapel Hill, NC
Jiawei Han Computer Science, University of Illinois, Urbana-Champaign, IL
Abstract
Nowadays, high throughput techniques such as microarray experiments make if feasible to examine
and collect massive data at the molecular level. These data, typically mapped to a very high
dimensional feature space, carry rich information about functionalities of certain biological
entities and can be used to infer valuable knowledge for the purposes of classification and
prediction. Typically, a small number of features or feature combinations play a deterministic
roles in functional discrimination. The identification of such features or feature combinations
is of great importance. In this paper, we study the problem of discovering compact and highly
discriminative features or feature combinations from a rich feature collection. We employ the
support of vector machines as the classification means and aim at finding compact feature combinations.
Comparing to previous methods on feature selection, which identify features solely based on their
individual roles in the classification, our method is able to identify minimal feature combinations
that ultimately have deterministic roles in as systematic fashion. Experimental study significant
individually but are most significant combinatively.
Return to Program or Index
SMASHing Regulatory Sites in DNA by
Human-mouse Sequence Comparisons
Mihaela Zavolan and Terry Gaasterland
Laboratory for Computational Genomics, and Laboratory for Molecular Genetics, The Rockefeller University,
New York, NY
Nikolaus Rajewsky Department of Biology, New York University, New York, NY
Nicholas D. Socci Pathology, Seaver Foundation Program in Bioinformatics and Laboratory for Molecular
Genetics, The Rockefeller University, New York, NY
Abstract
Regulatory sequence elements provide important clues to understanding and predicting gene expression.
Although the binding sites for hundreds of transcription factors are known, there has been no
systematic attempt to incorporate this information in the annotation of the human genome. Cross
species sequence comparisons are critical to a meaningful annotation of regulatory elements since
they generally reside in conserved non-coding regions. To take advantage of the recently completed
drafts of the mouse and human genomes for annotating transcription factor binding sites, we developed
SMASH, a computational pipeline that identifies thousands of orthologous human/mouse proteins, maps
them to genomic sequences, extracts and compares upstream regions and annotates putative regulatory
elements in conserved, non-coding, upstream regions. Our current dataset consists of approximately
2500 human/mouse gene pairs. Transcription start sites were estimated by mapping quasi-full length
cDNA sequences. SMASH uses a novel probabilistic method to identify putative conserved binding
sites that takes into account the competition between transcription factors for binding DNA. SMASH
presents the results via a genome browser web interface which displays the predicted regulatory
information together with the current annotations for the human genome. Our results are validated
by comparison to previously published experimental data. SMASH results compare favorably to other
existing computational approaches.
Return to Program or Index
Initial
Large-scale Exploration of Protein-protein Interactions in the Human
Brain
Jake Y. Chen, Andrey Y. Sivachenko, Russell Bell, Connie Kurschner,
Irene Ota and Sudhir Sahasrabudhe
Myriad Proteomics, Inc., Salt Lake City, UT
Abstract
Study of protein interaction networks is crucial to post-genomic systems
biology. Aided by high-throughput screening technologies, biologists
are rapidly accumulating protein-protein interaction data. Using a
random yeast two-hybrid (R2H) process, we have performed large-scale
yeast two-hybrid searches with approximately fifty thousand random
human brain cDNA bait fragments against a human brain cDNA prey fragment
library. From these searches, we have identified 13,656 unique protein-protein
interaction pairs involving 4,473 distinct known human loci. In this
paper, we have performed our initial characterization of the protein
interaction network in human brain tissue. We have classified and
characterized all identified interactions based on Gene Ontology (GO)
annotation of interacting loci. We have also described the "scale-free"
topological structure of the network.
Return to Program or Index
An SVM-based
Algorithm for Identification of Photosynthesis-specific Genome Features
Gong-Xin Yu, Al Geist, George Ostrouchov, and Nagiza F. Samatova
Computational Biology Group, Computer Science and Mathematics Division, Oak Ridge National Laboratory,
Oak Ridge, TN
Abstract
This paper presents a novel algorithm for identification and functional characterization of "key"
genome features responsible for a particular biochemical process of interest. The central idea
behind our algorithm is that individual genome features (or their combinations) are identified
as significant "key" features if the discrimination accuracy between two classes of genomes with
respect to a given biochemical process is sufficiently affected by the inclusion or exclusion
of these features. In this paper, genome features are defined by high-resolution gene functions.
The discrimination procedure utilizes the Support Vector Machine (SVM) classification technique.
Changes in classification accuracy in response to addition or deletion of genome features measure
the significance of these features. The application of this SVM-based feature identification
algorithm to the oxygenic photosynthetic process resulted in a total of 126 highly confident
candidate genome features. They cover not only dominant genome features (that always occur in
oxygenic photosynthetic genomes but not in the other genomes) but also weak yet complementary
genome features (their combinations make unique dominant genome features). While many of these
features are well-known components in the oxygenic photosynthetic process, others are completely
unknown, even including some hypothetical proteins. It is obvious that our SVM-based feature
identification algorithm has the capability to discover novel genome features related to a targeted biochemical process.
Return to Program or Index
Biomedical Research (Session V)
Stochastic
Stage-structured Modeling of the Adaptive Immune System
Dennis L. Chao and Stephanie Forrest Computer Science, University of New Mexico, Albuquerque, NM
Miles P. Davenport Pathology, Faculty of Medicine, University of New South Wales, Kensington, Australia
Alan S. Perelson Theoretical Biology and Biophysics, Los Alamos National Laboratory, Los Alamos, NM
Abstract
We have constructed a computer model of the cytotoxic T lymphocyte (CTL) response to antigen
and the maintenance of immunological memory. Because immune responses often begin with small
numbers of cells and there is great variation among individual immune systems, we have chosen
to implement a stochastic model that captures the life cycle of T cells more faithfully than
deterministic models. Past models of the immune response have been differential equation based,
which do not capture stochastic effects, or agent-based, which are computationally expensive.
We use a stochastic stage-structured approach that has many of the advantages of agent-based
modeling but is more efficient. Our model can provide insights into the effect infections have
on the CTL repertoire and the response to subsequent infections.
Return to Program or Index
A Personalized and Automated dbSNP
Surveillance System
Shuo Liu, Steve Lin, Mark Woon, Teri E. Klein, and Russ B. Altman
Department of Genetics, Stanford Medical Informatics, Stanford, CA
Abstract
The development of high throughput techniques and large-scale studies in the biological
sciences has given rise to an explosive growth in both the volume and types of data
available to researchers. A surveillance system that monitors data repositories and reports
changes helps manage the data overload. We developed a dbSNP surveillance system
(URL: http://www.pharmgkb.org/do/serve?id=tools.surveillance.dbsnp) that performs
surveillance on the dbSNP database and alerts users to new information. The system is
notable because it is personalized and fully automated. Each registered user has a list of
genes to follow and receives notification of new entries concerning these genes. The
system integrates data from dbSNP, LocusLink, PharmGKB, and Genbank to position
SNPs on reference sequences and classify SNPs into categories such as synonymous and
non-synonymous SNPs. The system uses data warehousing, object model-based data
integration, object-oriented programming, and a platform- neutral data access mechanism.
Return to Program or Index
Molecular
Structure (Session III)
Experimental
Studies of the Universal Chemical Key (UCK) Algorithm on the NCI Database
of Chemical Compounds
Robert Grossman, Donald Hamelberg and Pavan Kasturi
Laboratory for Advanced Computing, University of Illinois, Chicago, IL
Bing Liu Department of Computer Science, University of Illinois, Chicago, IL
Abstract
We have developed an algorithm called the Universal Chemical Key (UCK) algorithm
that constructs a unique key for a molecular structure. The molecular structures are
represented as undirected labeled graphs with the atoms representing the vertices of
the graph and the bonds representing the edges. The algorithm was tested on 236,917 compounds
obtained from the National Cancer Institute (NCI) database of chemical compounds. In this
paper we present the algorithm, some examples and the experimental results on the NCI database.
On the NCI database, the UCK algorithm provided distinct unique keys for chemicals with different molecular structures.
Return to Program or Index
|