SESSIONS
I. Structural Bioinformatics
II. Genomics 1
III. Transcriptomes 1
IV. Evolution and Phylogeny
V. Proteomics
VI. Transcriptomes 2
VII. Applied Bioinformatics
VIII. Data Mining and Ontology1
X. Data Base and Ontology
XI. Pathways and Networks 1
XII. Pathways and Networks 2
XIII. Genomics 2
XIV. Transcriptomes 3
SESSION I - STRUCTURAL BIOINFORMATICS
Multiple RNA Structure Alignment
Zhuozhi Wang - Experimental Therapeutics, Research, University Health Network, Canada
Kaizhong Zhang - Department of Computer Science, University of Western Ontario, Canada
RNA structures can be viewed as a kind of special strings
with some characters bonded with each other. The question
of aligning two RNA structures has been studied for a while,
and there are several successful algorithms that are based
upon different models. In this paper, by adopting the model
introduced in [18], we propose two algorithms to attack the
question of aligning multiple RNA structures. We reduce the
multiple RNA structure alignment problem to the problem of
aligning two RNA structure alignments.
Return to Program
Return to Top
Weighting Features to Recognize 3D Patterns of Electron Density in X-ray Protein Crystallography
Kreshna Gopal [1], Tod D. Romo [2], James C. Sacchettini [2], Thomas R. Ioerger [1]
[1] Department of Computer Science, Texas A&M University
[2] Department of Biochemistry & Biophysics, Texas A&M University
Feature selection and weighting are central
problems in pattern recognition and instance-based
learning. In this work, we discuss the challenges of
constructing and weighting features to recognize 3D
patterns of electron density to determine protein
structures. We present SLIDER, a feature-weighting
algorithm that adjusts weights iteratively such that
patterns that match query instances are better ranked
than mismatching ones. Moreover, SLIDER makes
judicious choices of weight values to be considered in
each iteration, by examining specific weights at which
matching and mismatching patterns switch as nearest
neighbors to query instances. This approach reduces
the space of weight vectors to be searched. We make
the following two main observations: (1) SLIDER
efficiently generates weights that contribute
significantly in the retrieval of matching electron
density patterns; (2) the optimum weight vector is
sensitive to the distance metric i.e. feature relevance
can be, to a certain extent, sensitive to the underlying
metric used to compare patterns.
Return to Program
Return to Top
Reasoning About Molecular Similarity and Properties
Rahul Singh - Department of Computer Science, San Francisco State University
Ascertaining the similarity amongst molecules is a
fundamental problem in biology and drug discovery.
Since similar molecules tend to have similar biological
properties, the notion of molecular similarity plays an
important role in exploration of molecular structural
space, query-retrieval in molecular databases, and in
structure-activity modeling. This problem is related to
the issue of molecular representation. Currently,
approaches with high descriptive power like 3D
surface-based representations are available. However,
most techniques tend to focus on 2D graph-based
molecular similarity due to the complexity that
accompanies reasoning with more elaborate
representations. This paper addresses the problem of
determining similarity when molecules are described
using complex surface-based representations. It
proposes an intrinsic, spherical representation that
systematically maps points on a molecular surface to
points on a standard coordinate system (a sphere).
Molecular geometry, molecular fields, and effects due
to field super-positioning can then be captured as
distributions on the surface of the sphere. Molecular
similarity is obtained by computing the similarity of
the corresponding property distributions using a novel
formulation of histogram-intersection. This method is
robust to noise, obviates molecular pose-optimization,
can incorporate conformational variations, and
facilitates highly efficient determination of similarity.
Retrieval performance, applications in structureactivity
modeling of complex biological properties,
and comparisons with existing research and
commercial methods demonstrate the validity and
effectiveness of the approach.
Return to Program
Return to Top
High-throughput 3D Structural Homology Detection via NMR Resonance Assignment
Christopher James Langmead - Carnegie Mellon Dept. of Computer Science
Bruce Randall Donald - Computer Science Department, Chemistry Department, Biological Sciences Department,
Dartmouth Center for Structural Biology and Computational Chemistry, Dartmouth College
One goal of the structural genomics initiative is the identification
of new protein folds. Sequence-based structural
homology prediction methods are an important means for
prioritizing unknown proteins for structure determination.
However, an important challenge remains: two highly dissimilar
sequences can have similar folds—how can we
detect this rapidly, in the context of structural genomics?
High-throughput NMR experiments, coupled with novel algorithms
for data analysis, can address this challenge. We
report an automated procedure, called HD, for detecting
3D structural homologies from sparse, unassigned protein
NMR data. Our method identifies 3D models in a protein
structural database whose geometries best fit the unassigned
experimental NMR data. HD does not use, and
is thus not limited by sequence homology. The method
can also be used to confirm or refute structural predictions
made by other techniques such as protein threading
or homology modelling. The algorithm runs in O(pn +
pn5/2 log (cn)+p log p) time, where p is the number of proteins
in the database, n is the number of residues in the target
protein and c is the maximum edge weight in an integerweighted
bipartite graph. Our experiments on real NMR
data from 3 different proteins against a database of 4,500
representative folds demonstrate that the method identifies closely related protein folds, including sub-domains of
larger proteins, with as little as 10-30% sequence homology
between the target protein (or sub-domain) and the computed
model. In particular, we report no false-negatives or
false-positives despite significant percentages of missing experimental
data.
Abbreviations used: NMR, nuclear magnetic resonance; RDC,
residual dipolar coupling; 3D, three-dimensional; HSQC, heteronuclear
single-quantum coherence; HN, amide proton; SAR,
structure activity relation; SO(3), special orthogonal (rotation)
group in 3D.
Return to Program
Return to Top
Pair Stochastic Tree Adjoining Grammars for Aligning and Predicting Pseudoknot RNA Structures
Hiroshi Matsui, Kengo Sato, Yasubumi Sakakibara - Department of Biosciences and Informatics, Keio University
Motivation: Since the whole genome sequences for many
species are currently available, computational predictions
of RNA secondary structures and computational identifi-
cations of those non-coding RNA regions by comparative
genomics become important, and require more advanced
alignment methods. Recently, an approach of structural
alignments for RNA sequences has been introduced to solve
these problems. By structural alignments, we mean a pairwise
alignment to align an unfolded RNA sequence into a
folded RNA sequence of known secondary structure. Pair
HMMs on tree structures (PHMMTSs) proposed by Sakakibara
are efficient automata-theoretic models for structural
alignments of RNA secondary structures, but are incapable
of handling pseudoknots. On the other hand, tree adjoining
grammars (TAGs) is a subclass of context-sensitive grammar,
which is suitable for modeling pseudoknots. Our goal
is to extend PHMMTSs by incorporating TAGs to be able to
handle pseudoknots.
Results: We propose the pair stochastic tree adjoining
grammars (PSTAGs) for modeling RNA secondary structures
including pseudoknots and show the strong experimental
evidences that modeling pseudoknot structures significantly
improves the prediction accuracies of RNA secondary
structures. First, we extend the notion of PHMMTSs
defined on alignments of ?etrees?f to PSTAGs defined on alignments
of ?gTAG (derivation) trees?h, which represent a topdown
parsing process of TAGs and are functionally equivalent
to derived trees of TAGs. Second, we modify PSTAGs so
that it takes as input a pair of a linear sequence and a TAG
tree representing a pseudoknot structure of RNA to produce
a structural alignment. Then, we develop a polynomial-time
algorithm for obtaining an optimal structural alignment
by PSTAGs, based on dynamic programming parser. We
have done several computational experiments for predicting
pseudoknots by PSTAGs, and our computational experiments
suggests that prediction of RNA pseudoknot structures
by our method are more efficient and biologically
plausible than by other conventional methods. The binary
code for PSTAG method is freely available from our website
at http://www.dna.bio.keio.ac.jp/pstag/.
Return to Program
Return to Top
An Algorithm for Detecting Homologues of Known Structured RNAs in Genomes
Shu-Yun Le, Jacob V. Maizel, Jr. - Laboratory of Experimental and Computational Biology, NCI Center for Cancer Research,
National Cancer Institute, NIH
Kaizhong Zhang - Department of Computer Science, University of Western Ontario;
School of Computing Science, Simon Fraser University
Distinct RNA structures are frequently involved in a wide range
of functions in various biological mechanisms. The
three dimensional RNA structures solved by X-ray crystallography
and various well-established RNA phylogenetic
structures indicate that functional RNAs have characteristic
RNA structural motifs represented by specific combinations
of base pairings and conserved nucleotides in the
loop region. Discovery of well-ordered RNA structures and
their homologues in genome-wide searches will enhance
our ability to detect the RNA structural motifs and help
us to highlight their association with functional and regulatory
RNA elements. We present here a novel computer algorithm,
HomoStRscan, that takes a single RNA sequence
with its secondary structure to search for homologousRNAs
in complete genomes. This novel algorithm completely differs
from other currently used search algorithms of homologous
structures or structural motifs. For an arbitrary segment
(or window) given in the target sequence, that has similar
size to the query sequence, HomoStRscan finds the most
similar structure to the input query structure and computes
the maximal similarity score (MSS) between the two structures.
The homologousRNA structures are then statistically
inferred from the MSS distribution computed in the target
genome. The method provides a flexible, robust and fine
search tool for any homologous structural RNAs.
Return to Program
Return to Top
Inverse Protein Folding in 2D HP Model
Arvind Gupta - School of Computing Science, Simon Fraser University, Canada
Ján Mañuch - School of Computing Science and PIMS, Simon Fraser University, Canada
Ladislav Stacho - Department of Mathematics, Simon Fraser University, Canada
The inverse protein folding problem is that of designing
an amino acid sequence which has a particular native protein
fold. This problem arises in drug design where a particular
structure is necessary to ensure proper protein-protein
interactions. In this paper we show that in the 2D HP model
of Dill it is possible to solve this problem for a broad class of
structures. These structures can be used to closely approximate
any given structure. One of the most important properties
of a good protein is its stability ?\ the aptitude not to
fold simultanously into other structures. We show that for
a number of basic structures, our sequences have a unique
fold.
Return to Program
Return to Top
Analysis of a Systematic Search-Based Algorithm for Determining Protein Backbone Structure from a Minimum
Number of Residual Dipolar Couplings
Lincong Wang - Dartmouth Computer Science Department, Dartmouth College
Bruce Randall Donald - Dartmouth Computer Science Department, Dartmouth Chemistry Department,
Dartmouth Department of Biological Sciences, Dartmouth College
We have developed an ab initio algorithm for determining
a protein backbone structure using global orientational
restraints on internuclear vectors derived from residual
dipolar couplings (RDCs) measured in one or two different
aligning media by solution nuclear magnetic resonance
(NMR) spectroscopy [14, 15]. Specifically, the conformation
and global orientations of individual secondary
structure elements are computed, independently, by an exact
solution, systematic search-based minimization algorithm
using only 2 RDCs per residue. The systematic search is
built upon a quartic equation for computing, exactly and
in constant time, the directions of an internuclear vector
from RDCs, and linear or quadratic equations for computing
the sines and cosines of backbone dihedral (φ,ψ ) angles
from two vectors in consecutive peptide planes. In contrast
to heuristic search such as simulated annealing (SA) or
Monte-Carlo (MC) used by other NMR structure determination
algorithms, our minimization algorithm can be analyzed
rigorously in terms of expected algorithmic complexity
and the coordinate precision of the protein structure as a
function of error in the input data. The algorithm has been
successfully applied to compute the backbone structures of
three proteins using real NMR data.
Return to Program
Return to Top
SESSION II - GENOMICS 1
Shannon Information in Complete Genomes
Chang-Heng Chang [1], Li-Ching Hsieh [1], Ta-Yuan Chen [1], Hong-Da Chen [1],
Liaofu Luo [3], Hoong-Chien Lee [1,2,4]
[1] Department of Physics, National Central University, Taiwan
[2] Department of Life Sciences, National Central University, Taiwan
[3] Physics Department, Inner Mongolia University, China
[4] National Center for Theoretical Sciences, Taiwan
Shannon information in the genomes of all completely
sequenced prokaryotes and eukaryotes are
measured in word lengths of two to ten letters. It is
found that in a scale-dependent way, the Shannon
information in complete genomes are much greater
than that in matching random sequences—thousands
of times greater in the case of short words.
Furthermore, with the exception of the 14 chromosomes
of Plasmodium falciparum, the Shannon information
in all available complete genomes belong
to a universality class given by an extremely simple
formula. The data are consistent with a model
for genome growth composed of two main ingredients:
random segmental duplications that increase
the Shannon information in a scale-independent way,
and random point mutations that preferentially reduces
the larger-scale Shannon information. The
inference drawn from the present study is that the
large-scale and coarse-grained growth of genomes
was selectively neutral and this suggests an independent
corroboration of Kimura?fs neutral theory
of evolution.
Return to Program
Return to Top
Segmental Duplications Containing Tandem Repeated Genes Encoding Putative Deubiquitinating Enzymes
Hong Liu, Li Li, Asher Zilberstein, Chang S. Hahn - Aventis Pharmaceuticals Inc.
Both inter- and intra-chromosomal segmental
duplications are known occurred in human genome
during evolution. Few cases of such segments
involving functional genes have been reported. While
searching for the human orthologs of murine
hematopoietic deubiquitinating enzymes (DUBs), we
identified four clusters of DUB-like genes on
chromosome 4p15 and chromosome 8p22-23 that are
over 90% identical to each other at the DNA level.
These genes are expressed in a cell type- and
activation-specific manner, with different clusters
possessing potentially distinct expression profiles.
Examining the surrounding sequences of these gene
duplication events, we have identified previously
unreported conserved sequence elements that are as
large as 35 to 74 kb encircling the gene clusters.
Traces of these elements are also found on
chromosome 12p13 and chromosome 11q13. The
coding and immediate upstream sequences for DUB -
like genes as well as the surrounding conserved
elements, are present in the chimpanzee trace
database, but not in rodent genome. We hypothesize
that the segments containing these DUB clusters and
surrounding elements arose relatively recently in
evolution through inter- and intra-chromosomal
duplicative transpositions, following the divergence of
primates and rodents. Genome wide systematical
search of the segmental duplication containing
duplicated gene cluster has been performed.
Return to Program
Return to Top
Compressed Pattern Matching in DNA Sequences
Lei Chen, Shiyong Lu, Jeffrey Ram - Wayne State University
We propose derivative Boyer-Moore (d-BM), a new compressed pattern matching algorithm in DNA sequences.
This algorithm is based on the Boyer-Moore method, which is one of the most popular string matching algorithms.
In this approach, we compress both DNA sequences and patterns by using two bits to represent each A, T, C, G
character. Experiments indicate that this compressed pattern matching algorithm searches long DNA patterns
(length > 50) more than 10 times faster than the exact match routine of the software package Agrep, which is
known as the fastest pattern matching tool. Moreover, compression of DNA sequences by this method gives a
guaranteed space saving of 75%. In part the enhanced speed of the algorithm is due to the increased efficiency
of the Boyer-Moore method resulting from an increase in alphabet size from 4 to 256.
Return to Program
Return to Top
A Self-tuning Method for One-Chip SNP Identification
Michael Molla [1,2], Jude Shavlik [1], Thomas Albert [2], Todd Richmond [2], Steven Smith [2]
[1] University of Wisconsin-Madison
[2] NimbleGen Systems, Inc.
Current methods for interpreting oligonucleotidebased
SNP-detection microarrays, SNP chips, are
based on statistics and require extensive parameter
tuning as well as extremely high-resolution images of
the chip being processed. We present a method, based
on a simple data-classification technique called
nearest-neighbors that, on haploid organisms,
produces results comparable to the published results of
the leading statistical methods and requires very little
in the way of parameter tuning. Furthermore, it can
interpret SNP chips using lower-resolution scanners of
the type more typically used in current microarray
experiments.
Along with our algorithm, we present the results of
a SNP-detection experiment where, when
independently applying this algorithm to six identical
SARS SNP chips, we correctly identify all 24 SNPs in a
particular strain of the SARS virus, with between 6 and
13 false positives across the six experiments.
Return to Program
Return to Top
Space-conserving Optimal DNA-Protein Alignment
Pang Ko, Mahesh Narayanan, Anantharaman Kalyanaraman, Srinivas Aluru - Department of Electrical and Computer Engineering,
Iowa State University
DNA-protein alignment algorithms can be used
to discover coding sequences in a genomic sequence,
if the corresponding protein derivatives are known.
They can also be used to identify potential coding sequences of a newly sequenced genome, by using proteins from related species. Previously known algorithms either solve a simplified formulation, or sacrifice optimality to achieve practical implementation.
In this paper, we present a comprehensive formulation of the DNA-protein alignment problem, and an
algorithmto compute the optimal alignment inO(mn)
time using only four tables of size (m + 1) ?~ (n + 1),
wheremand n are the lengths of the DNA and protein
sequences, respectively. We also developed a Protein
and DNA Alignment program PanDA that implements the proposed solution. Experimental results indicate that our algorithm
produces high quality alignments.
Return to Program
Return to Top
SESSION III - TRANSCRIPTOMES 1
A Theoretical Analysis of Gene Selection
Sach Mukherjee, Stephen J. Roberts - Department of Engineering Science, University of Oxford, UK
A great deal of recent research has focused on the challenging
task of selecting differentially expressed genes from
microarray data (?egene selection?f). Numerous gene selection
algorithms have been proposed in the literature, but it
is often unclear exactly how these algorithms respond to
conditions like small sample-sizes or differing variances.
Choosing an appropriate algorithm can therefore be difficult in many cases. In this paper we propose a theoretical
analysis of gene selection, in which the probability of successfully
selecting relevant genes, using a given gene ranking
function, is explicitly calculated in terms of population
parameters. The theory developed is applicable to any ranking
function which has a known sampling distribution, or
one which can be approximated analytically. In contrast to
empirical methods, the analysis can easily be used to examine
the behaviour of gene selection algorithms under a wide
variety of conditions, even when the numbers of genes involved
runs into the tens of thousands. The utility of our approach
is illustrated by comparing three well-known gene
ranking functions.
Return to Program
Return to Top
Improved Fourier Transform Method for Unsupervised Cell-cycle Regulated Gene Prediction
Karuturi R. Krishna Murthy, Liu Jian Hua - Genome Institute of Singapore, Republic of Singapore
Motivation: Cell-cycle regulated gene prediction using
microarray time-course measurements of the mRNA expression
levels of genes has been used by several researchers.
The popularly employed approach is Fourier transform
(FT) method in conjunction with the set of known cell-cycle
regulated genes. In the absence of training data, fourier
transform method is sensitive to noise, additive monotonic
component arising from cell population growth and deviation
from strict sinusoidal form of expression. Known cell
cycle regulated genes may not be available for certain organisms
or using them for training may bias the prediction.
Results: In this paper we propose an Improved Fourier
Transform (IFT) method which takes care of several factors
such as monotonic additive component of the cell-cycle
expression, irregular or partial-cycle sampling of gene expression.
The proposed algorithm does not need any known
cell-cycle regulated genes for prediction. Apart from alleviating
need for training set, it also removes bias towards
genes similar to the training set. We have evaluated the developed
method on two publicly available datasets: yeast
cell-cycle data and HeLa cell-cycle data. The proposed algorithm
has performed competitively on both datasets with
that of the supervised fourier transform method used. It
outperformed other unsupervised methods such as Partial
Least Squares (PLS) and Single Pulse Modeling
(SPM). This method is easy to comprehend and implement,
and runs faster.
Software: http://giscompute.gis.nus.edu.sg/cdcAnal
Return to Program
Return to Top
SESSION IV - EVOLUTION AND PHYLOGENY
Algorithms for Association Study Design Using a Generalized Model of Haplotype Conservation
Russell Schwartz - Department of Biological Sciences and School of Computer Science, Carnegie Mellon University
There is considerable interest in computational
methods to assist in the use of genetic polymorphism
data for locating disease-related genes. Haplotypes,
contiguous sets of correlated variants, may provide a
means of reducing the difficulty of the data analysis
problems involved. The field to date has been dominated
by methods based on the ?ghaplotype block?h
hypothesis, which assumes discrete population-wide
boundaries between conserved genetic segments, but
there is strong reason to believe that haplotype blocks
do not fully capture true haplotype conservation patterns.
In this paper, we address the computational
challenges of using a more flexible, block-free representation
of haplotype structure called the ?ghaplotype
motif?h model for downstream analysis problems.
We develope algorithms for htSNP selection
and missing data inference using this more generalized
model of sequence conservation. Application to
a dataset from the literature demonstrates the practical
value of these block-free methods.
Return to Program
Return to Top
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees
Usman W. Roshan - Department of Computer Science, University of Texas at Austin
Bernard M. E. Moret - Department of Computer Science, University of New Mexico
Tandy Warnow - Department of Computer Science, University of Texas at Austin; Radcliffe Institute for Advanced Study
Tiffani L. Williams - Department of Computer Science, University of New Mexico
Phylogenetic trees are commonly reconstructed based
on hard optimization problems such as maximum parsimony
(MP) and maximum likelihood (ML). Conventional
MP heuristics for producing phylogenetic trees produce
good solutions within reasonable time on small datasets (up
to a few thousand sequences), while ML heuristics are limited
to smaller datasets (up to a few hundred sequences).
However, since MP (and presumably ML) is NP-hard, such
approaches do not scale when applied to large datasets. In
this paper, we present a new technique called Recursive-Iterative-DCM3 (Rec-I-DCM3), which belongs to our family
of Disk-Covering Methods (DCMs). We tested this new
technique on ten large biological datasets ranging from
1,322 to 13,921 sequences and obtained dramatic speedups
as well as significant improvements in accuracy (better than
99.99%) in comparison to existing approaches. Thus, highquality
reconstructions can be obtained for datasets at least
ten times larger than was previously possible.
Return to Program
Return to Top
MinPD: Distance-based Phylogenetic Analysis and Recombination Detection of Serially-Sampled HIV Quasispecies
Patricia Buendia, Giri Narasimhan - Bioinformatics Reseach Group (BioRG), School of Computer Science,
Florida International University
A new computational method to study within-host
viral evolution is explored to better understand the
evolution and pathogenesis of viruses. Traditional
phylogenetic tree methods are better suited to study
relationships between contemporaneous species,
which appear as leaves of a phylogenetic tree.
However, viral sequences are often sampled serially
from a single host. Consequently, data may be
available at the leaves as well as the internal nodes of
a phylogenetic tree. Recombination may further
complicate the analysis. Such relationships are not
easily expressed by traditional phylogenetic methods.
We propose a new algorithm, called MinPD, based on
minimum pairwise distances. Our algorithm uses
multiple distance matrices and correlation rules to
output a MinPD tree or network. We test our
algorithm using extensive simulations and apply it to a
set of HIV sequence data isolated from one patient
over a period of ten years. The proposed visualization
of the phylogenetic tree\network further enhances the
benefits of our methods.
Return to Program
Return to Top
SESSION V - PROTEOMICS
SPIDER: Software for Protein Identification from Sequence Tags with De Novo Sequencing Error
Yonghua Han, Bin Ma, Kaizhong Zhang - Department of Computer Science, University of Western Ontario, Canada
For the identification of novel proteins using MS/MS, de
novo sequencing software computes one or several possible
amino acid sequences (called sequence tags) for each
MS/MS spectrum. Those tags are then used to match, accounting
amino acid mutations, the sequences in a protein
database. If the de novo sequencing gives correct tags, the
homologs of the proteins can be identified by this approach
and software such as MS-BLAST is available for the matching.
However, de novo sequencing very often gives only partially
correct tags. The most common error is that a segment
of amino acids is replaced by another segment with approximately
the same masses. We developed a new efficient
algorithm to match sequence tags with errors to database
sequences for the purpose of protein and peptide identification. A software package, SPIDER, was developed and
made available on Internet for free public use. This paper
describes the algorithms and features of the SPIDER software.
Return to Program
Return to Top
Estimating and Improving Protein Interaction Error Rates
Patrik D'haeseleer, George M. Church - Lipper Center for Computational Genetics, Harvard Medical School
High throughput protein interaction data sets have proven to be notoriously noisy. Although it is possible
to focus on interactions with higher reliability by using only those that are backed up by two or more
lines of evidence, this approach invariably throws out the majority of available data. A more optimal use
could be achieved by incorporating the probabilities associated with all available interactions into the analysis.
We present a novel method for estimating error rates associated with specific protein interaction data
sets, as well as with individual interactions given the data sets in which they appear. As a bonus, we also
get an estimate for the total number of protein interactions in yeast. Certain types of false positive results can
be identified and removed, resulting in a significant improvement in quality of the data set. For co-purification
data sets, we show how we can reach a tradeoff between the "spoke" and "matrix" representation of interactions
within co-purified groups of proteins to achieve an optimal false positive error rate.
Return to Program
Return to Top
Automated Protein Classification Using Consensus Decision
Tolga Can*, Orhan Çamoglu*, Ambuj K. Singh, Yuan-Fang Wang - Department of Computer Science,
University of California, Santa Barbara
* These authors contributed equally to this work.
We propose a novel technique for automatically generating
the SCOP classification of a protein structure with high
accuracy. High accuracy is achieved by combining the decisions
of multiple methods using the consensus of a committee
(or an ensemble) classifier. Our technique is rooted
in machine learning which shows that by judicially employing
component classifiers, an ensemble classifier can
be constructed to outperform its components. We use two
sequence- and three structure-comparison tools as component
classifiers. Given a protein structure, using the joint
hypothesis, we first determine if the protein belongs to an
existing category (family, superfamily, fold) in the SCOP
hierarchy. For the proteins that are predicted as members
of the existing categories, we compute their family-,
superfamily-, and fold-level classifications using the consensus
classifier. We show that we can significantly improve
the classification accuracy compared to the individual component
classifiers. In particular, we achieve error rates that
are 3-12 times less than the individual classifiers?f error
rates at the family level, 1.5-4.5 times less at the superfamily
level, and 1.1-2.4 times less at the fold level.
Return to Program
Return to Top
Separation of Ion Types in Tandem Mass Spectrometry Data Interpretation—A Graph-Theoretic Approach
Bo Yan #[1,2], Chongle Pan #[2,4], Victor N. Olman [1,2], Robert L Hettich [3], Ying Xu [1,2]
[1] Department of Biochemical and Molecular Biology, University of Georgia
[2] Computational Biology Institute, Oak Ridge National Laboratory
[3] Chemical Sciences Division, Oak Ridge National Laboratory
[4] Genome Science and Technology Graduate School, University of Tennessee
#Both authors contributed equally to this work
Mass spectrometry is one of the most popular
analytical techniques for identification of individual
proteins in a protein mixture, one of the basic
problems in proteomics. It identifies a protein through
identifying its unique mass spectral pattern. While the
problem is theoretically solvable, it remains a
challenging problem computationally. One of the key
challenges comes from the difficulty in distinguishing
the N- and C-terminus ions, mostly b- and y-ions
respectively. In this paper, we present a graph
algorithm for solving the problem of separating b- from
y-ions in a set of mass spectra. We represent each
spectral peak as a node and consider two types of
edges: a type-1 edge connects two peaks possibly of the
same ion types and a type-2 edge connects two peaks
possibly of different ion types, predicted based on local
information. The ion-separation problem is then
formulated and solved as a graph partition problem,
which is to partition the graph into three subgraphs,
namely b-, y-ions and others respectively, so to
maximize the total weight of type-1 edges while
minimizing the total weight of type-2 edges within each
subgraph. We have developed a dynamic programming
algorithm for rigorously solving this graph partition
problem and implemented it as a computer program
PRIME. We have tested PRIME on 18 data sets of high
accurate FT-ICR tandem mass spectra and found that
it achieved ~90% accuracy for separation of b- and y-ions.
Return to Program
Return to Top
SESSION VI - TRANSCRIPTOMES 2
Minimum Entropy Clustering and Applications to Gene Expression Analysis
Haifeng Li - University of California Riverside
Keshu Zhang - Rensselaer Polytechnic Institute
Tao Jiang - University of California Riverside
Clustering is a common methodology for analyzing the
gene expression data. In this paper, we present a new clustering
algorithm from an information-theoretic point of
view. First, we propose the minimum entropy (measured
on a posteriori probabilities) criterion, which is the conditional
entropy of clusters given the observations. Fano?fs inequality
indicates that it could be a good criterion for clustering.
We generalize the criterion by replacing Shannon?fs
entropy with Havrda-Charvat?fs structural ??-entropy. Interestingly,
the minimum entropy criterion based on structural
??-entropy is equal to the probability error of the nearest
neighbor method when ?? = 2. This is another evidence that
the proposed criterion is good for clustering. With a nonparametric
approach for estimating a posteriori probabilities,
an efficient iterative algorithm is then established to
minimize the entropy. The experimental results show that
the clustering algorithm performs significantly better than
k-means/medians, hierarchical clustering, SOM, and EM in
terms of adjusted Rand index. Particularly, our algorithm
performs very well even when the correct number of clusters
is unknown. In addition, most clustering algorithms
produce poor partitions in presence of outliers while our
method can correctly reveal the structure of data and effectively
identify outliers simultaneously.
Return to Program
Return to Top
A Mixed Factors Model for Dimension Reduction and Extraction of a Group Structure in Gene Expression Data
Ryo Yoshida - The Graduate University for Advanced Studies, Japan
Tomoyuki Higuchi - Institute of Statistical Mathematics, Japan
Seiya Imoto - Human Genome Center, Institute of Medical Science, University of Tokyo, Japan
When we cluster tissue samples on the basis of genes,
the number of observations to be grouped is much smaller
than the dimension of feature vector. In such a case, the
applicability of conventional model-based clustering is limited
since the high dimensionality of feature vector leads to
overfitting during the density estimation process. To overcome
such difficulty, we attempt a methodological extension
of the factor analysis. Our approach enables us not only to
prevent from the occurrence of overfitting, but also to handle
the issues of clustering, data compression and extracting
a set of genes to be relevant to explain the group structure.
The potential usefulness are demonstrated with the application
to the leukemia dataset.
Return to Program
Return to Top
Gridding and Compression of Microarray Images
Stefano Lonardi, Yu Luo - Department of Computer Science and Engineering, University of California Riverside
With the recent explosion of interest in microarray technology,
massive amounts of microarray images are currently
being produced. The storage and the transmission
of this type of data are becoming increasingly challenging.
Here we propose lossless and lossy compression algorithms
for microarray images originally digitized at 16 bpp (bits
per pixels) that achieve an average of 9.5?|11.5 bpp (lossless)
and 4.6?|6.7 bpp (lossy, with a PSNR of 63 dB). The
lossy compression is applied only on the background of the
image, thereby preserving the regions of interest. The methods
are based on a completely automatic gridding procedure
of the image.
Return to Program
Return to Top
SESSION VII - APPLIED BIOINFORMATICS
PoPS: A Computational Tool for Modeling and Predicting Protease Specificity
Sarah E. Boyd, Maria Garcia de la Banda - School of Computer Science & Software Engineering and Victorian
Bioinformatics Consortium, Monash University, Australia
Robert N. Pike, James C. Whisstock - Department of Biochemistry & Molecular Biology and Victorian
Bioinformatics Consortium, Monash University, Australia
George B. Rudy - Genetics & Bioinformatics Division, Walter & Eliza Hall Institute of Medical Research, Australia
Proteases play a fundamental role in the control of
intra- and extracellular processes by binding and cleaving
specific amino acid sequences. Identifying these targets
is extremely challenging. Current computational attempts
to predict cleavage sites are limited, representing
these amino acid sequences as patterns or frequency matrices.
Here we present PoPS, a publicly accessible bioinformatics
tool (http://pops.csse.monash.edu.au/) which provides
a novel method for building computational models of
protease specificity that, while still being based on these
amino acid sequences, can be built from any experimental
data or expert knowledge available to the user. PoPS specificity models can be used to predict and rank likely cleavages
within a single substrate, and within entire proteomes.
Other factors, such as the secondary or tertiary structure
of the substrate, can be used to screen unlikely sites. Furthermore,
the tool also provides facilities to infer, compare
and test models, and to store them in a publicly accessible
database.
Return to Program
Return to Top
Selection of Patient Samples and Genes for Outcome Prediction
Huiqing Liu, Jinyan Li, Limsoon Wong - Institute for Infocomm Research. Singapore
Gene expression profiles with clinical outcome data enable
monitoring of disease progression and prediction of
patient survival at the molecular level. We present a new
computational method for outcome prediction. Our idea is
to use an informative subset of original training samples.
This subset consists of only short-term survivors who died
within a short period and long-term survivors who were still
alive after a long follow-up time. These extreme training
samples yield a clear platform to identify genes whose expression
is related to survival. To find relevant genes, we
combine two feature selection methods ?\ entropy measure
and Wilcoxon rank sum test?\so that a set of sharp discriminating
features are identified. The selected training samples
and genes are then integrated by a support vector machine
to build a prediction model, by which each validation
sample is assigned a survival/relapse risk score for drawing
Kaplan-Meier survival curves. We apply this method
to two data sets: diffuse large-B-cell lymphoma (DLBCL)
and primary lung adenocarcinoma. In both cases, patients
in high and low risk groups stratified by our risk scores are
clearly distinguishable. We also compare our risk scores to
some clinical factors, such as International Prognostic Index
score for DLBCL analysis and tumor stage information
for lung adenocarcinoma. Our results indicate that gene expression
profiles combined with carefully chosen learning
algorithms can predict patient survival for certain diseases.
Return to Program
Return to Top
SESSION VIII - DATA MINING AND ONTOLOGY 1
Comparison of Two Schemes for Automatic Keyword Extraction from MEDLINE for Functional Gene Clustering
Ying Liu - College of Computing, Georgia Institute of Technology
Brian J. Ciliax - Department of Neurology, Emory University School of Medicine
Karin Borges - Department of Pharmacology, Emory University School of Medicine
Venu Dasigi - Southern Polytechnic State University
Ashwin Ram - College of Computing, Georgia Institute of Technology
Shamkant B. Navathe - College of Computing, Georgia Institute of Technology
Ray Dingledine - Department of Pharmacology, Emory University School of Medicine
One of the key challenges of microarray studies is to
derive biological insights from the unprecedented
quantities of data on gene-expression patterns.
Clustering genes by functional keyword association
can provide direct information about the nature of the
functional links among genes within the derived
clusters. However, the quality of the keyword lists
extracted from biomedical literature for each gene
significantly affects the clustering results. We
extracted keywords from MEDLINE that describe the
most prominent functions of the genes, and used the
resulting weights of the keywords as feature vectors
for gene clustering. By analyzing the resulting cluster
quality, we compared two keyword weighting
schemes: normalized z-score and term frequency?|inverse document frequency (TFIDF). The best
combination of background comparison set, stop list
and stemming algorithm was selected based on
precision and recall metrics. In a test set of four known
gene groups, a hierarchical algorithm correctly
assigned 25 of 26 genes to the appropriate clusters
based on keywords extracted by the TDFIDF
weighting scheme, but only 23 of 26 with the z-score
method. To evaluate the effectiveness of the weighting
schemes for keyword extraction for gene clusters from
microarray profiles, 44 yeast genes that are
differentially expressed during the cell cycle were used
as a second test set. Using established measures of
cluster quality, the results produced from TFIDFweighted
keywords had higher purity, lower entropy,
and higher mutual information than those produced
from normalized z-score weighted keywords. The
optimized algorithms should be useful for sorting
genes from microarray lists into functionally discrete
clusters.
Return to Program
Return to Top
Calculation, Visualization, and Manipulation of MASTs (Maximum Agreement Subtrees)
Shiming Dong, Eileen Kraemer - Computer Science Department, The University of Georgia
Phylogenetic trees are used to represent the
evolutionary history of a set of species. Comparison of
multiple phylogenetic trees can help researchers find
the common classification of a tree group, compare
tree construction inferences or obtain distances
between trees. We present TreeAnalyzer, a freely
available package for phylogenetic tree comparison. A
MAST (Maximum Agreement Subtree) algorithm is
implemented to compare the trees. Additional features
of this software include tree comparison, visualization,
manipulation, labeling, and printing.
Availability:
http://www.cs.uga.edu/~eileen/TreeAnalyzer
Return to Program
Return to Top
AZuRE, a Scalable System for Automated Term Disambiguation of Gene and Protein Names
Raf M. Podowski - AstraZeneca R&D Boston and Karolinska Institutet
John G. Cleary - Reel Two, Ltd. and University of Waikato
Nicholas T. Goncharoff - Reel Two, Inc.
Gregory Amoutzias - AstraZeneca
William S. Hayes - AstraZeneca R&D Boston
Researchers, hindered by a lack of standard gene and
protein-naming conventions, endure long, sometimes
fruitless, literature searches. A system is described
which is able to automatically assign gene names to
their LocusLink ID (LLID) in previously unseen
MEDLINE abstracts. The system is based on
supervised learning and builds a model for each
LLID. The training sets for all LLIDs are extracted
automatically from MEDLINE references in the
LocusLink and SwissProt databases. A validation
was done of the performance for all 20,546 human
genes with LLIDs. Of these, 7,344 produced good
quality models (F-measure > 0.7, nearly 60% of
which were > 0.9) and 13,202 did not, mainly due to
insufficient numbers of known document references.
A hand validation of MEDLINE documents for a set
of 66 genes agreed well with the system?fs internal
accuracy assessment. It is concluded that it is
possible to achieve high quality gene disambiguation
using scaleable automated techniques.
Return to Program
Return to Top
SESSION X - DATA BASE AND ONTOLOGY 2
Comparative Analysis of Gene Sets in the Gene Ontology Space under the Multiple Hypothesis Testing Framework
Sheng Zhong [1], Lu Tian [1], Cheng Li [1,3], Kai-Florian Storch [4], Wing H. Wong [1,2]
[1] Department of Biostatistics, Harvard University
[2] Department of Statistics, Harvard University
[3] Department of Biostatistical Sciences, Dana Farber Cancer Institute
[4] Department of Neurobiology, Harvard Medical School
The Gene Ontology (GO) resource can be used as
a powerful tool to uncover the properties shared
among, and specific to, a list of genes produced by
high-throughput functional genomics studies, such as
microarray studies. In the comparative analysis of
several gene lists, researchers maybe interested in
knowing which GO terms are enriched in one list of
genes but relatively depleted in another. Statistical
tests such as Fisher?fs exact test or Chi-square test can
be performed to search for such GO terms. However,
because multiple GO terms are tested simultaneously,
individual p-values from individual tests do not serve
as good indicators for picking GO terms. Furthermore,
these multiple tests are highly correlated, usual
multiple testing procedures that work under an
independence assumption are not applicable. In this
paper we introduce a procedure, based on False
Discovery Rate (FDR), to treat this correlated multiple
testing problem. This procedure calculates a
moderately conserved estimator of q-value for every
GO term. We identify the GO terms with q-values that
satisfy a desired level as the significant GO terms. This
procedure has been implemented into the GoSurfer
software. GoSurfer is a windows based graphical data
mining tool. It is freely available at
http://www.gosurfer.org
Return to Program
Return to Top
Gene Ontology Friendly Biclustering of Expression Profiles
Jinze Liu [1], Wei Wang [1], and Jiong Yang [2]
[1] Department of Computer Science, University of North Carolina, Chapel Hill
[2] Department of Computer Science, University of Illinois, Urbana-Champaign
The soundness of clustering in the analysis of gene
expression profiles and gene function prediction is
based on the hypothesis that genes with similar expression
profiles may imply strong correlations with their
functions in the biological activities. Gene Ontology
(GO) has become a well accepted standard in organizing
gene function categories. Different gene function
categories in GO can have very sophisticated relationships,
such as ?fpart of?f and ?foverlapping?f. Until
now, no clustering algorithm can generate gene clusters
within which the relationships can naturally reflect those of gene function categories in the GO hierarchy.
The failure in resembling the relationships may
reduce the confidence of clustering in gene function
prediction. In this paper, we present a new clustering
technique, Smart Hierarchical Tendency Preserving
clustering (SHTP-clustering), based on a bicluster
model, Tendency Preserving cluster (TP-Cluster). By
directly incorporating Gene Ontology information into
the clustering process, the SHTP-clustering algorithm
yields a TP-cluster tree within which any subtree can
be well mapped to a part of the GO hierarchy. Our
experiments on yeast cell cycle data demonstrate that
this method is efficient and effective in generating the
biological relevant TP-Clusters.
Return to Program
Return to Top
SESSION XI - PATHWAYS AND NETWORKS 1
Dynamic Algorithm for Inferring Qualitative Models of Gene Regulatory Networks
Zheng Yun - BIRC, School of Comp. Eng., Nanyang Technological University, Singapore
Kwoh Chee Keong - School of Comp. Eng., Nanyang Technological University, Singapore
It is still an open problem to identify functional relations
with o(N?Enk) time for any domain[2], where N is the number
of learning instances, n is the number of genes (or variables)
in the Gene Regulatory Network (GRN) models and
k is the indegree of the genes. To solve the problem, we introduce
a novel algorithm, DFL (Discrete Function Learning),
for reconstructing qualitative models of GRNs from
gene expression data in this paper. We analyze its complexity
of O(k ?E N ?E n2) on the average and its data requirements.
We also perform experiments on both synthetic and
Cho et al. [7] yeast cell cycle gene expression data to validate
the efficiency and prediction performance of the DFL
algorithm. The experiments of synthetic Boolean networks
show that the DFL algorithm is more efficient than current
algorithms without loss of prediction performances. The results
of yeast cell cycle gene expression data show that the
DFL algorithm can identify biologically significant models
with reasonable accuracy, sensitivity and high precision
with respect to the literature evidences.We further introduce
a method called function to deal with noises in data sets.
The experimental results show that the function method is
a good supplement to the DFL algorithm.
Return to Program
Return to Top
Mapping of Microbial Pathways through Constrained Mapping of Orthologous Genes
Victor Olman [1], Hanchuan Peng [1,2], Zhengchang Su [1,2] and Ying Xu [1,2]
[1] Computational Systems Biology Laboratory, Biochemistry and Molecular Biology Department, University of Georgia
[2] Computational Biology Institute, Oak Ridge National Laboratory
We present a novel computer algorithm
for mapping biological pathways from one
prokaryotic genome to another. The
algorithm maps genes in a known pathway to
their homologous genes (if any) in a target
genome that is most consistent with (a)
predicted orthologous gene relationship, (b)
predicted operon structures, and (c)
predicted co-regulation relationship of
operons. Mathematically, we have
formulated this problem as a constrained
minimum spanning tree problem (called a
Steiner network problem), and demonstrated
that this formulation has the desired property
through applications. We have solved this
mapping problem using a combinatorial
optimization algorithm, with guaranteed
global optimality. We have implemented this
algorithm as a computer program, called PMAP.
Our test results on pathway mapping
are highly encouraging---we have mapped a
number of pathways of H. influenzae, B.
subtilis, H. pylori, and M. tuberculosis to E.
coli using P-MAP, whose homologous
pathways in E. coli are known and hence the
mapping accuracy could be checked. We
have then mapped known E. coli pathways in
the EcoCyc database to the newly sequenced
organism Synechococcus sp WH8102, and
predicted 158 Synechococcus pathways.
Detailed analyses on the predicted pathways
indicate that P-MAP's mapping results are
consistent with our general knowledge about
(local) pathways. We believe that P-MAP will
be a useful tool for microbial genome
annotation projects and inference of
individual microbial pathways.
Return to Program
Return to Top
SESSION XII - PATHWAYS AND NETWORKS 2
Multi-knockout Genetic Network Analysis: The Rad6 Example
Alon Kaufman - Center of Neural Computation, Hebrew University, Israel
Martin Kupiec - Department of Molecular Microbiology and Biotechnology, Tel Aviv University, Israel
Eytan Ruppin - School of Computer Science and School of Medicine, Tel Aviv University, Israel
A novel and rigorous Multi-perturbation Shapley
Value Analysis (MSA) method has been recently presented
[12]. The method addresses the challenge of
defining and calculating the functional causal contributions
of elements of a biological system. This paper
presents the first study applying MSA to the analysis
of gene knockout data. The MSA identifies the importance
of genes in the Rad6 DNA repair pathway
of the yeast S. cerevisiae, quantifying their contributions
and characterizing their functional interactions.
Incorporating additional biological knowledge, a
new functional description of the Rad6 pathway is provided,
predicting the existence of additional DNA polymerase
and RFC-like complexes. The MSA is the first
method for rigorously analyzing multi-knockout experiments,
which are likely to soon become a standard and
necessary tool for analyzing complex biological systems.
Return to Program
Return to Top
A Hierarchical Mixture of Markov Models for Finding Biologically Active Metabolic Paths Using Gene
Expression and Protein Classes
Hiroshi Mamitsuka - Institute for Chemical Research, Kyoto University, Japan
Yasushi Okuno - Graduate School of Pharmaceutical Sciences, Kyoto University, Japan
With the recent development of experimental high-throughput
techniques, the type and volume of accumulating
biological data have extremely increased these
few years. Mining from different types of data might
lead us to find new biological insights. We present a
new methodology for systematically combining three different
datasets to find biologically active metabolic
paths/patterns. This method consists of two steps: First
it synthesizes metabolic paths from a given set of chemical
reactions, which are already known and whose enzymes
are co-expressed, in an efficient manner. It then represents
the obtained metabolic paths in a more comprehensible
way through estimating parameters of a probabilistic
model by using these synthesized paths. This model
is built upon an assumption that an entire set of chemical
reactions corresponds to a Markov state transition
diagram. Furthermore, this model is a hierarchical latent
variable model, containing a set of protein classes as
a latent variable, for clustering input paths in terms of existing
knowledge of protein classes. We tested the performance
of our method using a main pathway of glycolysis,
and found that our method achieved higher predictive performance
for the issue of classifying gene expressions than
those obtained by other unsupervised methods. We further
analyzed the estimated parameters of our probabilistic
models, and found that biologically active paths
were clustered into only two or three patterns for each expression
experiment type, and each pattern suggested
some new long-range relations in the glycolysis pathway.
Return to Program
Return to Top
SESSION XIII - GENOMICS 2
FastR: Fast Database Search Tool for Non-coding RNA
Vineet Bafna, Shaojie Zhang - Department of Computer Science and Engineering, University of California San Diego
The discovery of novel non-coding RNAs has been
among the most exciting recent developments in Biology.
Yet, many more remain undiscovered. It has been
hypothesized that there is in fact an abundance of functional
non-coding RNA (ncRNA) with various catalytic
and regulatory functions. Computational methods tailored
specifically for ncRNA are being actively developed.
As the inherent signal for ncRNA is weaker than
that for protein coding genes, comparative methods offer
the most promising approach, and are the subject of
our research. We consider the following problem: Given
an RNA sequence with a known secondary structure, ef-
ficiently compute all structural homologs (computed as a
function of sequence and structural similarity) in a genomic
database. Our approach, based on structural filters
that eliminate a large portion of the database, while
retaining the true homologs allows us to search a typical
bacterial database in minutes on a standard PC, with
high sensitivity and specificity. This is two orders of magnitude
better than current available software for the problem.
Return to Program
Return to Top
Recurrence Time Statistics: Versatile Tools for Genomic DNA Sequence Analysis
Yinhe Cao [1,3], Wen-wen Tung [2], J.B. Gao [3]
[1] 1026 Springfield Drive, Campbell, CA
[2] National Center for Atmospheric Research
[3] Department of Electrical and Computer Engineering, University of Florida
With the completion of the human and a few model organisms?f genomes, and the genomes
of many other organisms waiting to be sequenced, it has become increasingly important to
develop faster computational tools which are capable of easily identifying the structures and
extracting features from DNA sequences. One of the more important structures in a DNA sequence
is repeat-related. Often they have to be masked before protein coding regions along
a DNA sequence are to be identified or redundant expressed sequence tags (ESTs) are to be
sequenced. Here we report a novel recurrence time based method for sequence analysis. The
method can conveniently study all kinds of periodicity and exhaustively find all repeat-related
features from a genomic DNA sequence. An efficient codon index is also derived from the recurrence
time statistics, which has the salient features of being largely species-independent and
working well on very short sequences. Efficient codon indices are key elements of successful
gene finding algorithms, and are particularly useful for determining whether a suspected EST
belongs to a coding or non-coding region. We illustrate the power of the method by studying
the genomes of E. coli, the yeast S. cervisivae, the nematode worm C. elegans, and the human,
Homo sapiens. Computationally, our method is very efficient. It allows us to carry out analysis
of genomes on the whole genomic scale by a PC.
Return to Program
Return to Top
SESSION XIV - TRANSCRIPTOMES 3
Profile-based String Kernels for Remote Homology Detection and Motif Extraction
Rui Kuang [1], Eugene Ie [1,3], Ke Wang [1], Kai Wang [2], Mahira Siddiqi [2],
Yoav Freund [1,3,4], Christina Leslie [1,3,4]
[1] Department of Computer Science, Columbia University
[2] Department of Biomedical Informatics, Columbia University
[3] Center for Computational Learning Systems, Columbia University
[4] Center for Computational Biology and Bioinformatics, Columbia University
We introduce novel profile-based string kernels for use
with support vector machines (SVMs) for the problems of
protein classification and remote homology detection. These
kernels use probabilistic profiles, such as those produced by
the PSI-BLAST algorithm, to define position-dependent mutation
neighborhoods along protein sequences for inexact
matching of k-length subsequences (?gk-mers?h) in the data.
By use of an efficient data structure, the kernels are fast to
compute once the profiles have been obtained. For example,
the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation
time and the SVM training time. We present remote homology
detection experiments based on the SCOP database
where we show that profile-based string kernels used with
SVM classifiers strongly outperform all recently presented
supervised SVM methods. We also show how we can use the
learned SVM classifier to extract ?gdiscriminative sequence
motifs?h—short regions of the original profile that contribute
almost all the weight of the SVM classification score
—and show that these discriminative motifs correspond to
meaningful structural features in the protein data. The use
of PSI-BLAST profiles can be seen as a semi-supervised
learning technique, since PSI-BLAST leverages unlabeled
data from a large sequence database to build more informative
profiles. Recently presented ?gcluster kernels?h give
general semi-supervised methods for improving SVM protein
classification performance. We show that our profile
kernel results are comparable to cluster kernels while providing
much better scalability to large datasets.
Return to Program
Return to Top
MISAE: A New Approach for Regulatory Motif Extraction
Zhaohui Sun, Jingyi Yang, Jitender S. Deogun - Department of Computer Science and Engineering, University
of Nebraska Lincoln
The recognition of regulatory motifs of co-regulated
genes is essential for understanding the regulatory mechanisms.
However, the automatic extraction of regulatory motifs
from a given data set of the upstream non-coding
DNA sequences of a family of co-regulated genes is difficult because regulatory motifs are often subtle and
inexact. This problem is further complicated by the corruption
of the data sets. In this paper, a new approach
called Mismatch-allowed Probabilistic Suffix Tree Motif
Extraction (MISAE) is proposed. It combines the
mismatch-allowed probabilistic suffix tree that is a probabilistic
model and local prediction for the extraction of regulatory
motifs. The proposed approach is tested on 15
co-regulated gene families and compares favorably with
other state-of-the-art approaches. Moreover, MISAE performs
well on ?gcorrupted?h data sets. It is able to extract
the motif from a ?gcorrupted?h data set with less
than one fourth of the sequences containing the real motif.
Return to Program
Return to Top
Biclustering in Gene Expression Data by Tendency
Jinze Liu [1], Jiong Yang [2], and Wei Wang [1]
[1] Department of Computer Science, University of North Carolina, Chapel Hill
[2] Department of Computer Science, University of Illinois, Urbana-Champaign
The advent of DNA microarray technologies has
revolutionized the experimental study of gene expression.
Clustering is the most popular approach of analyzing
gene expression data and has indeed proven
to be successful in many applications. Our work focuses
on discovering a subset of genes which exhibit
similar expression patterns along a subset of conditions
in the gene expression matrix. Specifically, we
are looking for the Order Preserving clusters (OPCluster),
in each of which a subset of genes induce a
similar linear ordering along a subset of conditions.
The pioneering work of the OPSM model[3], which
enforces the strict order shared by the genes in a cluster,
is included in our model as a special case. Our
model is more robust than OPSM because similarly
expressed conditions are allowed to form order equivalent
groups and no restriction is placed on the order
within a group. Guided by our model, we design and
implement a deterministic algorithm, namely OPCTree,
to discover OP-Clusters. Experimental study on
two real datasets demonstrates the effectiveness of the
algorithm in the application of tissue classification
and cell cycle identification. In addition, a large percentage
of OP-Clusters exhibit significant enrichment
of one or more function categories, which implies that
OP-Clusters indeed carry significant biological relevance.
Return to Program
Return to Top
|