An Efficient Algorithm for Unique Signature Discovery on Whole-Genome EST Databases
Hsiao Ping Lee, Tzu Fang Sheu, Ching Hua Shih and Chuan
Department of Computer Science, National Tsing-Hua University
ESTs are found to be an invaluable resource for the discovery of new genes, particularly those involved
in human disease processes. Unique oligonucleotides can be thought of as signatures that distinguish an
EST from all the others. An important application of those short signatures is to be used in PCR primer
design and microarray experiments. In this research, we propose an efficient algorithm to extract unique
signatures from the EST database. The secondary memory-based extension is also provided to handle the
dataset of whole-genome scale. The key idea underlying our method is that if we can reduce the amount of
candidates by identifying the non-signatures early, the efficiency of discovery will be improved. To
efficiently achieve that, we convert the EST dataset to a directed graph with weighted edges and construct
a frequency-based filter. The mechanisms separate the dissimilar oligonucleotides as early as possible and
limit the costly string comparasons performing only on the potentially similar ones. In the experiments on
human chromosome EST databases with a regular PC of limited resources, our approach is able to handle the
largest human chromosome 1 EST database of 230M bases, and accomplishes the discovery in 55 hours.
Furthermore, our method extracts all unique signatures from the EST dataset of 156M bases in one day.
With respect to the most recent discovery algorithm, our method reduces the costly string comparasons by
maximum 40 times over chromosome Y EST database. The improvements, either in dataset scale or discovery
efficiency, are significant.
VirusFinder: A Web-based Virus Identification System Using Viral Nucleotide Signatures
Yan Ding (1), Hong Cao (1) and Shu Wen (2)
(1) Dept. of Bioinformatics, American Type Culture Collection, 10801 University Blvd., Manassas,
Virginia 20110; (2) Dept. of Microecology, Dalian Medical University, Dalian 116027, China
A web-based information system for viral signature pattern recognition named VirusFinder has been developed
using Hemorrhagic Fever viruses (HFVs) as testing viruses. HFVs are well known causative agents for a group
of similar fatal diseases - Viral Hemorrhagic fever (VHF). Nucleotide signature-based detection has been proven
to be a rapid and accurate pathogen identification approach in natural outbreaks or acts of bioterrorism.
Species-specific nucleotide signatures of selective families of HFVs including arenaviruses, bunyaviruses,
filoviruses, and flaviviruses have been identified by comparative genomic analysis using the Clustal X program.
The unique nucleotide signatures for each individual species of HFVs were derived from the multiple sequence
alignments using the consensus sequence as reference. The species-specific signatures were then saved into
the local signature database after rigorous selection processes. Then VirusFinder integrated with intelligent
viral signature recognition algorithm was implemented using J2EE (Java 2 Enterprise Edition) technology, as
such, the species level differentiation and identification of HFVs would be automatically achieved given an
input of unknown testing HFV nucleotide sequences. Results show that VirusFinder is able to identify HFVs
at species level with 100% accuracy. VirusFinder is capable of identifying any registered viruses once their
nucleotide signatures are obtained.
Return to Poster Abstract Index
Return to Top
Finding Cancer Biomarkers from Mass Spectrometry Data by Decision Lists
Jian Liu and Ming Li
School of Computer Science, University of Waterloo
Biomarkers are cellular, molecular or genetic alternation or patterns caused by the presence of specific
diseases. Finding accurate biomarkers is key to early diagnosis of many otherwise incurable diseases. We
study the problem of finding biomarkers for mass spectrometry (SELDI-TOF) spectra from cancerous and
normal tissues. In contrast to the common practice of using vague methods, such as genetic algorithms,
or un-interpretable (as biomarker) methods, such as SVM, we looked for a method that is simple,
intuitive, interpretable, usable, and more accurate. We introduce decision-lists to this domain, where
each decision list serves as a biomarker. By employing statistical tests, a small set of proteins can
be used to discover biomarkers. As multiple biomarkers could exist for cancers, we extend a k-decision
list to a (k,l)- decision list by allowing l alternative monomials at each node (H_i), provided that all
alternative monomials at each node are equivalent in the sense that they give the same classification
for the training examples. It can be proven that (k,l)-decision lists are pac-learnable. This method can
be generalized to perform multi-classification. Our experiments on clinical cancer datasets show decision
lists give more accurate results than other methods, such as SVM and decision tree classifiers. More
interestingly, the resulting decision lists are more interpretable, for possible causal relationship between
cancer and differentially expressed proteins, and directly usable in clinical biomarker design.
Return to Poster Abstract Index
Return to Top
Incremental and Decremental Least Squares Support Vector Machine and Its Application to Drug Design
Hyunsoo Kim (1) and Haesun Park (1,2)
(1) Dept. of Computer Science and Engineering, University of Minnesota; (2) The National Science
Foundation, Arlington, VA
The kernel discriminant analysis (KDA) and least squares support vector machine (LS-SVM) have shown to exhibit
excellent classification performance in many applications. Recently, an incremental and decremental nonlinear
discriminant analysis based on kernel functions and minimum squared error formulation has been introduced. In
this paper, we propose an incremental and decremental LS- SVM based on QR decomposition updating and downdating.
It can efficiently compute the updated solution when data points are appended or removed using a linear system
instead of an underdetermined system. An efficient algorithm to compute leave-one-out cross validation using the
decremental LS-SVM is described as well. The experiment results illustrated that the proposed incremental
algorithm efficiently produces the same solutions as those obtained by LS-SVM which recomputes the solution all
over even for small changes in the data. For drug design, the results of each biochemistry laboratory test on a
new compound can be iteratively included in the training set. This procedure can further improve precision in
order to select the next best predicted organic compound. Instead of retraining entire data points, it is much
efficient to update solution by incremental LS-SVM.
Return to Poster Abstract Index
Return to Top
A Prediction Model for the Drug Efficacy of Interferon in CHC Patients Based on SNPs
Eugene Lin, Dennis Chen, Yuchi Hwang, Ashely Chang, and Z. John Gu
Vita Genomics, Inc.
In the studies of pharmacogenomics, genetic predisposition information, such as single nucleotide polymorphisms
(SNPs), can be used to understand the relationship between genetic variations (or population variations) and drug
efficacy. In this paper, a prediction model is resulted from analyzing chronic hepatitis C (CHC) patient's SNPs,
comparing to control groups, to predict the responsiveness of interferon (IFN) combination treatment. We have
developed an advanced methodology with the combination of artificial neural network (ANN) and other algorithms to
achieve a prediction with high accuracy among the patients. Filtering through thousands of SNPs of 150 genes, we
found nearly 30 SNPs relevant to responsiveness of IFN. With a statistical analysis of sensitivity (SEN),
specificity (SPE), positive prediction value (PPV), and negative prediction value (NPV), our model achieves a
higher successful rate of prediction, i.e., > 90% accuracy. This model allows patients and doctors to make
more informed decisions based on SNP genotyping data. The data was generated in the high-throughput genomics lab of
Vita Genomics, Inc.
Return to Poster Abstract Index
Return to Top
RNA Motif Search Using the Structure to String (STR2) Method
Oriel Bergig, Danny Barash, and Klara Kedem
Department of Computer Science, Ben-Gurion University
We present a novel approach for searching RNA motif structure in long genomes. Aside of the traditional
sequence-based search methods such as BLAST and FASTA, there is a growing interest in detecting specific RNA
secondary structure domains by using effective structure-based search methods such as the RNAMotif. Towards this
end, we devise a new algorithm with ideas taken from computational geometry. The method, called Structure to
String (STR2), was initially developed to detect structural motifs in the tertiary structure of proteins. It converts
an RNA secondary structure into a shape representing string of characters that capture the various structural motifs.
To transform an RNA secondary structure to a string of characters, we adopt an approach used in proteomics for
generating a collection of fragments. We identify a library of fragments for use in RNA secondary structure where each
fragment is represented by a character. A unique feature of our method is that the fragments represent the geometry of
the transitions between the secondary structure elements, such as the curve of the transition between stems and
loops. Consequently, we represent the secondary structures of the query and target sequences by their corresponding
character string representation and seek shape similarities by applying string matching algorithms. For the RNA folding
prediction we use mfold. The method is implemented efficiently using suffix trees and other economization procedures.
We show examples of its applicability on aptamer domains that are functionally important and are well predicted by
mfold before the conversion to strings.
Return to Poster Abstract Index
Return to Top
PRUNER: Algorithms for Finding Monad Patterns in DNA Sequences
Ravi VijayaSatya and Amar Mukherjee
University of Central Florida
We present new algorithms for discovering monad patterns in DNA sequences. Monad patterns are of the form (l,d)-k,
where l is the length of the pattern, d is the maximum number of mismatches allowed, and k is the minimum number
of times the pattern is repeated in the given sample. The time-complexity of some of the best known algorithms to
date is O(nt^2(4l)^d), where t is the number of input sequences, and n is the length of each input sequence. The
first algorithm that we present takes O((nt)^2(4l)^(d/2)) and space O(nt(4l)^(d/2)), and the second algorithm
takes O ((nt)^3(4l)^d/2) time using O((4l)^(d/2)) space. Our algorithms are based on the WINNOWER algorithm.
Each l-gram in the input sequence is represented by a vertex in a graph. An edge is added between two vertices
if the graph if the l-grams corresponding to the vertices mismatch in at most 2d positions. Both the algorithms
we present search all the patterns that mismatch in at most d positions with both the l-grams that are connected
by each edge. The first algorithm processes all the incident on a vertex at a time, there by reducing the number
of comparisons needed, but increasing the memory requirements, where as the second algorithm processes each
edge separately, requiring lesser memory but more computations. The algorithms were tested on generated samples,
and the second algorithm was found to be fast even for large values of l and d, as long as the d/l ratio
is small.
Return to Poster Abstract Index
Return to Top
CUBIC: A Program for Binding Sites Prediction
Victor Olman, Jizhu Lu, PhuongAn Dam, Zhengchang Su, and Ying Xu
Dept. of Biochemistry and Molecular Biology, Univ. of Georgia, and Inst. of Computational Biology,
Oak Ridge National Laboratory; Dept. of Computer Science and Engineering, Univ. of California, Riverside
The regulation of gene transcription is achieved through specific interactions between transcription factors and
their binding sites in the upstream region of the gene being regulated. Correct Identification of these binding
sites represents a key challenging problem in computational biology. Our approach to the problem is to find a
"clear" cluster in the space of all k-mers from the upstream regulatory regions of a set of genes that
potentially share similar binding sites. The cluster identification is achieved by using Minimal Spanning Tree
technique with a special distance defined among k-mers based on the profile of frequencies in positions. The
higher "conservation" of profile the better results for cluster identification. It has been shown that the widely
used "conservation" characteristics in positions is a result of a "common sense" requirement for "conservation."
We presented an iterative algorithm for finding the most "conserved" profile formed by at least M out of N given
upstream regions, and proved that the algorithm converges in a finite number of steps. We also developed a model
for evaluation of statistical significance of the results by extrapolating the idea of checking the existence of
clusters for hypothetical 1-dimensional uniform distribution into the space of k-mers. We have implemented the
algorithm as a software called CUBIC and tested it on a number of binding site finding tasks, and the results
are highly promising.
Acknowledgement: DOE office of Biological/Environmental Research, Genome to Life Project,
"Carbon Sequestration in Synechococcus sp.: From Molecular Machines to Hierarchical Modeling."
Return to Poster Abstract Index
Return to Top
BioSPRINT: Classification of Intron and Exon Sequences Using the SPRINT Algorithm
Kevin Crosby and Paula Gabbert
Furman University
This paper focuses on classifying sequences of DNA as either an intron or an exon. Insights from this
classification can reduce the time needed for laboratory work which distinguishes between introns and exons.
Using a classification tree based on the SPRINT algorithm, sequences from the Drosophila melanogaster and
the Caenorhabditis elegans genomes were used for training and testing. There is an inherent gap between the
sequence patterns of DNA and the numerical data needed for classification. This research focused on finding
a bridge for this gap by developing software which analyzes the sequences of DNA to create training sets for
classification. Certain characteristics of the sequences were quantitatively analyzed to produce features
upon which to base classification. The results of the classification algorithm were varied. A large test
sample error rate of 15% was shown for the Drosophila melanogaster, whereas the rate for Caenorhabditis
elegans was only 1.6%. The tree created for the C. elegans consisted of twelve splits of the dataset resulting
in thirteen terminal nodes which were nearly pure. This shows that the research done can be a dependable
indicator of intron or exon for sequences of DNA from the Caenorhabditis elegans genome. However, the
Drosophila melanogaster results are not quite as reliable. This tree was highly branched, and the terminal
nodes were more homogeneous than those in the tree of the C. elegans. Future work in the area of intron and
exon classification includes exploring additional techniques for extracting classification attributes from
DNA sequences to improve classification.
Return to Poster Abstract Index
Return to Top
A Hidden Markov Model for Gene Function Prediction from Sequential Expression Data
Xutao Deng and Hesham Ali
College of Information Science and Technology, Peter Kiewit Institute, University of Nebraska at Omaha
The prediction of function classes of unannotated genes or Open Reading Frames (ORFs) is important for
understanding the function role of genes and gene networks. Existing data mining tools, such as Support
Vector Machines (SVMs) and K-Nearest Neighbors (KNNs), can only achieve about 30% sensitivity. Hidden
Markov Models (HMMs) have shown great successes in modeling noisy sequential data sets in the area of
speech recognition and protein sequence profiling. Association test results showed significant Markov
dependency in time-series expression data, and therefore that HMMs would be especially appropriate for
modeling gene expressions. In this project, we developed a gene function prediction tool based on profile
HMMs. Each function class is associated with a distinct HMM whose parameters are trained using yeast
time-series gene expression data. The function annotations of the HMM training set were obtained from
Munich Information Centre for Protein Sequences (MIPS) data base. We designed several structural
variants of HMMs (single, double-split and triple split) and tested each of them on forty function
classes which include more than one hundred instances. The highest prediction specificity we achieved
was 51% by using double-split HMM with n-fold cross-validation. We also attempted to generalize HMMs
to Dynamic Bayesian Networks (DBNs) for gene function prediction using heterogeneous data sets.
Return to Poster Abstract Index
Return to Top
Using Enhancing Signals to Improve Specificity of Ab Initio Splice Site Sensors
Alexandre Tchourbanov, Hesham H. Ali, and Jitender Deogun
University of Nebraska at Omaha
Ab initio gene annotation methods is the only way to find splice sites in the absense of homologs. We
describe a new approach to improve the precision of ab initio splice site annotation in human genes.
The problem is known to be extremely challenging since the human splice signals are highly indistinct
and frequent cryptic sites confuse signal sensors. There is a strong evidence that Exonic Splicing
Enhancers (ESE) and Exonic Splicing Silencers (ESS) influence commitment to splicing at early stages.
We propose the use of a Naive Bayesian Network (BN) combined with Boltzmann machine splice sensor, to
improve the specificity of splice site prediction. The SpliceScan program is implemented to demonstrate
feasibility of sensitivity enhancement based on ESE/ESS signals interactions. SpliceScan performs
favorably compared to GeneSplicer, NNSplice and SpliceView. The designed method is of particular value for
ab initio gene annotation methods.
Return to Poster Abstract Index
Return to Top
SPVS: A Similar Parikh Vector Searching Algorithm for Protein Sequence Analysis
Xiaolu Huang and Hesham Ali
University of Nebraska at Omaha
It has been shown that in some protein sequences, when two regions share similar amino acid composition,
they also share the same structure even when the order of the amino acids is not preserved and is not
strictly contiguous. In several applications, such as in the identification of tumor marker ligand through
in vivo phage display peptides, order flexibility is preferred in the identification process. Traditional
protein sequence analysis tools, such as PSI-BLAST, are sequence order sensitive. In this study, a more
flexible protein sequence comparison algorithm, Similar Parikh Vector Searching (SPVS) algorithm is
designed to detect similar amino acid composition in a sequence while allowing flexibility in the sequence
order. Given an ordered alphabet A of finite k elements, with redundant elements permissible, a Parikh
vector of a word w on the alphabet A is the integer vector v = (n1, n2,, , ni,, nk) where ni is the number
of occurrences of the ith letter of A in w. In SPVS, a peptide sequence is broken into a group of Parikh
vectors of a predefined word size. SPVS searches for similar Parikh vectors between two sequences with
some predefined tolerance level. Then for each pair of similar Parikh vectors, the algorithm assigns an
Order Score (OS) that measures the difference in order between the two sequences. The preliminary test has
shown that SPVS is capable of detecting shuffled or reversed protein sequence regions that share same
biological structure between two protein sequences.
Return to Poster Abstract Index
Return to Top
A Liner-Time Algorithm for Locating All Tandem Repeat
Don Adjeroh and Feng Jianan
West Virginia University
Return to Poster Abstract Index
Return to Top
|