A Markov Model Based Gene Discrimination Approach in Trypanosomes
Allison Griggs and Shuba Gopal
Rochester Institute of Technology
The trypanosomes are a class of eukaryotic parasites that diverged from <>i>Saccharomyces cerevisiae
about 800 million years ago. Many possible gene structures are present within these genomes but
most appear to be non-functional and do not code for proteins. Initial analyses of these genomes
suggest that over 70% of the putative genes have no biological function [1]. To provide confirmation
for genes that are likely to be biologically relevant but which lack other sources of evidence
(sequence homology based evidence is limited because of the evolutionary distance from these organisms
to their better-studied counterparts), we developed a method that takes advantage of unusual signals
in immediately upstream regions of known trypanosome genes. This signal is required for mRNA maturation
prior to translation. Our method uses Markov models and linear discriminant analysis to compare these
upstream regions with coding regions and identifies coding regions most likely to be truly functional.
We have been able to identify true coding regions in Trypanosoma brucei with 93% accuracy (96% sensitivity
and 90% specificity). The related organism, Leishmania major, is 300 million years diverged from T. brucei
yet our approach is able to identify true coding regions with a similar accuracy, 92% (92% sensitivity and
93% specificity). Our approach significantly improves on existing methods. Current approaches have an error
discovery rate [2] of 0.21[3]; the approach presented here has an error discovery rate of 0.08. Our success
in these organisms suggests such an approach may be applicable to other organisms that share aspects of
trypanosome biology.
[1] Based on public annotations of Chromosome I of T. brucei and L. major. [2] Error
discovery rate is calculated as the total false positives and false negatives over the total ORFs analyzed
(FP+FN/ORFs) [3] Aggarwal, G., Worthey, E.A., McDonagh, P.D. and Myler, P.J. (2003).
"Importing statistical measures into Artemis enhances gene identification in the Leishmania genome project,"
BMC Bioinformatics. http://www.biomedcentral.com/1471-2105/4/23.
Predicting Gene Ontology Annotations from Sequence data using Kernel-Based Machine Learning Algorithms
J. J. Ward, J. S. Sodhi, B. F. Buxton, and D. T. Jones
University College London, Bioinformatics Group
In this early part of the post-genomic era, the inference of the functions associated with gene products is a
necessary first step in understanding the development and maintenance of living cells. We describe the development
of a machine learning method for predicting biological process as defined by the Gene Ontology (GO). The algorithm
uses features that can be generated from amino acid sequence alone, and does not require further experimental
studies such as microarrays, 2-hybrid screens or systematic 'pull-down' assays. The budding yeast Saccharomyces
cerevisiae is used because of its comprehensive set of functional annotations, but the approach is sufficiently
general for application to other Eukaryote genomes. The input data include phylogenetic profiles, which represent
the distribution of orthologous proteins in the genomes of other organisms, position-specific scoring matrices, and
secondary structure and dynamic disorder predictions. These are represented using Mercer kernels constructed from
series expansions of Gram matrices, exponential kernels and diffusion kernels, which are used to represent pair-wise
relationships such as sequence or secondary structure element similarity between nodes (proteins) in a graph.
These kernels are benchmarked on the process prediction problem using a maximal margin (SVM) learning algorithm and
kernel Fisher discriminants. We also show preliminary results for an information-theoretic approach to obtaining
consensus annotations from multiple binary classifiers that are trained to recognize single nodes in directed acyclic
graph ontologies such as GO.
Return to Poster Abstract Index
Return to Top
Deriving a Novel Codon Index by Combining Period-3 and Fractal Features of DNA Sequences
Jianbo Gao, Yinhe Cao, Yan Qi, and Wen-wen Tung
University of Florida
When a gene finding algorithm incorporates multiple useful sources of information about coding regions,
it becomes more successful. It is thus highly desirable to find new and efficient codon indices. Here we
propose a novel codon index, which we call the period-3 fractal deviation ($PFD$). This is obtained by
incorporating two incompatible features of DNA sequences, the period-3 feature in coding regions and the
fractal feature in both coding and non-coding regions. The former is due to the fact that in coding regions,
three nucleotide bases encode an amino acid and that the usage of bases at the three reading frames is
highly biased. The fractal feature comes from the fact that the background of a DNA sequence is fairly random.
These two features are incompatible because period-3 defines a specific scale of three nucleotide bases while
fractal means there are not any specific scales. The $PFD$ is very different for coding and non-coding sequences.
The accuracy of the $PFD$ is evaluated by studying all of the 16 yeast chromosomes. It is found that when all
the coding and non-coding sequences are counted, the accuracy is over 80%. The accuracy is improved to almost
94% when only long coding and non-coding sequences are considered. In particular, we show that the fractal
deviation method automatically and correctly identifies which of the three reading frames is the one that
contains a gene. It is further shown that the $PFD$ is complementary to other codon indices such as codon
adaptation index and Fourier measures of period-3, and that integration of the $PFD$ measure with those
indices can significantly improve the accuracy of gene finding algorithms.
Return to Poster Abstract Index
Return to Top
Using Stem-loop Characteristics as Means to Locate Structural RNA Genes
Kirt Noel and Kay C. Wiese
School of Interactive Arts and Technology, Simon Fraser University, Surrey, British Columbia, Canada
To date an effective and efficient structural RNA gene-finder has been elusive. The difficulty largely
results from a lack of sequence conservation. This work aims to evaluate a stem-loop focused approach to
identify structural RNA genes along genomic sequences. Previous attempts to develop a structural RNA
gene-finder are largely dependant on measuring the Free Energy of secondary structures formed by segments
or windows along a genomic sequence. The size of these segments carries no biological relevance; instead
segment size is an abstract related to the performance of an RNA folding algorithm. Though not strongly
supported, the reasoning suggests that domains of interest will display atypically low Free Energy values.
Problematic however is the rising tide of GC content in genomic counterparts as in GC rich organisms.
Lastly, the polynomial computational complexity associated with RNA folding algorithms is not attractive.
Our simplified approach has an average computational complexity of O(n) for sequences of length n. The
search algorithm is designed to identify stem-loops which are typically found in ribosomal RNAs. Our goal
is to determine whether or not a stem-loop metric or combination of stem-loop related metrics sufficiently
correlate with the presence of ribosomal RNA genes along genomic sequences across the entire GC content
spectrum. At this juncture, we can report a degree of success delineating the sequence into ribosomal
RNAs from their counterparts in AT rich organisms based on a stem-loop frequency metric alone.
Return to Poster Abstract Index
Return to Top
RNA Gene Prediction via Positive Sample Only Learning (PSOL)
Richard F. Meraz, Xiaofeng He, Chris Ding, and Stephen R. Holbrook
Lawrence Berkeley National Laboratory
RNA genes lack most of the signals used for protein gene identification. The major shortcoming of previous
discriminative methods to distinguish functional RNA (fRNA) genes from other non-coding genomic sequences
is that only positive examples of fRNAs are known; there are no confirmed negatives -- only genomic sequence
that may be positive or negative. To address this problem we developed the Positive Sample Only Learning
(PSOL) method. Known fRNA gene sequences and intergenic sequences from a given genome are divided into
overlapping windows and parameterized using sequence statistics, computed free-energy of folding, and
conservation in related genomes. The dataset contains a small number of positive samples (known fRNAs) and
a large number of unlabeled samples, which include mainly true negatives (sequences not encoding fRNA) and
a small number of potentially positive examples (putative fRNAs) that we would like to identify. PSOL has
the following protocol: (1) Examples that are (a) far away in parameter space from the known positive examples
and (b) mutually far away among themselves are initially identified as true negative examples. (2) Support
Vector Machines are iteratively trained to label more true negative examples. (3) A final SVM trained using
the known positives and the resulting negative set is used to identify a given number of samples as positive.
Predicted sequence windows in the E. coli K12 genome were assembled into a few hundred putative RNA gene
predictions which were consistent with a previous whole-genome microarray analysis of transcription and other
computational assays for functional RNA genes.
Return to Poster Abstract Index
Return to Top
|