CSB2004

POSTER ABSTRACTS:
FUNCTIONAL GENOMICS

A Markov Model Based Gene Discrimination Approach in Trypanosomes
Allison Griggs and Shuba Gopal
Rochester Institute of Technology

The trypanosomes are a class of eukaryotic parasites that diverged from <>i>Saccharomyces cerevisiae about 800 million years ago. Many possible gene structures are present within these genomes but most appear to be non-functional and do not code for proteins. Initial analyses of these genomes suggest that over 70% of the putative genes have no biological function [1]. To provide confirmation for genes that are likely to be biologically relevant but which lack other sources of evidence (sequence homology based evidence is limited because of the evolutionary distance from these organisms to their better-studied counterparts), we developed a method that takes advantage of unusual signals in immediately upstream regions of known trypanosome genes. This signal is required for mRNA maturation prior to translation. Our method uses Markov models and linear discriminant analysis to compare these upstream regions with coding regions and identifies coding regions most likely to be truly functional. We have been able to identify true coding regions in Trypanosoma brucei with 93% accuracy (96% sensitivity and 90% specificity). The related organism, Leishmania major, is 300 million years diverged from T. brucei yet our approach is able to identify true coding regions with a similar accuracy, 92% (92% sensitivity and 93% specificity). Our approach significantly improves on existing methods. Current approaches have an error discovery rate [2] of 0.21[3]; the approach presented here has an error discovery rate of 0.08. Our success in these organisms suggests such an approach may be applicable to other organisms that share aspects of trypanosome biology.

[1] Based on public annotations of Chromosome I of T. brucei and L. major.
[2] Error discovery rate is calculated as the total false positives and false negatives over the total ORFs analyzed (FP+FN/ORFs)
[3] Aggarwal, G., Worthey, E.A., McDonagh, P.D. and Myler, P.J. (2003). "Importing statistical measures into Artemis enhances gene identification in the Leishmania genome project," BMC Bioinformatics. http://www.biomedcentral.com/1471-2105/4/23.

Predicting Gene Ontology Annotations from Sequence data using Kernel-Based Machine Learning Algorithms
J. J. Ward, J. S. Sodhi, B. F. Buxton, and D. T. Jones
University College London, Bioinformatics Group

In this early part of the post-genomic era, the inference of the functions associated with gene products is a necessary first step in understanding the development and maintenance of living cells. We describe the development of a machine learning method for predicting biological process as defined by the Gene Ontology (GO). The algorithm uses features that can be generated from amino acid sequence alone, and does not require further experimental studies such as microarrays, 2-hybrid screens or systematic 'pull-down' assays. The budding yeast Saccharomyces cerevisiae is used because of its comprehensive set of functional annotations, but the approach is sufficiently general for application to other Eukaryote genomes. The input data include phylogenetic profiles, which represent the distribution of orthologous proteins in the genomes of other organisms, position-specific scoring matrices, and secondary structure and dynamic disorder predictions. These are represented using Mercer kernels constructed from series expansions of Gram matrices, exponential kernels and diffusion kernels, which are used to represent pair-wise relationships such as sequence or secondary structure element similarity between nodes (proteins) in a graph. These kernels are benchmarked on the process prediction problem using a maximal margin (SVM) learning algorithm and kernel Fisher discriminants. We also show preliminary results for an information-theoretic approach to obtaining consensus annotations from multiple binary classifiers that are trained to recognize single nodes in directed acyclic graph ontologies such as GO.

Return to Poster Abstract Index
Return to Top

Deriving a Novel Codon Index by Combining Period-3 and Fractal Features of DNA Sequences
Jianbo Gao, Yinhe Cao, Yan Qi, and Wen-wen Tung
University of Florida

When a gene finding algorithm incorporates multiple useful sources of information about coding regions, it becomes more successful. It is thus highly desirable to find new and efficient codon indices. Here we propose a novel codon index, which we call the period-3 fractal deviation ($PFD$). This is obtained by incorporating two incompatible features of DNA sequences, the period-3 feature in coding regions and the fractal feature in both coding and non-coding regions. The former is due to the fact that in coding regions, three nucleotide bases encode an amino acid and that the usage of bases at the three reading frames is highly biased. The fractal feature comes from the fact that the background of a DNA sequence is fairly random. These two features are incompatible because period-3 defines a specific scale of three nucleotide bases while fractal means there are not any specific scales. The $PFD$ is very different for coding and non-coding sequences. The accuracy of the $PFD$ is evaluated by studying all of the 16 yeast chromosomes. It is found that when all the coding and non-coding sequences are counted, the accuracy is over 80%. The accuracy is improved to almost 94% when only long coding and non-coding sequences are considered. In particular, we show that the fractal deviation method automatically and correctly identifies which of the three reading frames is the one that contains a gene. It is further shown that the $PFD$ is complementary to other codon indices such as codon adaptation index and Fourier measures of period-3, and that integration of the $PFD$ measure with those indices can significantly improve the accuracy of gene finding algorithms.

Return to Poster Abstract Index
Return to Top

Using Stem-loop Characteristics as Means to Locate Structural RNA Genes
Kirt Noel and Kay C. Wiese
School of Interactive Arts and Technology, Simon Fraser University, Surrey, British Columbia, Canada

To date an effective and efficient structural RNA gene-finder has been elusive. The difficulty largely results from a lack of sequence conservation. This work aims to evaluate a stem-loop focused approach to identify structural RNA genes along genomic sequences. Previous attempts to develop a structural RNA gene-finder are largely dependant on measuring the Free Energy of secondary structures formed by segments or windows along a genomic sequence. The size of these segments carries no biological relevance; instead segment size is an abstract related to the performance of an RNA folding algorithm. Though not strongly supported, the reasoning suggests that domains of interest will display atypically low Free Energy values. Problematic however is the rising tide of GC content in genomic counterparts as in GC rich organisms. Lastly, the polynomial computational complexity associated with RNA folding algorithms is not attractive. Our simplified approach has an average computational complexity of O(n) for sequences of length n. The search algorithm is designed to identify stem-loops which are typically found in ribosomal RNAs. Our goal is to determine whether or not a stem-loop metric or combination of stem-loop related metrics sufficiently correlate with the presence of ribosomal RNA genes along genomic sequences across the entire GC content spectrum. At this juncture, we can report a degree of success delineating the sequence into ribosomal RNAs from their counterparts in AT rich organisms based on a stem-loop frequency metric alone.

Return to Poster Abstract Index
Return to Top

RNA Gene Prediction via Positive Sample Only Learning (PSOL)
Richard F. Meraz, Xiaofeng He, Chris Ding, and Stephen R. Holbrook
Lawrence Berkeley National Laboratory

RNA genes lack most of the signals used for protein gene identification. The major shortcoming of previous discriminative methods to distinguish functional RNA (fRNA) genes from other non-coding genomic sequences is that only positive examples of fRNAs are known; there are no confirmed negatives -- only genomic sequence that may be positive or negative. To address this problem we developed the Positive Sample Only Learning (PSOL) method. Known fRNA gene sequences and intergenic sequences from a given genome are divided into overlapping windows and parameterized using sequence statistics, computed free-energy of folding, and conservation in related genomes. The dataset contains a small number of positive samples (known fRNAs) and a large number of unlabeled samples, which include mainly true negatives (sequences not encoding fRNA) and a small number of potentially positive examples (putative fRNAs) that we would like to identify. PSOL has the following protocol: (1) Examples that are (a) far away in parameter space from the known positive examples and (b) mutually far away among themselves are initially identified as true negative examples. (2) Support Vector Machines are iteratively trained to label more true negative examples. (3) A final SVM trained using the known positives and the resulting negative set is used to identify a given number of samples as positive. Predicted sequence windows in the E. coli K12 genome were assembled into a few hundred putative RNA gene predictions which were consistent with a previous whole-genome microarray analysis of transcription and other computational assays for functional RNA genes.

Return to Poster Abstract Index
Return to Top

HOME • REGISTRATION • PAPERS • POSTERS • TUTORIALS • PROGRAM • KEYNOTE SPEAKERS • INVITED SPEAKERS
SPECIAL EVENTS • COMMITTEES • SPONSORS • NEWS ROOM • CONTACT US • PREVIOUS CONFERENCES