CSB2004

POSTER ABSTRACTS:
EVOLUTION AND PHYLOGENETICS

Comparative Genomic Analysis of Cyclin-dependent Kinases and the Evolution of the RNAP II CTD Kinases
Zhenhua Guo and John W. Stiller
Department of Biology, East Carolina University, Greenville, NC 27858

Cyclin-dependent kinases (CDKs) are a large family of proteins that function in a variety of key regulatory pathways in eukaryotic cells, including regulation of the cell cycle and gene transcription. Among the most important and broadly studied of these roles is reversible phosphorylation of the C-terminal domain (CTD) of RNA polymerase II, which results in a cycling between a hypo-phosphorylated form of RNAP II capable of entering the preinitiation complex, and a hyper-phosphorylated form required for processive elongation and co-transcriptional pre-mRNA processing. Here we present comparative genomic and phylogenetic analyses showing that transcription-related kinases group separately from cell cycle-related kinases and may has a single common kinase ancestor which evolved from cell cycle-related CDKs. Evolution within specific kinase clades are generally discussed based on the unrooted phylogenetic tree and some valuable suggestions are provided. CDK7, CDK8 and CDK9, which are involved in CTD phosphorylation, all are recovered as individual monophyletic groups, and most significantly, CDK7 and CDK8 are restricted to those organisms that possess a strongly conserved RNAP II CTD suggesting a co-origin and tightly linked co-evolution of these kinases with more complex mechanisms for control over gene expression conveyed by the CTD.

Embedded Computation of Maximum-likelihood Phylogeny Inference Using Platform FPGA
Terrence S. T. Mak and K. P. Lam
The Chinese University of Hong Kong

Computation for maximum-likelihood phylogeny inference is time consuming, not only because of the exponentially growing tree search space, but also because the tree likelihood evaluation process is computational intensive [2]. In [1], a Hardware/Software (HW/SW) co-design scheme was proposed for the implementation of Genetic Algorithm for Maximum Likelihood (GAML) phylogeny inference. The idea is to partition the GAML into a genetic algorithm (GA) and a fitness evaluation function, with the SW handling the GA and the HW evaluating the maximum likelihood function. This approach exploits the flexibility of software for tree topology searching and is able to speed up the tree likelihood computation. However, the communication overhead between this loosely coupled HW/SW remains a critical concern in realizing higher computational speed. Our work has recently been extended to a more powerful embedded platform, in which (*) a microprocessor is immersed into field programmable gates array (FPGA) for realizing an effective environment for our HW/SW co-design implementation. The internal bus infrastructure provides a high-throughput communication gateway between the microprocessor and FPGA [5]. Significant improvement in data transmission rate between hardware and software and higher clock frequency of FPGA have been realized when compared to the JBits interface in [1]. In addition, the embedded platform provides a greater flexibility in partitioning hardware and software tasks. These new features lead to much faster computation speed. The FPGA logic design for the tree likelihood evaluation has also been improved to tackle problem of larger scale by adopting the idea of partial likelihood [4]. We propose a fine-grained partitioning scheme, in which the maximum likelihood computation is partitioned into a number of subtasks. Under this new scheme, the system can support a wider range of evolution models on phylogeny inference and provide more accurate results than the previous implementation.

References
[1] Mak T. and K. P. Lam, High Speed GAML-based Phylogenetic Tree Reconstruction Using HW/SW Codesign, in Proc. of IEEE Computer Society Bioinformatics Conference (CSB'03), 2003
[2] Felsenstein J., Evolutionary trees from DNA sequences: A maximum likelihood approach, J. Mol. Evol., 17:368-376, 1981
[3] Lewis P., A genetic algorithm for maximum-likelihood phylogeny inference using nucleotide sequence data, Molecular Biology Evolution, 15(3):277-283, 1998
[4] Mak T. and K. P. Lam, FPGA-based Computation for Maximum Likelihood Phylogenetic Tree Evaluation, submitted to Field-Programmable Logic and Applications conference, 2004
[5] Virtex-II ProTM Platform FPGA Handbook, Xilinx, 2002
(*) The VirtexII-Pro Platform FPGA is used in our study. In this embedded platform, an IBM PowerPC-405 microprocessor is immersed into the Field Programmable Gates Array.

Return to Poster Abstract Index
Return to Top

Gene Lengths and Alternative Transcription of Fruit Fly
Boris Budagyan
CalState University, Hayward

The significant source of vast diversity of protein functions is alternative transcription. In this work, different statistical tests were applied to fruit fly genome annotation (Release 3) to analyze the relationships between alternative transcription and gene's size. It was found, that only for genes, producing the number of transcript variants from 1 to 4, there was high statistical evidence at significance level of 5% (p<=7.6e- 07) that the longer genes produce more transcript variants, than smaller, more compact genes. This relationship is perfectly described by the linear regression model (R^2=0.99). To understand this strong correlation, the relationship between gene's size and the number of exons in these genes has been analyzed. It was found that gene's size exponentially increases with the number of exons at highest statistical evidence of R^2=1. This result reexamines the statement in a number of publications that the size of genes is mainly determined by the length of introns. Moreover, the exponential dependence (R^2=0.98) of the number of transcripts on the mean exon count was also found. These strong relationships for genes producing from 1 to 4 transcript variants are very important, as smaller genes form a sheer dominance over the overall population of genes. It was established that the number of genes in fruit fly genome decreases with the increase of number of transcript variants in accordance with the power-law function, which indicates on an underlying DNA duplication process.

Return to Poster Abstract Index
Return to Top

Protein Classification into Domains of Life Using Markov Chain Models
Francisca Zanoguera and Massimo de Francesco
Serono Pharmaceutical Research Institute

It has recently been shown that oligopeptide composition allows clustering proteomes of different organisms into the main domains of life. In this paper, we go a step further by showing that, given a single protein, it is possible to predict whether it has a bacterial or eukaryotic origin with 85% accuracy, and we obtain this result after ensuring that no important homologies exist between the sequences in the test set and the sequences in the training set. To do this, we model the sequence as a Markov chain. A bacterial and a eukaryote model are produced using the training sets. Each input sequence is then classified by calculating the log-odds ratio of the sequence probability for each model. The method does not use any sequence alignment, and is thus computationally cheap. By analyzing the models obtained we extract a set of most discriminant signatures, many of which are part of known functional motifs.

Return to Poster Abstract Index
Return to Top

EcMLST: An Online Database for Multilocus Sequence Typing of Pathogenic Escherichia coli
Weihong Qi, David W. Lacher, Alyssa C. Bumbaugh, Katie E. Hyma, Lindsey M. Ouellette, Teresa M. Large, Cheryl L. Tarr, and Thomas S. Whittam
Microbial Evolution Laboratory, National Food Safety and Toxicology Center, Michigan State University

In order to provide a portable and accurate typing system for the unambiguous characterization of pathogenic Escherichia coli isolates to the scientific community, we have constructed an online database for MultiLocus Sequence Typing (MLST) of pathogenic E. coli (EcMLST) using current internet and open source technology. The system consists of an XML specification of the E. coli MLST system, and a set of perl modules defining the database tables and generating dynamic web pages for querying the database. It is implemented on a Sun server running the Apache web server. The underlying tier is the MySQL database system. Currently, the database contains nucleotide sequence data, predicted Restriction Fragment Length Polymorphism (RFLP) data, and annotated allelic profile data of 15 housekeeping genes for 600 representative E. coli isolates. Access to the central-held typing and epidemiology data is supported by parametric searching, full-text searching, as well as query interface links to the reference center of Shiga Toxin-producing E. coli (STEC). EcMLST has been used by public health laboratories and researchers for evolutionary and epidemiology study. The system can be accessible at http://www.shigatox.net/mlst.

Return to Poster Abstract Index
Return to Top

On Complexity Measures for Biological Sequences
Fei Nan and Don Adjeroh
West Virginia University

The complexity of an organism has a direct manifestation in the general organization of its genomic structures. Given the primary genomic sequence for an organism, we can make certain predictions about the organism based on the randomness of the sequence, or the level of difficulty in predicting or compressing the sequence. Thus sequence complexity plays an important role is various application areas, such as in biological sequence compression for compact storage of the sequence, construction of phylogenetic trees, comparative genomics, etc. In this work, we study different published measures of complexity for general sequences, to determine their effectiveness in dealing with biological sequences. By effectiveness, we refer to how closely the given complexity measure is able to identify known biologically relevant relationships, such as closeness on a phylogenetic tree. In particular, we studied three complexity measure, namely the traditional Shanon’s entropy [1], linguistic complexity [2], and the T-complexity [3] using the complete genomes form different organisms, and we measured their effectiveness based on the extent to which they can distinguish between (or relate) different known organisms. For each complexity measure, we construct the complexity profile for each organism in our test set, and based on the profiles we compare the organisms using similarity metrics, based on the information theoretic Kullback-Leibler divergence [1], and the apparent periodicity in the complexity profile. The preliminary results show that the T-complexity was the least effective in identifying previously established known associations between genomes from known organisms. Shannon's entropy and linguistic-complexity provided better results, although there was no clear winner. We are considering a combination of the different complexity measures, which we expect could provide an improvement in the results.

References:
[1] Cover T.M. and Thomas J. A., Elements of Information Theory, Wiley, 1991.
[2] Olga G. Troyanskaya, Ora Arbell , Yair Koren , Gad M. Landau and Alexander Bolshoy, Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity, Bioinformatics, 18 (5), 679-688, 2002.
[3] Ebeling W, Steuer R, and Titchener MR, Partition-based entropies of deterministic and stochastic maps, Stochastics and Dynamics, 1, 1 1 17, 2001.

Return to Poster Abstract Index
Return to Top

HOME • REGISTRATION • PAPERS • POSTERS • TUTORIALS • PROGRAM • KEYNOTE SPEAKERS • INVITED SPEAKERS
SPECIAL EVENTS • COMMITTEES • SPONSORS • NEWS ROOM • CONTACT US • PREVIOUS CONFERENCES