Comparative Genomic Analysis of Cyclin-dependent Kinases and the Evolution of the RNAP II CTD Kinases
Zhenhua Guo and John W. Stiller
Department of Biology, East Carolina University, Greenville, NC 27858
Cyclin-dependent kinases (CDKs) are a large family of proteins that function in a variety of key regulatory
pathways in eukaryotic cells, including regulation of the cell cycle and gene transcription. Among the most
important and broadly studied of these roles is reversible phosphorylation of the C-terminal domain (CTD) of
RNA polymerase II, which results in a cycling between a hypo-phosphorylated form of RNAP II capable of
entering the preinitiation complex, and a hyper-phosphorylated form required for processive elongation
and co-transcriptional pre-mRNA processing. Here we present comparative genomic and phylogenetic analyses
showing that transcription-related kinases group separately from cell cycle-related kinases and may has a
single common kinase ancestor which evolved from cell cycle-related CDKs. Evolution within specific kinase
clades are generally discussed based on the unrooted phylogenetic tree and some valuable suggestions are
provided. CDK7, CDK8 and CDK9, which are involved in CTD phosphorylation, all are recovered as individual
monophyletic groups, and most significantly, CDK7 and CDK8 are restricted to those organisms that possess a
strongly conserved RNAP II CTD suggesting a co-origin and tightly linked co-evolution of these kinases with
more complex mechanisms for control over gene expression conveyed by the CTD.
Embedded Computation of Maximum-likelihood Phylogeny Inference Using Platform FPGA
Terrence S. T. Mak and K. P. Lam
The Chinese University of Hong Kong
Computation for maximum-likelihood phylogeny inference is time consuming, not only because of the exponentially
growing tree search space, but also because the tree likelihood evaluation process is computational intensive
[2]. In [1], a Hardware/Software (HW/SW) co-design scheme was proposed for the implementation of Genetic Algorithm
for Maximum Likelihood (GAML) phylogeny inference. The idea is to partition the GAML into a genetic algorithm
(GA) and a fitness evaluation function, with the SW handling the GA and the HW evaluating the maximum likelihood
function. This approach exploits the flexibility of software for tree topology searching and is able to speed up
the tree likelihood computation. However, the communication overhead between this loosely coupled HW/SW remains a
critical concern in realizing higher computational speed. Our work has recently been extended to a more powerful
embedded platform, in which (*) a microprocessor is immersed into field programmable gates array (FPGA) for
realizing an effective environment for our HW/SW co-design implementation. The internal bus infrastructure provides
a high-throughput communication gateway between the microprocessor and FPGA [5]. Significant improvement in data
transmission rate between hardware and software and higher clock frequency of FPGA have been realized when compared
to the JBits interface in [1]. In addition, the embedded platform provides a greater flexibility in partitioning
hardware and software tasks. These new features lead to much faster computation speed. The FPGA logic design for
the tree likelihood evaluation has also been improved to tackle problem of larger scale by adopting the idea of
partial likelihood [4]. We propose a fine-grained partitioning scheme, in which the maximum likelihood computation
is partitioned into a number of subtasks. Under this new scheme, the system can support a wider range of evolution
models on phylogeny inference and provide more accurate results than the previous implementation.
References [1] Mak T. and K. P. Lam, High Speed GAML-based Phylogenetic Tree Reconstruction Using HW/SW Codesign,
in Proc. of IEEE Computer Society Bioinformatics Conference (CSB'03), 2003 [2] Felsenstein J., Evolutionary
trees from DNA sequences: A maximum likelihood approach, J. Mol. Evol., 17:368-376, 1981 [3] Lewis P., A
genetic algorithm for maximum-likelihood phylogeny inference using nucleotide sequence data, Molecular
Biology Evolution, 15(3):277-283, 1998 [4] Mak T. and K. P. Lam, FPGA-based Computation for Maximum Likelihood
Phylogenetic Tree Evaluation, submitted to Field-Programmable Logic and Applications conference, 2004 [5]
Virtex-II ProTM Platform FPGA Handbook, Xilinx, 2002 (*) The VirtexII-Pro Platform FPGA is used in our study.
In this embedded platform, an IBM PowerPC-405 microprocessor is immersed into the Field Programmable Gates Array.
Return to Poster Abstract Index
Return to Top
Gene Lengths and Alternative Transcription of Fruit Fly
Boris Budagyan
CalState University, Hayward
The significant source of vast diversity of protein functions is alternative transcription. In this work,
different statistical tests were applied to fruit fly genome annotation (Release 3) to analyze the
relationships between alternative transcription and gene's size. It was found, that only for genes,
producing the number of transcript variants from 1 to 4, there was high statistical evidence at significance
level of 5% (p<=7.6e- 07) that the longer genes produce more transcript variants, than smaller, more compact
genes. This relationship is perfectly described by the linear regression model (R^2=0.99). To understand
this strong correlation, the relationship between gene's size and the number of exons in these genes has
been analyzed. It was found that gene's size exponentially increases with the number of exons at highest
statistical evidence of R^2=1. This result reexamines the statement in a number of publications that the
size of genes is mainly determined by the length of introns. Moreover, the exponential dependence (R^2=0.98)
of the number of transcripts on the mean exon count was also found. These strong relationships for genes
producing from 1 to 4 transcript variants are very important, as smaller genes form a sheer dominance over
the overall population of genes. It was established that the number of genes in fruit fly genome decreases
with the increase of number of transcript variants in accordance with the power-law function, which indicates
on an underlying DNA duplication process.
Return to Poster Abstract Index
Return to Top
Protein Classification into Domains of Life Using Markov Chain Models
Francisca Zanoguera and Massimo de Francesco
Serono Pharmaceutical Research Institute
It has recently been shown that oligopeptide composition allows clustering proteomes of different
organisms into the main domains of life. In this paper, we go a step further by showing that, given
a single protein, it is possible to predict whether it has a bacterial or eukaryotic origin with 85%
accuracy, and we obtain this result after ensuring that no important homologies exist between the
sequences in the test set and the sequences in the training set. To do this, we model the sequence
as a Markov chain. A bacterial and a eukaryote model are produced using the training sets. Each input
sequence is then classified by calculating the log-odds ratio of the sequence probability for each
model. The method does not use any sequence alignment, and is thus computationally cheap. By analyzing
the models obtained we extract a set of most discriminant signatures, many of which are part of known
functional motifs.
Return to Poster Abstract Index
Return to Top
EcMLST: An Online Database for Multilocus Sequence Typing of Pathogenic Escherichia coli
Weihong Qi, David W. Lacher, Alyssa C. Bumbaugh, Katie E. Hyma, Lindsey M. Ouellette, Teresa M.
Large, Cheryl L. Tarr, and Thomas S. Whittam
Microbial Evolution Laboratory, National Food Safety and Toxicology Center, Michigan State University
In order to provide a portable and accurate typing system for the unambiguous characterization of pathogenic
Escherichia coli isolates to the scientific community, we have constructed an online database for MultiLocus
Sequence Typing (MLST) of pathogenic E. coli (EcMLST) using current internet and open source technology. The
system consists of an XML specification of the E. coli MLST system, and a set of perl modules defining the
database tables and generating dynamic web pages for querying the database. It is implemented on a Sun server
running the Apache web server. The underlying tier is the MySQL database system. Currently, the database
contains nucleotide sequence data, predicted Restriction Fragment Length Polymorphism (RFLP) data, and annotated
allelic profile data of 15 housekeeping genes for 600 representative E. coli isolates. Access to the central-held
typing and epidemiology data is supported by parametric searching, full-text searching, as well as query interface
links to the reference center of Shiga Toxin-producing E. coli (STEC). EcMLST has been used by public health
laboratories and researchers for evolutionary and epidemiology study. The system can be accessible at
http://www.shigatox.net/mlst.
Return to Poster Abstract Index
Return to Top
On Complexity Measures for Biological Sequences
Fei Nan and Don Adjeroh
West Virginia University
The complexity of an organism has a direct manifestation in the general organization of its genomic
structures. Given the primary genomic sequence for an organism, we can make certain predictions about
the organism based on the randomness of the sequence, or the level of difficulty in predicting or
compressing the sequence. Thus sequence complexity plays an important role is various application areas,
such as in biological sequence compression for compact storage of the sequence, construction of phylogenetic
trees, comparative genomics, etc. In this work, we study different published measures of complexity for
general sequences, to determine their effectiveness in dealing with biological sequences. By effectiveness,
we refer to how closely the given complexity measure is able to identify known biologically relevant
relationships, such as closeness on a phylogenetic tree. In particular, we studied three complexity measure,
namely the traditional Shanon’s entropy [1], linguistic complexity [2], and the T-complexity [3] using the
complete genomes form different organisms, and we measured their effectiveness based on the extent to which
they can distinguish between (or relate) different known organisms. For each complexity measure, we
construct the complexity profile for each organism in our test set, and based on the profiles we compare
the organisms using similarity metrics, based on the information theoretic Kullback-Leibler divergence [1],
and the apparent periodicity in the complexity profile. The preliminary results show that the T-complexity
was the least effective in identifying previously established known associations between genomes from known
organisms. Shannon's entropy and linguistic-complexity provided better results, although there was no clear
winner. We are considering a combination of the different complexity measures, which we expect could provide
an improvement in the results.
References: [1] Cover T.M. and Thomas J. A., Elements of Information Theory,
Wiley, 1991. [2] Olga G. Troyanskaya, Ora Arbell , Yair Koren , Gad M. Landau and Alexander Bolshoy, Sequence
complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity,
Bioinformatics, 18 (5), 679-688, 2002. [3] Ebeling W, Steuer R, and Titchener MR, Partition-based entropies
of deterministic and stochastic maps, Stochastics and Dynamics, 1, 1 1 17, 2001.
Return to Poster Abstract Index
Return to Top
|