Poster Abstracts for
Section B Poster Sessions II and III
- Genomic Annotation
-
Prediction of DtxR Regulon:
A Computational Approach to Identify Physiological Processes
Controlled by Iron in Corynebacterium Species
A New Approach to
Gene Prediction Using the Self-Organizing Map
On Gene Prediction
by Cross-Species Comparative Sequence Analysis
CUBIC: Identification of Regulatory Binding Sites Through Data Clustering
- Molecular Simulation
-
What Makes IgG Binding Domain of Protein L Fold Up to Native State: A Simulation
Study with Physical Oriented Energy Functions Coupled to Topology Induced Terms
New Computational Methods for Electrostatics in Macromolecular Simulation
The Approximate Algorithm for Analysis of the Strand Separation
Transition in Superhelical DNA Using Nearest Neighbor Energetics
Substrate Recognition by Enzymes: A Theoretical Study
Computational Simulation of Lipid Bilayer Reorientation at Gaps
- Phylogeny and Evolution
-
Search for Evolution-related-oligonucleotides and Conservative Words in rRNA Sequences
High Speed GAML-based Phylogenetic Tree Reconstruction Using HW/SW Codesign
Evidence for Growth of Microbial Genomes by Short Segmental Duplications
PTC: An Interactive Tool for Phylogenetic Tree Construction
Iterative Rank Based Methods for Clustering
Analysis of Phylogenetic Profiles by Bayesian Decomposition
Automating Recognition of Regions of Intrinsically Poor Multiple Alignment
for Phylogenetic Analysis Using Machine Learning
Reconstruction of Ancestral Gene Order Following Segmental
Duplication and Gene Loss
Reconstruction of Ancient Operons From Complete Microbial Genome Sequences
- Predictive Methods
-
An Evolving Approach to Finding Schemas for Protein Secondary Structure Prediction
Gene Selection for Multi-class Prediction of Microarray Data
Molecular Evaluation Using Comparative Molecular Interaction Profile Analysis System
Probe Design for Large-scale In Situ Hybridization and Oligo DNA Microarrays:
An Application of the New NIA Gene Index
Genes Selection for Cancer Classification Using Bootstrappped
Genetic Algorithms and Support Vector Machines
A Statistical Model of Proteolytic Digestion
Preliminary Wavelet Analysis of Genomic Sequences
Using Easel for Modeling and Simulating the Interactions
of Cells in Order to Better Understand the Basics of Biological Processes and to Predict Their Likely Behaviors
Fold Recognition Using Sequence Fingerprints of Protein Local Substructures
An Iterative Loop Matching Approach to the Prediction of RNA Secondary Structures with Pseudoknots
A New Method for Predicting RNA Secondary Structure
Minimum-Redundancy
Feature Selection from Microarray Gene Expression
- Sequence Comparison
-
A Linear Programming Based Algorithm for Multiple Sequence Alignment
Alignment-Free Sequence Comparison with Vector Quantization and Hidden Markov Models
Prediction of Protein Function Using Signal Processing of Biochemical Properties
Genomic Sequence Analysis Using Gap Sequences and Pattern Filtering
An Optimal DNA Segmentation Based on the MDL Principle
The Cybertory Sequence File System (SFS) for Managing Large DNA Sequences
Implementing Parallel Hmm-pfam on the EARTH Multithreaded Architecture
Genome on Demand: Interactive Substring Searching
Automatic Selection of Parameters for Sequence Alignment
CoMRI: A Compressed Multi-Resolution Index Structure for
Sequence Similarity Queries
- Systems Biology
-
Pathway Logic Modeling of Protein Functional Domains in Signal Transduction
Development of a Massively-Parallel, Biological Circuit Simulator
Representing and Reasoning About Signal Networks: An Illustration Using NFkappaB Dependent Signaling Pathways
Noise Attenuation in Artificial Genetic Networks
Computational Inference of Regulatory Pathways in Microbes:
An Application to Phosphorus Assimilation Pathways in Synechococcus WH8102
- Miscellaneous
-
Development and Assessment of Bioinformatics Tools to
Enhance Species Conservation and Habitat Management
MedfoLink: Bridging the Gap between IT and the Medical Community
TC-DB: A Membrane Transport Protein Classification Database
Genomic Annotation
Prediction of DtxR regulon: A computational approach to identify physiological processes
controlled by iron in Corynebacterium species
Sailu Yellaboina, prachee, Akash Ranjan, and Syed E. Hasnain
Centre for DNA finger printing and Diagnostics, India
Abstract
Background: The diphtheria toxin repressor DtxR
Corynebacterium diphtheriae has been shown to be an iron-sensitive transcription regulator that controls the
expression of iron uptake genes and diphtheria toxin. This
study aimed to increase understanding of the DtxR
regulated genes and their role in physiology of
Corynebacterium.
Result: We developed a user friendly online software tool
to identify the potential binding sites for any regulator
based on Shannon relative entropy scoring scheme. Known
DtxR binding sites of C. diphtheriae were used to generate
a position specific preference profile for DtxR which was
used by our method to identify the potential DNA binding
site within C. glutamicum genome. In addition DtxR
regulated operons were also identified taking into account
the predicted DtxR regulatory sites, transcription
termination sites and gene order conservation. The analysis
predicted a number of DtxR regulated operons/genes whose
orthologues code for iron uptake systems, such as
siderphore transport system, hemin transport system,
hemolysins and heam transporters. The analysis also
predicts the genes, whose orthologues for ferritin and
starvation inducible DNA binding protein (Dps) which are
involved in iron storage and oxidative stress defence in
various bacteria. In addition, we have found genes that
code for the orthologues of adaptive response regulator
(Ada) and endonuclease VIII (Nei) which are involved in DNA
repair, could be regulated by diphtheria toxin repressor.
Conclusions:
The methodology used to predict DtxR-iron regulated genes
proved highly effective as our results agreed with
experimental data observed in other bacterial systems. In
addition,
Our finding of many new DtxR-iron regulated genes reveals
the physiological process controlled by DtxR-iron in
various Corynebacteria speces. Our finding shows that
hemolysis, hemin uptake and heam oxidation might be an
essential iron-acquisition pathaway for various
Corynebacteria pathogens at low levels of extracellular
iron. Our data analysis reveals that DtxR coordinately
controls the genes involved in iron uptake and iron
storage, oxidative stress defence, DNA repair to counter
the low levels of iron as well as ferrous iron induced
oxidative damage by Fenton's reaction.
Return to Program or Index
A new approach to Gene Prediction using the Self-Organizing Map
Shaun
Mahony1, Aaron Golden2, James McInerney3, and Terry Smith1
1National Centre for Biomedical Engineering Science, NUI,
Galway, Ireland
2Dept. of Information Technology, NUI, Galway, Ireland
3Bioinformatics and Pharmacogenomics Laboratory, NUI, Maynooth,
Ireland
Abstract
Computational gene prediction methods have yet to achieve a
perfect accuracy rate, and many make a substantial number
of false-positive predictions, even in prokaryotic genomes.
One of the most obvious reasons for inaccurate gene
predictions is the high degree of compositional variation
that exists within most genomic sequences. For example, it
has long been recognized that synonymous codon usage is
highly variable, and under many evolutionary pressures.
Many Markov model based gene-finding tools use only one
model to represent protein coding regions in any given
genome, and so are less likely to predict genes with an
unusual composition. Indeed, it has been shown that using
two or three models substantially increases the accuracy of
a Markov-based gene-finder (Hayes & Borodovsky, 1998).
In this poster we present a new neural network based
approach to gene prediction that has the ability to
automatically identify all the major patterns of content
variation within a genome. The genome may then be scanned
for regions displaying the same properties as one of these
automatically identified models. Even using a relatively
simple coding measure (relative synonymous codon usage),
this method can predict the location of protein-coding
sequences with a reasonably high accuracy, and with a
specificity that is higher than Markov-based approaches in
many cases. We also show other advantages of the approach,
such as the ability to indicate genes that contain frame-
shifts. We believe that this method has the potential to
become a useful addition to the genome annotation process.
Return to Program or Index
On Gene Prediction by Cross-Species Comparative Sequence Analysis
Rong Chen and Hesham Ali
Department of Computer Science, College of Information Science and Technology,
University of Nebraska at Omaha, Omaha, NE 68182-0116
Abstract
Sequencing of large fragments of genomic DNA, as well as
complete eukaryotic chromosomes of many organisms, makes it
possible to apply comparison of genomic sequences for
identification of protein-coding regions. This approach is
based on the fact that protein-coding regions evolve much
slower than non-coding regions. Thus, candidate exons are
seen as islands of similarity in alignment to genomic
sequences that harbor homologous genes. Recently, several
algorithms have been described for automated gene
recognition by genomic comparison. Most of these programs
are specifically designed for the comparison of closely
related species. However, the non-coding regions may be
also conserved for species whose evolutionary distance is
too close. We conducted a comparative analysis of
homologous genomic sequences of organisms with different
evolutionary distances and found the conservation of the
non-coding regions between closely related organisms. In
contrast, more distance shows much less intron similarity
but less conservation on the exon structures (in the terms
of their number, length and sequence similarity). Based on
this finding and training of data sets, we proposed a model
by which coding sequence could be identified by comparing
sequences of multiple species, both close and approximately
distant. The reliability of the proposed method is
evaluated in terms of sensitivity and specificity, and
results are compared to the ones obtained by other popular
gene prediction programs. Provided sequences can be found
from other species at appropriate evolutionary distances,
this approach could be applied in newly sequenced organisms
where no species-dependent statistical models are
available.
Return to Program or Index
CUBIC: Identification of Regulatory Binding Sites Through Data Clustering
Victor Olman1, Dong Xu1 and Ying Xu1,2
1Protein Informatics Group, Life Sciences Division
2Computer Science and Mathematics Division
Oakridge National Laboratory, Oakridge, TN
Abstract
Transcription factor binding sites are short fragments in
the upstream regions of genes, to which transcription
factors bind to regulate the transcription of genes into
mRNA. Computational identification of transcription factor
binding sites remains an unsolved challenging problem
though a great amount of effort has been put into the study
of this problem. We have recently developed a novel
technique for identification of binding sites from a set of
upstream regions of genes, that could possibly be
transcriptionally co-regulated and hence might share
similar transcription factor binding sites. By utilizing
two key features of such binding sites (i.e., their high
sequence similarities and their relatively high frequencies
compared to other sequence fragments), we have formulated
this problem as a cluster identification problem. That is
to identify and extract data clusters from a noisy
background. While the classical data clustering problem
(partitioning a data set into clusters sharing common or
similar features) has been extensively studied, there is no
general algorithm for solving the problem of identifying
data clusters from a noisy background. In this paper, we
present a novel algorithm for solving such a problem. We
have proved that a cluster identification problem, under
our definition, can be rigorously and efficiently solved
through searching for substrings with special properties in
a linear sequence. We have also developed a method for
assessing the statistical significance of each identified
cluster, which can be used to rule out accidental data
clusters. We have implemented the cluster identification
algorithm and the statistical significance analysis method
as a computer software CUBIC. Extensive testing on CUBIC
has been carried out. We present here a few applications of
CUBIC on challenging cases of binding site identification.
Return to Program or Index
Molecular Simulation
What Makes IgG Binding Domain of Protein L Fold Up to Native State: A Simulation
Study with Physical Oriented Energy Functions Coupled to Topology Induced Terms
Seung Yup Lee, Yoshimi Fujitsuka,
Do Hyun Kim and Shoji Takada
Department of Chemical and Biomolecular Engineering and Center for Ultramicrochemical
Process Systems, Korea Advanced Institute of Science and Technology
Abstract
To find relationships between the folding mechanisms of
proteins and linear sequences still remains a fundamental
question despite many researches performed. The native
topology is found to control the folding mechanisms and
structural distributions of small proteins indicating that
interactions within native topology bias a protein to
native state. In addition, proteins sharing similar native
topology show almost the same folding pathways although the
sequence homology between them is quite low. Therefore,
protein topology may be more important than any other
specific interactions related to sequences. On the other
hand, there are some proteins showing different folding
scenarios where native topology is not a dominant role to
select folding pathways and kinetics any more. Especially,
due to the structural symmetry of native structure, it is
known for IgG binding domain of protein L consisting of two
beta hairpins and central alpha helix that folding dynamics
is not significantly determined by native topology but
sequence-specific interactions. In this study, the folding
of protein L is characterized by molecular simulation with
a coarse-grained peptide chain representation and physical
effective energy functions (EEFs) taking into account
solvent induced effects coupled to topological energies.
The propensities of secondary structure formations as well
as structural distributions along folding pathways are
analyzed quite in detail. Moreover, predicted results for
protein L are also compared with the folding of structural
analog, IgG binding domain of protein G, to find general
rules in determining the folding of small globular proteins.
Return to Program or Index
New Computational Methods for Electrostatics in Macromolecular Simulation
Igor Tsukerman
Dept. of Electrical & Computer Eng., The University of Akron
Abstract
Electrostatic effects are known to play a major role in the
behavior of macromolecules in solvents. Computer simulation
of these effects remains quite challenging due to a large
number of charges or particles involved, varying dielectric
constants, and ionic interactions in the solvent.
The paper introduces new difference schemes that can
incorporate any desired analytical approximation of the
electrostatic potential (for example, its singular
Coulombic or dipole terms) exactly, and with little
computational overhead.
For EXPLICIT solvent models, the schemes have several
salient advantages: (i) optimal computational cost with
respect to the number of atoms; (ii) real arithmetic only;
(iii) any boundary conditions (not necessarily periodic)
easily implemented; (iv) no FFTs (more efficient
parallelization); (v) 10-15 times higher accuracy in the
computed energy and force, as compared to the published
results on conventional methods with similar parameters
(e.g. smooth PME with 4th order interpolation). Numerical
experiments for varying mesh sizes and number of atoms will
be presented.
For IMPLICIT solvent models, not only the singular
Coulombic terms but also derivative jumps of the potential
at solute-solvent interfaces can be analytically
incorporated into the computational scheme through an
auxiliary coordinate mapping. One version of the scheme
employs a regular mesh and another one is meshless, with an
adaptively chosen distribution of nodes.
I gratefully acknowledge collaboration with Prof. Gary
FriedmanÕs group of Drexel University on magnetostatic and
electrostatic models of nanoparticles in colloidal
solutions and with Dr. Achim Basermann of NEC Europe Ltd.
on parallel implementation of the algorithms.
Return to Program or Index
The Approximate Algorithm for Analysis of the Strand Separation Transition
in Superhelical DNA Using Nearest Neighbor Energetics
Chengpeng
Bi and Craig J. Benham
UC Davis Genome Center, University of California, One Shields Ave.,
Davis CA 95616
Abstract
Accurate methods to computationally search genomic DNA sequences for
regulatory regions have been difficult to develop. Conventional string-based
methods have not been successful because many types of regulatory
regions do not have recognizable motifs. And even when a sequence
pattern is known to be associated with a class of regulatory regions
it commonly is necessary, but not sufficient, for function. This suggests
that other attributes, not necessarily strongly correlated with the
details of base sequence, are involved in regulation. Here we present
a computational method to analyze the propensity of superhelically
stressed DNA to undergo strand separation events, as is required for
the initiation of both transcription and replication. We build in
silico models to analyze the statistical mechanical equilibrium distribution
of a population of identical, stressed DNA molecules among its states
of strand separation. In this phenomenon, which we call stress induced
duplex destabilization (SIDD), a state energy is determined by the
energy cost of opening the specific separated base pairs in that state,
and the energy relief from the relaxation of stress this affords.
We use experimentally measured values of all energy parameters, including
the nearest neighbor energetics known to govern DNA base pair stability.
We perform a statistical mechanical analysis in which the approximate
equilibrium distribution is calculated from all states whose free
energies do not exceed a user-defined threshold. This provides the
most general and efficient computational approach to the analysis
of this phenomenon. The algorithm is implemented in C++, and its performance
is analyzed.
Return to Program or Index
Substrate Recognition by Enzymes: A Theoretical Study
Kaori Ueno-Noto1, Keiko Takano1,
and Miki Hara-Yokoyama2
1Graduate School of Humanities and Sciences, Ochanomizu University, 2-1-1 Otsuka, Bunkyo-ku, Tokyo 112-8610, Japan
2Graduate School of Dental, Tokyo Medical and Dental University, 1-5-45 Yushima, Bunkyo-ku, Tokyo 113-8549, Japan
Abstract
Oligosaccharides of glycosphingolipids on cell surface
show diversity in both primary and tertiary structures,
which is closely associated with specific interactions in
the extracellular domain. Reports on higher order
structures of glycosphingolipids have been limited because
of the experimental difficulties to deal with
oligosaccharides, documenting only the primary structures
of them. In this study, we investigated their structures
by applying computational chemistry.
Gangliosides are sphingolipids containing sialic acids
and their oligosaccharides can be recognized by some
enzymes. We previously reported that gangliosides inhibited
the activity of an enzyme NAD glycohydrolase (CD38), and
that those with tandem sialic acid residues in the sugar
chain had great inhibitory effect. We describe the results
of computer simulations on three-dimensional structures and
electronic structures of gangliosides in order to clarify
the causative factors of difference in the inhibitory
effect and the recognition mechanisms of the enzyme.
Similarities between dimeric sialic acid and NAD in
the calculated structures were assessed by conformational
analyses and molecular orbital calculations, on the
assumption that CD38, an NAD reacting enzyme, cross-reacts
with the tandem sialic acid residues in gangliosides. It
was found that calculated dipole moments and HOMO were
correlated with inhibitory effect. CD38 is likely to
recognize the two carboxyl groups in tandem sialic acid
residues of gangliosides, as well as the phosphate groups
in NAD. Solvation effects were also considered to
interpret the substrate recognition mechanisms in the
biological system, which supported above results.
Return to Program or Index
Computational Simulation of Lipid Bilayer Reorientation at Gaps
Peter M. Kasson and Vijay S. Pande
Stanford University Medical School, Stanford, CA
Abstract
Understanding cellular membrane processes is critical for
the study of events such as viral entry, neurotransmitter
exocytosis, and immune activation. Supported lipid
bilayers serve as a model system for many membrane
processes, offering the potential to study them in a more
controllable setting. Despite the relative experimental
simplicity of this model system, many important structural
and dynamic parameters are not experimentally observable
with current techniques. The orientation of individual
lipid molecules within the bilayer is one of these
experimentally indeterminable parameters. Computational
approaches allow the development of a high-resolution model
of bilayer processes. We have performed molecular dynamics
simulations of dimyristoylphosphatidylcholine (DMPC)
bilayers to model the creation of bilayer gaps--a common
process in bilayer patterning--and to analyze their
structure and dynamics. Molecular simulation of these
large systems was made computationally tractable using
parallel processing. Based on our observations, we propose
a model for gap formation in which the bilayer edges form
metastable micelle-like structures on a nanosecond time
scale. Lipid molecules near the edges are structurally
similar to lipids in ungapped bilayers but undergo small-
scale motions on a more rapid timescale. These data
suggest that lipids may undergo rapid local rearrangements
in bilayers undergoing membrane fusion, thus facilitating
the formation of the fusion structures postulated to be
intermediates in the infection cycle of viruses such as
influenza, Ebola, and HIV.
Return to Program or Index
Phylogeny and Evolution
Search for Evolution-related-oligonucleotides and Conservative Words in rRNA Sequences
Liaofu Luo, Fengmin Ji, Mengwen Jia,
Li-Ching Hsieh and H.C. Lee
Inner Mongolia University, China
Abstract
We describe a method for finding ungapped conserved words
in rRNA sequences that is effective, utilizes evolutionary
information and does not depend on multiple sequence
alignment. Evolutionary distance (called n- distance)
between a pair of 16S or 18S rRNA sequences is defined in
terms of the difference in the two sets of frequencies of
occurrence of oligonucleotides n bases long (n-mers) given
by the sequences. These n-distances are used to
reconstruct phylogenetic trees for 35 representative
organisms from all three kingdoms. The quality of the tree
generally improves with increasing n and reaches a plateau
of best fit at n=7 or 8. Hence the 7-mer or 8-mer
(oligonucleotide of 7 or 8 bases) frequencies provide a
basis to describe rRNA evolution. Based on the analysis of
the contribution of a particular 7-mers to 7-distances, a
set of 612 7-mers (called evolution-related-
oligonucleotides, EROs) that are critical to the topology
of the best phylogenetic tree are identified. Expanding
from this set of EROs, evolution-related conservative words
longer than 7 bases in 16S rRNA sequences from an enlarged
set of 98 organisms in Bacteria and Archaea are identified
based on two criteria: (1) the word is highly conserved in
nearly all species of a kingdom (or a sub-kingdom); (2) the
word is located at nearly the same site in each sequence.
Three examples of words thus found are: The 13-mer
ggattagataccc located at the end of a loop near H24 (in
E.coli) is conservative in almost all species in Archaea
and Bacteria. The 8-mer aacgagcg located on H35 is also
conservative in Archaea and Bacteria. Its expansion, the 32-
mer tgttgggttaagtcccgcaacgagcgcaaccc, is conservative in
Bacteria but not in Archaea.
Return to Program or Index
High Speed GAML-based Phylogenetic Tree Reconstruction Using HW/SW Codesign
Terrence S.T. Mak and K. P. Lam
The Chinese University of Hong Kong
Abstract
With the accumulation of genetic information for biological
organisms, different computational techniques are applied
to inferring meaningful patterns from the alignment of DNA
sequences. A phylogenetic relationship between different
organisms can be inferred by making use of a maximum
likelihood approach that has been demonstrated to solve the
problems effectively. Heuristics for calculating the
phylogenetic trees based on maximum likelihood are
computationally expensive. The tree evaluation function
that calculates the likelihood value for each tree topology
dominates the time in searching the optimal tree. We
developed a hybrid Hardware/Software (HW/SW) system for
solving the phylogenetic tree reconstruction problem using
the Genetic Algorithm for Maximum Likelihood (GAML)
approach. The GAML is partitioned into a genetic algorithm
(GA) and a fitness evaluation function, with the SW
handling the GA and the hardware evaluating the maximum
likelihood function. This approach exploits the flexibility
of software and the speed up of high performance hardware.
An efficient Field programmable Gate Arrays (FPGA)
implementation for the required computation on evolution
tree topology fitness evaluation is proposed. The complete
high-level digital design is developed using XilinxÕs
System Generator for DSP toolbox, which provides an
efficient generation of VHDL code for hardware circuitry
programming. In addition, XilinxÕs Java-based JBits toolkit
is used for constructing a BRAM interface for digital
process synchronization control and GA chromosomes
transmission between a workstation and the Xilinx Virtex-
800 FPGA Processor. This implementation provides
approximately 30 to 100 times in speedup improvement when
compared to a software-only solution.
Return to Program or Index
Evidence for Growth of Microbial Genomes by Short Segmental Duplications
Li-Ching Hsieh, Liaofu Luo and Hoong-Chien Lee
Department of Physics, National Central University, Taiwan
Abstract
We show that textual analysis of microbial complete genomes
reveals telling footprints of their early evolution. If a
DNA sequence considered as a text in its four bases is
sufficiently random, the distribution of frequencies of
words of a fixed length from the text should be
Poissonian. We point out that in reality, for words less
than nine letters complete microbial genomes universally
have distributions that are uniformly many times wider than
those of corresponding Poisson distributions. We interpret
this phenomenon as follows: the genome is a large system
that possesses the statistical characteristics of a much
smaller random system, and certain textual statistical
properties of genomes observable now are remnants of those
of their ancestral genomes, which were much shorter than
genomes today. This interpretation motivates a simple
biologically plausible model for the growth of genomes: the
genome first grew randomly to an initial length of not more
than one thousand bases (1 kb), thereafter mainly grew by
random segmental duplications. Setting the lengths of
duplicated segments to average around 25b, we have
generated model sequences in silico whose statistical
properties emulate those of present day genomes. The small
size of the initial random sequence and the shortness of
the lengths the duplicated segments both dictate an RNA
world at the time growth by duplication began. Growth by
duplication allowed the genome repetitive use of hard-to-
come-by codes increasing thereby the rates of evolution and
species diversion enormously.
Return to Program or Index
PTC: An Interactive Tool for Phylogenetic Tree Construction
Sami Khuri and Chen Yang
San Jose State University
Abstract
A phylogenetic tree represents the evolutionary history of
a group of organisms. Constructing phylogenetic trees is a
crucial step for finding out how todayÕs species are
related to one another in terms of common ancestors.
Numerous computer tools, such as PHYLIP and PAUP, have been
developed to construct such trees. The algorithms
implemented in these tools are generally either character
or distance based. The first group uses individual
substitutions among the sequences, while distance based
algorithms construct trees based on pairwise distances
between the sequences.
In our work, we introduce the Phylogenetic Tree
Construction package (PTC): a novel interactive tool for
constructing phylogenetic trees. PTC currently supports
four well-known algorithms, Unweighted Pair Group Method
using Arithmetic Average, Neighbor Joining, Fitch
Margoliash, and Maximum Parsimony. The main reason behind
our project is the lack of interactivity in existing tools.
The existing packages, unlike our tool, only visualize the
resulting phylogenetic tree. Furthermore, the interaction
with those packages occurs only at the beginning, before
the programÕs execution. We strongly believe that
interactive tree construction can be extremely valuable to
bioinformaticians. Through interacting with PTC, users can
gain a deeper understanding of the algorithms, of the input
and output data and not just see the final tree generated
by an algorithm. We provide the capability to edit input
data, to view the tree construction step-by-step, and to
compare two consecutive states of the algorithm. Trees are
dynamically drawn and can be resized by the user. PTC is
implemented in Java and can be extended to include
additional algorithms.
Return to Program or Index
Iterative Rank Based Methods for Clustering
S.W.
Perrey, H. Brinck and A. Zielesny
University of Applied Sciences of Gelsenkirchen,
Recklinghausen, Germany
Abstract
Recently a new clustering algorithm was developed, useful
in phylogenetic systematics and taxonomy. It derives a
hierarchy from (dis)similarity data on a simple and rather
natural way. It transforms a given dissimilarity by an
iterative approach. Each iteration step consists of ranking
the objects under consideration according to their pairwise
dissimilarity and calculating the Euclidian distance of the
resulting rank vectors. We investigate alterations of this
order of steps as well as substitute the Euclidian distance
by standard statistical measures for series of estimates. We
evaluate the resulting different procedures on biological
and other data sets of different structure regarding their
underlying cluster systems. Thereby, potentials and limits
of this kind of iterative approach become obvious.
Return to Program or Index
Analysis of Phylogenetic Profiles by Bayesian Decomposition
Ghislain Bidaut, Karsten Suhre,
Jean-Michel Claverie and Michael Ochs
Bioinformatics Working Group, Fox Chase Cancer Center, Philadelphia, PA, USA;
Structural and Genetic Information Laboratory, UMR 1889 CNRS-AVENTIS, Marseille, France
Abstract
Antibiotic resistance together with the side effects of
broad spectrum antibacterials makes development of targeted
antibiotics of great interest. Here we demonstrate a
Bayesian approach for the analysis of phylogenetic data
aimed at identifying targets specific to species and genus
in bacteria. The data comprise a series of BLAST scores for
selected genes from multiple bacterial species compared to
the well-known genomes of E. coli and M. tuberculosis. As
organisms adapted to new niches during evolution, new genes
were created from lateral gene transfers, duplication and
mutation of existing genes, or merging of multiple genes. The
phylogenetic profiles observed today therefore result from
the superposition of fundamental genetic relationships that
cannot be inferred from single gene similarity. We have
applied Bayesian Decomposition (BD), an algorithm developed
to identify fundamental signals in mixtures, to identify the
fundamental patterns comprising multiple, overlapping
functional units of related genes.
Preliminary analysis shows that certain patterns are highly
conserved (correlation of 70-90% in the genes included) as
we increase the number of patterns used by BD. Some
patterns appear genus-specific, suggesting that sets of
genes have evolved early and been maintained in all related
species. In addition, a pattern linking a functional core
of genes common to all the studied organisms has been
identified. Further analysis should yield specific sets of
genes tied in functional units that are unique to species
and genus, providing protential targets for therapeutics at
the level of disruption protein interactions or of
individual proteins.
Return to Program or Index
Automating Recognition of Regions of Intrinsically Poor Multiple Alignment for Phylogenetic
Analysis Using Machine Learning
Yunfeng Shan and Evangelos E. Milios1;
Andrew J. Roger and Christian Blouin2; Edward Susko3
1Faculty of Computer Science, Dalhousie University, 6050 University, Avenue, Halifax, NS Canada B3H 1W5
2Dept. of Biochemistry and Molecular Biology, Genome Atlantic/Genome Canada, Dalhousie University,
Halifax, Nova-Scotia, Canada, B3H 1X5
3Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova-Scotia, Canada, B3H 4H7
Abstract
Phylogenetic analysis requires alignment of gene
sequences. Automatic alignment programs produced regions
of intrinsically poor alignment that are currently
detected and deleted manually. We present the results of a
machine learning approach to detection of these regions of
the alignment. We compare naive Bayes, standard decision
trees, and support vector machines.
The results show three algorithms can accurately identify
the bad sites of multiple sequence alignment based on
three attributes such as the gap ratio, the site
likelihood and the degree of homoplasy (consistency
index). Among three algorithms, naive Bayes and SVM
provide the best performance for bad site prediction, but
C4.5 decision trees provide the best performance for the
ambiguous and good site predictions. Among three classes,
it is the most difficult to distinguish the ambiguous
sites and the good sites, but easiest to distinguish the
bad sites among these three classes. No evident difference
among three parsimony count index as an attribute for
reducing classification error is observed. Generally, the
classifiers of naive Bayes and C4.5 decision tree learnt
from the subset of a balanced class distribution generally
come up with optimal performance compared with natural
class distributions: the random subset and the entire data
set.
Return to Program or Index
Reconstruction of Ancestral Gene Order Following Segmental Duplication and Gene Loss
Jun Huan, Jan F. Prins, Wei Wang and Todd J. Vision
Dept. of Computer Science, Univ. of North Carolina, Chapel Hill, NC
Abstract
As gene order evolves through a variety of chromosomal
rearrangements, conserved segments provide important
insight into evolutionary relationships and functional
roles of genes. However, gene loss within otherwise
conserved segments, as typically occurs following large
scale genome duplication, has received limited algorithmic
study. This has been a major impediment to comparative
genomics in certain taxa, such as plants and fish. When
large scale genome duplication and gene loss occur, how
well can we infer both the true gene order within ancestral
chromosomal segments and the ancestral ordering of those
segments?
We propose a heuristic algorithm for the inference of
ancestral gene order in a set of genomes for which at least
some genomic segments are partially related by common
ancestry to two or more different segments. It does not
require gene content and order to be perfectly conserved
among segments. First, conserved chromosomal regions are
identified using existing pairwise genomic alignment
algorithms. Second, segments are iteratively clustered
under the control of two parameters, (1) the minimal
required number of shared genes between two segments or
clusters and (2) the maximal allowed number of
rearrangement breakpoints along the lineage leading to each
descendant segment. Finally, we compute the estimated
ancestral gene order for each cluster.
We evaluate the performance of this algorithm on simulated
data that models a genome evolving by large-scale
duplication, duplicate gene loss, transposition,
translocation, and inversion. The results suggest that
ancestral gene orders may be estimated with sufficient
accuracy to substantially improve the detection sensitivity
of pairwise genomic alignment algorithms.
Return to Program or Index
Reconstruction of Ancient Operons From Complete Microbial Genome Sequences
Yuhong Wang1, John Rose2, Bi-cheng Wang2 and Dawei Lin2
1Department of Molecular Biology, Jilin University, Changchun 130023, PRC
2Southeast Colloboratory Biochemistry and Molecular Biology
Department, University of Georgia, Athens, GA 30602, USA
Abstract
Completed genomes not only provide DNA sequence information, but also
reveal the relative locations of genes. In this paper, we propose a new
method for reconstruction of "ancient operons" by taking advantages of
the evolutionary information in both orthologous genes and their
locations in a genome. The basic assumption is that the closer two genes
were in an ancient genome, the more likely they will stay closer in the
current genome. An assembly of non-random neighboring pairs of genes in
current genomes should be able to reconstruct the gene groups that were
together at a certain point of time during evolution. Given the fact
that genes that are close neighbors are more likely functionally
related, the gene groups generated by this assembly process are named
"ancient operons."
The assembly is only meaningful when enough non-random pairs can be
found. This was made possible by over 100 microbial genomes available in
recent years. For proof of concept, we chose 63 non-redundant complete
microbial genomes from RefSeq database [May 2003 release] at NCBI. In
order to normalize the effect of protein sequence mutations and other
changes due to evolution, we only consider assembly of COGs (Cluster of
Orthologous Group) in these genomes. There are total 4901 COGs from NCBI
COG database are used.
The assembly process is similar to the one that assembles DNA sequences
into contigs. In our case, the neighbor COG pairs are used as basic
assembly units. A target function is defined based on neighbor frequency
of pair-wise link among all 4901 COGs after analysis for all 63 genomes.
We use random cost algorithm, a global optimization algorithm to
minimize the target function and assemble COGs into contigs. The
significance of these contigs are then assessed by statistical methods.
The results suggest that the assembled contigs are statistically and
biologically significant. This method and the assembled ancient operons
provides a new way for studying microbial genomes, their evolution and
for annotating proteins of unknown functions.
Return to Program or Index
Predictive Methods
An Evolving Approach to Finding Schemas for Protein Secondary Structure Prediction
Huang, Hsiang Chi
Institute for Information Industry, Taiwan
Abstract
Assessing accurate secondary structures of a protein
involves in preparation a crystal of protein, x-ray
scanning and computing. These cost a lot. Researchers have
developed methods to predict secondary structures of a
protein since 1960s. Recently, methods predicting protein
secondary structure through the use of new algorithms such
as HMM (3), neural networks (2), new evolutionary databases
(2) etc.
These algorithms do help to predict protein secondary
structure. However, some algorithms are like "Black Boxes."
Researchers donÕt understand the meanings of enormous
parameters or how the results of prediction come out but
only accept them. This study intends to predict protein
secondary structure schemas by genetics algorithms.
In this research, a genetic algorithm has been applied to
predict building schemas of protein secondary structure.
The results of this GAPS (Genetics Algorithm for Protein
Secondary Structure) achieved an average Q3 score of 55%~
65%. Although the highest Q3 of this research is not the
highest score among researches, some fundamental and useful
building schemas of protein secondary structure information
have been found.
Previous researches (e.g. focused on global free energy
minimum of protein secondary structure) could not give us a
complete understanding of the driving forces behind protein
folding. Why?
Previous researches take every amino acid in the sequence
into consideration. However, from the results of this
study, not all the residues in a schemas effect the
conformation of protein secondary structure. Only few amino
acids actually effect the conformation of protein secondary
structure folding.
Return to Program or Index
Gene Selection for Multi-class Prediction of Microarray Data
Dechang Chen, Dong Hua, Xiuzhen Cheng and Jaques Reifman
Uniformed Services University of the Health Sciences, Bethesda, MD
Abstract
Gene expression data from microarrays have been
successfully applied to class prediction, where the
purpose is to classify and predict the diagnostic category
of a sample by its gene expression profile. A typical
microarray dataset consists of expression levels for a
large number of genes on a relatively small number of
samples. As a consequence, one basic and important
question associated with class prediction is: how do we
identify a small subset of informative genes contributing
the most to the classification task? Many methods have
been proposed but most focus on two-class problems, such
as discrimination between normal and disease samples. This
paper addresses selecting informative genes for multi-
class prediction problems by jointly considering all the
classes simultaneously. Our approach is based on power of
genes in discriminating among tumor types (classes) and
the correlation of genes. We formulate the expression
levels of a given gene by a one-way analysis of variance
model with heterogeneity of variances, and determine the
discrimination power of the gene by a test statistic
designed to test the equality of the class means. In other
words, the discrimination power of a gene is associated
with a Behrens-Fisher problem. Informative genes are
chosen such that each selected gene has a high
discrimination power and the correlation between any pair
of selected genes is low. Several test statistics are
studied in this paper, such as Brown-Forsythe test
statistic and Welch test statistic. Their performances are
evaluated over a number of classification methods applied
to publicly available microarray datasets
Return to Program or Index
Molecular Evaluation Using Comparative Molecular Interaction Profile Analysis System
Yoshiharu Hayashi, Katsuyoshi Sakaguchi, Nao Iwata and Masaki Kobayashi
KLIMERS Co., Ltd., Japan
Abstract
Creating a new molecular description factor based on the
results of computational docking study will add new
dimensions in molecular evaluation. We propose a new
molecular description factor analysis system named
Comparative Molecular Interaction Profile Analysis system
(CoMIPA) in which the AutoDock program is used for docking
evaluation of small molecule compound-protein complexes.
Interaction energies are calculated, and the data sets
obtained are named interaction profiles (IPFs). Using IPF
as a scoring indicator, the system could be a powerful tool
to cluster the interacting properties between small
molecules and bio macromolecules such as ligand-receptor
bindings. As for clustering methods, we used hierarchical
clustering which has visual advantages, but it has a number
of shortcomings for the study by CoMIPA. So, in addition to
hierarchical clustering, we tried to use Kohonen's Self-
Organizing Map (SOM) which is a kind of neural networks
that learn the feature of multi dimensions data without
supervision. SOM have a number of futures that that make
them particularly well suited to clustering and analysis of
interaction profiles. Further development of the system
will enable us to predict the adverse effects of a drug
candidate.
Return to Program or Index
Probe Design for Large-scale In Situ Hybridization and Oligo DNA Microarrays: An Application of the New NIA Gene Index
Vincent
VanBuren, Toshiyuki Yoshikawa, Toshio Hamatani, Alexei A. Sharov,
Yong Qian, Dawood B. Dudekula, Minoru S.H. Ko
NIH/NIA
Abstract
Large-scale molecular biology techniques such as the use of
oligo DNA microarrays are widely used to gain an
appreciation of global transcription in biological time
series, tissue contrast, classification and prognosis, and
response-to-treatment experiments. Recent efforts to
perform large-scale in situ hybridizations have used cDNA
probes. An approach that makes use of designed oligo probes
should offer improved consistency at uniform hybridization
conditions and improved specificity, as demonstrated by
various oligo microarray platforms. While many tools exist
to aid in probe design, most tools are not suitable for
large-scale automation of selection and there are no freely
available tools that optimize probe selection by
considering the complex interaction of the physical
properties and cross-reactivity of the probe as measured by
microarray studies. We describe a new Web-based application
that takes FASTA-formatted sequence and some simple
parameters as input, and returns both a list of the best
choices for probes and a full report containing possible
alternatives. Designing probes for microarrays uses a
specialized probe scoring routine that optimizes probe
intensity based upon an artificial neural network (ANN)
trained to predict the average probe intensity from the
physical properties of the probe. Determining probe cross-
reactivity requires a high-quality gene index that includes
both gene and transcript information. The new NIA gene
index was applied to this effort by creating a BLAST
database of the index and searching all potential probes
against it. This new tool should provide a reliable way to
construct probes that maximize signal intensity while
minimizing cross-reactivity.
Return to Program or Index
Genes Selection for Cancer Classification Using Bootstrappped Genetic Algorithms and Support Vector Machines
Xue-wen Chen
California State University
Abstract
The gene expression data obtained from microarrays have
shown useful in cancer classification. DNA microarray data
have extremely high dimensionality compared to the small
number of available samples. An important step in
microarray based cancer classification is to remove genes
irrelevant to the learning problem and to select a small
number of genes expressed in biological samples under
specific conditions. In this paper, we propose a novel
system for selecting a set of genes for cancer
classification. A wrapper method is employed: a linear
support vector machine is used to evaluate the fitness of
subsets of genes and a genetic algorithm is employed to
search for a subset of genes which discriminate samples in
two classes well. To overcome the problem of the small
size of training samples, bootstrap methods are combined
in genetic search. This new system is very efficient for
selecting sets of genes in very high dimensional feature
space for cancer classification. Two databases are
considered: the colon cancer database and the leukemia
database. Our experimental results show that the proposed
method is capable of finding genes that discriminate
between normal cells and cancer cells and it generalizes
well (this is particularly important in microarray data
classifications, since only very limited training samples
are available).
Return to Program or Index
A Statistical Model of Proteolytic Digestion
I-Jeng Wang, Christopher P. Diehl and Fernando J. Pineda
Johns Hopkins University Applied Physics Laboratory, 11100 Johns Hopkins Rd., Laurel, MD
Abstract
We present a stochastic model of proteolytic digestion of a
proteome, assuming (1) the distribution of parent protein
lengths in the proteome, (2) the relative abundances of the
20 amino acids in the proteome and (3) the
digestion “rules” of the enzyme used in the digestion.
Unlike hypothesis tests in most protein identification
software, the developed model accounts for the fact that
digestion products come from the mixture of proteins that
constitutes a microorganism’s proteome. We believe that
incorporation of a more rigorous model of proteolytic
digestion in the hypothesis test would significantly
improve the selectivity and sensitivity of the assay. Our
approach is based on the observation that, when overlaid
with the cleavage process, a protein sequence can be
equivalently modeled as a regenerative process with the
cleavage sites as the regeneration points. Hence the
regenerative cycles between these points (the fragments)
are i.i.d.. Therefore, Wald’s first lemma can be applied to
the stochastic process of fragmentation, leading to closed
form expressions for the distribution of fragment lengths
with a minor approximation to account for the first
fragment. With this methodology we derived a closed form
expression for the fragment mass distribution for a large
class of enzymes including the widely used Trypsin. The
expression uses the distribution of lengths in a mixture of
proteins taken from a proteome, as well as the relative
abundances of the 20 amino acids in the proteome. The
agreement between theory and the in silico digest is
excellent.
Return to Program or Index
Preliminary Wavelet Analysis of Genomic Sequences
Jianchang Ning, Charles N. Moore and James C. Nelson
University of Delaware, Delaware Biotechnology Institute, 15 Innovation Way Newark, DE
Abstract
Large genome-sequencing projects have made urgent the
development of accurate methods for annotation of DNA
sequences. Existing methods combine ab initio pattern
searches with knowledge gathered from comparison with
sequence databases or from training sets of known genes.
However, the accuracy of these methods is still far from
satisfactory. In the present study, wavelet algorithms, in
combination with entropy method, are being developed as an
alternative way to determine gene locations in genomic DNA
sequences. Wavelet methods seek periodicity present in
sequences. Periodicity, due in general to nonrandomness of
nucleotide usage associated with the triplet nature of
codons, is an important target of most gene-parsing computer
programs. A promising advantage of wavelets is their
adaptivity to varying lengths of coding/non-coding regions.
Moreover, the wavelet methods integrated with entropy
method just search the information contents of the
sequences, which do not need to be trained. However, since
this development has not been completely finished yet, only
the preliminary results from the development are to be
presented. The preliminary results show that the wavelet
approach is feasible and may be better than some
knowledge-dependent approaches based on a sample of genomic
DNA sequences.
Return to Program or Index
Using Easel for Modeling and Simulating the Interactions of Cells in Order to Better Understand
the Basics of Biological Processes and to Predict their Likely Behaviors
Vojislav Stojkovic, Grace Steele and William Lupton
Morgan State University, Baltimore, MD 21251
Abstract
Modeling and simulation plays a significant role in
bioinformatics. It helps researchers to better understand
biological processes and to predict their likely behaviors.
Easel (Emergent Algorithm Simulation Language and
Environment) is a new general, modeling, and simulation
programming language. Easel is designed to model and
simulate systems with large numbers of interacting actors.
Easel can be used for modeling and simulating problems
with large numbers of lack of central control interacting
components, and incomplete and imprecise information.
We use Easel for modeling and simulating the interactions
of cells in order to better understand the basics of
biological processes. We have developed many Easel
programs to solve different biological problems.
For example the "Message Passing" Easel program simulates
the transmission of signal from one cell to another based
on the distance between cells.
The following scenario summarizes the model:
- There is a passive cell population; cells are
graphically represented by circles of different radius and colors;
- Cells move randomly within their own limited "areas";
- The researcher can randomly click any passive cell to
activate the cell (change radius and color). This action
initiates the transmission of signal between cells;
- The signal is transmitted to the closest passive cell;
- When the new cell is activated (change color and
radius), the old cell will be deactivated;
- The new cell continues the signal transmission to the
closest passive cell;
- The researcher may continue to click on passive cells to
activate them and start the parallel signal transmission.
Return to Program or Index
Fold Recognition Using Sequence Fingerprints of Protein Local Substructures
Andriy Kryshtafovych, Torgeir R. Hvidsten,
Jan Komorowski and Krzysztof Fidelis
Lawrence Livermore National Lab, Livermore, CA 94550
Abstract
Modern structure prediction methods can consistently
produce reliable structural models for protein sequences
with more than 25% sequence identity to proteins with known
structure. But even if no proteins with significant
similarity can be detected for the protein of interest,
there is still a chance that it can be assembled with local
blocks existing in structural archives.
We have developed the method of local descriptors of
protein structure that could detect common local structural
environments in proteins and organize them into a limited
number of shape similarity classes so that representatives
from these classes could be used as elementary building
blocks to reconstruct native protein structures or model
unknown folds. Here we discuss application of this approach
to fold recognition problem.
The descriptors consist of several short backbone fragments
that are close to each other in 3D space and do not overlap
on the sequence. Their number (usually up to 6) and length
(5-25 residues) are not fixed and depend on backbone’s
local conformation and relative positioning of amino acid
side chains. We built 374,558 descriptors that encompass 3
or more backbone segments for 4006 structural domains
represented in the SCOP and having less than 40% sequence
identity between each other. Grouping descriptors according
to their SCOP attributes and gathering them into classes of
similarity, we built a library of groups containing sets of
sequence fragments with geometrically similar local
structures.
Analysis of the groups was carried out to establish
relationships between sequences of the segments and
specific geometrical conformations. Using the detected
sequence-based fingerprints, groups were assigned to target
sequences.
The ability of the approach to recognize correct SCOP folds
was tested on 273 sequences from the 49 most popular folds.
Good predictions were obtained in 86% of cases. No
performance drop was observed with decreasing sequence
similarity between target sequences and protein sequences
used to build the library.
Return to Program or Index
An Iterative Loop Matching Approach to the Prediction of RNA Secondary Structures with Pseudoknots
Jianhua Ruan and Weixiong Zhang
Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO 63112
Abstract
Pseudoknots have generally been excluded from the
prediction of RNA secondary structures due to the
difficulty in modeling and complexity in computing.
Although several dynamic programming algorithms exist for
the prediction of pseudoknots using thermodynamic
approaches, they are neither reliable nor efficient. On
the other hand, comparative methods are more reliable, but
are often done in an ad hoc manner and require expert
intervention. Maximum weighted matching (Tabaska et. al,
Bioinformatics, 14:691-9, 1998), an algorithm for
pseudoknot prediction with comparative analysis, suffers
from low prediction accuracy in many cases. Here we
present an algorithm, iterative loop matching, for
predicting RNA secondary structures including pseudoknots
reliably and efficiently. The method can utilize either
thermodynamic or comparative information or both, thus is
able to predict for both aligned sequences and individual
sequences. We have tested the algorithm on a number of RNA
families, including both structures with and without
pseudoknots. Using 8-12 homologous sequences, the
algorithm correctly identifies more than 90% of base-pairs
for short sequences and 80% overall. It correctly predicts
nearly all pseudoknots. Furthermore, it produces very few
spurious base-pairs for sequences without pseudoknots.
Comparisons show that our algorithm is both more sensitive
and more specific than the maximum weighted matching
method. In addition, our algorithm has high prediction
accuracy on individual sequences, comparable to the PKNOTS
algorithm (Rivas & Eddy, J Mol Biol, 285:2053-68, 1999),
while using much less computational resources.
Return to Program or Index
A New Method for Predicting RNA Secondary Structure
Hirotoshi Taira, Tomonori Izumitani, Takeshi Suzuki and Eisaku Maeda
NTT Communication Science Laboratories, 2-4, Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0237 Japan
Abstract
It has become clear recently that many RNA
are not translated into proteins, instead working as
important functional molecules in organisms. These RNA are
called "non-coding RNA."
If we identify these structures correctly, the function of
specific RNA can be predicted more easily and will be more
useful in the development of new medicines. Nussinov's and
Zuker's algorithms are the two conventional algorithms
for the prediction of RNA's secondary structures.
In this paper, we will focus on Nussinov's algorithm. This
algorithm utilizes dynamic programming to search for remote
base pairs.
To improve the algorithm's performance, we added a new
scoring mechanism and controled the loop's minimum length.
For the experiment, we used the original Nussinov (NA) and
improved Nussinov (NB) algorithm, as well as the Nussinov
algorithm using SCFG (SA) and the improved Nussinov
algorithm using SCFG (SB). The algorithm was evaluated by
accuracy and F-measure. On average, the F-measure rises from
0.333 to 0.711 due to the improvements from NA to NB.
Moreover, the F-measure rises from 0.577 to 0.714 due to the
improvements from SA to SB. Accuracy evaluation shows
similar results.
The F-measure of 11 in 18 sequences increased with
loop restriction, and the F-measure of 9 sequences increased
when considering G-U pairs, due to the improvements from NA
to NB. Our experimental results indicate that the proposed
approach effectively improves the performance of Nussinov's algorithm.
Return to Program or Index
Minimum-Redundancy Feature Selection from Microarray Gene Expression
Chris Ding and Hanchuan Peng
Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720
Abstract
Selecting a small subset of genes out of the thousands of
genes in microarray data is important for
accurate classification of phenotypes. These "marker" genes also
provide additional biological insights.
Widely used methods typically rank genes according to their differential expressions among phenotypes and
pick the top-ranked genes. Quite often feature sets so obtained have
certain redundancy; the mutual
information or correlation among the feature sets are high.
In this paper, we propose the minimum redundancymaximum relevance
(MRMR) methods to select
features that can represent broader spectrum of characteristics of
phenotypes than those obtained through
standard ranking methods; they are more robust and generalize well to
unseen data.
We perform comprehensive tests on 3 well-known multi-class gene
expression data sets, NCI cancer cell
lines(9 classes), Lung cancer (7 classes), Lymphoma (9 classes), and
two 2-class datasets, Leukemia and
Colon cancer.
Feature set selected via our MRMR method leads to leave-one-out cross
validation (LOOCV) accuracy of
98.3% for NCI, 97.3% for Lung, 96.9 % for Lymphoma, 100% for
Leukemia, and 93.5% for Colon. These
results are substantially better than standard feature selection
methods, and are also better than those
reported in literature.
We also performed forward feature selections of wrapper approach on
these datasets. The resulting feature
sets, although obtained via far more computation, are not as good as
those selected via our MRMR
methods. Our extensive tests also show that discretization of the gene
expressions leads to clearly better
classification accuracy than the original continuous data.
Return to Program or Index
Sequence Comparison
A Linear Programming Based Algorithm for Multiple Sequence Alignment
Fern Y. Hunt, Anthony J. Kearsley and Abigail O'Gallagher
National Institute of Standards and Technology, MD
Abstract
In this talk we will discuss an approach to multiple
sequence alignment based on treating the alignment
process as a stochastic control problem. The
use of a model based on a Markov decision process (MDP)
leads to a linear programming problem. Our goal is to avoid
the expense in time and computation associated with the use
of dynamic programming methods when aligning large numbers
of sequences. The formulation naturally allows for various
constraints e.g. on the number of indels, and in some cases
one can characterize the sensitivity of the alignment to
changes in the cost function. The expected discounted cost
for both the primal and dual problem will be discussed.
Under some ergodicity assumptions, one can show pathwise
optimality in an asymptotic sense with probability one.
Return to Program or Index
Alignment-Free Sequence Comparison with Vector Quantization and Hidden Markov Models
Tuan Pham
Griffith University, Australia
Abstract
Alignment-free sequence domain for the comparison of
biological data is still a very recent area of research in
regard to alignment-based sequence methods1. Apart
from clustering analysis2, there are two main categories
known as the word (n-gram) frequency-based and the
information-based methods1. This paper presents a new
approach for alignment-free biological sequence
comparisons, which at least overcomes some problems
encountered by the frequency-based method in both
computational speed and numerical analysis. It has been
reported1,3 that for long sequences, the frequency-
based approach encounters a memory problem; the use of
Mahalanobis distance causes singularity in the conversion
of covariance matrices; and the computational process will
take a considerably long time.
In this study, a given long sequence of alphabets is
transformed into numbers and then vectorized according to a
prescribed vector size. This collection of vectors is then
represented by a codebook that contains a number of
template vectors as its own identity. Based on the
codebook for each individual sequence generated by vector
quantization, a hidden Markov model (HMM) is then built for
each sequence. Since a HMM has been built for each
sequence, the comparisons of similarity/dissimilarity
between sequences can be performed in terms of a distance
measure of the likelihoods, which includes cross-entropy,
divergence, and discrimination information.
The proposed approach is tested against several biological
datasets that are of public availability2. Comparisons
are also made with the frequency-based method with
different distance measures. The proposed approach has
been found to be efficient for real-time processing and
provides an effective alternative for alignment-free
sequence comparison.
REFERENCES
1S. Vinga, and J. Almeida, Alignment-free sequqnce
comparison Š A review, Bioinformatics, 19:4 (2003) 513-523.
2E.R. Dougherty et. al., Inference from clustering with
application to gene-expression microarrays, J.
Computational Biology, 9:1 (2002) 105-126.
3http://www.bioinformatics.musc.edu/resources.html
Return to Program or Index
Prediction of Protein Function Using Signal Processing of Biochemical Properties
Krishna Gopalakrishnan, Kayvan Najarian and Dean Scott Warren
University of North Carolina, Charlotte, NC
Abstract
We present a technique to find the biological function of
protein from its primary sequence using advanced signal
processing. Currently the majority of protein
classification methods make use of multiple alignments. We
use signal-processing features obtained from the primary
sequence of the protein, to predict its biological
function. The primary sequence of protein is first
converted to signals based on the encoding of biochemical
properties like hydrophobicity, solubility, molecular
weight of constituent amino acids. Then, signal processing
features like complexity, mobility and fractal dimension
are extracted from the obtained signals and used for
classification of proteins. Sample studies conducted for
lipase, protease and isomerase proteins of length between
100 and 200 amino acids support the proposed method.
Return to Program or Index
Genomic Sequence Analysis Using Gap Sequences and Pattern Filtering
Shih-Chieh Su, Chia H. Yeh, and C. C. Jay Kuo
University of Southern California, Los Angeles, CA
Abstract
A new pattern filtering technique is developed to analyze
the genomic sequence in this research based on gap
sequences, in which the distance of the same symbol is
recorded consecutively as a sequence of integer numbers. A
joint decision is made upon a family of gap sequences
generated using selected patterns to perform genomic
sequence alignment and similarity check. The effects of
basic operations on genomic sequences are studied under
the gap sequence framework. Several tools are developed
based on the Poisson distribution approximation of gap
value to analyze the gap sequences such as histogram-aided
alignment, conditional entropy over gap knowledge, and
unwanted segments merging. Simulation results show that
the extension of gap match indicates the corresponding
segment extension in the original genomic sequence. With
the proposed algorithm, we are able to generalize the
conventional sequence alignment methods in a more adaptive
way. The match between the gap sequences is considered as
a frame match while a true match requires both frame and
stuffing match. Conditional entropy measurement has been
proposed to predict the possibility of a true match
conditioned on a frame match. Extensive experimental
results will be presented to demonstrate the proposed
method.
Return to Program or Index
An Optimal DNA Segmentation Based on the MDL Principle
Wojciech Szpankowski1,
Wenhui Ren1, and Lukasz Szpankowski2
1Pudue University
2University of Michigan, Ann Arbor
Abstract
A major challenge facing computational biology is the post-sequencing analysis of genomic
DNA and other biological sequences. The biological world is highly stochastic as well as inhomogeneous
in its behavior. Indeed, there are regions in DNA with high
concentration of G or C bases; stretches of sequences with an abundance of CG dinucleotide (CpG islands);
coding regions with strong periodicity-of-three pattern, and so forth. The transition between homogeneous
and inhomogeneous regions, also known as change points, carry important biological information. Computational
methods used to
identify these homogeneous regions are called segmentation. Our goal is to employ rigorous methods of
information theory to quantify structural properties of DNA
sequences. In particular, we adopt the Stein-Ziv lemma to find an asymptotically optimal discriminant
function that determines whether two DNA segments are generated by the same source while assuring
exponentially small false positives. Then we apply the Minimum Description Length (MDL) principle
to select parameters of our
segmentation algorithm so that the underlying DNA sequence has the smallest possible length (best
compression). The MDL principle is adopted since recent analyses indicate that biological sequences
are well compressed (by nature). Finally, we perform extensive
experimental work on human chromosome 22 data. In particular, we observe that
grouping A and G (purines) and T and C (pyrimidines) leads to better segmentation of chromosome
22 and identification of biologically meaningful transition points.
Let us give a more precise description of our methods. We partition a DNA sequence (e.g., chromosome 22)
into fixed length blocks. Guided by universal data compression, we shall follow Shamir and Costello and
set the length of the block to minimize the average redundancy (MDL principle) which turns out to be slightly bigger than
log N, where N is the length of the DNA sequence in base pairs (bps).
Then we invoke the Stein-Ziv lemma (of hypothesis testing and universal data compression) and apply an
asymptotically optimal discriminant (based on
Jensen-Shannon) to determine whether two blocks are generated by the same
source. If the discriminant function is positive, then a change point (i.e., change of distributions)
between these blocks is expected. Since the length of of a block is of order O(log N) bps, we subdivide
the blocks that potentially contain the change points
into subblocks of smaller length loglog N (to assure the best fit of data from the MDL principle point of
view). The optimal discriminant function is again applied to these
subblocks. Once the change points are found we compute the entropy of these
segments (between change points) to identify (hidden) sources that generate (segments of) the sequence.
Return to Program or Index
The Cybertory Sequence File System (SFS) for Managing Large DNA Sequences
Carl E. McMillin and Robert H. Horton
Attotron Biosensor Corporation
Abstract
Many queries that researchers run against
DNA sequences involve matching short patterns against large
targets, e.g., finding restriction sites or transcription
factor binding sites in a genome. If patterns are known
beforehand, the target may be pre-indexed to facilitate
rapid execution of queries that combine several of the
smaller patterns. The Cybertory Sequence File System is a
cross-platform hybrid file-management/database designed for
storage and pre-indexing of large DNA sequences. Using a
combination of C++ and Java components, the SFS addresses
three major functional domains: "Resource Management" for
constraining runtime CPU and virtual-memory
utilization; "Data Translation/Segmentation" for minimizing
persistent-storage requirements and improving access
performance; and "Query Processing" for fast index-and
heuristics-based processing. Resource Management handles
dynamically allocated resources: CPU's and large memory
buffers, operating system file-handles and
compressors/encoders are allocated, reassigned, and
destroyed as necessary. The Data Translation/Segmentation
domain imports and exports DNA sequence data from/to FASTA
files and adaptively encodes this data into an internal
binary format, organized for speed of access, flexibility
of use, and density of storage. Adaptable modules are used
for compression and encoding. The Query-Processing domain
searches for patterns in the imported data, persists the
results, and combines results in various ways. The source
code is available under the GNU Public License. Funded by
NIH grant #R44 RR13645 02A2 to Attotron Biosensor
Corporation.
Return to Program or Index
Implementing Parallel Hmm-pfam on the EARTH Multithreaded Architecture
Weirong Zhu, Yanwei Niu, Jizhu Lu and Guang R. Gao
Department of Electrical and Computer Engineering, University of Delaware
Abstract
Hmmpfam is a widely used bioinformatics software for
sequence classification provided by Sean Eddy’s Lab at the
Washington University in St.Louis. Using state-of-the-art
multi-threading computing concept, we implement a new
parallel version of hmmpfam on EARTH (Efficient
Architecture for Running Threads), which is an event-driven
fine-grain multi-threaded program architecture and
programming execution model. In order to parallelize
hmmpfam, we develop two parallel schemes: one pre-
determines job distribution on all computing nodes by a
round-robin algorithm; the other takes advantage of the
dynamic load balancing support of EARTH Runtime system 2.5,
which makes the job distribution completely transparent to
user. In this poster, we will analyze the hmmpfam program
and different parallel schemes. Then we will show our test
results on various computing platforms with comparison to
the PVM version hmmpfam. When matching 250 sequences
against a 585-family Hmmer database on 18 dual-CPU
computing nodes, the PVM version gets absolute speedup of
18.50, while EARTH version gets 30.91, achieving a 40.1%
improvement on execution time. We also test our program on
the advanced supercomputing cluster Chiba City at Argonne
National Laboratory. When the seqfile contains 38192
sequences, and the HMMer database has 50 families, the
EARTH version achieves an absolute speedup of 222.8 on 128
dual-CPU nodes, which means that it could reduce the total
execution time from 15.9 hours (serial program) to only 4.3
minutes.
Return to Program or Index
Genome on Demand: Interactive Substring Searching
Tamer Kahveci and Ambuj K. Singh
University of California Santa Barbara
Abstract
Motivation: The explosive growth of genome databases makes similarity
search a challenging problem. Current search tools are non-
interactive in the sense that the user has to wait a long
period of time until the entire database is inspected.
Results: We consider the problem of interactive string searching,
and propose two innovative k-NN (k-Nearest Neighbor) search
algorithms. For a given query, our techniques start
reporting the best resuts found so far immediately. Later,
these partial results are periodically refined depending on
the user satisfaction. We split the genome strings into
substrings, and maintain a small fingerprint for these
substrings. This fingerprint is then stored using an index
structure. We propose a new model to compute the distance
distribution of a query string to a set of strings with the
help of their fingerprints. Using this distribution, our
first technique orders the MBRs (Minimum Bounding
Rectangles) of the index structure based on their order
statistics. We also propose an early pruning strategy to
reduce the total search time for this technique. Our second
technique exploits existing statistical models to define
an order on the index structure MBRs. We also propose a
method to compute the confidence levels for the partial
results. Our experiments show that our techniques can
achieve 75 % accuracy within the first 2.5-35 % of the
iterations and 90 % accuracy within the first 12-45 % of
the iterations for both DNA and protein strings.
Furthermore, the reported confidence levels reflect the
quality of the partial results accurately.
Return to Program or Index
Automatic Selection of Parameters for Sequence Alignment
Jiangping Zhou and Gary Livingston
University of Massachusetts, Lowell
Abstract
We have shown that simulated annealing search can be used
to find close to optimal settings for parameters used in a
modified version of the DDS (DNA-DNA search) program. The
modified DDS algorithm is a very efficient biological
sequence alignment algorithm with linear space complexity.
However, the DDS algorithm appears to be seldom used. We
conjecture that the reason for this is that it has 7
parameters or cutoffs to manually set, which requires the
use of heuristics obtained by experience and usually
involves several trials. The optimal settings for these
parameters are highly dependent on the type of sequence
being compared and the sequences themselves. We use the
average high-scoring chain score to measure the "goodness"
of the resulting alignments, and then use simulated
annealing algorithm to perform automatic search within a
space of parameter values to maximize this goodness
measure. We tested our method using pairs of DNA sequences
and compared our method to manually selecting parameters
for DDS. Our results show that close to optimal parameter
settings are very difficult to find manually. Our simulated
annealing search was able to find good settings for DDS
parameters for all DNA sequences we tried. A surprising
finding is that there may be several combinations of
parameters that yield close to optimal alignments. We
conclude that our approach can quickly and automatically
find parameters which will allow DDS to make close to
optimal alignments.
Return to Program or Index
CoMRI: A Compressed Multi-Resolution Index Structure for Sequence Similarity Queries
Hong Sun, Ozgur Ozturk and Hakan Ferhatosmanouglu
Ohio State University
Abstract
In this paper, we present CoMRI, Compressed
Multi-Resolution Index, our system for fast sequence
similarity
search in DNA sequence databases. We employ Virtual Bounding
Rectangles (VBRs) concept to build a compressed, grid style
index
structure. An advantage of grid format over trees is
subsequence
location information is given by the order of corresponding
VBR in
the VBR list. Taking advantage of VBRs, our index structure
fits
into a reasonable size of memory easily. Together with a new
optimized multi-resolution search algorithm, the query
speed is
improved significantly. Extensive performance evaluations
on Human
Chromosome sequence data show that VBRs save 80%-93% index
storage size compared to MBRs and new search algorithm
prunes
almost all unnecessary VBRs which guarantees efficient disk
I/O
and CPU cost. According to the results of our experiments,
the
performance of CoMRI is at least 100 times faster than MRS
which
is another grid index structure introduced very recently.
Return to Program or Index
Systems Biology
Pathway Logic Modeling of Protein Functional Domains in Signal Transduction
C. Talcott, S. Eker, M. Knapp, P. Lincoln and K. Laderoute
SRI International
Abstract
Cells respond to changes in their environment through
biochemical signaling
pathways that detect and transmit information to effector
molecules within
different cellular compartments. Protein functional domains
(PFDs) are
consensus sequences within signaling molecules that
recognize and bind other
signaling components to make complexes.
Pathway Logic is an application of techniques from formal
methods to the
modeling and analysis of signal transduction networks in
mammalian cells.
These signaling network models are developed using
Maude [http://maude.cs.uiuc.edu], a symbolic language founded
on rewriting
logic. Models can be queried (analyzed) using the
execution, search and
model-checking tools of the Maude system. We show how
signal transduction
processes can be modeled using Maude at very different
levels of abstraction
involving either an overall state of a protein or its PFDs
and their
interactions.
Pathway Logic is an example of how formal modeling
techniques can be used to
develop a new science of symbolic systems biology. This
computational
science will provide researchers with powerful tools to
facilitate the
understanding of complex biological systems and accelerate
the design of
experiments to test hypotheses about their functions in vivo.
We will illustrate our approach using the biochemistry of
signaling
involving the ubiquitous Raf-1 serine/threonine protein
kinase. The
Raf-1 kinase is a proximal effector of EGFR and other RTK
signaling
through the ERK1/2 MAPK pathway, which contains the kinase
cascade
Raf-1 => Erk => Mek.
Examples of queries and graphical representations of the
results will be shown.
Return to Program or Index
Development of a Massively-Parallel, Biological Circuit Simulator
Richard Schiek and Elebeoba May
Sandia National Laboratories
Abstract
Genetic expression and control pathways can be successfully modeled as
electrical circuits. Given the vast quantity of genomic data, very large
and complex genetic circuits can be constructed. To tackle such
problems, the massively-parallel, electronic circuit simulator, Xyce (TM),
is being adapted to address biological problems. Unique to this
biocircuit simulator is the ability to simulate not just one or a set of
genetic circuits in a cell, but many cells and their internal circuits
interacting through a common environment.
Currently, electric circuit analogs for common biological and chemical
machinery have been created. Using such analogs, one can
construct expression, regulation and reaction networks. Individual
species can be connected to other networks or cells via non-diffusive or
diffusive channels (i.e. regions where species diffusion limits mass
transport). Within any cell, a hierarchy of networks may exist operating
at different time-scales to represent different aspects of cellular
processes.
Though under development, this simulator can model
interesting biological and chemical systems. Prokaryotic genetic and
metabolic regulatory circuits have been constructed and their
interactions simulated for Escherichia coli's tryptophan biosynthesis
pathway. Additionally, groups of cells each containing an internal
reaction network and communicating via a diffusion limited
environment can produce periodic concentration waves. Thus, this
biological circuit simulator has the potential to explore large, complex
systems and environmentally coupled problems.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a
Lockheed Martin Company, for the United States Department of Energy
under Contract DE-AC04-94AL85000.
Return to Program or Index
Representing and Reasoning About Signal Networks: An Illustration Using NFkappaB Dependent Signaling Pathways
Chitta Baral1, Karen Chancellor1,
Nam Tran1 and Nhan Tran2
1Arizona State University
2Translational Genomics Research Institute
Abstract
We propose a formal language to represent and reason about
signal transduction networks. The existing approaches using
graphical representation, petrinets, and computer fomal
languages fall short in many ways and our work suggests that
an artificial Intelligence (AI) approach is well suited for
the task. We applied a form of action language to represent
and reason about NFkappaB dependent signaling pathways. Our
language supports essential features of signal transduction
knowledge, namely reasoning with partial (or incomplete)
knowledge, non-monotic reasoning, reasoning with
uncertainty, reasoning about triggered evolutions of the
world and elaboration tolerance. We selected NFkappaB
dependent signaling to be the test bed, because of its
growing important role in cellular functions. NFkappaB is
a central mediator of the immun response. It can also
regulate stress responses, as well as cell death/survival in
several cell types. While many extracellular signals may
lead to the activation of NFkappaB, few related pathways are
elucidated. We can then study different problems:
representation of pathways, reasoning with pathways,
planning to alter the outcomes, and predicting new pathways.
We show that all these problems can be well formulated in
our framework. Many interesting problems also immerge. Taken
together, our work shows that a formal representation of
signal networks is feasible and practical with AI reasoning
approaches. The work also shows that it is important to be
able to formally formulate the signal transduction problems,
as it provides a solid research methodology as well as
promising research directions.
Return to Program or Index
Noise Attenuation in Artificial Genetic Networks
Yoshihiro Morishita and Kazuyuki Aihara
University of Tokyo, Japan
Abstract
Due to the progress of techniques in genetic operations, we
are now able to artificially create genetic networks that
have specific functions such as switch and oscillation.
This area is very important from the viewpoint of
engineering and medical applications.
However, it is indicated that dynamics of genetic networks
are very noisy and unstable because of the discreteness and
stochasticity originating from the smallness of the number
of related molecules.
Thus to explore methods to control the fluctuation is one
of the important problems toward designing networks that
operate with high stability and reliability.
In this study, we propose a new and plausible method to
control fluctuation in artificial genetic networks.
The main idea of this method is that the fluctuation in
gene expressions is reduced through interaction between
synthesized proteins and noise-attenuation molecules that
are designed based on domain structure of the proteins to
specifically interact with the proteins.
We analytically derive the noise-attenuation level by
approximating the master equations that governs dynamics of
a single gene expression.
We also clarify that the amount of the noise attenuators
and the rates of the interaction between proteins and them
are important for the attenuation level.
In addition, we demonstrate efficiency of the method for
stabilization of system dynamics where there are multiple
stable equilibrium points by using the toggle switch model.
Return to Program or Index
Computational Inference of Regulatory Pathways in Microbes: An Application to Phosphorus
Assimilation Pathways in Synechococcus WH8102
Z. Su1, P. Dam1, X. Chen2,
V. Olman1, T. Jiang1, B. Palenik3 and Y. Xu1
1Protein Informatics Group, Divisions of Life Sciences and Computer Sciences & Mathematics,
Oak Ridge National Laboratory
2Department of Computer Science and Engineering, Univ. of California at Riverside
3Scripps Institute of Oceanography, University of California at San Diego
Abstract
We present a computational protocol for inference of
regulatory and signaling pathways in a microbial cell,
through mining "high-throughput" biological data of various
types, literature search, and computer-assisted human
inference. This protocol consists of four key components:
(a) construction of template pathways for microbial gnomes
related to the target genome, which have either been
thoroughly studied or had a significant amount of relevant
experimental data, (b) inference of target pathways for the
target genome, by combining the template pathway models and
target genome-specific information, (c) assignment of
genes of the target genome to each individual functional
roles in the pathway model, and (d) validation and
refinement of the pathway models using pathway-specific
experimental data or other information. To demonstrate the
effectiveness of this procedure, we have applied this
computational protocol to the construction of the
phosphorus assimilation pathways in Synechococcus WH8102.
In this paper, we present a model of the core component of
this pathway and provide our justification in support of
the predicted model.
Return to Program or Index
Miscellaneous
Development and Assessment of Bioinformatics Tools to Enhance Species Conservation and Habitat Management
Melanie A. Sutton, Lori Deneke, John Eme, Wayne Bennett and Frank
Wray
Departments of Biology and Computer Science, University of West Florida
Abstract
This project encompasses our interdisciplinary approach to
integrating computational methods into the knowledge-
discovery process associated with understanding biological
systems impacted by the loss or destruction of sensitive
habitats. Data mining is used to intelligently query
databases to extract meaningful information and to
elucidate broad patterns that facilitate overall data
interpretation. Developed visualization techniques present
mined data in ways where context, perceptual cues, and
spatial reasoning skills can be applied to help reveal
potential impacts of conservation efforts and to uncover
significant trends in behavioral patterns, habitat use,
species diversity, and community composition. In this
context, we explore the following systems:
Marginal Fish Habitats This project involves the
development of an Internet-searchable database for data
associated with the investigation of fish diversity and
community structure in coral reefs in Indonesia. This
database will be used to assist in conducting
biodiversity assessments of a sensitive habitat,
facilitating the development of resource management
practices for reef-associated systems.
Beach Mouse Communities This project involves the
development of an Internet-searchable database and
innovative multi-media-based visualization strategies
(e.g., virtual tours) for tracking the population
dynamics of the Santa Rosa Beach Mouse (SRBM). The
bioinformatics tools associated with this project will be
used to study how habitat fragmentation impacts the
population dynamics of the SRBM, with the goal of
facilitating the development of better management
practices for related subspecies.
The design and assessment of our tools emphasizes support
for basic research as well as the creation of mechanisms
for using the tools to support regional and international
inter-institutional educational objectives in the areas of
pattern recognition and human-computer interaction.
Return to Program or Index
MedfoLink: Bridging the Gap between IT and the Medical Community
Vezen Wu1, Mitchell Berman, M.D.2, Armen Kherlopian3 and
Joseph Gerrein3
1Columbia Univ. IT Programs
2Columbia Presbyterian Hospital
3Columbia Univ. Biomedical Engineering
Abstract
MedfoLink is a new software technology that applies
novel design to achieve a solution to the issues
surrounding medical records that are part of the national
agenda. The majority of medical records are still kept on
paper, which raises issues of patient privacy, lost patient
history and clerical errors that cause death. Our software
overcomes the vocabulary and performance limitations that
hinder the adoption of existing medical language processing
technology for handling medical records. MedfoLink is a
Java technology that uses medical language processing and
the UMLS to enable a computer to accurately record and
interpret data from patient records. Benefits include
security to ensure patient privacy, consolidated patient
histories, real-time access to patient records, and the
elimination of clerical errors. Applications range from
enhanced individual patient care to public health concerns
like bioterrorism.
MedfoLink unifies different technologies to transform
the data from patient records into a relational database
for computer analysis. MedfoLink draws its knowledge from
the Unified Medical Language Source (UMLS), a medical
language database created by the NIH. MedfoLinkÕs design
incorporates two novel approaches as follows: 1.
Intelligent algorithms improve its comprehension of medical
records with practice. 2. Adaptive Learning Interface (ALI)
architecture enhances performance by interfacing with
existing technologies. MedfoLink generates reports
containing statistics on patient data, which will improve
the ability of healthcare professionals to identify trends
in the patient population.
Our interdisciplinary team from Columbia University
developed MedfoLink to solve both a technological challenge
and the market need with broad implications for saving
lives.
Return to Program or Index
TC-DB: A Membrane Transport Protein Classification Database
Can Van Tran, Nelson M. Yang and Milton H. Saier Jr.
University of California, San Diego
Abstract
TC-DB is a comprehensive relational database containing
functional and
evolutionary information on transmembrane transport proteins. The
database contains data extracted from more than 9000 references,
covering approximately 3000 representative proteins, classified in over
380 different families. TC-DB is the primary resource for the retrieval
of transporter data classified by the TC system. The TC system, which has
been adopted by the IUBMB (International Union of Biochemistry and
Molecular Biology), is analogous to the EC (Enzyme Commission)
system's classification of enzymes. The functional ontology developed
for TC-DB serves as an infrastructure to develop powerful queries that
yield new biological insight. The TC-DB website offers a plethora of
software tools that facilitate the analysis of membrane transporters.
Multiple avenues of accession are supported by the web interface
including classification drill-down, parametric searching, and full-
text searching. Various uses of TC-DB include software applications for the
annotation of transport proteins in newly sequenced genomes as well as
tools to trace evolutionary pathways of transport proteins. The
database as well as web services are accessible free of charge at http://
tcdb.ucsd.edu.
Return to Program or Index
|