IEEE Computer Society Bioinformatics Conference

Poster Abstracts for Section B
Poster Sessions II and III

Genomic Annotation: Prediction of DtxR Regulon: A Computational Approach to Identify Physiological Processes Controlled by Iron in Corynebacterium Species
A New Approach to Gene Prediction Using the Self-Organizing Map
On Gene Prediction by Cross-Species Comparative Sequence Analysis
CUBIC: Identification of Regulatory Binding Sites Through Data Clustering
Molecular Simulation: What Makes IgG Binding Domain of Protein L Fold Up to Native State: A Simulation Study with Physical Oriented Energy Functions Coupled to Topology Induced Terms
New Computational Methods for Electrostatics in Macromolecular Simulation
The Approximate Algorithm for Analysis of the Strand Separation Transition in Superhelical DNA Using Nearest Neighbor Energetics
Substrate Recognition by Enzymes: A Theoretical Study
Computational Simulation of Lipid Bilayer Reorientation at Gaps
Phylogeny and Evolution: Search for Evolution-related-oligonucleotides and Conservative Words in rRNA Sequences
High Speed GAML-based Phylogenetic Tree Reconstruction Using HW/SW Codesign
Evidence for Growth of Microbial Genomes by Short Segmental Duplications
PTC: An Interactive Tool for Phylogenetic Tree Construction
Iterative Rank Based Methods for Clustering
Analysis of Phylogenetic Profiles by Bayesian Decomposition
Automating Recognition of Regions of Intrinsically Poor Multiple Alignment for Phylogenetic Analysis Using Machine Learning
Reconstruction of Ancestral Gene Order Following Segmental Duplication and Gene Loss
Reconstruction of Ancient Operons From Complete Microbial Genome Sequences
Predictive Methods: An Evolving Approach to Finding Schemas for Protein Secondary Structure Prediction
Gene Selection for Multi-class Prediction of Microarray Data
Molecular Evaluation Using Comparative Molecular Interaction Profile Analysis System
Probe Design for Large-scale In Situ Hybridization and Oligo DNA Microarrays: An Application of the New NIA Gene Index
Genes Selection for Cancer Classification Using Bootstrappped Genetic Algorithms and Support Vector Machines
A Statistical Model of Proteolytic Digestion
Preliminary Wavelet Analysis of Genomic Sequences
Using Easel for Modeling and Simulating the Interactions of Cells in Order to Better Understand the Basics of Biological Processes and to Predict Their Likely Behaviors
Fold Recognition Using Sequence Fingerprints of Protein Local Substructures
An Iterative Loop Matching Approach to the Prediction of RNA Secondary Structures with Pseudoknots
A New Method for Predicting RNA Secondary Structure
Minimum-Redundancy Feature Selection from Microarray Gene Expression
Sequence Comparison: A Linear Programming Based Algorithm for Multiple Sequence Alignment
Alignment-Free Sequence Comparison with Vector Quantization and Hidden Markov Models
Prediction of Protein Function Using Signal Processing of Biochemical Properties
Genomic Sequence Analysis Using Gap Sequences and Pattern Filtering
An Optimal DNA Segmentation Based on the MDL Principle
The Cybertory Sequence File System (SFS) for Managing Large DNA Sequences
Implementing Parallel Hmm-pfam on the EARTH Multithreaded Architecture
Genome on Demand: Interactive Substring Searching
Automatic Selection of Parameters for Sequence Alignment
CoMRI: A Compressed Multi-Resolution Index Structure for Sequence Similarity Queries
Systems Biology: Pathway Logic Modeling of Protein Functional Domains in Signal Transduction
Development of a Massively-Parallel, Biological Circuit Simulator
Representing and Reasoning About Signal Networks: An Illustration Using NFkappaB Dependent Signaling Pathways
Noise Attenuation in Artificial Genetic Networks
Computational Inference of Regulatory Pathways in Microbes:
An Application to Phosphorus Assimilation Pathways in Synechococcus WH8102
Miscellaneous: Development and Assessment of Bioinformatics Tools to Enhance Species Conservation and Habitat Management
MedfoLink: Bridging the Gap between IT and the Medical Community
TC-DB: A Membrane Transport Protein Classification Database

Genomic Annotation

Prediction of DtxR regulon: A computational approach to identify physiological processes controlled by iron in Corynebacterium species

Sailu Yellaboina, prachee, Akash Ranjan, and Syed E. Hasnain
Centre for DNA finger printing and Diagnostics, India

Abstract
Background: The diphtheria toxin repressor DtxR Corynebacterium diphtheriae has been shown to be an iron-sensitive transcription regulator that controls the expression of iron uptake genes and diphtheria toxin. This study aimed to increase understanding of the DtxR regulated genes and their role in physiology of Corynebacterium.

Result: We developed a user friendly online software tool to identify the potential binding sites for any regulator based on Shannon relative entropy scoring scheme. Known DtxR binding sites of C. diphtheriae were used to generate a position specific preference profile for DtxR which was used by our method to identify the potential DNA binding site within C. glutamicum genome. In addition DtxR regulated operons were also identified taking into account the predicted DtxR regulatory sites, transcription termination sites and gene order conservation. The analysis predicted a number of DtxR regulated operons/genes whose orthologues code for iron uptake systems, such as siderphore transport system, hemin transport system, hemolysins and heam transporters. The analysis also predicts the genes, whose orthologues for ferritin and starvation inducible DNA binding protein (Dps) which are involved in iron storage and oxidative stress defence in various bacteria. In addition, we have found genes that code for the orthologues of adaptive response regulator (Ada) and endonuclease VIII (Nei) which are involved in DNA repair, could be regulated by diphtheria toxin repressor.

Conclusions: The methodology used to predict DtxR-iron regulated genes proved highly effective as our results agreed with experimental data observed in other bacterial systems. In addition, Our finding of many new DtxR-iron regulated genes reveals the physiological process controlled by DtxR-iron in various Corynebacteria speces. Our finding shows that hemolysis, hemin uptake and heam oxidation might be an essential iron-acquisition pathaway for various Corynebacteria pathogens at low levels of extracellular iron. Our data analysis reveals that DtxR coordinately controls the genes involved in iron uptake and iron storage, oxidative stress defence, DNA repair to counter the low levels of iron as well as ferrous iron induced oxidative damage by Fenton's reaction.

Return to Program or Index

A new approach to Gene Prediction using the Self-Organizing Map

Shaun Mahony¹, Aaron Golden², James McInerney³, and Terry Smith¹
¹National Centre for Biomedical Engineering Science, NUI, Galway, Ireland
²Dept. of Information Technology, NUI, Galway, Ireland
³Bioinformatics and Pharmacogenomics Laboratory, NUI, Maynooth, Ireland

Abstract
Computational gene prediction methods have yet to achieve a perfect accuracy rate, and many make a substantial number of false-positive predictions, even in prokaryotic genomes. One of the most obvious reasons for inaccurate gene predictions is the high degree of compositional variation that exists within most genomic sequences. For example, it has long been recognized that synonymous codon usage is highly variable, and under many evolutionary pressures. Many Markov model based gene-finding tools use only one model to represent protein coding regions in any given genome, and so are less likely to predict genes with an unusual composition. Indeed, it has been shown that using two or three models substantially increases the accuracy of a Markov-based gene-finder (Hayes & Borodovsky, 1998). In this poster we present a new neural network based approach to gene prediction that has the ability to automatically identify all the major patterns of content variation within a genome. The genome may then be scanned for regions displaying the same properties as one of these automatically identified models. Even using a relatively simple coding measure (relative synonymous codon usage), this method can predict the location of protein-coding sequences with a reasonably high accuracy, and with a specificity that is higher than Markov-based approaches in many cases. We also show other advantages of the approach, such as the ability to indicate genes that contain frame- shifts. We believe that this method has the potential to become a useful addition to the genome annotation process.

Return to Program or Index

On Gene Prediction by Cross-Species Comparative Sequence Analysis

Rong Chen and Hesham Ali
Department of Computer Science, College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE 68182-0116

Abstract
Sequencing of large fragments of genomic DNA, as well as complete eukaryotic chromosomes of many organisms, makes it possible to apply comparison of genomic sequences for identification of protein-coding regions. This approach is based on the fact that protein-coding regions evolve much slower than non-coding regions. Thus, candidate exons are seen as islands of similarity in alignment to genomic sequences that harbor homologous genes. Recently, several algorithms have been described for automated gene recognition by genomic comparison. Most of these programs are specifically designed for the comparison of closely related species. However, the non-coding regions may be also conserved for species whose evolutionary distance is too close. We conducted a comparative analysis of homologous genomic sequences of organisms with different evolutionary distances and found the conservation of the non-coding regions between closely related organisms. In contrast, more distance shows much less intron similarity but less conservation on the exon structures (in the terms of their number, length and sequence similarity). Based on this finding and training of data sets, we proposed a model by which coding sequence could be identified by comparing sequences of multiple species, both close and approximately distant. The reliability of the proposed method is evaluated in terms of sensitivity and specificity, and results are compared to the ones obtained by other popular gene prediction programs. Provided sequences can be found from other species at appropriate evolutionary distances, this approach could be applied in newly sequenced organisms where no species-dependent statistical models are available.

Return to Program or Index

CUBIC: Identification of Regulatory Binding Sites Through Data Clustering

Victor Olman¹, Dong Xu¹ and Ying Xu^1,2
¹Protein Informatics Group, Life Sciences Division
²Computer Science and Mathematics Division
Oakridge National Laboratory, Oakridge, TN

Abstract
Transcription factor binding sites are short fragments in the upstream regions of genes, to which transcription factors bind to regulate the transcription of genes into mRNA. Computational identification of transcription factor binding sites remains an unsolved challenging problem though a great amount of effort has been put into the study of this problem. We have recently developed a novel technique for identification of binding sites from a set of upstream regions of genes, that could possibly be transcriptionally co-regulated and hence might share similar transcription factor binding sites. By utilizing two key features of such binding sites (i.e., their high sequence similarities and their relatively high frequencies compared to other sequence fragments), we have formulated this problem as a cluster identification problem. That is to identify and extract data clusters from a noisy background. While the classical data clustering problem (partitioning a data set into clusters sharing common or similar features) has been extensively studied, there is no general algorithm for solving the problem of identifying data clusters from a noisy background. In this paper, we present a novel algorithm for solving such a problem. We have proved that a cluster identification problem, under our definition, can be rigorously and efficiently solved through searching for substrings with special properties in a linear sequence. We have also developed a method for assessing the statistical significance of each identified cluster, which can be used to rule out accidental data clusters. We have implemented the cluster identification algorithm and the statistical significance analysis method as a computer software CUBIC. Extensive testing on CUBIC has been carried out. We present here a few applications of CUBIC on challenging cases of binding site identification.

Return to Program or Index

Molecular Simulation

What Makes IgG Binding Domain of Protein L Fold Up to Native State: A Simulation Study with Physical Oriented Energy Functions Coupled to Topology Induced Terms

Seung Yup Lee, Yoshimi Fujitsuka, Do Hyun Kim and Shoji Takada
Department of Chemical and Biomolecular Engineering and Center for Ultramicrochemical Process Systems, Korea Advanced Institute of Science and Technology

Abstract
To find relationships between the folding mechanisms of proteins and linear sequences still remains a fundamental question despite many researches performed. The native topology is found to control the folding mechanisms and structural distributions of small proteins indicating that interactions within native topology bias a protein to native state. In addition, proteins sharing similar native topology show almost the same folding pathways although the sequence homology between them is quite low. Therefore, protein topology may be more important than any other specific interactions related to sequences. On the other hand, there are some proteins showing different folding scenarios where native topology is not a dominant role to select folding pathways and kinetics any more. Especially, due to the structural symmetry of native structure, it is known for IgG binding domain of protein L consisting of two beta hairpins and central alpha helix that folding dynamics is not significantly determined by native topology but sequence-specific interactions. In this study, the folding of protein L is characterized by molecular simulation with a coarse-grained peptide chain representation and physical effective energy functions (EEFs) taking into account solvent induced effects coupled to topological energies. The propensities of secondary structure formations as well as structural distributions along folding pathways are analyzed quite in detail. Moreover, predicted results for protein L are also compared with the folding of structural analog, IgG binding domain of protein G, to find general rules in determining the folding of small globular proteins.

Return to Program or Index

New Computational Methods for Electrostatics in Macromolecular Simulation

Igor Tsukerman
Dept. of Electrical & Computer Eng., The University of Akron

Abstract
Electrostatic effects are known to play a major role in the behavior of macromolecules in solvents. Computer simulation of these effects remains quite challenging due to a large number of charges or particles involved, varying dielectric constants, and ionic interactions in the solvent.

The paper introduces new difference schemes that can incorporate any desired analytical approximation of the electrostatic potential (for example, its singular Coulombic or dipole terms) exactly, and with little computational overhead.

For EXPLICIT solvent models, the schemes have several salient advantages: (i) optimal computational cost with respect to the number of atoms; (ii) real arithmetic only; (iii) any boundary conditions (not necessarily periodic) easily implemented; (iv) no FFTs (more efficient parallelization); (v) 10-15 times higher accuracy in the computed energy and force, as compared to the published results on conventional methods with similar parameters (e.g. smooth PME with 4th order interpolation). Numerical experiments for varying mesh sizes and number of atoms will be presented.

For IMPLICIT solvent models, not only the singular Coulombic terms but also derivative jumps of the potential at solute-solvent interfaces can be analytically incorporated into the computational scheme through an auxiliary coordinate mapping. One version of the scheme employs a regular mesh and another one is meshless, with an adaptively chosen distribution of nodes.

I gratefully acknowledge collaboration with Prof. Gary FriedmanÕs group of Drexel University on magnetostatic and electrostatic models of nanoparticles in colloidal solutions and with Dr. Achim Basermann of NEC Europe Ltd. on parallel implementation of the algorithms.

Return to Program or Index

The Approximate Algorithm for Analysis of the Strand Separation Transition in Superhelical DNA Using Nearest Neighbor Energetics

Chengpeng Bi and Craig J. Benham
UC Davis Genome Center, University of California, One Shields Ave., Davis CA 95616

Abstract
Accurate methods to computationally search genomic DNA sequences for regulatory regions have been difficult to develop. Conventional string-based methods have not been successful because many types of regulatory regions do not have recognizable motifs. And even when a sequence pattern is known to be associated with a class of regulatory regions it commonly is necessary, but not sufficient, for function. This suggests that other attributes, not necessarily strongly correlated with the details of base sequence, are involved in regulation. Here we present a computational method to analyze the propensity of superhelically stressed DNA to undergo strand separation events, as is required for the initiation of both transcription and replication. We build in silico models to analyze the statistical mechanical equilibrium distribution of a population of identical, stressed DNA molecules among its states of strand separation. In this phenomenon, which we call stress induced duplex destabilization (SIDD), a state energy is determined by the energy cost of opening the specific separated base pairs in that state, and the energy relief from the relaxation of stress this affords. We use experimentally measured values of all energy parameters, including the nearest neighbor energetics known to govern DNA base pair stability. We perform a statistical mechanical analysis in which the approximate equilibrium distribution is calculated from all states whose free energies do not exceed a user-defined threshold. This provides the most general and efficient computational approach to the analysis of this phenomenon. The algorithm is implemented in C++, and its performance is analyzed.

Return to Program or Index

Substrate Recognition by Enzymes: A Theoretical Study

Kaori Ueno-Noto¹, Keiko Takano¹, and Miki Hara-Yokoyama²
¹Graduate School of Humanities and Sciences, Ochanomizu University, 2-1-1 Otsuka, Bunkyo-ku, Tokyo 112-8610, Japan
²Graduate School of Dental, Tokyo Medical and Dental University, 1-5-45 Yushima, Bunkyo-ku, Tokyo 113-8549, Japan

Abstract
Oligosaccharides of glycosphingolipids on cell surface show diversity in both primary and tertiary structures, which is closely associated with specific interactions in the extracellular domain. Reports on higher order structures of glycosphingolipids have been limited because of the experimental difficulties to deal with oligosaccharides, documenting only the primary structures of them. In this study, we investigated their structures by applying computational chemistry.

Gangliosides are sphingolipids containing sialic acids and their oligosaccharides can be recognized by some enzymes. We previously reported that gangliosides inhibited the activity of an enzyme NAD glycohydrolase (CD38), and that those with tandem sialic acid residues in the sugar chain had great inhibitory effect. We describe the results of computer simulations on three-dimensional structures and electronic structures of gangliosides in order to clarify the causative factors of difference in the inhibitory effect and the recognition mechanisms of the enzyme.

Similarities between dimeric sialic acid and NAD in the calculated structures were assessed by conformational analyses and molecular orbital calculations, on the assumption that CD38, an NAD reacting enzyme, cross-reacts with the tandem sialic acid residues in gangliosides. It was found that calculated dipole moments and HOMO were correlated with inhibitory effect. CD38 is likely to recognize the two carboxyl groups in tandem sialic acid residues of gangliosides, as well as the phosphate groups in NAD. Solvation effects were also considered to interpret the substrate recognition mechanisms in the biological system, which supported above results.

Return to Program or Index

Computational Simulation of Lipid Bilayer Reorientation at Gaps

Peter M. Kasson and Vijay S. Pande
Stanford University Medical School, Stanford, CA

Abstract
Understanding cellular membrane processes is critical for the study of events such as viral entry, neurotransmitter exocytosis, and immune activation. Supported lipid bilayers serve as a model system for many membrane processes, offering the potential to study them in a more controllable setting. Despite the relative experimental simplicity of this model system, many important structural and dynamic parameters are not experimentally observable with current techniques. The orientation of individual lipid molecules within the bilayer is one of these experimentally indeterminable parameters. Computational approaches allow the development of a high-resolution model of bilayer processes. We have performed molecular dynamics simulations of dimyristoylphosphatidylcholine (DMPC) bilayers to model the creation of bilayer gaps--a common process in bilayer patterning--and to analyze their structure and dynamics. Molecular simulation of these large systems was made computationally tractable using parallel processing. Based on our observations, we propose a model for gap formation in which the bilayer edges form metastable micelle-like structures on a nanosecond time scale. Lipid molecules near the edges are structurally similar to lipids in ungapped bilayers but undergo small- scale motions on a more rapid timescale. These data suggest that lipids may undergo rapid local rearrangements in bilayers undergoing membrane fusion, thus facilitating the formation of the fusion structures postulated to be intermediates in the infection cycle of viruses such as influenza, Ebola, and HIV.

Return to Program or Index

Phylogeny and Evolution

Search for Evolution-related-oligonucleotides and Conservative Words in rRNA Sequences

Liaofu Luo, Fengmin Ji, Mengwen Jia, Li-Ching Hsieh and H.C. Lee
Inner Mongolia University, China

Abstract
We describe a method for finding ungapped conserved words in rRNA sequences that is effective, utilizes evolutionary information and does not depend on multiple sequence alignment. Evolutionary distance (called n- distance) between a pair of 16S or 18S rRNA sequences is defined in terms of the difference in the two sets of frequencies of occurrence of oligonucleotides n bases long (n-mers) given by the sequences. These n-distances are used to reconstruct phylogenetic trees for 35 representative organisms from all three kingdoms. The quality of the tree generally improves with increasing n and reaches a plateau of best fit at n=7 or 8. Hence the 7-mer or 8-mer (oligonucleotide of 7 or 8 bases) frequencies provide a basis to describe rRNA evolution. Based on the analysis of the contribution of a particular 7-mers to 7-distances, a set of 612 7-mers (called evolution-related- oligonucleotides, EROs) that are critical to the topology of the best phylogenetic tree are identified. Expanding from this set of EROs, evolution-related conservative words longer than 7 bases in 16S rRNA sequences from an enlarged set of 98 organisms in Bacteria and Archaea are identified based on two criteria: (1) the word is highly conserved in nearly all species of a kingdom (or a sub-kingdom); (2) the word is located at nearly the same site in each sequence. Three examples of words thus found are: The 13-mer ggattagataccc located at the end of a loop near H24 (in E.coli) is conservative in almost all species in Archaea and Bacteria. The 8-mer aacgagcg located on H35 is also conservative in Archaea and Bacteria. Its expansion, the 32- mer tgttgggttaagtcccgcaacgagcgcaaccc, is conservative in Bacteria but not in Archaea.

Return to Program or Index

High Speed GAML-based Phylogenetic Tree Reconstruction Using HW/SW Codesign

Terrence S.T. Mak and K. P. Lam
The Chinese University of Hong Kong

Abstract
With the accumulation of genetic information for biological organisms, different computational techniques are applied to inferring meaningful patterns from the alignment of DNA sequences. A phylogenetic relationship between different organisms can be inferred by making use of a maximum likelihood approach that has been demonstrated to solve the problems effectively. Heuristics for calculating the phylogenetic trees based on maximum likelihood are computationally expensive. The tree evaluation function that calculates the likelihood value for each tree topology dominates the time in searching the optimal tree. We developed a hybrid Hardware/Software (HW/SW) system for solving the phylogenetic tree reconstruction problem using the Genetic Algorithm for Maximum Likelihood (GAML) approach. The GAML is partitioned into a genetic algorithm (GA) and a fitness evaluation function, with the SW handling the GA and the hardware evaluating the maximum likelihood function. This approach exploits the flexibility of software and the speed up of high performance hardware. An efficient Field programmable Gate Arrays (FPGA) implementation for the required computation on evolution tree topology fitness evaluation is proposed. The complete high-level digital design is developed using XilinxÕs System Generator for DSP toolbox, which provides an efficient generation of VHDL code for hardware circuitry programming. In addition, XilinxÕs Java-based JBits toolkit is used for constructing a BRAM interface for digital process synchronization control and GA chromosomes transmission between a workstation and the Xilinx Virtex- 800 FPGA Processor. This implementation provides approximately 30 to 100 times in speedup improvement when compared to a software-only solution.

Return to Program or Index

Evidence for Growth of Microbial Genomes by Short Segmental Duplications

Li-Ching Hsieh, Liaofu Luo and Hoong-Chien Lee
Department of Physics, National Central University, Taiwan

Abstract
We show that textual analysis of microbial complete genomes reveals telling footprints of their early evolution. If a DNA sequence considered as a text in its four bases is sufficiently random, the distribution of frequencies of words of a fixed length from the text should be Poissonian. We point out that in reality, for words less than nine letters complete microbial genomes universally have distributions that are uniformly many times wider than those of corresponding Poisson distributions. We interpret this phenomenon as follows: the genome is a large system that possesses the statistical characteristics of a much smaller random system, and certain textual statistical properties of genomes observable now are remnants of those of their ancestral genomes, which were much shorter than genomes today. This interpretation motivates a simple biologically plausible model for the growth of genomes: the genome first grew randomly to an initial length of not more than one thousand bases (1 kb), thereafter mainly grew by random segmental duplications. Setting the lengths of duplicated segments to average around 25b, we have generated model sequences in silico whose statistical properties emulate those of present day genomes. The small size of the initial random sequence and the shortness of the lengths the duplicated segments both dictate an RNA world at the time growth by duplication began. Growth by duplication allowed the genome repetitive use of hard-to- come-by codes increasing thereby the rates of evolution and species diversion enormously.

Return to Program or Index

PTC: An Interactive Tool for Phylogenetic Tree Construction

Sami Khuri and Chen Yang
San Jose State University

Abstract
A phylogenetic tree represents the evolutionary history of a group of organisms. Constructing phylogenetic trees is a crucial step for finding out how todayÕs species are related to one another in terms of common ancestors. Numerous computer tools, such as PHYLIP and PAUP, have been developed to construct such trees. The algorithms implemented in these tools are generally either character or distance based. The first group uses individual substitutions among the sequences, while distance based algorithms construct trees based on pairwise distances between the sequences.

In our work, we introduce the Phylogenetic Tree Construction package (PTC): a novel interactive tool for constructing phylogenetic trees. PTC currently supports four well-known algorithms, Unweighted Pair Group Method using Arithmetic Average, Neighbor Joining, Fitch Margoliash, and Maximum Parsimony. The main reason behind our project is the lack of interactivity in existing tools. The existing packages, unlike our tool, only visualize the resulting phylogenetic tree. Furthermore, the interaction with those packages occurs only at the beginning, before the programÕs execution. We strongly believe that interactive tree construction can be extremely valuable to bioinformaticians. Through interacting with PTC, users can gain a deeper understanding of the algorithms, of the input and output data and not just see the final tree generated by an algorithm. We provide the capability to edit input data, to view the tree construction step-by-step, and to compare two consecutive states of the algorithm. Trees are dynamically drawn and can be resized by the user. PTC is implemented in Java and can be extended to include additional algorithms.

Return to Program or Index

Iterative Rank Based Methods for Clustering

S.W. Perrey, H. Brinck and A. Zielesny
University of Applied Sciences of Gelsenkirchen, Recklinghausen, Germany

Abstract
Recently a new clustering algorithm was developed, useful in phylogenetic systematics and taxonomy. It derives a hierarchy from (dis)similarity data on a simple and rather natural way. It transforms a given dissimilarity by an iterative approach. Each iteration step consists of ranking the objects under consideration according to their pairwise dissimilarity and calculating the Euclidian distance of the resulting rank vectors. We investigate alterations of this order of steps as well as substitute the Euclidian distance by standard statistical measures for series of estimates. We evaluate the resulting different procedures on biological and other data sets of different structure regarding their underlying cluster systems. Thereby, potentials and limits of this kind of iterative approach become obvious.

Return to Program or Index

Analysis of Phylogenetic Profiles by Bayesian Decomposition

Ghislain Bidaut, Karsten Suhre, Jean-Michel Claverie and Michael Ochs
Bioinformatics Working Group, Fox Chase Cancer Center, Philadelphia, PA, USA; Structural and Genetic Information Laboratory, UMR 1889 CNRS-AVENTIS, Marseille, France

Abstract
Antibiotic resistance together with the side effects of broad spectrum antibacterials makes development of targeted antibiotics of great interest. Here we demonstrate a Bayesian approach for the analysis of phylogenetic data aimed at identifying targets specific to species and genus in bacteria. The data comprise a series of BLAST scores for selected genes from multiple bacterial species compared to the well-known genomes of E. coli and M. tuberculosis. As organisms adapted to new niches during evolution, new genes were created from lateral gene transfers, duplication and mutation of existing genes, or merging of multiple genes. The phylogenetic profiles observed today therefore result from the superposition of fundamental genetic relationships that cannot be inferred from single gene similarity. We have applied Bayesian Decomposition (BD), an algorithm developed to identify fundamental signals in mixtures, to identify the fundamental patterns comprising multiple, overlapping functional units of related genes.

Preliminary analysis shows that certain patterns are highly conserved (correlation of 70-90% in the genes included) as we increase the number of patterns used by BD. Some patterns appear genus-specific, suggesting that sets of genes have evolved early and been maintained in all related species. In addition, a pattern linking a functional core of genes common to all the studied organisms has been identified. Further analysis should yield specific sets of genes tied in functional units that are unique to species and genus, providing protential targets for therapeutics at the level of disruption protein interactions or of individual proteins.

Return to Program or Index

Automating Recognition of Regions of Intrinsically Poor Multiple Alignment for Phylogenetic Analysis Using Machine Learning

Yunfeng Shan and Evangelos E. Milios¹;
Andrew J. Roger and Christian Blouin²; Edward Susko³
¹Faculty of Computer Science, Dalhousie University, 6050 University, Avenue, Halifax, NS Canada B3H 1W5
²Dept. of Biochemistry and Molecular Biology, Genome Atlantic/Genome Canada, Dalhousie University, Halifax, Nova-Scotia, Canada, B3H 1X5
³Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova-Scotia, Canada, B3H 4H7

Abstract
Phylogenetic analysis requires alignment of gene sequences. Automatic alignment programs produced regions of intrinsically poor alignment that are currently detected and deleted manually. We present the results of a machine learning approach to detection of these regions of the alignment. We compare naive Bayes, standard decision trees, and support vector machines.

The results show three algorithms can accurately identify the bad sites of multiple sequence alignment based on three attributes such as the gap ratio, the site likelihood and the degree of homoplasy (consistency index). Among three algorithms, naive Bayes and SVM provide the best performance for bad site prediction, but C4.5 decision trees provide the best performance for the ambiguous and good site predictions. Among three classes, it is the most difficult to distinguish the ambiguous sites and the good sites, but easiest to distinguish the bad sites among these three classes. No evident difference among three parsimony count index as an attribute for reducing classification error is observed. Generally, the classifiers of naive Bayes and C4.5 decision tree learnt from the subset of a balanced class distribution generally come up with optimal performance compared with natural class distributions: the random subset and the entire data set.

Return to Program or Index

Reconstruction of Ancestral Gene Order Following Segmental Duplication and Gene Loss

Jun Huan, Jan F. Prins, Wei Wang and Todd J. Vision
Dept. of Computer Science, Univ. of North Carolina, Chapel Hill, NC

Abstract
As gene order evolves through a variety of chromosomal rearrangements, conserved segments provide important insight into evolutionary relationships and functional roles of genes. However, gene loss within otherwise conserved segments, as typically occurs following large scale genome duplication, has received limited algorithmic study. This has been a major impediment to comparative genomics in certain taxa, such as plants and fish. When large scale genome duplication and gene loss occur, how well can we infer both the true gene order within ancestral chromosomal segments and the ancestral ordering of those segments?

We propose a heuristic algorithm for the inference of ancestral gene order in a set of genomes for which at least some genomic segments are partially related by common ancestry to two or more different segments. It does not require gene content and order to be perfectly conserved among segments. First, conserved chromosomal regions are identified using existing pairwise genomic alignment algorithms. Second, segments are iteratively clustered under the control of two parameters, (1) the minimal required number of shared genes between two segments or clusters and (2) the maximal allowed number of rearrangement breakpoints along the lineage leading to each descendant segment. Finally, we compute the estimated ancestral gene order for each cluster.

We evaluate the performance of this algorithm on simulated data that models a genome evolving by large-scale duplication, duplicate gene loss, transposition, translocation, and inversion. The results suggest that ancestral gene orders may be estimated with sufficient accuracy to substantially improve the detection sensitivity of pairwise genomic alignment algorithms.

Return to Program or Index

Reconstruction of Ancient Operons From Complete Microbial Genome Sequences

Yuhong Wang¹, John Rose², Bi-cheng Wang² and Dawei Lin²
¹Department of Molecular Biology, Jilin University, Changchun 130023, PRC
²Southeast Colloboratory Biochemistry and Molecular Biology Department, University of Georgia, Athens, GA 30602, USA

Abstract
Completed genomes not only provide DNA sequence information, but also reveal the relative locations of genes. In this paper, we propose a new method for reconstruction of "ancient operons" by taking advantages of the evolutionary information in both orthologous genes and their locations in a genome. The basic assumption is that the closer two genes were in an ancient genome, the more likely they will stay closer in the current genome. An assembly of non-random neighboring pairs of genes in current genomes should be able to reconstruct the gene groups that were together at a certain point of time during evolution. Given the fact that genes that are close neighbors are more likely functionally related, the gene groups generated by this assembly process are named "ancient operons."

The assembly is only meaningful when enough non-random pairs can be found. This was made possible by over 100 microbial genomes available in recent years. For proof of concept, we chose 63 non-redundant complete microbial genomes from RefSeq database [May 2003 release] at NCBI. In order to normalize the effect of protein sequence mutations and other changes due to evolution, we only consider assembly of COGs (Cluster of Orthologous Group) in these genomes. There are total 4901 COGs from NCBI COG database are used.

The assembly process is similar to the one that assembles DNA sequences into contigs. In our case, the neighbor COG pairs are used as basic assembly units. A target function is defined based on neighbor frequency of pair-wise link among all 4901 COGs after analysis for all 63 genomes. We use random cost algorithm, a global optimization algorithm to minimize the target function and assemble COGs into contigs. The significance of these contigs are then assessed by statistical methods. The results suggest that the assembled contigs are statistically and biologically significant. This method and the assembled ancient operons provides a new way for studying microbial genomes, their evolution and for annotating proteins of unknown functions.

Return to Program or Index

Predictive Methods

An Evolving Approach to Finding Schemas for Protein Secondary Structure Prediction

Huang, Hsiang Chi
Institute for Information Industry, Taiwan

Abstract
Assessing accurate secondary structures of a protein involves in preparation a crystal of protein, x-ray scanning and computing. These cost a lot. Researchers have developed methods to predict secondary structures of a protein since 1960s. Recently, methods predicting protein secondary structure through the use of new algorithms such as HMM (3), neural networks (2), new evolutionary databases (2) etc.

These algorithms do help to predict protein secondary structure. However, some algorithms are like "Black Boxes." Researchers donÕt understand the meanings of enormous parameters or how the results of prediction come out but only accept them. This study intends to predict protein secondary structure schemas by genetics algorithms.

In this research, a genetic algorithm has been applied to predict building schemas of protein secondary structure. The results of this GAPS (Genetics Algorithm for Protein Secondary Structure) achieved an average Q3 score of 55%~ 65%. Although the highest Q3 of this research is not the highest score among researches, some fundamental and useful building schemas of protein secondary structure information have been found.

Previous researches (e.g. focused on global free energy minimum of protein secondary structure) could not give us a complete understanding of the driving forces behind protein folding. Why?

Previous researches take every amino acid in the sequence into consideration. However, from the results of this study, not all the residues in a schemas effect the conformation of protein secondary structure. Only few amino acids actually effect the conformation of protein secondary structure folding.

Return to Program or Index

Gene Selection for Multi-class Prediction of Microarray Data

Dechang Chen, Dong Hua, Xiuzhen Cheng and Jaques Reifman
Uniformed Services University of the Health Sciences, Bethesda, MD

Abstract
Gene expression data from microarrays have been successfully applied to class prediction, where the purpose is to classify and predict the diagnostic category of a sample by its gene expression profile. A typical microarray dataset consists of expression levels for a large number of genes on a relatively small number of samples. As a consequence, one basic and important question associated with class prediction is: how do we identify a small subset of informative genes contributing the most to the classification task? Many methods have been proposed but most focus on two-class problems, such as discrimination between normal and disease samples. This paper addresses selecting informative genes for multi- class prediction problems by jointly considering all the classes simultaneously. Our approach is based on power of genes in discriminating among tumor types (classes) and the correlation of genes. We formulate the expression levels of a given gene by a one-way analysis of variance model with heterogeneity of variances, and determine the discrimination power of the gene by a test statistic designed to test the equality of the class means. In other words, the discrimination power of a gene is associated with a Behrens-Fisher problem. Informative genes are chosen such that each selected gene has a high discrimination power and the correlation between any pair of selected genes is low. Several test statistics are studied in this paper, such as Brown-Forsythe test statistic and Welch test statistic. Their performances are evaluated over a number of classification methods applied to publicly available microarray datasets

Return to Program or Index

Molecular Evaluation Using Comparative Molecular Interaction Profile Analysis System

Yoshiharu Hayashi, Katsuyoshi Sakaguchi, Nao Iwata and Masaki Kobayashi
KLIMERS Co., Ltd., Japan

Abstract
Creating a new molecular description factor based on the results of computational docking study will add new dimensions in molecular evaluation. We propose a new molecular description factor analysis system named Comparative Molecular Interaction Profile Analysis system (CoMIPA) in which the AutoDock program is used for docking evaluation of small molecule compound-protein complexes. Interaction energies are calculated, and the data sets obtained are named interaction profiles (IPFs). Using IPF as a scoring indicator, the system could be a powerful tool to cluster the interacting properties between small molecules and bio macromolecules such as ligand-receptor bindings. As for clustering methods, we used hierarchical clustering which has visual advantages, but it has a number of shortcomings for the study by CoMIPA. So, in addition to hierarchical clustering, we tried to use Kohonen's Self- Organizing Map (SOM) which is a kind of neural networks that learn the feature of multi dimensions data without supervision. SOM have a number of futures that that make them particularly well suited to clustering and analysis of interaction profiles. Further development of the system will enable us to predict the adverse effects of a drug candidate.

Return to Program or Index

Probe Design for Large-scale In Situ Hybridization and Oligo DNA Microarrays: An Application of the New NIA Gene Index

Vincent VanBuren, Toshiyuki Yoshikawa, Toshio Hamatani, Alexei A. Sharov, Yong Qian, Dawood B. Dudekula, Minoru S.H. Ko
NIH/NIA

Abstract
Large-scale molecular biology techniques such as the use of oligo DNA microarrays are widely used to gain an appreciation of global transcription in biological time series, tissue contrast, classification and prognosis, and response-to-treatment experiments. Recent efforts to perform large-scale in situ hybridizations have used cDNA probes. An approach that makes use of designed oligo probes should offer improved consistency at uniform hybridization conditions and improved specificity, as demonstrated by various oligo microarray platforms. While many tools exist to aid in probe design, most tools are not suitable for large-scale automation of selection and there are no freely available tools that optimize probe selection by considering the complex interaction of the physical properties and cross-reactivity of the probe as measured by microarray studies. We describe a new Web-based application that takes FASTA-formatted sequence and some simple parameters as input, and returns both a list of the best choices for probes and a full report containing possible alternatives. Designing probes for microarrays uses a specialized probe scoring routine that optimizes probe intensity based upon an artificial neural network (ANN) trained to predict the average probe intensity from the physical properties of the probe. Determining probe cross- reactivity requires a high-quality gene index that includes both gene and transcript information. The new NIA gene index was applied to this effort by creating a BLAST database of the index and searching all potential probes against it. This new tool should provide a reliable way to construct probes that maximize signal intensity while minimizing cross-reactivity.

Return to Program or Index

Genes Selection for Cancer Classification Using Bootstrappped Genetic Algorithms and Support Vector Machines

Xue-wen Chen
California State University

Abstract
The gene expression data obtained from microarrays have shown useful in cancer classification. DNA microarray data have extremely high dimensionality compared to the small number of available samples. An important step in microarray based cancer classification is to remove genes irrelevant to the learning problem and to select a small number of genes expressed in biological samples under specific conditions. In this paper, we propose a novel system for selecting a set of genes for cancer classification. A wrapper method is employed: a linear support vector machine is used to evaluate the fitness of subsets of genes and a genetic algorithm is employed to search for a subset of genes which discriminate samples in two classes well. To overcome the problem of the small size of training samples, bootstrap methods are combined in genetic search. This new system is very efficient for selecting sets of genes in very high dimensional feature space for cancer classification. Two databases are considered: the colon cancer database and the leukemia database. Our experimental results show that the proposed method is capable of finding genes that discriminate between normal cells and cancer cells and it generalizes well (this is particularly important in microarray data classifications, since only very limited training samples are available).

Return to Program or Index

A Statistical Model of Proteolytic Digestion

I-Jeng Wang, Christopher P. Diehl and Fernando J. Pineda
Johns Hopkins University Applied Physics Laboratory, 11100 Johns Hopkins Rd., Laurel, MD

Abstract
We present a stochastic model of proteolytic digestion of a proteome, assuming (1) the distribution of parent protein lengths in the proteome, (2) the relative abundances of the 20 amino acids in the proteome and (3) the digestion “rules” of the enzyme used in the digestion. Unlike hypothesis tests in most protein identification software, the developed model accounts for the fact that digestion products come from the mixture of proteins that constitutes a microorganism’s proteome. We believe that incorporation of a more rigorous model of proteolytic digestion in the hypothesis test would significantly improve the selectivity and sensitivity of the assay. Our approach is based on the observation that, when overlaid with the cleavage process, a protein sequence can be equivalently modeled as a regenerative process with the cleavage sites as the regeneration points. Hence the regenerative cycles between these points (the fragments) are i.i.d.. Therefore, Wald’s first lemma can be applied to the stochastic process of fragmentation, leading to closed form expressions for the distribution of fragment lengths with a minor approximation to account for the first fragment. With this methodology we derived a closed form expression for the fragment mass distribution for a large class of enzymes including the widely used Trypsin. The expression uses the distribution of lengths in a mixture of proteins taken from a proteome, as well as the relative abundances of the 20 amino acids in the proteome. The agreement between theory and the in silico digest is excellent.

Return to Program or Index

Preliminary Wavelet Analysis of Genomic Sequences

Jianchang Ning, Charles N. Moore and James C. Nelson
University of Delaware, Delaware Biotechnology Institute, 15 Innovation Way Newark, DE

Abstract
Large genome-sequencing projects have made urgent the development of accurate methods for annotation of DNA sequences. Existing methods combine ab initio pattern searches with knowledge gathered from comparison with sequence databases or from training sets of known genes. However, the accuracy of these methods is still far from satisfactory. In the present study, wavelet algorithms, in combination with entropy method, are being developed as an alternative way to determine gene locations in genomic DNA sequences. Wavelet methods seek periodicity present in sequences. Periodicity, due in general to nonrandomness of nucleotide usage associated with the triplet nature of codons, is an important target of most gene-parsing computer programs. A promising advantage of wavelets is their adaptivity to varying lengths of coding/non-coding regions. Moreover, the wavelet methods integrated with entropy method just search the information contents of the sequences, which do not need to be trained. However, since this development has not been completely finished yet, only the preliminary results from the development are to be presented. The preliminary results show that the wavelet approach is feasible and may be better than some knowledge-dependent approaches based on a sample of genomic DNA sequences.

Return to Program or Index

Using Easel for Modeling and Simulating the Interactions of Cells in Order to Better Understand the Basics of Biological Processes and to Predict their Likely Behaviors

Vojislav Stojkovic, Grace Steele and William Lupton
Morgan State University, Baltimore, MD 21251

Abstract
Modeling and simulation plays a significant role in bioinformatics. It helps researchers to better understand biological processes and to predict their likely behaviors. Easel (Emergent Algorithm Simulation Language and Environment) is a new general, modeling, and simulation programming language. Easel is designed to model and simulate systems with large numbers of interacting actors. Easel can be used for modeling and simulating problems with large numbers of lack of central control interacting components, and incomplete and imprecise information. We use Easel for modeling and simulating the interactions of cells in order to better understand the basics of biological processes. We have developed many Easel programs to solve different biological problems. For example the "Message Passing" Easel program simulates the transmission of signal from one cell to another based on the distance between cells.

The following scenario summarizes the model:

There is a passive cell population; cells are graphically represented by circles of different radius and colors;
Cells move randomly within their own limited "areas";
The researcher can randomly click any passive cell to activate the cell (change radius and color). This action initiates the transmission of signal between cells;
The signal is transmitted to the closest passive cell;
When the new cell is activated (change color and radius), the old cell will be deactivated;
The new cell continues the signal transmission to the closest passive cell;
The researcher may continue to click on passive cells to activate them and start the parallel signal transmission.

Return to Program or Index

Fold Recognition Using Sequence Fingerprints of Protein Local Substructures

Andriy Kryshtafovych, Torgeir R. Hvidsten, Jan Komorowski and Krzysztof Fidelis
Lawrence Livermore National Lab, Livermore, CA 94550

Abstract
Modern structure prediction methods can consistently produce reliable structural models for protein sequences with more than 25% sequence identity to proteins with known structure. But even if no proteins with significant similarity can be detected for the protein of interest, there is still a chance that it can be assembled with local blocks existing in structural archives.

We have developed the method of local descriptors of protein structure that could detect common local structural environments in proteins and organize them into a limited number of shape similarity classes so that representatives from these classes could be used as elementary building blocks to reconstruct native protein structures or model unknown folds. Here we discuss application of this approach to fold recognition problem.

The descriptors consist of several short backbone fragments that are close to each other in 3D space and do not overlap on the sequence. Their number (usually up to 6) and length (5-25 residues) are not fixed and depend on backbone’s local conformation and relative positioning of amino acid side chains. We built 374,558 descriptors that encompass 3 or more backbone segments for 4006 structural domains represented in the SCOP and having less than 40% sequence identity between each other. Grouping descriptors according to their SCOP attributes and gathering them into classes of similarity, we built a library of groups containing sets of sequence fragments with geometrically similar local structures.

Analysis of the groups was carried out to establish relationships between sequences of the segments and specific geometrical conformations. Using the detected sequence-based fingerprints, groups were assigned to target sequences.

The ability of the approach to recognize correct SCOP folds was tested on 273 sequences from the 49 most popular folds. Good predictions were obtained in 86% of cases. No performance drop was observed with decreasing sequence similarity between target sequences and protein sequences used to build the library.

Return to Program or Index

An Iterative Loop Matching Approach to the Prediction of RNA Secondary Structures with Pseudoknots

Jianhua Ruan and Weixiong Zhang
Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO 63112

Abstract
Pseudoknots have generally been excluded from the prediction of RNA secondary structures due to the difficulty in modeling and complexity in computing. Although several dynamic programming algorithms exist for the prediction of pseudoknots using thermodynamic approaches, they are neither reliable nor efficient. On the other hand, comparative methods are more reliable, but are often done in an ad hoc manner and require expert intervention. Maximum weighted matching (Tabaska et. al, Bioinformatics, 14:691-9, 1998), an algorithm for pseudoknot prediction with comparative analysis, suffers from low prediction accuracy in many cases. Here we present an algorithm, iterative loop matching, for predicting RNA secondary structures including pseudoknots reliably and efficiently. The method can utilize either thermodynamic or comparative information or both, thus is able to predict for both aligned sequences and individual sequences. We have tested the algorithm on a number of RNA families, including both structures with and without pseudoknots. Using 8-12 homologous sequences, the algorithm correctly identifies more than 90% of base-pairs for short sequences and 80% overall. It correctly predicts nearly all pseudoknots. Furthermore, it produces very few spurious base-pairs for sequences without pseudoknots. Comparisons show that our algorithm is both more sensitive and more specific than the maximum weighted matching method. In addition, our algorithm has high prediction accuracy on individual sequences, comparable to the PKNOTS algorithm (Rivas & Eddy, J Mol Biol, 285:2053-68, 1999), while using much less computational resources.

Return to Program or Index

A New Method for Predicting RNA Secondary Structure

Hirotoshi Taira, Tomonori Izumitani, Takeshi Suzuki and Eisaku Maeda
NTT Communication Science Laboratories, 2-4, Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0237 Japan

Abstract
It has become clear recently that many RNA are not translated into proteins, instead working as important functional molecules in organisms. These RNA are called "non-coding RNA."

If we identify these structures correctly, the function of specific RNA can be predicted more easily and will be more useful in the development of new medicines. Nussinov's and Zuker's algorithms are the two conventional algorithms for the prediction of RNA's secondary structures. In this paper, we will focus on Nussinov's algorithm. This algorithm utilizes dynamic programming to search for remote base pairs.

To improve the algorithm's performance, we added a new scoring mechanism and controled the loop's minimum length. For the experiment, we used the original Nussinov (NA) and improved Nussinov (NB) algorithm, as well as the Nussinov algorithm using SCFG (SA) and the improved Nussinov algorithm using SCFG (SB). The algorithm was evaluated by accuracy and F-measure. On average, the F-measure rises from 0.333 to 0.711 due to the improvements from NA to NB. Moreover, the F-measure rises from 0.577 to 0.714 due to the improvements from SA to SB. Accuracy evaluation shows similar results.

The F-measure of 11 in 18 sequences increased with loop restriction, and the F-measure of 9 sequences increased when considering G-U pairs, due to the improvements from NA to NB. Our experimental results indicate that the proposed approach effectively improves the performance of Nussinov's algorithm.

Return to Program or Index

Minimum-Redundancy Feature Selection from Microarray Gene Expression

Chris Ding and Hanchuan Peng
Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720

Abstract
Selecting a small subset of genes out of the thousands of genes in microarray data is important for accurate classification of phenotypes. These "marker" genes also provide additional biological insights. Widely used methods typically rank genes according to their differential expressions among phenotypes and pick the top-ranked genes. Quite often feature sets so obtained have certain redundancy; the mutual information or correlation among the feature sets are high.

In this paper, we propose the minimum redundancy–maximum relevance (MRMR) methods to select features that can represent broader spectrum of characteristics of phenotypes than those obtained through standard ranking methods; they are more robust and generalize well to unseen data.

We perform comprehensive tests on 3 well-known multi-class gene expression data sets, NCI cancer cell lines(9 classes), Lung cancer (7 classes), Lymphoma (9 classes), and two 2-class datasets, Leukemia and Colon cancer.

Feature set selected via our MRMR method leads to leave-one-out cross validation (LOOCV) accuracy of 98.3% for NCI, 97.3% for Lung, 96.9 % for Lymphoma, 100% for Leukemia, and 93.5% for Colon. These results are substantially better than standard feature selection methods, and are also better than those reported in literature.

We also performed forward feature selections of wrapper approach on these datasets. The resulting feature sets, although obtained via far more computation, are not as good as those selected via our MRMR methods. Our extensive tests also show that discretization of the gene expressions leads to clearly better classification accuracy than the original continuous data.

Return to Program or Index

Sequence Comparison

A Linear Programming Based Algorithm for Multiple Sequence Alignment

Fern Y. Hunt, Anthony J. Kearsley and Abigail O'Gallagher
National Institute of Standards and Technology, MD

Abstract
In this talk we will discuss an approach to multiple sequence alignment based on treating the alignment process as a stochastic control problem. The use of a model based on a Markov decision process (MDP) leads to a linear programming problem. Our goal is to avoid the expense in time and computation associated with the use of dynamic programming methods when aligning large numbers of sequences. The formulation naturally allows for various constraints e.g. on the number of indels, and in some cases one can characterize the sensitivity of the alignment to changes in the cost function. The expected discounted cost for both the primal and dual problem will be discussed. Under some ergodicity assumptions, one can show pathwise optimality in an asymptotic sense with probability one.

Return to Program or Index

Alignment-Free Sequence Comparison with Vector Quantization and Hidden Markov Models

Tuan Pham
Griffith University, Australia

Abstract
Alignment-free sequence domain for the comparison of biological data is still a very recent area of research in regard to alignment-based sequence methods¹. Apart from clustering analysis², there are two main categories known as the word (n-gram) frequency-based and the information-based methods¹. This paper presents a new approach for alignment-free biological sequence comparisons, which at least overcomes some problems encountered by the frequency-based method in both computational speed and numerical analysis. It has been reported^1,3 that for long sequences, the frequency- based approach encounters a memory problem; the use of Mahalanobis distance causes singularity in the conversion of covariance matrices; and the computational process will take a considerably long time.

In this study, a given long sequence of alphabets is transformed into numbers and then vectorized according to a prescribed vector size. This collection of vectors is then represented by a codebook that contains a number of template vectors as its own identity. Based on the codebook for each individual sequence generated by vector quantization, a hidden Markov model (HMM) is then built for each sequence. Since a HMM has been built for each sequence, the comparisons of similarity/dissimilarity between sequences can be performed in terms of a distance measure of the likelihoods, which includes cross-entropy, divergence, and discrimination information.

The proposed approach is tested against several biological datasets that are of public availability². Comparisons are also made with the frequency-based method with different distance measures. The proposed approach has been found to be efficient for real-time processing and provides an effective alternative for alignment-free sequence comparison.

REFERENCES
¹S. Vinga, and J. Almeida, Alignment-free sequqnce comparison Ð A review, Bioinformatics, 19:4 (2003) 513-523.
²E.R. Dougherty et. al., Inference from clustering with application to gene-expression microarrays, J. Computational Biology, 9:1 (2002) 105-126.
³http://www.bioinformatics.musc.edu/resources.html

Return to Program or Index

Prediction of Protein Function Using Signal Processing of Biochemical Properties

Krishna Gopalakrishnan, Kayvan Najarian and Dean Scott Warren
University of North Carolina, Charlotte, NC

Abstract
We present a technique to find the biological function of protein from its primary sequence using advanced signal processing. Currently the majority of protein classification methods make use of multiple alignments. We use signal-processing features obtained from the primary sequence of the protein, to predict its biological function. The primary sequence of protein is first converted to signals based on the encoding of biochemical properties like hydrophobicity, solubility, molecular weight of constituent amino acids. Then, signal processing features like complexity, mobility and fractal dimension are extracted from the obtained signals and used for classification of proteins. Sample studies conducted for lipase, protease and isomerase proteins of length between 100 and 200 amino acids support the proposed method.

Return to Program or Index

Genomic Sequence Analysis Using Gap Sequences and Pattern Filtering

Shih-Chieh Su, Chia H. Yeh, and C. C. Jay Kuo
University of Southern California, Los Angeles, CA

Abstract
A new pattern filtering technique is developed to analyze the genomic sequence in this research based on gap sequences, in which the distance of the same symbol is recorded consecutively as a sequence of integer numbers. A joint decision is made upon a family of gap sequences generated using selected patterns to perform genomic sequence alignment and similarity check. The effects of basic operations on genomic sequences are studied under the gap sequence framework. Several tools are developed based on the Poisson distribution approximation of gap value to analyze the gap sequences such as histogram-aided alignment, conditional entropy over gap knowledge, and unwanted segments merging. Simulation results show that the extension of gap match indicates the corresponding segment extension in the original genomic sequence. With the proposed algorithm, we are able to generalize the conventional sequence alignment methods in a more adaptive way. The match between the gap sequences is considered as a frame match while a true match requires both frame and stuffing match. Conditional entropy measurement has been proposed to predict the possibility of a true match conditioned on a frame match. Extensive experimental results will be presented to demonstrate the proposed method.

Return to Program or Index

An Optimal DNA Segmentation Based on the MDL Principle

Wojciech Szpankowski¹, Wenhui Ren¹, and Lukasz Szpankowski²
¹Pudue University
²University of Michigan, Ann Arbor

Abstract
A major challenge facing computational biology is the post-sequencing analysis of genomic DNA and other biological sequences. The biological world is highly stochastic as well as inhomogeneous in its behavior. Indeed, there are regions in DNA with high concentration of G or C bases; stretches of sequences with an abundance of CG dinucleotide (CpG islands); coding regions with strong periodicity-of-three pattern, and so forth. The transition between homogeneous and inhomogeneous regions, also known as change points, carry important biological information. Computational methods used to identify these homogeneous regions are called segmentation. Our goal is to employ rigorous methods of information theory to quantify structural properties of DNA sequences. In particular, we adopt the Stein-Ziv lemma to find an asymptotically optimal discriminant function that determines whether two DNA segments are generated by the same source while assuring exponentially small false positives. Then we apply the Minimum Description Length (MDL) principle to select parameters of our segmentation algorithm so that the underlying DNA sequence has the smallest possible length (best compression). The MDL principle is adopted since recent analyses indicate that biological sequences are well compressed (by nature). Finally, we perform extensive experimental work on human chromosome 22 data. In particular, we observe that grouping A and G (purines) and T and C (pyrimidines) leads to better segmentation of chromosome 22 and identification of biologically meaningful transition points.

Let us give a more precise description of our methods. We partition a DNA sequence (e.g., chromosome 22) into fixed length blocks. Guided by universal data compression, we shall follow Shamir and Costello and set the length of the block to minimize the average redundancy (MDL principle) which turns out to be slightly bigger than log N, where N is the length of the DNA sequence in base pairs (bps). Then we invoke the Stein-Ziv lemma (of hypothesis testing and universal data compression) and apply an asymptotically optimal discriminant (based on Jensen-Shannon) to determine whether two blocks are generated by the same source. If the discriminant function is positive, then a change point (i.e., change of distributions) between these blocks is expected. Since the length of of a block is of order O(log N) bps, we subdivide the blocks that potentially contain the change points into subblocks of smaller length loglog N (to assure the best fit of data from the MDL principle point of view). The optimal discriminant function is again applied to these subblocks. Once the change points are found we compute the entropy of these segments (between change points) to identify (hidden) sources that generate (segments of) the sequence.

Return to Program or Index

The Cybertory Sequence File System (SFS) for Managing Large DNA Sequences

Carl E. McMillin and Robert H. Horton
Attotron Biosensor Corporation

Abstract
Many queries that researchers run against DNA sequences involve matching short patterns against large targets, e.g., finding restriction sites or transcription factor binding sites in a genome. If patterns are known beforehand, the target may be pre-indexed to facilitate rapid execution of queries that combine several of the smaller patterns. The Cybertory Sequence File System is a cross-platform hybrid file-management/database designed for storage and pre-indexing of large DNA sequences. Using a combination of C++ and Java components, the SFS addresses three major functional domains: "Resource Management" for constraining runtime CPU and virtual-memory utilization; "Data Translation/Segmentation" for minimizing persistent-storage requirements and improving access performance; and "Query Processing" for fast index-and heuristics-based processing. Resource Management handles dynamically allocated resources: CPU's and large memory buffers, operating system file-handles and compressors/encoders are allocated, reassigned, and destroyed as necessary. The Data Translation/Segmentation domain imports and exports DNA sequence data from/to FASTA files and adaptively encodes this data into an internal binary format, organized for speed of access, flexibility of use, and density of storage. Adaptable modules are used for compression and encoding. The Query-Processing domain searches for patterns in the imported data, persists the results, and combines results in various ways. The source code is available under the GNU Public License. Funded by NIH grant #R44 RR13645 02A2 to Attotron Biosensor Corporation.

Return to Program or Index

Implementing Parallel Hmm-pfam on the EARTH Multithreaded Architecture

Weirong Zhu, Yanwei Niu, Jizhu Lu and Guang R. Gao
Department of Electrical and Computer Engineering, University of Delaware

Abstract
Hmmpfam is a widely used bioinformatics software for sequence classification provided by Sean Eddy’s Lab at the Washington University in St.Louis. Using state-of-the-art multi-threading computing concept, we implement a new parallel version of hmmpfam on EARTH (Efficient Architecture for Running Threads), which is an event-driven fine-grain multi-threaded program architecture and programming execution model. In order to parallelize hmmpfam, we develop two parallel schemes: one pre- determines job distribution on all computing nodes by a round-robin algorithm; the other takes advantage of the dynamic load balancing support of EARTH Runtime system 2.5, which makes the job distribution completely transparent to user. In this poster, we will analyze the hmmpfam program and different parallel schemes. Then we will show our test results on various computing platforms with comparison to the PVM version hmmpfam. When matching 250 sequences against a 585-family Hmmer database on 18 dual-CPU computing nodes, the PVM version gets absolute speedup of 18.50, while EARTH version gets 30.91, achieving a 40.1% improvement on execution time. We also test our program on the advanced supercomputing cluster Chiba City at Argonne National Laboratory. When the seqfile contains 38192 sequences, and the HMMer database has 50 families, the EARTH version achieves an absolute speedup of 222.8 on 128 dual-CPU nodes, which means that it could reduce the total execution time from 15.9 hours (serial program) to only 4.3 minutes.

Return to Program or Index

Genome on Demand: Interactive Substring Searching

Tamer Kahveci and Ambuj K. Singh
University of California Santa Barbara

Abstract
Motivation: The explosive growth of genome databases makes similarity search a challenging problem. Current search tools are non- interactive in the sense that the user has to wait a long period of time until the entire database is inspected.

Results: We consider the problem of interactive string searching, and propose two innovative k-NN (k-Nearest Neighbor) search algorithms. For a given query, our techniques start reporting the best resuts found so far immediately. Later, these partial results are periodically refined depending on the user satisfaction. We split the genome strings into substrings, and maintain a small fingerprint for these substrings. This fingerprint is then stored using an index structure. We propose a new model to compute the distance distribution of a query string to a set of strings with the help of their fingerprints. Using this distribution, our first technique orders the MBRs (Minimum Bounding Rectangles) of the index structure based on their order statistics. We also propose an early pruning strategy to reduce the total search time for this technique. Our second technique exploits existing statistical models to define an order on the index structure MBRs. We also propose a method to compute the confidence levels for the partial results. Our experiments show that our techniques can achieve 75 % accuracy within the first 2.5-35 % of the iterations and 90 % accuracy within the first 12-45 % of the iterations for both DNA and protein strings. Furthermore, the reported confidence levels reflect the quality of the partial results accurately.

Return to Program or Index

Automatic Selection of Parameters for Sequence Alignment

Jiangping Zhou and Gary Livingston
University of Massachusetts, Lowell

Abstract
We have shown that simulated annealing search can be used to find close to optimal settings for parameters used in a modified version of the DDS (DNA-DNA search) program. The modified DDS algorithm is a very efficient biological sequence alignment algorithm with linear space complexity. However, the DDS algorithm appears to be seldom used. We conjecture that the reason for this is that it has 7 parameters or cutoffs to manually set, which requires the use of heuristics obtained by experience and usually involves several trials. The optimal settings for these parameters are highly dependent on the type of sequence being compared and the sequences themselves. We use the average high-scoring chain score to measure the "goodness" of the resulting alignments, and then use simulated annealing algorithm to perform automatic search within a space of parameter values to maximize this goodness measure. We tested our method using pairs of DNA sequences and compared our method to manually selecting parameters for DDS. Our results show that close to optimal parameter settings are very difficult to find manually. Our simulated annealing search was able to find good settings for DDS parameters for all DNA sequences we tried. A surprising finding is that there may be several combinations of parameters that yield close to optimal alignments. We conclude that our approach can quickly and automatically find parameters which will allow DDS to make close to optimal alignments.

Return to Program or Index

CoMRI: A Compressed Multi-Resolution Index Structure for Sequence Similarity Queries

Hong Sun, Ozgur Ozturk and Hakan Ferhatosmanouglu
Ohio State University

Abstract
In this paper, we present CoMRI, Compressed Multi-Resolution Index, our system for fast sequence similarity search in DNA sequence databases. We employ Virtual Bounding Rectangles (VBRs) concept to build a compressed, grid style index structure. An advantage of grid format over trees is subsequence location information is given by the order of corresponding VBR in the VBR list. Taking advantage of VBRs, our index structure fits into a reasonable size of memory easily. Together with a new optimized multi-resolution search algorithm, the query speed is improved significantly. Extensive performance evaluations on Human Chromosome sequence data show that VBRs save 80%-93% index storage size compared to MBRs and new search algorithm prunes almost all unnecessary VBRs which guarantees efficient disk I/O and CPU cost. According to the results of our experiments, the performance of CoMRI is at least 100 times faster than MRS which is another grid index structure introduced very recently.

Return to Program or Index

Systems Biology

Pathway Logic Modeling of Protein Functional Domains in Signal Transduction

C. Talcott, S. Eker, M. Knapp, P. Lincoln and K. Laderoute
SRI International

Abstract
Cells respond to changes in their environment through biochemical signaling pathways that detect and transmit information to effector molecules within different cellular compartments. Protein functional domains (PFDs) are consensus sequences within signaling molecules that recognize and bind other signaling components to make complexes.

Pathway Logic is an application of techniques from formal methods to the modeling and analysis of signal transduction networks in mammalian cells. These signaling network models are developed using Maude [http://maude.cs.uiuc.edu], a symbolic language founded on rewriting logic. Models can be queried (analyzed) using the execution, search and model-checking tools of the Maude system. We show how signal transduction processes can be modeled using Maude at very different levels of abstraction involving either an overall state of a protein or its PFDs and their interactions.

Pathway Logic is an example of how formal modeling techniques can be used to develop a new science of symbolic systems biology. This computational science will provide researchers with powerful tools to facilitate the understanding of complex biological systems and accelerate the design of experiments to test hypotheses about their functions in vivo.

We will illustrate our approach using the biochemistry of signaling involving the ubiquitous Raf-1 serine/threonine protein kinase. The Raf-1 kinase is a proximal effector of EGFR and other RTK signaling through the ERK1/2 MAPK pathway, which contains the kinase cascade

Raf-1 => Erk => Mek.
Examples of queries and graphical representations of the results will be shown.

Return to Program or Index

Development of a Massively-Parallel, Biological Circuit Simulator

Richard Schiek and Elebeoba May
Sandia National Laboratories

Abstract
Genetic expression and control pathways can be successfully modeled as electrical circuits. Given the vast quantity of genomic data, very large and complex genetic circuits can be constructed. To tackle such problems, the massively-parallel, electronic circuit simulator, Xyce (TM), is being adapted to address biological problems. Unique to this biocircuit simulator is the ability to simulate not just one or a set of genetic circuits in a cell, but many cells and their internal circuits interacting through a common environment.

Currently, electric circuit analogs for common biological and chemical machinery have been created. Using such analogs, one can construct expression, regulation and reaction networks. Individual species can be connected to other networks or cells via non-diffusive or diffusive channels (i.e. regions where species diffusion limits mass transport). Within any cell, a hierarchy of networks may exist operating at different time-scales to represent different aspects of cellular processes.

Though under development, this simulator can model interesting biological and chemical systems. Prokaryotic genetic and metabolic regulatory circuits have been constructed and their interactions simulated for Escherichia coli's tryptophan biosynthesis pathway. Additionally, groups of cells each containing an internal reaction network and communicating via a diffusion limited environment can produce periodic concentration waves. Thus, this biological circuit simulator has the potential to explore large, complex systems and environmentally coupled problems.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under Contract DE-AC04-94AL85000.

Return to Program or Index

Representing and Reasoning About Signal Networks: An Illustration Using NFkappaB Dependent Signaling Pathways

Chitta Baral¹, Karen Chancellor¹, Nam Tran¹ and Nhan Tran²
¹Arizona State University
²Translational Genomics Research Institute

Abstract
We propose a formal language to represent and reason about signal transduction networks. The existing approaches using graphical representation, petrinets, and computer fomal languages fall short in many ways and our work suggests that an artificial Intelligence (AI) approach is well suited for the task. We applied a form of action language to represent and reason about NFkappaB dependent signaling pathways. Our language supports essential features of signal transduction knowledge, namely reasoning with partial (or incomplete) knowledge, non-monotic reasoning, reasoning with uncertainty, reasoning about triggered evolutions of the world and elaboration tolerance. We selected NFkappaB dependent signaling to be the test bed, because of its growing important role in cellular functions. NFkappaB is a central mediator of the immun response. It can also regulate stress responses, as well as cell death/survival in several cell types. While many extracellular signals may lead to the activation of NFkappaB, few related pathways are elucidated. We can then study different problems: representation of pathways, reasoning with pathways, planning to alter the outcomes, and predicting new pathways. We show that all these problems can be well formulated in our framework. Many interesting problems also immerge. Taken together, our work shows that a formal representation of signal networks is feasible and practical with AI reasoning approaches. The work also shows that it is important to be able to formally formulate the signal transduction problems, as it provides a solid research methodology as well as promising research directions.

Return to Program or Index

Noise Attenuation in Artificial Genetic Networks

Yoshihiro Morishita and Kazuyuki Aihara
University of Tokyo, Japan

Abstract
Due to the progress of techniques in genetic operations, we are now able to artificially create genetic networks that have specific functions such as switch and oscillation. This area is very important from the viewpoint of engineering and medical applications. However, it is indicated that dynamics of genetic networks are very noisy and unstable because of the discreteness and stochasticity originating from the smallness of the number of related molecules. Thus to explore methods to control the fluctuation is one of the important problems toward designing networks that operate with high stability and reliability. In this study, we propose a new and plausible method to control fluctuation in artificial genetic networks. The main idea of this method is that the fluctuation in gene expressions is reduced through interaction between synthesized proteins and noise-attenuation molecules that are designed based on domain structure of the proteins to specifically interact with the proteins. We analytically derive the noise-attenuation level by approximating the master equations that governs dynamics of a single gene expression. We also clarify that the amount of the noise attenuators and the rates of the interaction between proteins and them are important for the attenuation level. In addition, we demonstrate efficiency of the method for stabilization of system dynamics where there are multiple stable equilibrium points by using the toggle switch model.

Return to Program or Index

Computational Inference of Regulatory Pathways in Microbes: An Application to Phosphorus Assimilation Pathways in Synechococcus WH8102

Z. Su¹, P. Dam¹, X. Chen², V. Olman¹, T. Jiang¹, B. Palenik³ and Y. Xu¹
¹Protein Informatics Group, Divisions of Life Sciences and Computer Sciences & Mathematics, Oak Ridge National Laboratory
²Department of Computer Science and Engineering, Univ. of California at Riverside
³Scripps Institute of Oceanography, University of California at San Diego

Abstract
We present a computational protocol for inference of regulatory and signaling pathways in a microbial cell, through mining "high-throughput" biological data of various types, literature search, and computer-assisted human inference. This protocol consists of four key components: (a) construction of template pathways for microbial gnomes related to the target genome, which have either been thoroughly studied or had a significant amount of relevant experimental data, (b) inference of target pathways for the target genome, by combining the template pathway models and target genome-specific information, (c) assignment of genes of the target genome to each individual functional roles in the pathway model, and (d) validation and refinement of the pathway models using pathway-specific experimental data or other information. To demonstrate the effectiveness of this procedure, we have applied this computational protocol to the construction of the phosphorus assimilation pathways in Synechococcus WH8102. In this paper, we present a model of the core component of this pathway and provide our justification in support of the predicted model.

Return to Program or Index

Miscellaneous

Development and Assessment of Bioinformatics Tools to Enhance Species Conservation and Habitat Management

Melanie A. Sutton, Lori Deneke, John Eme, Wayne Bennett and Frank Wray
Departments of Biology and Computer Science, University of West Florida

Abstract
This project encompasses our interdisciplinary approach to integrating computational methods into the knowledge- discovery process associated with understanding biological systems impacted by the loss or destruction of sensitive habitats. Data mining is used to intelligently query databases to extract meaningful information and to elucidate broad patterns that facilitate overall data interpretation. Developed visualization techniques present mined data in ways where context, perceptual cues, and spatial reasoning skills can be applied to help reveal potential impacts of conservation efforts and to uncover significant trends in behavioral patterns, habitat use, species diversity, and community composition. In this context, we explore the following systems:

Marginal Fish Habitats This project involves the development of an Internet-searchable database for data associated with the investigation of fish diversity and community structure in coral reefs in Indonesia. This database will be used to assist in conducting biodiversity assessments of a sensitive habitat, facilitating the development of resource management practices for reef-associated systems.

Beach Mouse Communities This project involves the development of an Internet-searchable database and innovative multi-media-based visualization strategies (e.g., virtual tours) for tracking the population dynamics of the Santa Rosa Beach Mouse (SRBM). The bioinformatics tools associated with this project will be used to study how habitat fragmentation impacts the population dynamics of the SRBM, with the goal of facilitating the development of better management practices for related subspecies.

The design and assessment of our tools emphasizes support for basic research as well as the creation of mechanisms for using the tools to support regional and international inter-institutional educational objectives in the areas of pattern recognition and human-computer interaction.

Return to Program or Index

MedfoLink: Bridging the Gap between IT and the Medical Community

Vezen Wu¹, Mitchell Berman, M.D.², Armen Kherlopian³ and Joseph Gerrein³
¹Columbia Univ. IT Programs
²Columbia Presbyterian Hospital
³Columbia Univ. Biomedical Engineering

Abstract
MedfoLink is a new software technology that applies novel design to achieve a solution to the issues surrounding medical records that are part of the national agenda. The majority of medical records are still kept on paper, which raises issues of patient privacy, lost patient history and clerical errors that cause death. Our software overcomes the vocabulary and performance limitations that hinder the adoption of existing medical language processing technology for handling medical records. MedfoLink is a Java technology that uses medical language processing and the UMLS to enable a computer to accurately record and interpret data from patient records. Benefits include security to ensure patient privacy, consolidated patient histories, real-time access to patient records, and the elimination of clerical errors. Applications range from enhanced individual patient care to public health concerns like bioterrorism.

MedfoLink unifies different technologies to transform the data from patient records into a relational database for computer analysis. MedfoLink draws its knowledge from the Unified Medical Language Source (UMLS), a medical language database created by the NIH. MedfoLinkÕs design incorporates two novel approaches as follows: 1. Intelligent algorithms improve its comprehension of medical records with practice. 2. Adaptive Learning Interface (ALI) architecture enhances performance by interfacing with existing technologies. MedfoLink generates reports containing statistics on patient data, which will improve the ability of healthcare professionals to identify trends in the patient population.

Our interdisciplinary team from Columbia University developed MedfoLink to solve both a technological challenge and the market need with broad implications for saving lives.

Return to Program or Index

TC-DB: A Membrane Transport Protein Classification Database

Can Van Tran, Nelson M. Yang and Milton H. Saier Jr.
University of California, San Diego

Abstract
TC-DB is a comprehensive relational database containing functional and evolutionary information on transmembrane transport proteins. The database contains data extracted from more than 9000 references, covering approximately 3000 representative proteins, classified in over 380 different families. TC-DB is the primary resource for the retrieval of transporter data classified by the TC system. The TC system, which has been adopted by the IUBMB (International Union of Biochemistry and Molecular Biology), is analogous to the EC (Enzyme Commission) system's classification of enzymes. The functional ontology developed for TC-DB serves as an infrastructure to develop powerful queries that yield new biological insight. The TC-DB website offers a plethora of software tools that facilitate the analysis of membrane transporters. Multiple avenues of accession are supported by the web interface including classification drill-down, parametric searching, and full- text searching. Various uses of TC-DB include software applications for the annotation of transport proteins in newly sequenced genomes as well as tools to trace evolutionary pathways of transport proteins. The database as well as web services are accessible free of charge at http:// tcdb.ucsd.edu.

Return to Program or Index

RETURN TO
PROGRAM

Return to Top

HOME | REGISTRATION | STANFORD SITE | PAPERS | POSTERS | TUTORIALS | PROGRAM | KEYNOTE SPEAKERS
INVITED SPEAKERS | SPECIAL EVENTS | COMMITTEES | SPONSORS | NEWS | CONTACT | 2002 CONFERENCE