IEEE Computer Society Bioinformatics Conference

Poster Abstracts for Section A
Poster Sessions I and IV

Data Integration: A Semantic Mediation Approach for Problems in Computational Molecular Biology
A Laboratory Information Management System for Flow Cytometry
Supporting Scientific Discovery with a Scientific Query Language
Data Mining: Application of Singular Value Decomposition and Functional Clustering to Analyzing Gene Expression Profiles of Renal Cell Carcinoma
Statistical Resynchronization and Bayesian Detection of Periodically Expressed Genes
Understanding Macrophage Signaling Pathways through DNA Microarray Data Using Competitive Learning Neural Network and Fuzzy Logic Data Mining Approach
A Method for Tight Clustering: With Application to Microarray
Riptide: Fast Protein Identification from Mass Spectrometer Data
Accelerating the Drug Design Process through Parallel ILP Data Mining
Estimating Recombination Rate Distribution by Optimal Quantization
Statistical Issues in the Analysis of the CGH Data
Identification of Contaminants in Proteomics Mass Spectrometry Data
A Flexible Pipeline for Experimental Design, Processing, and Analysis of Microarray Data
Computational Approach to Reconstructing Gene Regulatory Networks
Probability Profiles – Novel Approach in Mass Spectrometry De Novo Sequencing
A Symbolic Logic Strategy For Mapping Biological Concepts in the HERBE Prototype
A Computational Method for Assessing Peptide-Identification Reliability in Tandem Mass Spectrometry Analysis with SEQUEST
Functional Genomics: Tomato Expression Database (TED) – An Interactive Management Tool for Tomato Expression Profiling Data
Oncogenetic Tree Models—An Estimation
The GeneCards Family of Databases: GeneCards, GeneLoc, GeneNote and GeneAnnot
Wavelet Transforms for the Analysis of Microarray Experiments
MageBuilder: A Schema Translation Tool for Generating MAGE-ML from Tabular Microarray Data
OptiRNAi, a Web-based Program to Select siRNA Sequences
A Rule-based Framework for Gene Regulation Pathways Discovery
Gene Function and Metabolic Pathways in the Yeast
Using Rule Induction to Analyze Gene Expression Data
Genotyping and SNPs: SNP Analyzing System for Detecting Multifactorial Disease Associated Site
Haplotype Pattern Mining and Classification for Detecting Disease Associated SNPs
Strings, Graphs, and Algorithms: Genetic Algorithm Approach to the Center Sequence Problem Arising in Post-transcriptional Gene Silencing
GenericBioMatch: A Novel Generic Pattern Match Algorithm for Biological Sequences
Aligning ESTs to Genome Using Multi-Layer Unique Markers
A Parallel Evolutionary Method for Physical Mapping of Chromosomes
On Finding Common Intervals among k Protein Sequences Containing Redundant Elements
Algorithms for Bounded-Error Correlation of High Dimensional Data in Microarray Experiments
Exact and Heuristic Algorithms for the DNA Fragment Assembly Problem
Finding Higher Order Motifs under the Levenshtein Measure
Mapping Discontinuous Antibody Epitopes to Reveal Protein Structure and Changes in Structure Related to Function
The SCP and Compressed Domain Analysis of Biological Sequences
Identifying Regulatory Signals in DNA-sequences with a Non-statistical Approximation Approach
A Genetic Algorithm for Simplifying the Amino Acid Alphabet
Structural Biology: Spectral Decomposition of the Laplacian Matrix Applied to RNA Folding Prediction
Identification of Non-random Patterns in Structural and Mutational Data: The Case of Prion Protein
Multiple Protein Structure Alignment by Deterministic Annealing
An Exact Algorithm For Determining Protein Backbone Structure From NH Residual Dipolar Couplings
Automatic Construction of 3D Structural Motifs for Protein Function Prediction
Local Minima-Based Exploration for Off-Lattice Protein Folding
Text Mining and Ontologies: HyBrow: A System for Hypothesis Validation
Identifying Gene and Protein Names from Biological Texts
Refining the Extraction of Relevant Documents from Biomedical Literature to Create a Corpus for Pathway Text Mining
Using Natural Language Processing and the Gene Ontology to Populate a Structured Pathway Database
Text Visualization for Natural Language Analysis of Biology Full Text

Data Integration

A Semantic Mediation Approach for Problems in Computational Molecular Biology

M. Chagoyen, M.E. Kurul, P.A. De-Alarc—n, S. Santini, B. LudŠscher, J.M. Carazo, and A. Gupta
Centro Nacional de Biotecnologia - CSIC, Madrid, Spain

Abstract
A computational method of problem solving needs to interleave information access and algorithm execution by putting them together in a problem-specific "workflow." In a complex domain like molecular biology, this workflow often involves executing a number of algorithms requiring access to multiple information sources (or multiple times to the same information source) to provide parameters at different steps of its execution. To translate this requirement into reality, one needs to develop a general-purpose system from scientific workflows that contains an information management component providing access to a wide variety of information sources including local databases, public data banks having a query interface, web sites that publish information without any explicit means to "query" them, computational resources. Equally important is the need to ensure that different pieces of information acquired from different information sources are semantically compatible so that a consumer of information in the workflow can meaningfully accept the information object produced by some other source at a prior step of execution.

We present two system components that are parts of a larger computation infrastructure to execute a scientific workflow. The first is a programmable integrator that can access information from multiple, heterogeneous sources in a consorted manner. The second component is a semantic mediator that can integrate information from different information sources by bridging over the semantic differences among them. The key characteristic of this mediator is its ability to encode and execute domain-specific expert rules in order to join information from multiple sources.

Return to Program or Index

A Laboratory Information Management System for Flow Cytometry

Jeffrey D. Grant, Elizabeth Goralczyk, Frank J. Manion, R. Randy Hardy, J. Robert Beck, and Michael F. Ochs
Fox Chase Cancer Center, Philadelphia, PA

Abstract
Flow cytometry has evolved into an important tool for biological and medical research. In order to maximize throughput on a cytometer and fully explore data sets, scientists need a way to design protocols (including cell types, fluorochrome-stain combinations), have these protocols conveyed to the flow cytometer, review their experiments, and efficiently analyze results using visualization tools. To meet these needs, we have created a Laboratory Information Management System (Flow LIMS) that enables users to design and manage experimental protocols for the Becton Dickinson DiVA flow cytometer. Results are accessed by the automatic linking of FCS output files to the Flowjo application (Tree Star, Inc., San Carlos, CA) for visual analysis. At the core of the Flow LIMS is a collection of Java classes and JavaServer Pages that handle user authentication using LDAP, record and search protocol details in an Oracle database, and generate XML files for conveying protocol details to the DiVA machine. A Perl script periodically copies all DiVA output to a regularly backed-up archive space. We are currently at work integrating the Flow LIMS with the National Cancer Institute's caLIMS and caBIO, which will allow the LIMS to easily access NCI Center for Bioinformatics resources including singular nucleotide polymorphisms (SNPs), tissue specific expression information, molecular targets of specific therapeutics, details of signaling pathways and protein interactions, and genomic data. Ultimately, the Flow LIMS will become part of a larger LIMS that will encompass many types of biological data including microarray and proteomics data, among others.

Return to Program or Index

Supporting Scientific Discovery with a Scientific Query Language
Barbara Eckman¹, Kerry Deutsch², Marta Janer², Zoè Lacroix³ and Louiqa Raschid⁴
¹IBM Life Science
²Systems Biology
³Arizona State University
⁴University of Maryland

Abstract
The deficiencies of relational databases for supporting biological investigations have often been noted. The relational data model is too impoverished to accurately represent richly hierarchical biological structures. In a federated context, wide-ranging multi-source queries, particularly those containing joins on the results of BLAST searches, often return unmanageably large result sets, requiring non-traditional methods to identify and exclude extraneous data. Such filtering often requires more sophisticated conditions than the relational algebra can express, e.g., regular expression pattern matching. It is widely accepted that in order to support biological investigations a query language must provide functions wrapping analytic tools not supported by the relational query languages (e.g., sequence comparisons, motif searches, chemical sub-structure searches). In addition, the mapping between relational operators and the biological semantics of a scientific query is not obvious. For example, querying a table of gene expression results for genes relatively specific to pancreas (i.e., expressed in pancreas and only, say, 3 other tissues) requires a sub-query with GROUP BY and HAVING operators; the ubiquity of variant types in biological databases often requires outer joins and the use of functions like coalesce(A,B), which returns A if it is non-null, but B otherwise; and since greater confidence is placed in data attested by multiple sources, biologists often want to return only objects that are found in multiple selection sets defined by multiple sub-queries. On the other hand, a query language is attractive for encoding biological investigations and may be relatively easy to optimize. A declarative language is also in theory easier for biologists to use in implementing their own investigations, without relying on professional programmers. The scientific query language proposed in the poster is composed of four operators: collection, filtering, ranking and validation. We illustrate our approach with the study of SNP profiles to characterize diseases.

Return to Program or Index

Data Mining

Application of Singular Value Decomposition and Functional Clustering to Analyzing Gene Expression Profiles of Renal Cell Carcinoma

Zhong-Hui Duan, Joseph DiDonato, Louis S. Liou and Ting Shi
University of Akron

Abstract
The microarray gene expression profiles of both clear cell renal cell carcinoma tissues and a renal cell carcinoma cell line were analyzed using singular value decomposition and functional clustering analysis. The expression matrix of the tissue and cell line samples based on the 3145 selected genes was decomposed using singular value decomposition. The 2-D projection of the expression profiles revealed significant differences between the profiles of renal cell carcinoma tissue and cell line samples. Selected genes were annotated based on biological processes and clustered into functional groups. The expression levels of genes in each group were analyzed. The analyses revealed remarkable gene expression alterations in the biological pathways of renal cell carcinoma and provided insights into understanding the molecular mechanism of renal cell tumorigenesis. Based on the expression levels, 74 commonly differentially expressed genes in renal cell carcinoma tissue samples were identified. This group of genes may be useful as molecular tumor markers that can potentially be used for more accurate diagnosis, prognosis and possibly can serve as drug targets for effective therapies. The detailed comparison of gene expression patterns in renal cell carcinoma tissue samples and renal cell carcinoma cell line shows significant differences between the two types of samples, but many important expression patterns were consistent.

Return to Program or Index

Statistical Resynchronization and Bayesian Detection of Periodically Expressed Genes

Xin Lu, Wen Zhang, Zhaohui S. Qin, and Jun S. Liu
Department of Statistics, Harvard University

Abstract
A major challenge of identifying periodically expressed genes from cell-cycle microarray data is to eliminate the substantial amount of noises caused by synchrony decay, block/release effect and experimental variation. We proposed a Periodic-Normal Mixture model to improve the reliability of measuring cell-cycle length and gene expression periodicity by re-synchronizing the transcription profiles with progressively mixed Fourier decomposition. Subsequently, we devised a two-component Beta mixture model to approximate the PNM fitting residues of three datasets (cdc28, alpha and cdc15) in order to combine information gathered from different experiments and detect genes that are periodically expressed. The result has shown that about 1780 genes (32.3% of 5510 genes examined) are likely to be periodically expressed. We identified 822 genes whose posterior probability of periodicity are greater than 0.95, among which 282 are absent from the 800 list by Spellman et al (1998). When matching the 822 resynchronized expression profiles of the three independent experiments, little phase shifts were observed, which indicated that the three synchronization methods bring cells to the same cell-cycle phase at the time of release. The consensus expression profiles of the 822 genes were elucidated by weighted average of the expression profiles in the three datasets, and the expression stage of these genes were estimated by matching the consensus profiles with typical profiles drawn from five groups of well-known phase-specific genes. Checking with the Gene Ontology annotation, it was observed that many of the 822 genes were involved in important cell- cycle related processes, function and components.

Return to Program or Index

Understanding Macrophage Signaling Pathways through DNA Microarray Data Using Competitive Learning Neural Network and Fuzzy Logic Data Mining Approach

Xin Feng, Chin-Fu Chen and Peter J. Tonelllato
Marquette University, Dept. of Electrical and Computer Engineering

Abstract
Understanding the response of human white blood cells (macrophages) to pathogens may provide insights to both the mechanisms of host defenses and the tactics used by pathogens to circumvent these defenses. The DNA microarray method has evolved to become one of the most powerful tools to understand the dynamics of gene response. Currently there is no standard approach to systematically analyze the data and the interpretation of results can vary dramatically. In this paper, we employed a new combined Competitive Learning neural network and fuzzy logic data mining approach to explore patterns of time sequence data from eight bacteria with 977 gene responses that showed significant changes on a microarray chip. We annotated the genes by their "biological processes" of Gene Ontology (GO). Our result suggests that the CL-Fuzzy logic data mining approach is effective in exploring how human macrophages respond to each bacterium with a unique combination of shared biological processes. The shared processes include: signal transduction, transcription, metabolism, and cell cycle and proliferation. Our result also suggests that there are similar responses (identical genes) to several bacteria and the similarities may be related to the shared mechanism of bacterial pathogenesis.

Return to Program or Index

A Method for Tight Clustering: With Application to Microarray

George C. Tseng and Wing H. Wong
Department of Biostatistics, Harvard University

Abstract
In this paper we propose a method for clustering that produces tight and stable clusters without forcing all points into clusters. The methodology is initially motivated from cluster analysis of microarray experiments. Many existing clustering algorithms have been applied in microarray data to search for gene clusters with similar expression patterns. However, none has provided a way to deal with an essential feature of array data: many genes are expressed sporadically and do not belong to any of the significant biological functions (clusters) of interest. In fact, most current algorithms aim to assign all genes into clusters. For many biological studies, however, we are mainly interested in the most informative, tight and stable clusters with sizes of, say, 20-60 genes for further investigation. Tight Clustering has been developed specifically to address this problem. It applies K-means clustering as an intermediate clustering engine. Early truncation of hierarchical clustering tree is used to overcome the local minimum problem in K-means clustering. The tightest and most stable clusters are identified in a sequential manner through an analysis of the tendency of genes to be grouped together under repeated resampling. We validated this method in a simulated example and applied it to analyze expression profiles of the Drosophila life cycle. The result of this new method is shown to better serve biological needs in microarray analysis.

Return to Program or Index

Riptide: Fast Protein Identification from Mass Spectrometer Data

Richard J. Carter
Hewlett Packard Labs

Abstract
The biotech firm Target Discovery Inc. (TDI) has developed a 100X faster approach to producing a fragmentation mass spectrum from a labeled protein sample. This speed increase has put pressure for similar speed improvements in the company's algorithm for determining the unknown protein's terminal amino acid sequence from its mass spectrum. This work presents the novel algorithm Riptide, which demonstrates a 125X speed improvement over TDI's web- published algorithm as measured from coded C programs.

Given a mass spectrum dataset of an unknown protein, TDI's algorithm assigns a score to each of the 20ⁿ sequence possibilities for the protein's n terminal amino acids (for some desired n). The highest scoring sequence is reported as the most likely terminal identity of the protein. Central to the TDI algorithm is a mass spec look-up function MS() that reports a normalized occurrence-counts for a given mass/charge based on the current dataset.

Riptide, by design, produces an identical answer to TDI's algorithm with far less computational complexity. Riptide capitalizes on the fact that the MS() return value, although dependent on the combination of amino acids in the fragment being evaluated, is independent of the amino acid order. By performing the MS() call once for a given amino acid combination, Riptide is able to reduce the MS() calls from 49 million to 177K for the desirable case of a 6-deep sequencing. In addition, the mean and standard deviation score parameters required by the normalization operation can be calculated efficiently in Riptide's combination space sequencing approach.

Return to Program or Index

Accelerating the Drug Design Process through Parallel ILP Data Mining

James Graham¹, C. David Page² and Ahmed Kamal³
¹Univ. of Wisconsin
²Systems Biology
³Western Kentucky Univ.

Abstract
Present drug design methodologies often require several years in the initial chemical evaluation stages before compounds can be formulated for animal and human testing. This poster presents a new parallel approach for inductive logic programming (ILP) that has the potential to streamline this process. The system is based on effective data partitioning and message passing in the parallel ILP search protocols. Two of the key design features of this approach are its portability between different parallel machine architectures, and the ease of parallizing important ILP search problems. The system has been tested on a Beowulf cluster and on an IBM SP2 supercomputer to successfully search for new phamacophores for a group of angiotensin-converting enzyme (ACE) inhibitor drugs. The system has shown an approximately linear speedup of 60 on a 64 processor machine during our initial testing. This parallel ILP architecture should be applicable to a wide range of other bioinformatics and biochemistry problems, in addition to pharacophore identification.

Return to Program or Index

Estimating Recombination Rate Distribution by Optimal Quantization

Mingzhou Song¹, Stephane Boissinot², Robert M. Haralick³ and Ihsin T. Phillips¹
¹Department of Computer Science, Queens College, Flushing, NY 11367
²Department of Biology, Queens College, Flushing, NY 11367
³Doctoral Program in Computer Science, Graduate Center, City University of New York, New York, NY 10016

Abstract
Evolutionary biologists are interested in a high resolution recombination map that depicts accurately how often a recombination event occurs at a specific location in the genome. With the availability of human genome physical map and fast sequencing technology, people start to estimate recombination rate distributions. We obtain recombination rate distribution functions for all the chromosomes in the human genome using an optimal quantization method. In this method, we are able to control explicitly over- fitting/under-fitting. The obtained piece-wise constant recombination rate distribution functions are convenient to store and retrieve. Our experimental results showed more abrupt distribution functions than two recently published results. In the previous results, the over-/under-fitting issues were not addressed explicitly. We also had better quantitative performance than the Parzen window used in a previous approach. It suggests that the optimal quantization might be of great advantage for other genome feature distribution estimation.

Return to Program or Index

Statistical Issues in the Analysis of the CGH Data

Jane Fridlyand, Antoine Snijders, Donna Albertson and Ajay Jain
UCSF Cancer Center, San Francisco, CA

Abstract
The developement of solid tumors is associated with acquisition of complex genetic alterations, indicating that failures in the mechanisms that mainitain the integrity of the genome contribute to tumor evolution. Thus, one expects that the particular types of genomic derangement seen in tumors reflect underlying failures in maintenance of genetic stability, as well as selection for changes that provide growth advantage. In order to investigate genomic alterations we are using microarray-based comparative genomic hybridization (array CGH). The computational task is to map and characterize the number and types of copy number alterations present in the tumors, and so define copy number phenotypes as well as to associate them with known biological markers.

To utilize the spatial coherence between nearby clones, we use unsupervised Hidden Markov Models approach. The clones are partitioned into the states which represent underlying copy number of the group of clones. We discuss the model and demonstrate its performance using simulated and real data. The biological conclusions drawn from the analyses are discussed.

Return to Program or Index

Identification of Contaminants in Proteomics Mass Spectrometry Data

M. Duncan¹, K. Fung¹, H. Wang², C. Yen² and K. Cios²
¹University of Colorado Health Sciences Center
²University of Colorado at Denver

Abstract
Mass spectrometry (MS) is a widely used method for protein identification. Peptide mass fingerprinting is the protein identification technique in which MS is employed to determine the masses of peptide fragments generated following enzymatic digestion of proteins. The masses of peptides are then submitted to a recognition program, e.g., MASCOT or MSFIT, for identification of a protein. The strategy is hampered, however, because not only are the peptide masses determined, but also the masses of multiple contaminants that are also present in the sample. Although the masses of some common and known contaminants are removed (e.g., peptides generated by trypsin autolysis), many others are inadvertently incorporated into the analysis. In this paper we present an approach for automatic identification of contaminant masses so that they can be removed prior to the identification process. For this purpose we have developed an algorithm that clusters mass values. We calculate the frequencies of all masses and then identify contaminants. We propose that masses with frequency higher than a given value are contaminants. In our analysis of 3,029 digested proteins, yielding 78,384 masses, we identified 16 possible contaminants. Of these 16, four are known trypsin autolysis peptides. Removing these contaminant masses from the database search will lead to more accurate and reliable protein identification.

Return to Program or Index

A Flexible Pipeline for Experimental Design, Processing, and Analysis of Microarray Data

Stephen Osborn, Scot Kennedy and Daniel Chin
PPD Discovery, Menlo Park, CA

Abstract
Microarray data is nearly unrivalled in the volume of data generated, as well as the variety of comparisons and analyses that can be made. It is thus imperative to have a computational infrastructure capable of more than storing the raw data: aiding the planning of an experiment; facilitating communication between the groups involved in its execution; and analyzing its results in an organized, consistent manner. Standards like MIAME/MAGE exist for the exchange of data, but are insufficient for efficient and flexible processing. Software tools exist for performing a multitude of analyses, but they often do not provide an overall framework, leaving scientists to apply post hoc any of a wide array of analyses to data and organize the results. We created a database and "pipeline" attempting to address these issues. A hierarchy with raw datasets at the "bottom" and high-level multi-array analyses at the "top" allows the specification of comparisons and relationships in a generalized, flexible manner. This structure implements 'groupings' of datasets such as replicates, dye-swaps, treatments, etc, and associates an extensible range of analyses with each. Data is viewed naturally as features, probes, genes, or arbitrary sets. Researchers can establish the structure of a microarray experiment before beginning, while modifying as needed using a simple web interface. The system adapts to many applications and platforms: the modular pipeline allows calculations to be performed by "plug-ins", built-in or external software, and the hierarchical object data structure maps naturally to XML/DOM format standards while maintaining a scalable efficient relational structure.

Return to Program or Index

Computational Approach to Reconstructing Gene Regulatory Networks

Xutao Deng and Hesham Ali
Computer Science Department, University of Nebraska at Omaha

Abstract
With the rapid accumulation of gene expression data in publicly accessible databases, computational study of gene regulation has become an obtainable goal. Intrinsic to this task will be data mining tools for inferring knowledge from biological data. In this project, we have developed a new data mining technique in which we adapt the connectivity of a recurrent neural network model by indexing regulatory elements and including nonlinear interaction terms. As a result, the new technique reduces the number of parameters by O(n), therefore increasing the chance of recovering the underlying regulatory network. In order to fit the model from data, we have developed a genetic fitting algorithm with O(n) time complexity and that adapts the connectivity during the fitting process until a satisfactory fit is obtained. We have implemented this fitting algorithm and applied it to three data sets: simulated data with 5 genes, rat central nervous system development data with 112 genes, and yeast whole genome data with 2467 genes. With multiple runs of the fitting algorithm, we were able to efficiently generate a statistical pattern of the model parameters from the data. Because of its adaptive features, this method will be especially useful for reconstructing coarse-grained gene regulatory network from large scale or whole genome scale gene expression data sets.

Return to Program or Index

Probability Profiles—Novel Approach in Mass Spectrometry De Novo Sequencing

T. Fridman¹, R. Day¹, J. Razumovskaya², D. Xu² and A. Gorin¹
¹Computer Science and Mathematics Division
²Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831

Abstract
Development of the robust computational algorithms for protein identification in mass spectrometry (MS) experiments is a challenging data analysis problem with a typical set of "biological attributes": abundance of experimental noise, incomplete data, and possibility of wild card factors such as posttranslational modifications. Novel and robust MS analysis approaches are crucially important for emerging Genomes-to-Life Program efforts to compile a comprehensive catalog of cellular protein-protein complexes through high-throughput mass spectrometry. The established methods rely on sequence database lookups and will not be adequate for the identification of cross-linked peptides from two or more subunits of protein-protein complexes.

Here we present a novel "probability profile" approach forDe Novo peptide sequencing from experimental tandem massspectra. A large database of previously resolved peptide spectra was used to determine "neighborhood patterns" for each peak category: C- or N-terminus ions, their dehydrated fragments, etc. The established patterns are applied to assign probabilities for new spectra peaks to fit into those categories. At least a few peaks usually could be identified with a fair confidence creating strong "anchor points" for De Novo algorithm assembling sequence subgraphs.

Our approach is utilizing all informational content of agiven MS experimental data set, including peak intensities, weak and noisy peaks, and unusual fragments. We also discuss ways to provide learning features in our method: adjustments for a specific MS device and user initiated changes in the list of considered peak categories.

Return to Program or Index

A Symbolic Logic Strategy For Mapping Biological Concepts in the HERBE Prototype

Eric G. Stephan, George Chin, Kyle R. Klicker, Abigail L. Corrigan and Heidi J. Sofia
Pacific Northwest National Lab, Richland, WA

Abstract
Current database strategies are not powerful enough to support biologists in the efficient use of complex, large scale, and rapidly changing systems biology data. We are engaged in research to advance database technologies so that biologists can capture, manage, compute upon, and search biological concepts in ways that are presently not possible.

The Heuristic Entity Relationship Building Environment (HERBE) project is developing mechanisms to automate the translation of concepts into computable forms. HERBE features an open visual environment to collect diverse, complex data and construct models graphically, thus allowing users to apply their own symbolic logic. As data are entered, HERBE captures the data, then integrates and heuristically interprets underlying concepts into a computable form using a multi-tiered data management infrastructure.

We have constructed a prototype as an early proof of concept to demonstrate the positive impact this approach can make. The prototype features a fully functional visual concept-mapping layer, extraction components to translate the user's symbolic logic into computational logic, and a concept repository. The prototype also has a derivation feature that automatically builds an overarching data model that is connected to the concept repository so that complex queries between concept maps and the resulting data model are possible.

Return to Program or Index

A Computational Method for Assessing Peptide-Identification Reliability in Tandem Mass Spectrometry Analysis with SEQUEST

Jane Razumovskaya^1,3, Victor Olman¹, Dong Xu^1,3, Ed Uberbacher^1,3, Nathan Verbermoes³, Ying Xu^1,2,3
¹Life Sciences Division and ²Computer Sciences and Mathematics Division, Oak Ridge National Laboratory, TN 37830-6480, USA
³School of Genome Science and Technology, University of Tennessee, Knoxville, TN 37922, USA

Abstract
High throughput protein identification in mass spectrometry is predominantly achieved by first identifying tryptic peptides using SEQUEST and then by combining the peptide hits for protein identification. Peptide identification is typically carried out by selecting SEQUEST hits above a specified threshold, the value of which is typically chosen empirically in an attempt to separate true identifications from the false ones. These SEQUEST scores are not normalized with respect to the composition, length and other parameters of the peptides. Furthermore, there is no rigorous reliability estimate assigned to the protein identifications derived from these scores. Hence the interpretation of SEQUEST hits generally requires human involvement, making it difficult to scale up the identification process for genome-scale applications. To overcome these limitations, we have developed a method, which combines a neural network and a statistical model, for "normalizing" SEQUEST scores, and also for providing a reliability estimate for each SEQUEST hit. This method improves the sensitivity and specificity of peptide identification compared to the standard filtering procedure used in the SEQUEST package, and provides a basis for estimating the reliability of protein identifications.

Return to Program or Index

Functional Genomics

Tomato Expression Database (TED)—An Interactive Management Tool for Tomato Expression Profiling Data

Zhangjun Fei, Xuemei Tang, Rob Alba, Paxton Payton and Jim Giovannoni
Cornell University

Abstract
We are embarking on large-scale functional genomics using cDNA microarrays to obtain expression profiles of tomato genes during fruit development and ripening. Expression data for approximately 12000 ESTs over a time course of fruit development was generated. The ESTs employed represent sequences unique to specific tissues (e.g. fruit, flower), as well as defined genetic, biochemical and physiological functions (e.g. protein kinases, transcriptional factors). In order to provide the research community access to our normalized and replicated microarray data as a tool to assess relative expression of genes of interest, we developed a publicly accessible online database — Tomato Expression Database (TED: http://ted.bti.cornell.edu). Through this database, we provide multiple approaches to pursue analysis of specific genes of interest and/or access the larger microarray data set to identify sets of genes that may behave in a pattern of interest to the user. To achieve this goal, a set of useful data mining and data visualization tools were developed and are under continuing expansion according to user's requirements. Developed initially as a data mining and analysis resource, TED also contains comprehensive annotation of each EST including homology derived from sequence similarity searches of GenBank and GO terms assigned manually according to putative functions. Future expansions will include integration of expression data from tomato fruit ripening mutants and from normal tissues in response to treatments that influence ripening, integration of tomato metabolite data, and addition of interfaces for expanded data analyses including clustering.

Return to Program or Index

Oncogenetic Tree Models—An Estimation

Harpreet Kaur Gill
Christian Medical College

Abstract
The development of human cancers is believed to be driven by a sequence of genetic changes that lead to an abnormal behavior of cells. For many tumor types, characteristic alterations (gene mutations, partial gains or losses of chromosomes) are known, but in many cases, the function of these alterations in tumor development is yet unknown. Thus it is desirable to infer which genetic changes are typically early or late events in tumor development, and whether there are preferred sequential orders of their occurrence. The modelling of tumorigenesis as a sequence of genetic changes was pioneered by Fearon and Vogelstein. However, the heterogeneity encountered in many tumor types suggests that more complex models might be needed to model the dependencies between alterations. Here we show how the method of maximum likelihood can be applied to estimate tree models for oncogenesis.

Return to Program or Index

The GeneCards Family of Databases: GeneCards, GeneLoc, GeneNote and GeneAnnot

Marilyn Safran, Vered Chalifa-Caspi, Orit Shmueli, Naomi Rosen, Hila Benjamin-Rodrig, Ron Ophir, Itai Yanai, Shirley Horn-Saban, Pavel Kats, Michael Shmoish and Doron Lancet
Weizmann Institute of Science

Abstract
The popular GeneCards integrated database of human genes, genomic maps, proteins and diseases has recently spawned three related functional genomics efforts: GeneLoc, GeneNote and GeneAnnot. As sequence data rapidly accumulates, the bottleneck in biology shifts from data production to analysis; researchers seek a profound understanding of the role of each gene and of the way genes function together.

GeneLoc integrates human gene collections by comparing genomic coordinates at the exon level, eliminating redundancies, and assigning unique and meaningful location- based identifiers to each GeneCards gene. When upgrading to new versions, GeneCards has grappled with keeping these identifier associations persistent and consistent, while taking into account changes in gene location due to new elucidations of the human genome.

GeneCards expression tissue vectors are provided by GeneNote, the first effort to present complex expression analyses (including clustering) for a variety of normal human tissues using the full complement of gene representations (Affymetrix arrays HG-U95A-E). GeneNote variation plots are presented for each gene, visualizing Pearson's correlations between individual probe-set vectors and the average tissue vector for the gene, as well as relative scalar lengths of individual vectors. Probe-sets are mapped to GeneCards genes via the GeneAnnot system, which revises and improves their annotation by aligning them to major sets of human mRNA sequences. Each GeneCards gene is annotated with relevant probe-set associations, including their identifiers, arrays, sensitivity and specificity scores, and genes-to-probe-sets counts.

GeneLoc, GeneNote and GeneAnnot have their own respective in-depth web-sites, in addition to providing summary data for enriching GeneCards.

Return to Program or Index

Wavelet Transforms for the Analysis of Microarray Experiments

Taku Tokuyasu, Donna Albertson, Ajay Jain and Dan Pinkel
UCSF Cancer Center, San Francisco, CA

Abstract
Array comparative genomic hybridization (cgh) is a microarray technology for measuring the relative copy number of thousands of genomic regions. Continuing advancements allow entire genomes to be probed with increasing resolution and precision. Array cgh is proving to be a sensitive tool for measuring genomic abnormalities in diseases such as cancer.

The measurements from an array cgh experiment have a natural linear ordering, namely the genome ordering of the spotted clones. Visual examination of cgh profiles shows that genomic changes occur on a variety of length scales. It has been hypothesized that such changes are characteristic of phenotypic variables such as tumor type and gene mutational status.

To aid in identifying such features and exploring their relationship with phenotypic outcomes, we are applying wavelet tranforms to the analysis of cgh profiles. This allows us to decompose a cgh signal into short and long- length scale components, which are characteristic of aberrations over a few clones and on the scale of an entire chromosome. We have begun to address the question of which features of the genomic changes are most highly correlated with outcome.

Wavelet transforms may also be useful in the realm of gene expression. Genes can be ordered as a result of hierarchical clustering, permitting an expression signal to be defined that can be analyzed by wavelets. The wavelet transform can compress the signal from multiple genes into a few components, possibly aiding in the development of new tumor classifiers.

Return to Program or Index

MageBuilder: A Schema Translation Tool for Generating MAGE-ML from Tabular Microarray Data

William Martin and Robert M. Horton
Attotron Biosensors Corporation, East Palo Alto, CA

Abstract
The MicroArray Gene Expression Markup Language (MAGE-ML) is a standardized XML format for representing the results of microarray experiments (www.mged.org). It is designed to include sufficient structural metainformation to make this data useful for purposes such as meta-analysis and data mining. The MAGE data model is sufficiently complex that specialized software tools, such as the MAGE software toolkit (MAGEstk), are invaluable for dealing with the structured data. Laboratories conducting microarray experiments typically store results in tabular format, either in spreadsheets or relational databases, and conversion of this tabular data to MAGE format may be daunting. Here we present tools to facilitate mapping from table-based schemas to MAGE-ML. A 'MageBuilder' object takes a set of 'MageMap' objects and a set of data streams as input, and produces a MAGEstk object representation, which is then serialized as MAGE-ML. A 'MageMap' object encapsulates the rules of how data records from an input stream relate to one MAGE object. Mapping rules may be as simple as copying a column to an attribute, while custom subroutines may be required to calculate attribute values for more complex mappings. Each input "stream" is an anonymous subroutine that supplies records whose fields represent columns in the input table. The input tables can be delimited text files, database queries, or essentially any source that can be coerced into a set of records with fixed fields. Combining abstract input streams with flexible mapping of streams to objects provides an extensively customizable approach to generating MAGE-ML from sets of data tables.

Return to Program or Index

OptiRNAi, a Web-based Program to Select siRNA Sequences

Wenwu Cui¹, Jianchang Ning², Ulhas P Naik¹and Melinda K. Duncan¹
¹Department of Biological Sciences, ²Delaware Biotechnology Institute, University of Delaware, Newark, Delaware 19716

Abstract
RNA interference (RNAi) is a recently developed reverse genetic tool to knockdown expression of specific genes in eukaryotes. Successful implementation of this technology requires the design of double stranded RNAs identical to specific regions of the target gene. However, not every possible small interfering RNA (siRNA) is able to reduce target gene expression. Recently, Elbashir et al (Methods, 26:199-213, 2002) has established empirical criteria for siRNA sequence selection that significantly improved the success rate for RNAi attempts. We have developed OptiRNAi, a computational tool, which uses the Elbashir et al criteria to predict appropriate sequences for siRNA production. The output of the program provides researchers with a numbers of potential siRNA targets, a score that reflects how well the sequence fulfills the empirical criteria, and the location of the target sequence in the input. Specificity of these siRNAs for the target of interest can then be assessed by the investigator using the embedded blast server optimized for RNAi design. Thus, OptiRNAi is a labor saving tool for RNAi design using criteria that are more stringent than other available tools.

Return to Program or Index

A Rule-based Framework for Gene Regulation Pathways Discovery

Bartosz Wilczynski, Torgeir Hvidsten, Andriy Kryshtafovych, Lisa Stubbs, Jan Komorowski and Krzysztof Fidelis
Lawrence Livermore National Lab, Livermore, CA 94550

Abstract
We have developed a novel approach to discover the rules that govern gene regulation mechanisms. The method is based on supervised machine learning and is designed to reveal relationships between transcription factors and gene promoters. As the representation of the gene regulatory circuit we have chosen a special form of IF-THEN rules associating certain features (TFBS) in gene promoters with specific gene expression profiles.

A given set of genes with known expression levels under certain conditions is used as an input to our procedure. We start from gene expression clustering and then iteratively perform the following steps:

searching for the significant features (TFBS) in the gene promoter groups
inducing rules from clustering and a feature set
evaluating the rules
refining the set of features

until no further progress can be achieved.

The core part of the algorithm is the induction of the rules from expression clustering and feature set. To reach this goal we create a binary matrix with rows representing genes associated with expression profile class and columns representing features and then apply the Rosetta software to the created matrix. The results of applying our algorithm to the yeast gene expression data are shown.

Return to Program or Index

Gene Function and Metabolic Pathways in the Yeast

Qing Dong, Rama Balakrishnan, Gail Binkley, Karen R. Christie, Maria Costanzo, Kara Dolinski, Selina S. Dwight, Stacia Engel, Dianna G. Fisk, Jodi Hirschman, Eurie L. Hong, Laurie Issel-Tarver, Rob Nash, Anand Sethuraman, Chandra L. Theesfeld, Shuai Weng, David Botstein and J. Michael Cherry
Saccharomyces Genome Database, Department of Genetics, Stanford University, Stanford, CA

Abstract
The budding yeast, Saccharomyces cerevisiae, has been experimentally manipulated for several decades. A wealth of information is available from the Saccharomyces Genome Database (SGD, http://www.yeastgenome.org/). This poster will highlight two datasets that are maintained by the SGD. A large dataset of hand curated information is provided in machine readable format for each gene of the Saccharomyces genome. These annotations use the Gene Ontology (GO) controlled vocabularies for biological process, molecular function and cellular component. Each hand curated annotation includes an evidence code to a literature reference. Tools are provided at SGD for the analysis of sets of genes. The GO Term Finder helps identify GO terms that a group of genes has in common. The GO Term Mapper allows you to map the specific, granular GO terms used to annotate a list of genes to their more general parent GO terms. The second area of focus is metabolic pathways. A new dataset of hand curated information on metabolic pathways within budding yeast was released in May 2003 and can be downloaded from our ftp site. This resource can be searched to view biochemical reactions and pathways as they occur in S. cerevisiae. This resource also maps data from genome-wide expression analyses onto the pathway overview so that changes in gene expression can be viewed in the context of cellular metabolism. These pathways are created and edited using the Pathway Tools software that is developed and maintained by Peter Karp and his colleagues at SRI International. SGD is funded by the US National Human Genome Research Institute.

Return to Program or Index

Using Rule Induction to Analyze Gene Expression Data

Gary Livingston and Xiao Li
University of Massachusetts-Lowell, Lowell, MA

Abstract
We have applied rule induction to a publicly available adenocarcinoma gene expression data set. The typical approach to the analysis of gene expression data is to cluster the genes; which requires a lot of laborious analysis to interpret the clusters and examine their significance. With rules, the interpretation is more obvious (e.g., (CDKN3 > 253)  (tumor-stage = 3); in addition, the rules establish ranges for values, also aiding interpretation. The data set we used contains 7,129 genes for 86 lung tumor and 10 normal lung samples as well as patient data, such as tumor stage (1 or 3), survival time (months), etc. We used standard approaches to reduce the number of genes to 485. Our rule induction tool was a new semi-autonomous discovery system we are developing called HAMB, and we used it to learn rules for survival status, survival time, and tumor stage. When we searched the world-wide web for publications relating our top 20 genes from our discovered rules to lung cancer, we found 5 of them are known to be associated with lung cancer-GLRX, PCNA, CDH17, KAL1, and SCTR, 5 of them are known to be associated with other types of cancer-CDKN3, KRT6A, FABP4, ADH1, and SCTR, and the remaining 10 were not found to be associated with cancer: ACP5, CP, FMO2, ABLIM, CLECSF2, REG1A, EMP2, AOC3, AGTR1, ALDH2, CPA3. Our results suggest that the latter two groups of genes be examined more closely for their association with lung cancer.

Return to Program or Index

Genotyping and SNPs

SNP Analyzing System for Detecting Multifactorial Disease Associated Site

Yoko Higashi, Hirotaka Higuchi, Takashi Kido, Hirohito Matsumine, Toshihiko Morimoto and Masaaki Muramatsu
NTT Data Corporation, Kayaba-cho Tower, 1-21-2, Shinkawa, Chuo-ku, Tokyo, Japan

Abstract
We have built a system that can efficiently detect genes that cause multifactorial diseases, such as diabetes and high blood pressure. Its main function is to detect polymorphisms in which there are significant differences between the genotypes of a Case group and a Control group by method of statistical tests.

Since many factors are thought to contribute together to a multifactorial disease, single-point analyses are not enough and multiple single-point analyses is a significant problem. To improve the capability to detect them, the new system has a function for extracting the functional and heredity units (referred to as ldblocks) of a genome and for performing association analysis on each unit. Moreover, the existing software for linkage-disequilibrium analysis requires rearranged genotype data in order of a row of polymorphisms. In contrast, the developed system automatically rearranges this data according to the positional information held interiorly. In addition, the system automatically detects the ldblock. Owing to these new functions, the number of polymorphisms it can analyze is about 30 to 50 times higher than by standard manual procedures.

The results of the analyses are stored in a database, and can be browsed as an association-analyses result table. Furthermore, the system produces a polymorphisms-and- ldblock map with the position of genes and polymorphisms recorded in a public data base. It makes investigation of disease-associated site more efficient.

During a poster session, we will present our results on operability and performance of the system.

Return to Program or Index

Haplotype Pattern Mining and Classification for Detecting Disease Associated SNPs

Takashi Kido
Joware Hanzonmon, 2F 2-19 Hayabusa-cho, Chiyoda-ku, Tokyo, Japan

Abstract
Finding the causative genes for common diseases using SNP markers is now becoming really challenging issue. Although conventional single-locus SNP association tests exist, they could neither explain the effects of abundant SNP combinations nor evolution of disease chromosomes. Haplotype analysis of disease chromosomes can not only provide more powerful information than single-locus SNP analysis, but also can help identify probable causative mutations.

In this paper, we introduce a new method on the effective haplotype pattern mining for detecting disease associated mutations. The summary of our developed strategy is as follows:

We detect the similar patterns which frequently appear in disease chromosomes from all possible haplotype combinations (haplotype mining).
We also compare all possible pairs of haplotypes which have only one different SNP (mutation). If we detect the significant difference in association tests on these two haplotypes, we can make assumptions that this mutation may be causative SNP for the disease.
We classify haplotypes based on the phylogenetic similarities and represent their relationships by networks. We then map the phenotype association to each node of the network (Haplotypes).
By comparing the results of these three strategies repeatedly, we can localize the causative mutations.

Our method is inspired by data mining methods and fully utilizes haplotype information. Using this method, we have discovered some of the new disease associated SNPs which could not be detected by conventional methods. In this poster, we will introduce the details of our methods and tools with some worked examples.

Return to Program or Index

Strings, Graphs, and Algorithms

Genetic Algorithm Approach to the Center Sequence Problem Arising in Post-transcriptional Gene Silencing

Holger Mauch
University of Hawaii at Manoa

Abstract
A fundamental aspect of Post-transcriptional gene silencing (PTGS) is the requirement of homology between the transgene sequence and the virus or mRNA sequence being silenced. For example, virus-resistant plants are resistant only to viruses that are very closely related to the virus from which the transgene was derived. One idea is to devise an artificial sequence that is at least 90-95% homologous to all virus genes. This requires an algorithm which can determine an artificial sequence with an optimal homology (or at least a 90-95% homology) to all of the virus sequences. The genetic algorithm presented in this poster serves this purpose. It should be of great value to all researchers who utilize PTGS, gene quelling, or RNAi.

Mathematically, the task is to find the radius of a code $S \subset \{A,C,G,T\}^{n}$. Experimental results suggest that this NP-complete optimization problem can be approached well with a customized genetic algorithm (GA). Alternative approaches to the problem are discussed and put in contrast to the GA approach. Heuristics produce results that are usually far away from a global optimum and therefore most of the time useless in practice. Exhaustive search and branch and bound search have exponential time requirements and are not suitable for real world problem sizes. The instance studied in this poster consists of m=116 virus sequences each having a sequence length of n=626. The GA quickly found a solution with 94.57% homology, which comes very close to the theoretical upper bound of 94.73% for this problem instance.

Return to Program or Index

GenericBioMatch: A Novel Generic Pattern Match Algorithm for Biological Sequences

Youlian Pan and A. Fazel Famili
Integrated Reasoning Group, Institute for Information Technology, National Research Council Canada, 1200 Montreal Road, M-50, Ottawa, ON K1A 0R6, Canada

Abstract
The grammar of biological sequence language varies markedly from any of the natural languages. The schema of DNA and protein sequences is less strict than that of natural languages, such that it allows deletion and insertion of letters in a sequence pattern. It also allows alternation of letters within a "word." Many traditional text-searching algorithms are unsuitable for biological sequences. This has encouraged many researchers to apply probabilistic algorithms, such as Hidden Markov Model, Gibbs Sampling, for biological sequences. However, there are very few algorithms that are suitable for quick exact pattern match in biological sequence. This poster presents a novel algorithm (called GenericBioMatch) for exact match of biological sequences with provided sequence motifs. This algorithm allows wild card letters (eg. Y, R, W in DNA sequence) and one or more gaps of any number of base pairs within the provided sequence motif pattern. GenericBioMatch is a relatively fast algorithm [O(mn), m=motif_length, n=sequence_length], as compared to Hidden Markov Model and other probabilistic algorithms, and has very little computational overhead. It is able to perform exact match of protein motifs as well as DNA motifs. This algorithm can serve as a quick validation tool for implementation of other algorithms, and can also serve as a supporting tool for probabilistic algorithms. This algorithm has been implemented in the BioMiner (http://iit-iti.nrc- cnrc.gc.ca/biomine_e.trx), a suite of tools for integrated data mining, and tested successfully with DNA sequences from human, yeast, and Arabidopsis.

Return to Program or Index

Aligning ESTs to Genome Using Multi-Layer Unique Markers

F. R. Hsu and J. F. Chen
Department of information technology, Taichung Healthcare and Management University, Taiwan

Abstract
As more and more genomic sequences have been sequenced and rapid growing of number of EST (Expressed Sequence Tag), we need very fast and efficient algorithms to align ESTs to genome. ESTs contain the exon/intron structure. So the alignment is different to general alignment between two sequences.

Let G and E denote the given genome and the given EST sequence respectively. In order to reduce time complexity, it is quite important to locate the position of E in G efficiently. A unique marker is a sequence appearing in G only once. We can locate the position of E in G by finding unique markers contained in E. Let Uk denote the set of unique markers of G with length k. The larger of k, the probability of E contains any element in Uk is higher. Yet, the larger of k, the number of elements in Uk is more. The time complexity to align ESTs to genome grows rapidly when we use larger unique markers. In this paper, we propose a multi-layer unique marker method to align ESTs to genome. Our method combines the benefit of both smaller and larger unique markers. Our algorithm can also use to locate the position of single nucleotide polymorphism (SNP). By using four personal computers, our software has successfully located all SNPs in dbSNP in one day. Using only personal computers, our software also successfully aligns the entire EST set to the human genome. Experimental result shows that our method is much faster than common tools such as SIM4 and BLAT.

Return to Program or Index

A Parallel Evolutionary Method for Physical Mapping of Chromosomes

Suchendra M. Bhandarkar, Jinling Huang and Jonathan Arnold
Department of Computer Science, The University of Georgia

Abstract
Reconstructing a physical map of a chromosome from a genomic library presents a central computational problem in genetics. Physical map reconstruction in the presence of errors is a problem of high computational complexity. A parallel evolutionary method for a maximum likelihood estimation-based approach to physical map reconstruction is presented. The estimation procedure entails gradient descent search for determining the optimal spacings between probes for a given probe ordering. The optimal probe ordering is determined using an evolutionary algorithm. A two-tier parallelization strategy is proposed wherein the gradient descent search is parallelized at the lower level and the evolutionary algorithm is simultaneously parallelized at the higher level. Implementation and experimental results on a network of shared-memory symmetric multiprocessors (SMPs) are presented. The evolutionary algorithm is seen to result in physical maps with fewer contig breaks when compared to simulated Monte Carlo algorithms such as simulated annealing and the large-step Markov chain algorithm.

Return to Program or Index

On Finding Common Intervals among k Protein Sequences Containing Redundant Elements

Xiaolu Huang and Hesham Ali
University of Omaha, Nebraska

Abstract
Predicting protein structures based on the protein amino acid sequences has been on the most challenging problem in computational biology. There are many statistics-based machine-learning methods available to biologists these days. However, no single method has proven to solve the problem with its various instances. In this project, we address the protein structure prediction by using a common interval search algorithm. The proposed algorithms, the Redundant element sequence Common Interval Searching (RCIS) Algorithm is finds all the common intervals within an amino acid sequence or among k amino acid sequences. Given a sequence of elements, each of which belongs to a finite set of elements, (in the case of amino acid sequence, all the elements belongs to a element set of 20 amino acids), the intervals within this sequence, consisting of the same set and number of elements is called a common interval. Given a sequence of size n, RCIS has space complexity of O(n2), and time complexity of O(n3). This common interval searching algorithm provides a new way of approaching the protein secondary structure prediction.

Return to Program or Index

Algorithms for Bounded-Error Correlation of High Dimensional Data in Microarray Experiments

Mehmet Koyuturk, Ananth Grama and Wojciech Szpankowski
Purdue University

Abstract
The problem of clustering continuous valued data has been well studied in literature. Its application to microarray analysis relies on such algorithms as k-means, dimensionality reduction techniques (including SVD, principal components, KL transforms, etc.), and graph-based approaches for building dendrograms of sample data. In contrast, similar problems for discrete-attributed data are relatively unexplored. An instance of analysis of discrete-attributed data arises in detecting co-regulated samples in microarrays. Here, our interest is in detecting samples that are up- and down-regulated together. Efficient solutions to this problem enable more refined correlation between presence of sequence motifs and underlying regulation patterns in microarray data.

One of the key challenges associated with clustering discrete-attributed data is that these problems typically are NP-hard and few effective heuristics are known for solving them. In this paper, we present an algorithm and a software framework, Proximus, for error-bounded clustering of high-dimensional discrete attributed datasets. In our past work, we have demonstrated the excellent performance of Proximus in terms of runtime, scalability to large datasets, and performance in terms of capability to represent data in a compact form. In this work, we demonstrate its application to microarray data in extracting co-regulated samples. We show that Proximus delivers outstanding performance in extracting accurate patterns of gene-expression.

Return to Program or Index

Exact and Heuristic Algorithms for the DNA Fragment Assembly Problem

Sami Khuri and Yimin Jiang
San Jose State University

Abstract
As more and more public and private genomic assemblies are available, the need for comparison of different computation methods used in the DNA fragment assembly becomes apparent. In this work, we design, implement, and test four heuristic algorithms for assembling DNA fragments. We also implemented and tested an exact algorithm to solve the Fragment Assembly problem. As expected, the exact algorithm gives the best results, but at a high price; it is very slow compared to our heuristics. The first heuristic implemented in this work, gaFRAG, is a hybrid genetic algorithm that first uses a semiglobal dynamic programming approach for finding pairwise overlaps then applies genetic algorithm operators to order the fragments, and, finally, progressively builds the layout. The second heuristic, mwcFRAG, forms maximum-weight contigs and then uses the greedy technique to find Hamiltonian paths, followed by a progressive technique to find the order of fragments and build the layout. The third algorithm, mFRAG, which considers each fragment as a contig, merges the contigs based on pairwise overlaps using the semiglobal dynamic programming algorithm. The last heuristic, bbFRAG, is derived from the exact algorithm. It builds a tree with a single path to find the heuristic solution instead of building a tree by the branch and bound technique. In order to accurately compare the performance of the different assembly heuristics, we used a collection of benchmark datasets. The experimental results show that our mFRAG and bbFRAG algorithms are as efficient and accurate as those from PHRAP and CAP3.

Return to Program or Index

Finding Higher Order Motifs under the Levenshtein Measure

Ezekiel F. Adebiyi and Tinuke Dipe
Mathematics and Computer Science Department, University of Ilorin, Ilorin, Nigeria

Abstract
Pattern discovery in unaligned DNA sequences is a fundamental problem in computational biology with important applications in finding regulatory signals. Existing approaches on finding patterns focus on monad patterns, otherwise known as common motifs (Sagot, 1998) or simply motifs (Pevzner and Sze, S2000), that correspond to relatively short contiguous strings (Adebiyi, 2002). A new problem of finding composite motifs, otherwise known as structured motifs (Vanet, Marsan, Labigne and Sagot, 2000) or higher order motifs (Sinha, 2002), arises, when the number of monad motifs, $p$, that participates in a biological process is greater than one. The relative position of each monad motif is now important and is not random but sterically defined. In eukaryotic, these monad motifs may interact with one another. Recently, increasing efforts have been made to tackle this problem, but with respect to hamming distance (Sinha, 2002;Eskin 2001). It is known (Adebiyi, 2002; Pevzner and Sze, 2000; Sand, 2002) that sequences of organisms exist, where insertion and deletions are important to find regulatory signals. It is this challenge that we consider in this project. In the problem set-up, we are given $N$ sequences, each of average length $n$, over a finite alphabet $\Sigma$ and thresholds $D$ and $q$, we are to find composite motifs that contain motifs of length $P$ (these motifs occur with atmost $D$ differences) in $1\leq q \leq N$ distinct sequences.

Two interesting but involved algorithms for finding higher order motifs under the edit distance was presented by Marsan and Sagot, RECOMB 2000. Their second algorithm is much more complicated and its complexity is asymptotically not better. The running time of their first algorithm under the edit distance is $O(M\cdot N^2n^{1+\alpha\cdot p\cdot pow(\epsilon)})$, where $p\geq 2$, $\alpha >0$, $pow(\epsilon)$ is a concave function that is less than 1, $\epsilon = D/P$ and $M$ is the expected number of all monad motifs.

$M$ is omitted in the complexity analysis given in Marsan and Sagot, RECOMB 2000, but $M$ is not constant. Under the hamming distance, Buhler and Tompa, 2001, estimated $M= {|\Sigma|}^P(1-(1-P_D)^{n-P+1})^q$, where $P_D$ is the probability that a given motif of length $P$, occurs with upto $D$ substitutions at a given position of a random sequence. We showed in Adebiyi 2002, that under the edit distance, $P_D<2N^{\alpha\cdot (pow(\epsilon)-1)}$.

We present an alternative algorithmic approach also for Edit distance based on the concept described in Adebiyi, Jiang and Kaufmann, ISMB 2001; Adebiyi and Kaufmann, WABI 2002. The resulting algorithm is simpler and runs in $O(N^2n^{1+ p\cdot pow(\epsilon)})$ expected time, where $p\geq 1$, $\epsilon = D/P$ and $pow(\epsilon)$ is a concave function less than zero.

Return to Program or Index

Mapping Discontinuous Antibody Epitopes to Reveal Protein Structure and Changes in Structure Related to Function

Mumey, B.¹, Angel, T.², Kirkpatrick, B.¹, Bailey, B.⁴, Hargrave, P.⁵, Jesaitis, A.³ and Dratz, E.2
Departments of ¹Computer Sciences, ²Chemistry & Biochemistry, and ³Microbiology, Montana State Univ., Bozeman, MT
⁴NIH-NIAAA
⁵Univ. of Florida, Gainesville, FL.

Abstract
Antibody imprinting is a new method for protein structure determination that overcomes limitations in traditional methods, such as x-ray crystallography or nuclear magnetic resonance, and requires about a thousand fold less protein. The method uses antibodies against target proteins and random peptide probe libraries to map the epitopes of antibody binding sites on target proteins. The first step in the process is to experimentally determine a subset of the probes that have a high affinity to the antibody. Once this set of probes is found, we apply an existing algorithm we have developed [Mumey et al., RECOMB 2002] to find the best alignments of each probe sequence to the target protein sequence. In this poster, we report on new work to synthesize the set of probe-target alignments into a single map of the epitope surface of the protein. We have recently developed a new algorithm that creates a "surface neighbor graph" on the epitope residues and finds a putative planar embedding of this graph which is most consistent with the alignments found.

We have used hen egg lysozyme as a validation case and present surface neighbor graphs that correctly identify regions of the protein that are exposed to solvent and distant regions in the sequence that are in close proximity in the folded protein. We have also identified monoclonal antibodies (mAbs) that stabilize light-excited conformations of rhodopsin and antibody imprinting is providing information on the conformations of the light- excited protein. We have characterized the epitopes of 25 mAbs that recognize at least 5 distinct regions of human neutrophil flavocytochrome b, an integral membrane electron transferase that produces host defensive superoxide. We are using antibody imprint analysis to help predict its structure in advance of attempts using X-ray crystallography. We are now testing the ability of antibody imprinting to screen out incorrect ab initio protein structure predictions with a goal of arriving at much more accurate protein structure predictions.

Return to Program or Index

The SCP and Compressed Domain Analysis of Biological Sequences

Don Adjeroh and Jianan Feng
West Virginia University

Abstract
We present new methods to rapidly analyze biological sequences, by looking at their compressed representations. We focus on BWT-based compression, and introduce the SCP - the sorted common prefix, and show how the SCP could be used for improved algorithms for anchor-based sequence alignment, and for locating tandem repeats. Using the relationship between the Burrows-Wheeler Transform (BWT) and the suffix tree, in addition to our new auxiliary vectors, we can obtain the sorted suffixes directly from the compressed sequence. Based on the properties of the SCP, we develop algorithms to compute the SCP in O(u + |A|K) number of comparisons, and O(K) extra space, where u is the sequence length, K is the maximum SCP value, and |A| is alphabet size. The SCP is a natural data structure for the analysis of regularities and repetitions in a sequence. The areas that contain tandem arrays form a specific geometric structure in the SCP. For each primitive repeat, we can construct an isosceles triangle, such that points along the sides correspond to SCP values in the area. We call this structure the p-periodic (triangular) neighborhood in the SCP. By analyzing this triangular structure, we can provide specific information about the repeat. Based on the above, we can find all the n canonical tandem arrays in a given sequence T in O(u + n + |A| K) time on average. For a fixed |A|, this implies O(u + n) time. This can be compared with the previous non-compressed domain algorithms for exact tandem repeats which run in O(u log u).

Return to Program or Index

Identifying Regulatory Signals in DNA-sequences with a Non-statistical Approximation Approach

Cun-Quan Zhang¹, Yunkai Liu², Elaine M. Eschen² and Keqiang Wu³
¹Dept. of Mathematics, ²Lane Dept. of Computer Science and Electrical Engineering, ³Dept. of Biology, West Virginia University

Abstract
With the development of gene-profiling technology, the expression patterns of thousands of genes under a variety of conditions are now available. This gives us the opportunity to analyze the parts of the genome believed to be responsible for transcription control—the transcription factor DNA-binding motifs. We give a mathematical model for the motif finding problem and a non-statistical approximation algorithm to find motifs in a set of genome sequences.

Let S be a set of length n strings and s be a similarity function between pairs of strings. Let K be the complete graph with vertex set S and weight s(x,y) for each edge xy. Given constants c and m, the goal of motif finding is to find a subgraph H of K of size m such that s(x,y) is at least c for every pair of vertices of H. This is equivalent to finding a clique of size at least m in the graph K' obtained from K by deleting all edges with weight less than c, which is NP-Complete. Thus, we introduce the notion of an R-group to develop an approximation approach.

For a subgraph H of K we construct a vertex R(H) that represents the distribution of symbols in all strings of H, and we extend the similarity function s to R(H). An R-group of K is a subgraph G of K such that s(R(G),x) is at least c for every x of G. R-groups approximate the desired cliques of K'. Our algorithm finds an R-group in O( n(|S|^3) ) time. This yields a practical approximation to the clique (motif) problem.

Return to Program or Index

A Genetic Algorithm for Simplifying the Amino Acid Alphabet

Matthew Palensky and Hesham Ali
University of Nebraska at Omaha

Abstract
Simplified amino acid alphabets have been successful in several areas of bioinformatics, including predicting protein structure, predicting protein function, and protein classification. Simplification can be done by clustering the symbols and creating a symbol representing each cluster. Since there are twenty symbols in the amino acid alphabet, the number of possible simplifications is large. It is not practical to search through all possible simplifications to find one suitable for a specific application. A previous study conducted by the authors identified and evaluated four types of computational simplification algorithms. This study's initial findings indicate that algorithms with heavy reliance on randomness and heuristics tend to produce poor simplifications. Genetic algorithms have been generally successful in producing quality solutions to problems with a large solution space, though their reliance on randomness makes it difficult to create quality simplifications. This study's goal is to overcome these difficulties, and create a genetic simplification algorithm. The presented results include the genetic simplification algorithm, as well as the difficulties of creating such an algorithm. The described algorithm has led to the development of a computer program that uses a genetic algorithm to produce simplified alphabets, and these outputs will be listed and analyzed. The evaluation metrics will include the pair wise alignment scoring, physical characteristics of the groupings, algorithm running time, and suitability in particular domains.

Return to Program or Index

Structural Biology

Spectral Decomposition of the Laplacian Matrix Applied to RNA Folding Prediction

Danny Barash
Genome Diversity Center, Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel

Abstract
RNA secondary structure consists of elements such as stems, bulges, loops. The most obvious and important scalar number that can be attached to an RNA structure is its free energy, with a landscape that governs the folding pathway. However, because of the unique geometry of RNA secondary structure, another interesting single-signed scalar number based on geometrical scales exists that can assist in RNA structure computations. This scalar number is the second eigenvalue of the Laplacian matrix corresponding to a tree-graph representation of the RNA secondary structure. Because of the mathematical properties of the Laplacian matrix, the first eigenvalue is always zero, and the second eigenvalue (often denoted as the Fiedler eigenvalue) is a measure of the compactness of the associated tree-graph. The concept of using the Fiedler eigenvalue/eigenvector is borrowed from domain decomposition in parallel computing. Thus, along with the free energy, the Fiedler eigenvalue can be used as a signature in a clever search among a collection of structures by providing a similarity measure between RNA secondary structures. This can also be used for mutation predictions, classification of RNA secondary folds, filtering and clustering. Furthermore, the Fiedler eigenvector may be used to chop large RNAs into smaller fragments by using spectral graph partitioning, based on the geometry of the secondary structure. Each fragment may then be treated differently for the folding prediction of the entire domain.

Return to Program or Index

Identification of Non-random Patterns in Structural and Mutational Data: The Case of Prion Protein

Igor B. Kuznetsov and S.Rackovsky
Dept. of Biomathematical Sciences, Mount Sinai School of Medicine

Abstract
Prion diseases (mad cow disease, CJD, etc.) are a group of fatal neurodegenerative disorders associated with structural conversion of a normal, mostly a-helical cellular prion protein, PrPC, into a pathogenic b-sheet- rich conformation, PrPSc. The structure of PrPC is well characterized, whereas little is known about the structure of PrPSc, what parts of PrPC undergo conformational transition, or how known disease promoting mutations facilitate this transition. In this work we address these problems by using a computational statistical analysis of patterns observed in both sequence and structural databases, as well as in PrP mutational data.

1. We compare sequence properties of PrP to the distributions of properties observed in a non-redundant database of known protein structures. Using z-statistic we identify PrP fragments that show unusual intrinsic structural propensities.

2. We construct a novel entropic index which provides a quantitative measure of context-dependent conformational flexibility of a sequence fragment. We use this index to identify PrP fragments that have sequence context with unusually high degree of conformational flexibility compared to sequence context of all fragments with similar properties.

3. We test for statistically significant clusters of PrP mutations and statistically significant patterns of changes in physico-chemical properties associated with these mutations. Estimates of statistical significance for these tests are based on a stochastic model of mutational process with unequal substitution rates and context-dependent mutational hot spots, which allows us to take into account a non-uniform distribution of the events.

Return to Program or Index

Multiple Protein Structure Alignment by Deterministic Annealing

Luonan Chen, Mituyasu Seo and Junya Nakai
Osaka Sangyo University, Japan

Abstract
Multiple Aligning protein structures is one of the most important computational problems in molecular biology and plays a key role in protein structure prediction, fold family classification, motif finding, tree reconstruction and so on. Rather than a pairwise sequence alignment that simply compares two strings of characters, a pairwise structure alignment is a more difficult problem that optimally superimposes two structures and further finds the regions of closest overlap in a three-dimension space. From the viewpoint of the computational complexity, even a pairwise structure alignment is a NP-hard problem due to matching of rigid bodies, in contrast to the existence of the polynomial time algorithm for pairwise sequence alignment by dynamical programming (DP).

Although many algorithms have been developed so far for structure alignment problem based on either distance-based methods or vector-based methods, such as iterative dynamical programming, Monte Carlo simulation, graph theory, and mean field equations, only a few methods are published for multiple structure alignment, such as progressive structure alignment and Monte Carlo optimization.

In this paper, we propose a novel method for solving multiple structure alignment problem, based on deterministic or mean field annealing technique. We define the structure alignment as a mixed integer-programming (MIP) problem with the inter-atomic distances between two or more structures as an objective function. The integer variables represent the marchings among structures whereas the continuous variables are translation vectors and rotation matrices with each protein structure as a rigid body. By exploiting the special structure of continuous partial problem, we transform the MIP into a nonlinear optimization problem (NOP) with a nonlinear objective function and linear constraints, based on mean field equations. To optimize the NOP, a mean field annealing procedure is adopted with a modified Potts spin model. Since all linear constraints are embedded in the mean field equations, we do not need to add any penalty terms of the constraints to the error function. In other words, there is no "soft constraint" in our mean field model and all constraints are automatically satisfied during the annealing process, thereby not only making the optimization more efficiently but also eliminating unnecessary parameters of penalty that usually require careful tuning dependent on the problems.

Return to Program or Index

An Exact Algorithm for Determining Protein Backbone Structure from NH Residual Dipolar Couplings

Lincong Wang, Ramgopal R. Mettu, Ryan H. Lilien and Bruce Randall Donald
Department of Computer Science, Dartmouth College, Hanover, NH 03755

Abstract
We have developed a novel algorithm for protein backbone structure determination using global orientational restraints on internuclear vectors derived from residual dipolar couplings (RDCs) measured in solution NMR. The algorithm is based on two low-degree polynomial equations for computing the backbone $(\phi,\psi)$ angles from two vectors in consecutive peptide planes; each vector is obtained by solving a quartic equation on two RDCs. Given RDCs measured on any \emph{single} backbone vector type, these equations make it possible to compute $(\phi,\psi)$ \emph{exactly} and \emph{in constant time per residue}. Using these equations, our algorithm computes $(\phi,\psi)$ angles that simultaneously maximize structural agreement with experimental RDCs and minimize deviation from standard secondary structure geometry. Our algorithm samples a Gaussian error interval around each RDC, and then, in a depth-first manner, searches the discrete sets of $(\phi,\psi)$ solutions. The search tree has a branching factor of at most 16 (typically 4), since for each pair of consecutive RDCs our equations have multiple solutions. The depth-first search incorporates a pruning strategy based on favorable Ramachandran regions. We give evidence that under a Gaussian error model for RDCs, our algorithm has an expected runtime that is considerably lower than the worst-case exponential runtime. In practice, our implementation runs quickly; it has been demonstrated on Human Ubiquitin using backbone NH RDCs. The secondary structure elements were subsequently positioned with twelve hydrogen bonds and four NOE distance restraints. Compared with other RDC-based automated fold determination algorithms, our exact method uses fewer restraints but achieves similar or better accuracy.

Return to Program or Index

Automatic Construction of 3D Structural Motifs for Protein Function Prediction

Mike Hsin-Ping Liang, Doug Brutlag and Russ Altman
Stanford Biomedical Informatics

Abstract
Structural genomics initiatives are on the verge of generating a vast number of protein structures. The biological roles for many of these proteins are still unknown, and high-throughput methods for determining their function are necessary. Understanding the function of these proteins will have profound impact in drug development and protein engineering.

Current methods for protein function prediction on structures require manual creation of structural motifs. Thus only few structural motifs are available. The lack of structural motifs limits the use of these methods for function prediction at a structural-genomics scale.

To overcome this limitation, I present a method for automatically creating a library of three dimensional structural motifs from amino acid sequence motifs databases. Automatically generating a library of structural motifs can be used for structural-genomic scale function prediction on protein structures.

Return to Program or Index

Local Minima-Based Exploration for Off-Lattice Protein Folding

Eugene Santos Jr., Keum Joo Kim and Eunice E. Santos
Dept. of Computer Science and Engineering, University of Connecticut

Abstract
Proteins have specific native folding structures, but without proper folding, these proteins function incorrectly and lead to disease. The major challenge lies in effectively predicting such native conformations. We present a new and simple algorithmic approach to help predict protein structures from amino acid sequences based on energy minimization. In the search for the minimal energy conformation, we analyze and exploit the protein structures found at the various local minima to direct the search the global minimum. As such, we explore the energy landscape efficiently by considering only the space of local minima instead of the whole feasible space of conformations. Our specific algorithmic approach is comprised of two different elements: local minimization and operators from genetic algorithms. Unlike existing hybrid approaches where the local optimization is used to fine-tune the solutions, we focus primarily on the local optimization and employ stochastic sampling through genetic operators for diversification. Our empirical results indicate that each local minimum is representative of the substructures contained in the set of solutions surrounding the local minima. We applied our approach to determining the minimal energy conformation of proteins from the Protein Data Bank (PDB) using the UNRES energy model. We compared against standard genetic algorithms and Monte Carlo approaches as well as the conformations found in the PDB as the baseline. In all cases, our new approach computed the lowest energy conformation.

Return to Program or Index

Text Mining and Ontologies

HyBrow: A System for Hypothesis Validation

Stephen A. Racunas, Nigam H. Shah, Shashi Phoha and Nina V. Fedorof
Penn State University, Dept of Electrical Engineering

Abstract
Representation of diverse data and biological processes in a single formalism is a major challenge in bioinformatics because of the variety of available data and the different levels of detail at which biological processes can be considered. We have developed an event-based ontology for representing both biological processes and data from different sources at varying levels of detail. We have integrated this ontology with a discrete event systems framework to support qualitative reasoning. The framework adopts a hierarchical, contradiction-based approach to evaluate competing hypotheses based on available data and provide a ranking among them. We have designed a database backend to structure information in the ontology and programs to perform contradiction-based evaluation of hypotheses constructed using the ontology. We have also designed command line, web and custom interfaces for the system. We have named our system HyBrow (from Hypothesis Browser), which includes the event-based ontology, an associated database, the hypothesis evaluation framework and the interface programs. We demonstrate the feasibility of HyBrow, using the galactose metabolism gene network in Saccharomyces cerevisiae as our test system, for ranking alternative hypotheses in an automated manner using available data. The Hybrow system for modeling biological systems is highly extensible and can facilitate the process of understanding how a biological system works by providing an integrated view of the biological process under study.

Return to Program or Index

Identifying Gene and Protein Names from Biological Texts

Weijian Xuan, Stanley J. Watson, Huda Akil and Fan Meng
Mental Health Research Institute and Department of Psychiatry, University of Michigan

Abstract
Identifying gene and protein names from literature is a critical step for mining functional information of genes and proteins. While extensive efforts have been devoted to this important task, most of them were aiming at extracting gene and protein names per se without paying much attention to associate the extracted names with existing gene and protein database entries. We developed a simple and efficient method for identifying gene and protein names in literature using a combination of heuristic and statistical strategies. We build a large dynamic dictionary based on LocusLink and UMLS. We use handcrafted rules and natural language processing techniques to transform the dictionary and text, and use an efficient algorithm to search for interesting patterns. A series of procedures are applied to disambiguate the detected entities. Our approach will map the identified entities to individual LocusLink entries thus enable the seamless integration of literature information with existing gene and protein databases. Evaluation on two different corpuses shows that our method achieves around 90% precision and recall, and outperforms two other publicly available systems for protein name recognition. The analysis of the entire set of Medline abstracts can be finished in less than 15 hours on a single PC. Our method does not require any training data, and can be used as a building block for large biomedical literature mining systems.

Return to Program or Index

Refining the Extraction of Relevant Documents from Biomedical Literature to Create a Corpus for Pathway Text Mining

Rachel Harte, Yan Lu, David Dehoney, Stephen Osborn and Daniel Chin
PPD Discovery, Inc.

Abstract
For biologists to keep up with developments in their field or related fields, automation is desirable to more efficiently read and interpret a rapidly growing literature. Identification of proteins or genes and their interactions can facilitate the mapping of canonical or evolving pathways from the literature. In order to mine such data, we developed procedures and tools to pre-qualify documents for further analysis. Initially, a corpus of documents for proteins of interest was built using alternate symbols from LocusLink and the Stanford SOURCE as MEDLINE search terms. The query was refined using the optimum keywords together with MeSH terms combined in a Boolean query to minimize false positives. The document space was examined using a Latent Semantic Indexing (LSI) tool, which uses Entrez's "related papers" utility for MEDLINE. Documents' relationships are then visualized using an undirected graph and scored depending on their degree of relatedness. Distinct document clusters, formed by the most highly connected related papers, are mostly composed of abstracts relating to one aspect of research. This feature was used to filter irrelevant abstracts, which resulted in a reduction in corpus size of 10% to 20% depending on the domain. The excluded documents were examined to confirm their lack of relevance. Corpora consisted of the most relevant documents thus reducing the number of false positives and irrelevant examples in the training set for pathway mapping. Documents were tagged, using a modified version of GATE2, with terms based on GO for rule induction using RAPIER.

Return to Program or Index

Using Natural Language Processing and the Gene Ontology to Populate a Structured Pathway Database

David Dehoney, Rachel Harte, Yan Lu and Daniel Chin
PPD Discovery

Abstract
Reading literature is one of the most time consuming tasks a busy Scientist has to contend with. As the volume of literature continues to grow there is a need to sort through this information in a more efficient manner. Mapping the pathways of genes and proteins of interest is one goal that requires frequent reference to the literature. Pathway databases can help here and Scientists currently have a choice between buying access to externally curated pathway databases or building their own in-house. However such databases are either expensive to license or slow to populate manually.

Building upon easily available, open-source tools we have developed a pipeline to automate the collection, structuring and storage of gene and protein interaction data from the literature. As a team of both biologists and computer scientists we integrated our natural language processing (NLP) software with the Gene Ontology (GO) to collect and translate unstructured text data into structured interaction data. For NLP we used a machine learning approach with a rule induction program, RAPIER (http://www.cs.utexas.edu/users/ml/rapier.html). RAPIER was modified to learn rules from tagged documents, and then it was trained on a corpus tagged by expert curators. The resulting rules were used to extract information from a test corpus automatically. Extracted Genes and Proteins were mapped onto Locuslink, and extracted interactions were mapped onto GO. Once information was structured in this way it was stored in a pathway database and this formal structure allowed us to perform advanced data mining and visualization.

Return to Program or Index

Text Visualization for Natural Language Analysis of Biology Full Text

Andrea Elaina Grimes and Robert P. Futrelle
Northeastern University

Abstract
Full text articles in the biology literature are a rich source of knowledge. Our focus is on discovering enough of the text structure to answer natural language queries such as, "What are the mutants of the TBP dimer interface?" Rather than view text as grammar-based, we view it as "data," just as DNA sequences are. We have developed novel visualization tools to help us discover significant language patterns. The corpus is 300M words of text from Highwire Press. One important class of "Framework" patterns consists of a left anchor, an endogram and a right anchor. An example is, "... expression of [e] was ..." in which the endogram [e] is typically a gene name such as "csgA" or "AcuF-LacZ." In our Framework Viewer, the anchors are aligned in columns with the remainder of each sentence displayed also. The sentences can be sorted in various ways, e.g., by endogram length or alphabetically by the word preceding the right anchor. It is possible to manually mark interesting Framework instances for later processing. The endogram multisets generated can be combined to broaden or sharpen them. We are cataloging every Framework in the corpus to discover the most frequent expressions that authors use. Queries are processed by computing their distance from items in the text using measures based on word insertions, deletions and substitutions. This surface view of language is still not powerful enough, so extensions of Frameworks are being developed to induce natural language constituent structure.

Return to Program or Index

RETURN TO
PROGRAM

Return to Top

HOME | REGISTRATION | STANFORD SITE | PAPERS | POSTERS | TUTORIALS | PROGRAM | KEYNOTE SPEAKERS
INVITED SPEAKERS | SPECIAL EVENTS | COMMITTEES | SPONSORS | NEWS | CONTACT | 2002 CONFERENCE