CSB2004

POSTER ABSTRACTS:
DATA MINING

Search-space Reduction of a Non-redundant Peptide Database Using De Novo Sequencing
Ian Shadforth, Daniel Crowther, Conrad Bessant
Cranfield University and GlaxoSmithKline

Mass spectrometric peptide assignation, using peptide mass fingerprinting, tandem mass mapping or de novo sequencing, is commonly employed to identify proteins in a sample. However, up to 90% of peptides remain unidentified for a number of reasons including post translational modifications (PTMs) and alternative splice variance. To search protein databases iteratively using many permutations of PTM or to search the 6-frame translation of the genome, for variants, prevents confident peptide identification due to increased statistical potential for false positive matches. A search-space filter using amino acids identified by a novel de novo sequencing methodology is presented. This methodology provides a high-accuracy set of amino acid predictions through exploiting the internal fragmentation of amino acid chains during tandem mass spectrometry. These predictions are then used to reduce the number of peptides in a non-redundant peptide database. Such databases are created through the in silico tryptic digestion of any proteomic, or translated genomic, database, then stored in a relational schema uniquely keyed by peptide sequence, as opposed to protein. The presence of one confirmed amino acid can be used to reduce the search-database size to between 67% (Leucine) and 17% (Tryptophan) of the original size, with multiple identifications further reducing the initial pool. One or more accurate amino acid identifications are shown to be made in up to 18% of simulated and 9% of experimental peptide spectra considered. Given the large proportion of currently unidentified peptides, this method represents a useful tool for increasing peptide identification rates.

Automated and Rapid Bacterial Identification Using LC-Mass Spectrometry with a Relational Database Management System
Samir V. Deshpande (1), Rabih E. Jabbour (2), Waleed M. Maswadeh (3), and A. Peter Snyder (3)
(1) Science & Technology Corporation, Edgewood, MD 21040; (2) Geo-Centers, Inc., Aberdeen Proving Ground, MD 21010; (3) US Army Edgewood Chemical Biological Center, Aberdeen Proving Ground, MD 21010

We have developed an integrated and automated software application for rapid bacterial identification using a relational database management system and electrospray liquid chromatography-ion trap mass spectrometry (ESI-LC/MS). ESI-LC/MS is used to generate chromatographic profiles of proteins in a bacterial sample along with a software program that automates the data analysis. The software program ProMAPDB automates the data collection, peak identification, spectral purification, mass spectral integration of scans in a peak, and assignment of molecular weights for observed proteins by using a deconvolution algorithm described by Zhang and Marshall. The approach generates a list of biomarker masses along with retention time and relative abundance for all masses obtained by the algorithm. The list of masses is stored in a relational database as a reference library including the sample information such as growth conditions and experimental information. The identification of unknown samples is performed by correlation to the relational database. Bacteria include E. coli, Bacillus subtilis, B. thuringiensis, and B. megaterium. The approach has been tested for bacterial discrimination and identification from mass spectra of mixtures of microorganisms and from mass spectra of organisms at different growth conditions. Experimental factors such as sample preparation, reproducibility, mass range and mass accuracy tolerance are also addressed and evaluated. This approach has the potential for reliable and accurate automated data analysis.

Return to Poster Abstract Index
Return to Top

Development of a Knowledge-based Multi-scheme Cancer Microarray Data Analysis System
John H. Phan (1), Chang F. Quo (1), Kejiao Guo (1), Weimin Feng (1), Geoffrey Wang (2), May D. Wang (1)
(1) Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University; (2)School of Biology, Georgia Institute of Technology 313 Ferst Drive, Atlanta, GA 30332, USA

Comparing genes expressed in normal and diseased states assists the understanding the cancer pathophysiology, detection, prognosis, and therapeutic target study. Many existing expression analysis papers show that the microarray data have inconsistency, small sample (patients) size, large gene dimensions, and are case-dependent. Thus, we have been developing a robust multi-parameter, multi-scheme knowledge-based optimization system that integrates strength from statistics, pattern-recognition (PR), supporting vector machines (SVM), and fuzzy sets theory. Statistics is good for process large sample data. PR can process small sample data with various distance measures. SVM finds optimal hyper-plane separating known clinical diagnosed groups. Fuzzy set better models cases where some genes belong to multiple groups instead of one. The optimization logic decides optimal cancer signature genes by jointly utilizing two data types and four analysis models, three tiers of processing: Rank-Order, Marker-Identification and Diagnosis-Prediction, and clinical knowledge feedback. In Diagnosis-Prediction, we evaluate its power for correct classification using the "leave-one-out" method for cross-validation. Our system is being finalized by testing over public and in-house datasets. It has provided advantages by being able to (1) process heterogeneous data sets from different types of microarray chips; (2) handle multiple types of data set from the same microarry technology; (3) normalize data with multiple options; (4) perform both standard statistical hypotheses tests and permutation tests; (5) perform multiple pattern-classification using different feature distance measures; (6) use different Kernel for the construction of the hyperplanes in SVM; (7) apply multiple one-dimension-search methods; and (8) perform multiple knowledge rules in diagnosis prediction.

Return to Poster Abstract Index
Return to Top

FigSearch: A Figure Legend Indexing and Classification System
Fang Liu, Tor-Kristian Jenssen, Vegard Nygaard, John Sack, and Eivind Hovig
Dept. of Tumor Biology, The Norwegian Radium Hospital

FigSearch is a prototype text-mining and classification system for figures from any corpus of XML-formatted full-text biological papers. By classification, we mean the figures are classified into being relevant/non-relevant to a certain figure type, in our study, that is schematic representations of protein interactions and signaling events. Our system is composed of three modules: a XML parser for figure extraction from articles; a supervised machine learning approach which, based on maximum entropy criteria in feature words vector space, rank figures by its relevance to the defined figure type; a gene/protein symbols indexing scheme being used to create figure-to-gene (s) indexes, where PubGenes integrated gene identifiers/alias resources were adopted as the dictionary. A web interface was implemented, allowing users to retrieve a ranked list of figures illustrating protein interaction, while containing the user-queried gene in figure captions. We tested on a set of approximately 50,000 figures. Based on a training set of 4,500 figures, we chose 0.53 as threshold in classification at which the system resulted in 70.5% recall, 86% precision and 77.4% effectiveness. Then, 414 out of ~50,000 figures was categorized as relevant figures by classifier. According to domain experts, the system showed satisfactory performance. To use the graphical content to facilitate experts manual work while apply text-mining in the legend is the feature of this system. Our pilot study reveals an existing link between graphics and text. Such a system can be useful in such applications as Publishers website, bio-picture gallery construction, aid for other complicated text-mining project.

Return to Poster Abstract Index
Return to Top

GeneTIDE: Terra Incognita Discovery Endeavor - Mining ESTs and Expression Data to Elucidate Known and De Novo GeneCards Genes
Maxim Shklar, Orit Shmueli, Liora Strichman-Almashanu, Michael Shmoish, Tsippi Iny-Stein, Marilyn Safran and Doron Lancet
Weizmann Institute of Science

GeneTIDE, the Gene Terra Incognita Discovery Endeavor (http://genecards.weizmann.ac.il/genetide/) and newest addition to the GeneCards suite of databases, comprehensively associates over 4.5 of the ~5.5 million human ESTs currently available at dbEST with known or newly defined putative human genes. Association of transcripts with existing genes is done by integrating data from UniGene, DoTS, AceView, BLAT & GeneLoc, and GeneAnnot into a Consensus and Uniqueness based scoring scheme. In the quest for novel genes, groups of previously unannotated non-singleton transcripts spanning the same UniGene, DoTS and AceView clusters are defined as EST based Gene Candidates (EGCs). EGCs are then annotated with various parameters that help determine evidence levels for whether or not they truly form new genes. Currently, GeneTIDE defines ~20,000 EGCs with various degrees of confidence. The annotation of each EGC includes genomic locations of its constituting transcripts, an indication of splicing (to rule out various contaminations such as genomic sequences) and whether or not the transcripts include repetitive elements according to RepeatMasker. Furthermore, expression tissue vectors, according to GeneNote data, of probe sets derived from the EGC's member transcripts are given, offering initial insight to the possible function of the proposed EGC. An immediate application of GeneTIDE is in microarray annotation. Gene associations given to trancripts by GeneTIDE were applied to the probe sets derived from those transcripts. In a specific example, we were able to provide gene identities to an additional ~15,000 unassigned EST-based Affymetrix HGU95A-E GeneChip microarray probesets, a 50% increase relative to previous annotations.

Return to Poster Abstract Index
Return to Top

ProtExt, a Web-based Protein-protein Interaction Extraction System for PubMed Abstracts
Chin-Lin Peng (1), Yung-Chung Lin (1), Hsuan-Cheng Huang (2,3), Cheng-Yan Kao (1), Shui-Tein Chen (2), and Hsueh-Fen Juan (1,4)
(1) Department of Computer Science and Information Engineering, National Taiwan University; (2) Institute of Biochemical Sciences, National Taiwan University; (3) Impac Inc., Taiwan; (4) Department of Chemical Engineering and Institute of Biotechnology, National Taipei University of Technology, Taiwan

Functional proteomics is aimed at understanding the protein-protein and gene-protein interactions. Networks of interacting proteins provide a first level of understanding the cellular mechanism. Furthermore, the cellular functions of uncharacterized proteins are revealed through their linkages to characterized proteins. The networks of linkages offer a new view of the meaning of protein function, and a deepened understanding of the functioning of cells. However, since information about protein-protein interactions still primarily buries in the scientific literature. Efficient processing of large amounts of interactions therefore needs an intelligent information extraction method. We present a web-based software package, ProtExt, which automatically extracts information of protein-protein interactions from the literature abstracts available at the NCBI Entrez-PubMed system and provides a visualization tool for constructing protein-protein interaction network. The engine of ProtExt is based on a Na Øve Bayes learning method to efficiently process sentences from abstracts and generate possible protein-protein interactions. Users can specify their interested protein and a network of the predicted interacting proteins will be generated and visualized on the web page. This system will provide a valuable resource for prediction of protein-protein interactions and further study of the related proteins. ProtExt can be accessed freely at http://protext.csie.org.

Return to Poster Abstract Index
Return to Top

Biclustering Gene-Feature Matrices for Statistically Significant Patterns
Mehmet Koyuturk, Wojciech Szpankowski, and Ananth Grama
Purdue University

Biclustering is an important problem that arises in diverse applications, including analysis of gene expression and drug interaction data. The problem can be formalized in various ways through different interpretation of data and associated optimization functions. We focus on the problem of finding unusually dense patterns in binary (0-1) matrices. This formulation is appropriate for analyzing experimental datasets that come from not only binary quantization of gene expression data, but also more comprehensive datasets such as gene-feature matrices that include functions of coded proteins and motifs in the coding sequence. We formalize the notion of an "unusually" dense submatrix to evaluate the interestingness of a pattern in terms of statistical significance based on the assumption of a uniform memoryless source. We then derive simplified formulations to assess statistical significance of discovered patterns. Using statistical significance as an objective function, we formulate the problem as one of finding maximally significant dense submatrices of a large sparse matrix. Adapting a simple iterative heuristic along with randomized initialization techniques, we derive fast algorithms for discovering binary biclusters. We conduct experiments on a binary gene-feature matrix, in which each entry indicates the existence of a feature (functions and motifs) on a gene. Our experimental results show that the proposed method quickly discovers all interesting patterns on this dataset.

Return to Poster Abstract Index
Return to Top

Recognition of Exon/Intron Boundaries Using Dynamic Ensembles
Xuanfu Wu and Zhengxin Chen
University of Nebraska at Omaha

Many studies have been carried out in recognition of exon/intron boundaries. Since this problem can be cast as a classification task, ensemble learning can be applied. An ensemble consists of a set of organized individual trained classifiers whose individual decisions are combined in a certain way for classification purpose. However, existing studies typically take static approaches which hampered flexibility for improved accuracy. To overcome this problem we have proposed the concept of dynamic ensemble and developed a new algorithm, BAGA, which combines bagging and genetic algorithm techniques. Empirical studies have been carried out on StaLog DNA data set, using the proposed BAGA and its variations. Experiments showed that BAGA could achieve a better performance than traditional bagging in recognizing exon/intron or intron/exon boundaries, while using only a small number of individual classifiers chosen at execution time. In this paper, we describe the settings used in the experiments and describe the results of our empirical study. We also provide a discussion on the potential of using dynamic ensembles for other classification tasks in bioinformatics, as well as a comparison of the proposed approach versus existing approaches.

Return to Poster Abstract Index
Return to Top

Probabilistic Consistency Analysis for Gene Selection
S. N. Mukherjee and S. J. Roberts
Dept. of Engineering Science, University of Oxford

A great deal of recent research has focused on the challenging task of selecting differentially expressed genes from microarray data (gene selection). Recent theoretical work* has shown that the effectiveness of a selection method can be captured as a probability of success (called selection accuracy), which is jointly determined by the statistical properties of the biological system under study, and the form of the selection function. Subtle features of the underlying system can have serious effects on selection accuracy, to the extent that many widely-used methods may produce quite spurious results under certain conditions. Unfortunately, in practical situations, relatively little tends to be known about the very features upon which algorithm performance depends, making it difficult to choose an appropriate method. This poster presents a simple, but fully probabilistic, measure of consistency in selection results which turns out to be an effective guide to algorithm choice. The practical utility of the approach lies in the fact that it can be used to choose good methods for experimental data, and calculate levels of confidence in those choices (in effect, making statements of the form Algorithm A is 85% likely to be better suited to this dataset, than Algorithm B). Results on microarray data provide estimates of selection accuracies for several well-known algorithms and show, among other things, that a very simple fold-type method often outdoes the widely-used t-test. Drawing on theoretical results, we conjecture that this effect may be due to systematic differences in variances across genes in microarray datasets. [*our paper submission to CSB2004 entitled A Theoretical Analysis of Gene Selection]

Return to Poster Abstract Index
Return to Top

Automating the Biological Data Collection Process with Agents
Zoe Lacroix, Kaushal Parekh, Hasan Davulcu, I. V. Ramakrishnan, and Nikeeta Julasana
Arizona State University

Scientists spend significant amount of time accessing Web resources, extracting information of interest, filtering, and integrating relevant data from multiple heterogeneous Web sites to support their data collection needs. This tedious collection process is typically performed manually as available technology does not allow scientists to explore and control their data collection process step by step. However, most of the process can be automated. While scripts (e.g., Perl) may be written to retrieve, parse and extract data of interest, many scientists are not programmers and do not have IT support to develop this scripting support. In contrast we propose an approach based on Personal Information Agents (PIA) that provide scientists a user-friendly mechanism to automate their data collection processes without the need of any programming. WinAgent works like a recording machine of the data collection protocol. It sits as a small toolbar in the Internet Explorer having intuitive buttons like record, play, stop, etc. Once the scientist has designed the data collection protocol, selecting the various resources, identifying relevant information from each retrieved page and the navigation process (e.g., selection of hyperlinks to go from one page to the other), he launches the WinAgent that records step by step the overall process. When all steps are recorded, an agent will be automatically generated. The agent can be used to collect data automatically following step by step the data collection protocol, and store all retrieved data in an XML document or a database for further data transformation, integration and annotation.

Return to Poster Abstract Index
Return to Top

An Intelligent Digital Library System for Biologists
Jeffrey Stone, Xindong Wu, and Marc Greenblatt
Department of Computer Science, University of Vermont

To aid researchers in obtaining, organizing and managing biological data, we have developed a sophisticated digital library system that utilizes advanced data mining techniques to guide the researcher. Our digital library system is centralized with Web links to publicly accessible data repositories. The library is based on a framework used for conventional libraries and an object-oriented paradigm, and will provide personalized user-centered services based on the user's areas of interests and preferences. To make personalized service possible, a user profile that represents the preferences of an individual user is constructed based upon the past activities, goals indicated by the user, and options specified by the user. Utilizing these user profiles, our system makes relevant nformation available to the user in an appropriate form, amount, and level of detail with minimal user effort. The core of our project is an agent architecture that provides advanced services by combining data mining capabilities of the popular weak data mining package with domain knowledge in the form of a semantic network. The semantic network will impart a knowledge structure through which the system can reason and draw conclusions about biological data objects and will provide a federated view of the many disparate databases of interest to biologists. In the development of our semantic network, we have included the concepts from several established controlled vocabularies, chief among them being the National Library of Medicine's Unified Medical language System (UMLS). Our complete semantic network consists of 183 semantic types and 69 relationships.

Return to Poster Abstract Index
Return to Top

A New Approach to Clustering Biological Data Using Message Passing
Huimin Geng (1), Dhundy Bastola (2), and Hesham Ali (1)
(1) Department of Computer Science, University of Nebraska at Omaha; (2) Department of Pathology and Microbiology, University of Nebraska Medical Center

Clustering algorithms are widely used in bioinformatics, having been applied to range of problems from the analysis of gene expression to the building of phylogenetic trees. Biological data often describe parallel and spontaneous processes such as molecular interactions and genome evolution. To capture these features, we propose a new clustering algorithm that employs the concept of message passing. Inspired by a real world situation in which initially unknown people can form groups by exchanging messages, Message Passing Clustering (MPC) allows data objects to communicate with each other and produces clusters in parallel, thereby making the clustering process intrinsic. Other advantages of MPC over traditional clustering methods include that it is relatively straightforward to understand and implement and that it takes into account both local and global structure. We have proved that MPC shares similarity with Hierarchical Clustering (HC) but offers significantly improved performance. To validate the method of MPC, we analyzed thirty-five sets of simulated dynamic gene expression data. In our experiments, 95% hit rate was achieved, in which 639 genes out of total 674 genes were correctly clustered. We have also applied MPC to two real data sets to build phylogenetic trees from aligned mycobacterium sequences. The results show higher classification accuracies as compared to traditional clustering methods such as HC.

Return to Poster Abstract Index
Return to Top

Ontology Specific Data Mining Based on Dynamic Grammars
Daniel Quest and Hesham Ali
University of Nebraska at Omaha

A general approach to sequence comparison and data mining is introduced: grammar entric ontology searching. This is a formal grammar based approach for data mining that can do what previous tools such as alignment search tools and regular expression engines can do but also much more. A common hypothesis is that biological sequences contain elements and these elements are the functional units that will determine the interactions of the molecule. These elements may not be detectable by a simple homology search (alignment) because of the interference and noise produced by mutations in the evolutionary process. However, these consensus subsequences or expressions are the key to the functionality of the sequence or relationship to other biological units for many biological problems. In this paper, we introduce a grammar and a corresponding data mining engine capable of extracting records from the public databases, extracting a subset of those records for mining, and then sorting those records based on similarity to a model grammar designed by the user. This model is based on the objective (ontology) of the user and scoring is dynamic at runtime.

Return to Poster Abstract Index
Return to Top

Application of Relief-F Feature Filtering Algorithm to Application of Relief-F Feature Filtering Algorithm to Selecting Informative Genes for Cancer Classification Using Microarray Data
Yuhang Wang and Fillia Makedon
Dartmouth College

Numerous recent studies have shown that microarray gene expression data is useful for cancer classification. Classification based on microarray data is very different from previous classification problems in that the number of features (genes) greatly exceeds the number of instances (tissue samples). Typically, there are thousands of genes and only less then one hundred samples. It has been shown that selecting a small set of informative genes can lead to improved classification accuracy. It is thus important to first apply feature selection methods prior to classification. In the machine learning field, one of the most successful feature filtering algorithms is the Relief-F algorithm. This algorithm has been successfully used in many large subset feature selection tasks. However, to our knowledge, its performance on gene selection has not been evaluated. In this work, we empirically evaluate its performance on three published cancer classification data sets: ALL-AML leukemia, MLL leukemia, and colon tumor. We use the linear Support Vector Machine (SVM) and k-NN as classifiers in the experiments. We compare the performance of Relief-F with other feature filtering methods, including Information Gain, one-dimensional SVM, Gain Ratio, and chi-squared statistics. Using the leave-one-out cross validation, experimental results show that the performance of Relief-F is comparable with other methods.

Return to Poster Abstract Index
Return to Top

Probability Profile Method - New Road for Data Analysis in Tandem Mass Spectrometry
Andrey Gorin, Robert Day, Andrey Borziak, Jane Razumovskaya, and Tema Fridman
Oak Ridge National Laboratory

Mass spectrometry is the tool-of-choice of modern proteomics applied to the crucially important experimental problems ranging from the reconstruction of the whole cell proteome content and to the high throughput discovery of protein-protein complexes. Currently only ~10% of all tandem MS (MS/MS) spectra lead to peptide identification, and the rate of false positives on the peptide level is high. Our Probability Profile Method (PPM) opens a new road for reliable and comprehensive peptide/protein identification. The principal idea could be described as micro analysis of the spectra: analysis of the patterns typical for individual peak categories and relationships between individual peaks. Results confirmed on large and diverse data sets (~60,000 spectra) indicate that a large majority of the useful peaks in MS/MS spectra could be identified with a surprising level of confidence, providing the foundation for a range of new algorithmic capabilities: (1) spectra can be edited by sorting out desirable peak categories; (2) overall characteristics of MS/MS spectra, such as parent ion charge or total number of the present ions, can be rapidly estimated with a high precision; (3) labeled peaks of the same category, e.g. b-ion peaks, can be connected into de novo peptides with a quantitative measure of the confidence level.

Return to Poster Abstract Index
Return to Top

Assigning Gene Ontology (GO) Categories to Yeast Genes Using Text-Based Supervised Learning Methods
Tomonori Izumitani, Hirotoshi Taira, Hideto Kazawa and Eisaku Maeda
NTT Communication Science Laboratories

In such genome database projects as Saccharomyces Genome Database (SGD) curators manually assign Gene Ontology (GO) codes to each gene by reading enormous relevant documents. Such automatic assignment of GO codes is demanding and challenging. We propose a method for assigning upper level GO codes (GO categories) to each gene using documents relevant to the gene. This method combines relevant documents to each gene and represents each gene as a vector that consists of weighted word frequencies. Then, a binary classifier is acquired for each category by supervised learning methods. We tried support vector machines (SVM) and maximum entropy method (MEM) as learning methods. We applied this method for assigning GO categories to yeast genes and compared the performance with the existing method developed by Raychaudhuri et al. Twelve GO categories from their study were tested, and their performance was measured by F-measure (harmonic mean between precision and recall). Our method achieved an averaged F-measure of 0.67, that is 0.3 higher than Raychaudhuri et al.'s method when SVMs with the first order polynomial kernels were used as classifiers. We also applied the method to genome-wide annotation for yeast using all GO Slim categories provided by SGD. The proposed method achieved averaged F-measures of 0.58, 0.72, and 0.60 for the three GO parts (Cellular component, Molecular function, and Biological process), respectively. These results indicate that the proposed method can provide temporary annotations for genes which have not been checked by curators yet and contribute to the efficiency of GO annotation.

Return to Poster Abstract Index
Return to Top

PPM-chain Efficient and Precise Tool for De Novo Peptide Identification in Tandem Mass Spectrometry
Robert Day, Andrey Borziak and Andrey Gorin
Computer Science and Mathematics Division, ORNL

Recently we introduced Probability Profile Method (PPM), which utilizes the noisy neutral-loss neighborhood around each peak to assign a specific probability for the peak categories under consideration. Here we present a PPM-chain PPM based tool for de novo protein tag identification. De novo peptide identification involves finding a connected sequence of ion peaks separated by intervals of amino acid masses, corresponding to a tagging of a source protein. However, the number of possible connected sequences in a raw spectrum can be in the hundreds of thousands, at it increases exponentially with the desired length of the tag. PPM can be used to locate high probability islands, containing practically only b- and y-ion peaks, thereby greatly reducing computational complexity and sharply increasing precision of tagging. In addition, the obtained tags can be reliably ranked using the PPM-derived probabilities assigned to the connected peaks. The value of peptide tags was demonstrated on a large database of ~90,000 proteins. Even without taking into account protease digestion restrictions or additional available mass constraints, the obtained tags were efficient in elimination of incorrect proteins. For example, out of all constructed tags of length 4, 90% reduced the number of candidate proteins to less than 400. With additional flanking mass constraints PPM-chain shows precision exceeding the gold standard of the field - SEQUEST program, while providing a set of unique new capabilities and outperforming SEQUEST from 200 to 2000 times by speed.

Return to Poster Abstract Index
Return to Top

HOME • REGISTRATION • PAPERS • POSTERS • TUTORIALS • PROGRAM • KEYNOTE SPEAKERS • INVITED SPEAKERS
SPECIAL EVENTS • COMMITTEES • SPONSORS • NEWS ROOM • CONTACT US • PREVIOUS CONFERENCES