Poster Abstracts for
Section A Poster Sessions I and IV
- Data Integration
-
A Semantic Mediation Approach for Problems in Computational Molecular Biology
A Laboratory Information Management System for Flow Cytometry
Supporting Scientific Discovery with a Scientific Query Language
- Data Mining
-
Application of Singular Value Decomposition and Functional Clustering to Analyzing
Gene Expression Profiles of Renal Cell Carcinoma
Statistical Resynchronization and Bayesian Detection of Periodically Expressed Genes
Understanding Macrophage Signaling Pathways through DNA Microarray Data Using
Competitive Learning Neural Network and Fuzzy Logic Data Mining Approach
A Method for Tight Clustering: With Application to Microarray
Riptide: Fast Protein Identification from Mass Spectrometer Data
Accelerating the Drug Design Process through Parallel ILP Data Mining
Estimating Recombination Rate Distribution by Optimal Quantization
Statistical Issues in the Analysis of the CGH Data
Identification of Contaminants in Proteomics Mass Spectrometry Data
A Flexible Pipeline for Experimental Design, Processing, and Analysis of Microarray Data
Computational Approach to Reconstructing Gene Regulatory Networks
Probability Profiles Novel Approach in Mass Spectrometry De Novo Sequencing
A Symbolic Logic Strategy For Mapping Biological Concepts in the HERBE Prototype
A Computational Method for Assessing Peptide-Identification Reliability in
Tandem Mass Spectrometry Analysis with SEQUEST
- Functional Genomics
-
Tomato Expression Database (TED) An Interactive Management Tool for
Tomato Expression Profiling Data
Oncogenetic Tree ModelsAn Estimation
The GeneCards Family of Databases: GeneCards, GeneLoc, GeneNote and GeneAnnot
Wavelet Transforms for the Analysis of Microarray Experiments
MageBuilder: A Schema Translation Tool for Generating MAGE-ML from Tabular Microarray Data
OptiRNAi, a Web-based Program to Select siRNA Sequences
A Rule-based Framework for Gene Regulation Pathways Discovery
Gene Function and Metabolic Pathways in the Yeast
Using Rule Induction to Analyze Gene Expression Data
- Genotyping and SNPs
-
SNP Analyzing System for Detecting Multifactorial Disease Associated Site
Haplotype Pattern Mining and Classification for Detecting Disease Associated SNPs
- Strings, Graphs, and Algorithms
-
Genetic Algorithm Approach to the Center Sequence Problem Arising in
Post-transcriptional Gene Silencing
GenericBioMatch: A Novel Generic Pattern Match Algorithm for Biological Sequences
Aligning ESTs to Genome Using Multi-Layer Unique Markers
A Parallel Evolutionary Method for Physical Mapping of Chromosomes
On Finding Common Intervals among k Protein Sequences Containing Redundant Elements
Algorithms for Bounded-Error Correlation of High Dimensional Data in
Microarray Experiments
Exact and Heuristic Algorithms for the DNA Fragment Assembly Problem
Finding Higher Order Motifs under the Levenshtein Measure
Mapping Discontinuous Antibody Epitopes to Reveal Protein Structure
and Changes in Structure Related to Function
The SCP and Compressed Domain Analysis of Biological Sequences
Identifying Regulatory Signals in DNA-sequences with a Non-statistical
Approximation Approach
A Genetic Algorithm for Simplifying the Amino Acid Alphabet
- Structural Biology
-
Spectral Decomposition of the Laplacian Matrix Applied to RNA Folding Prediction
Identification of Non-random Patterns in Structural and Mutational
Data: The Case of Prion Protein
Multiple Protein Structure Alignment by Deterministic Annealing
An Exact Algorithm For Determining Protein Backbone Structure
From NH Residual Dipolar Couplings
Automatic Construction of 3D Structural Motifs for Protein Function Prediction
Local Minima-Based Exploration for Off-Lattice Protein Folding
- Text Mining and Ontologies
-
HyBrow: A System for Hypothesis Validation
Identifying Gene and Protein Names from Biological Texts
Refining the Extraction of Relevant Documents from Biomedical
Literature to Create a Corpus for Pathway Text Mining
Using Natural Language Processing and the Gene Ontology to
Populate a Structured Pathway Database
Text Visualization for Natural Language Analysis of Biology Full Text
Data Integration
A Semantic Mediation Approach for Problems in Computational Molecular Biology
M. Chagoyen, M.E. Kurul, P.A. De-Alarc—n, S. Santini, B. LudŠscher,
J.M. Carazo, and A. Gupta
Centro Nacional de Biotecnologia - CSIC, Madrid, Spain
Abstract
A computational method of problem solving needs to
interleave information access and algorithm execution by
putting them together in a problem-specific "workflow."
In a complex domain like molecular biology, this workflow often
involves executing a number of algorithms requiring access
to multiple information sources (or multiple times to the
same information source) to provide parameters at different
steps of its execution. To translate this requirement into
reality, one needs to develop a general-purpose system from
scientific workflows that contains an information management
component providing access to a wide variety of information
sources including local databases, public data banks having
a query interface, web sites that publish information
without any explicit means to "query" them, computational
resources. Equally important is the need to ensure that
different pieces of information acquired from different
information sources are semantically compatible so that a
consumer of information in the workflow can meaningfully
accept the information object produced by some other source
at a prior step of execution.
We present two system components that are parts of a larger
computation infrastructure to execute a scientific workflow.
The first is a programmable integrator that can access
information from multiple, heterogeneous sources in a
consorted manner. The second component is a semantic
mediator that can integrate information from different
information sources by bridging over the semantic
differences among them. The key characteristic of this
mediator is its ability to encode and execute
domain-specific expert rules in order to join information
from multiple sources.
Return to Program or Index
A Laboratory Information Management System for Flow
Cytometry
Jeffrey D. Grant, Elizabeth Goralczyk, Frank J. Manion, R.
Randy Hardy,
J. Robert Beck, and Michael F. Ochs Fox Chase Cancer
Center, Philadelphia, PA
Abstract
Flow cytometry has evolved into an important tool for
biological and medical research. In order to maximize
throughput on a cytometer and fully explore data sets,
scientists need a way to design protocols (including cell
types, fluorochrome-stain combinations), have these
protocols conveyed to the flow cytometer, review their
experiments, and efficiently analyze results using
visualization tools. To meet these needs, we have created a
Laboratory Information Management System (Flow LIMS) that
enables users to design and manage experimental protocols
for the Becton Dickinson DiVA flow cytometer. Results are
accessed by the automatic linking of FCS output files to the
Flowjo application (Tree Star, Inc., San Carlos, CA) for
visual analysis. At the core of the Flow LIMS is a
collection of Java classes and JavaServer Pages that handle
user authentication using LDAP, record and search protocol
details in an Oracle database, and generate XML files for
conveying protocol details to the DiVA machine. A Perl
script periodically copies all DiVA output to a regularly
backed-up archive space. We are currently at work
integrating the Flow LIMS with the National Cancer
Institute's caLIMS and caBIO, which will allow the LIMS to
easily access NCI Center for Bioinformatics resources
including singular nucleotide polymorphisms (SNPs), tissue
specific expression information, molecular targets of
specific therapeutics, details of signaling pathways and
protein interactions, and genomic data. Ultimately, the Flow
LIMS will become part of a larger LIMS that will encompass
many types of biological data including microarray
and proteomics data, among others.
Return to Program or Index
Supporting Scientific Discovery with a Scientific Query Language
Barbara Eckman1, Kerry Deutsch2, Marta Janer2,
Zoè Lacroix3 and Louiqa Raschid4
1IBM Life Science
2Systems Biology
3Arizona State University
4University of Maryland
Abstract
The deficiencies of relational databases for supporting
biological investigations have often been noted. The
relational data model is too impoverished to accurately
represent richly hierarchical biological structures. In a
federated context, wide-ranging multi-source queries,
particularly those containing joins on the results of BLAST
searches, often return unmanageably large result sets,
requiring non-traditional methods to identify and exclude
extraneous data. Such filtering often requires more
sophisticated conditions than the relational algebra can
express, e.g., regular expression pattern matching. It is
widely accepted that in order to support biological
investigations a query language must provide functions
wrapping analytic tools not supported by the relational
query languages (e.g., sequence comparisons, motif
searches, chemical sub-structure searches). In addition,
the mapping between relational operators and the biological
semantics of a scientific query is not obvious. For
example, querying a table of gene expression results for
genes relatively specific to pancreas (i.e., expressed in
pancreas and only, say, 3 other tissues) requires a sub-query with GROUP BY and HAVING operators; the ubiquity of
variant types in biological databases often requires outer
joins and the use of functions like coalesce(A,B), which
returns A if it is non-null, but B otherwise; and since
greater confidence is placed in data attested by multiple
sources, biologists often want to return only objects that
are found in multiple selection sets defined by multiple
sub-queries. On the other hand, a query language is
attractive for encoding biological investigations and may
be relatively easy to optimize. A declarative language is
also in theory easier for biologists to use in implementing
their own investigations, without relying on professional
programmers. The scientific query language proposed in the
poster is composed of four operators: collection,
filtering, ranking and validation. We illustrate our
approach with the study of SNP profiles to characterize
diseases.
Return to Program or Index
Data Mining
Application of Singular Value Decomposition and Functional Clustering
to Analyzing Gene Expression Profiles of Renal Cell Carcinoma
Zhong-Hui Duan, Joseph DiDonato, Louis S. Liou and Ting Shi
University of Akron
Abstract
The microarray gene expression profiles of both clear cell
renal cell carcinoma tissues and a renal cell carcinoma
cell line were analyzed using singular value decomposition
and functional clustering analysis. The expression matrix
of the tissue and cell line samples based on the 3145
selected genes was decomposed using singular value
decomposition. The 2-D projection of the expression
profiles revealed significant differences between the
profiles of renal cell carcinoma tissue and cell line
samples. Selected genes were annotated based on biological
processes and clustered into functional groups. The
expression levels of genes in each group were analyzed. The
analyses revealed remarkable gene expression alterations in
the biological pathways of renal cell carcinoma and
provided insights into understanding the molecular
mechanism of renal cell tumorigenesis. Based on the
expression levels, 74 commonly differentially expressed
genes in renal cell carcinoma tissue samples were
identified. This group of genes may be useful as molecular
tumor markers that can potentially be used for more
accurate diagnosis, prognosis and possibly can serve as
drug targets for effective therapies. The detailed
comparison of gene expression patterns in renal cell
carcinoma tissue samples and renal cell carcinoma cell line
shows significant differences between the two types of
samples, but many important expression patterns were
consistent.
Return to Program or Index
Statistical Resynchronization and Bayesian Detection of Periodically
Expressed Genes
Xin Lu, Wen Zhang, Zhaohui S. Qin, and Jun S. Liu
Department of Statistics, Harvard University
Abstract
A major challenge of identifying periodically expressed
genes from cell-cycle microarray data is to eliminate the
substantial amount of noises caused by synchrony decay,
block/release effect and experimental variation. We
proposed a Periodic-Normal Mixture model to improve the
reliability of measuring cell-cycle length and gene
expression periodicity by re-synchronizing the
transcription profiles with progressively mixed Fourier
decomposition. Subsequently, we devised a two-component
Beta mixture model to approximate the PNM fitting residues
of three datasets (cdc28, alpha and cdc15) in order to
combine information gathered from different experiments
and detect genes that are periodically expressed. The
result has shown that about 1780 genes (32.3% of 5510
genes examined) are likely to be periodically expressed.
We identified 822 genes whose posterior probability of
periodicity are greater than 0.95, among which 282 are
absent from the 800 list by Spellman et al (1998). When
matching the 822 resynchronized expression profiles of the
three independent experiments, little phase shifts were
observed, which indicated that the three synchronization
methods bring cells to the same cell-cycle phase at the
time of release. The consensus expression profiles of the
822 genes were elucidated by weighted average of the
expression profiles in the three datasets, and the
expression stage of these genes were estimated by matching
the consensus profiles with typical profiles drawn from
five groups of well-known phase-specific genes. Checking
with the Gene Ontology annotation, it was observed that
many of the 822 genes were involved in important cell-
cycle related processes, function and components.
Return to Program or Index
Understanding Macrophage Signaling Pathways through DNA Microarray Data
Using Competitive Learning Neural Network and Fuzzy Logic Data Mining Approach
Xin Feng, Chin-Fu Chen and Peter J. Tonelllato
Marquette University, Dept. of Electrical and Computer Engineering
Abstract
Understanding the response of human white blood cells
(macrophages) to pathogens may provide insights to both the
mechanisms of host defenses and the tactics used by
pathogens to circumvent these defenses. The DNA microarray
method has evolved to become one of the most powerful tools
to understand the dynamics of gene response. Currently
there is no standard approach to systematically analyze the
data and the interpretation of results can vary
dramatically. In this paper, we employed a new combined
Competitive Learning neural network and fuzzy logic data
mining approach to explore patterns of time sequence data
from eight bacteria with 977 gene responses that showed
significant changes on a microarray chip. We annotated the
genes by their "biological processes" of Gene Ontology
(GO). Our result suggests that the CL-Fuzzy logic data
mining approach is effective in exploring how human
macrophages respond to each bacterium with a unique
combination of shared biological processes. The shared
processes include: signal transduction, transcription,
metabolism, and cell cycle and proliferation. Our result
also suggests that there are similar responses (identical
genes) to several bacteria and the similarities may be
related to the shared mechanism of bacterial pathogenesis.
Return to Program or Index
A Method for Tight Clustering: With Application to Microarray
George C. Tseng and Wing H. Wong
Department of Biostatistics, Harvard University
Abstract
In this paper we propose a method for clustering that produces tight
and stable clusters without forcing all points into clusters. The
methodology is initially motivated from cluster analysis of microarray
experiments. Many existing clustering algorithms have been applied
in microarray data to search for gene clusters with similar expression
patterns. However, none has provided a way to deal with an essential
feature of array data: many genes are expressed sporadically and do
not belong to any of the significant biological functions (clusters)
of interest. In fact, most current algorithms aim to assign all genes
into clusters. For many biological studies, however, we are mainly
interested in the most informative, tight and stable clusters with
sizes of, say, 20-60 genes for further investigation. Tight Clustering
has been developed specifically to address this problem. It applies
K-means clustering as an intermediate clustering engine. Early truncation
of hierarchical clustering tree is used to overcome the local minimum
problem in K-means clustering. The tightest and most stable clusters
are identified in a sequential manner through an analysis of the tendency
of genes to be grouped together under repeated resampling. We validated
this method in a simulated example and applied it to analyze expression
profiles of the Drosophila life cycle. The result of this new
method is shown to better serve biological needs in microarray analysis.
Return to Program or Index
Riptide: Fast Protein Identification from Mass Spectrometer Data
Richard J. Carter
Hewlett Packard Labs
Abstract
The biotech firm Target Discovery Inc. (TDI) has developed
a 100X faster approach to producing a fragmentation mass
spectrum from a labeled protein sample. This speed
increase has put pressure for similar speed improvements in
the company's algorithm for determining the unknown
protein's terminal amino acid sequence from its mass
spectrum. This work presents the novel algorithm Riptide,
which demonstrates a 125X speed improvement over TDI's web-
published algorithm as measured from coded C programs.
Given a mass spectrum dataset of an unknown protein, TDI's
algorithm assigns a score to each of the 20n sequence
possibilities for the protein's n terminal amino acids (for
some desired n). The highest scoring sequence is reported
as the most likely terminal identity of the protein.
Central to the TDI algorithm is a mass spec look-up
function MS() that reports a normalized occurrence-counts
for a given mass/charge based on the current dataset.
Riptide, by design, produces an identical answer to TDI's
algorithm with far less computational complexity. Riptide
capitalizes on the fact that the MS() return value,
although dependent on the combination of amino acids in the
fragment being evaluated, is independent of the amino acid
order. By performing the MS() call once for a given amino
acid combination, Riptide is able to reduce the MS() calls
from 49 million to 177K for the desirable case of a 6-deep
sequencing. In addition, the mean and standard deviation
score parameters required by the normalization operation
can be calculated efficiently in Riptide's combination
space sequencing approach.
Return to Program or Index
Accelerating the Drug Design Process through Parallel ILP Data Mining
James Graham1, C. David Page2 and Ahmed Kamal3
1Univ. of Wisconsin
2Systems Biology
3Western Kentucky Univ.
Abstract
Present drug design methodologies often require several
years in the initial chemical evaluation stages before
compounds can be formulated for animal and human testing.
This poster presents a new parallel approach for inductive
logic programming (ILP) that has the potential to streamline
this process. The system is based on effective data
partitioning and message passing in the parallel ILP search
protocols. Two of the key design features of this approach
are its portability between different parallel machine
architectures, and the ease of parallizing important ILP
search problems. The system has been tested on a Beowulf
cluster and on an IBM SP2 supercomputer to successfully
search for new phamacophores for a group of
angiotensin-converting enzyme (ACE) inhibitor drugs. The
system has shown an approximately linear speedup of 60 on a
64 processor machine during our initial testing. This
parallel ILP architecture should be applicable to a wide
range of other bioinformatics and biochemistry problems, in
addition to pharacophore identification.
Return to Program or Index
Estimating Recombination Rate Distribution by Optimal Quantization
Mingzhou Song1, Stephane Boissinot2, Robert M. Haralick3 and Ihsin T. Phillips1
1Department of Computer Science, Queens College, Flushing, NY 11367
2Department of Biology, Queens College, Flushing, NY 11367
3Doctoral Program in Computer Science, Graduate Center,
City University of New York, New York, NY 10016
Abstract
Evolutionary biologists are interested in a high resolution
recombination map that depicts accurately how often a
recombination event occurs at a specific location in the
genome. With the availability of human genome physical map
and fast sequencing technology, people start to estimate
recombination rate distributions. We obtain recombination
rate distribution functions for all the chromosomes in the
human genome using an optimal quantization method. In this
method, we are able to control explicitly over-
fitting/under-fitting. The obtained piece-wise constant
recombination rate distribution functions are convenient to
store and retrieve. Our experimental results showed more
abrupt distribution functions than two recently published
results. In the previous results, the over-/under-fitting
issues were not addressed explicitly. We also had better
quantitative performance than the Parzen window used in a
previous approach. It suggests that the optimal
quantization might be of great advantage for other genome
feature distribution estimation.
Return to Program or Index
Statistical Issues in the Analysis of the CGH Data
Jane Fridlyand, Antoine Snijders, Donna Albertson and Ajay Jain
UCSF Cancer Center, San Francisco, CA
Abstract
The developement of solid tumors is associated with acquisition of complex
genetic alterations, indicating that failures in the mechanisms that
mainitain the integrity of the genome contribute to tumor evolution. Thus,
one expects that the particular types of genomic derangement seen in tumors
reflect underlying failures in maintenance of genetic stability, as well as
selection for changes that provide growth advantage. In order to investigate
genomic alterations we are using microarray-based comparative genomic
hybridization (array CGH). The computational task is to map and characterize
the number and types of copy number alterations present in the tumors, and so
define copy number phenotypes as well as to associate them with known
biological markers.
To utilize the spatial coherence between nearby clones, we use unsupervised Hidden
Markov Models approach. The clones are partitioned into the states which represent
underlying copy number of the group of clones. We discuss the model and demonstrate
its performance using simulated and real data. The biological conclusions drawn
from the analyses are discussed.
Return to Program or Index
Identification of Contaminants in Proteomics Mass Spectrometry Data
M. Duncan1, K. Fung1, H. Wang2, C. Yen2 and K.
Cios2
1University of Colorado Health Sciences Center
2University of Colorado at Denver
Abstract
Mass spectrometry (MS) is a widely used method for protein
identification. Peptide mass fingerprinting is the protein
identification technique in which MS is employed to
determine the masses of peptide fragments generated
following enzymatic digestion of proteins. The masses of
peptides are then submitted to a recognition program,
e.g., MASCOT or MSFIT, for identification of a protein.
The strategy is hampered, however, because not only are
the peptide masses determined, but also the masses of
multiple contaminants that are also present in the sample.
Although the masses of some common and known contaminants
are removed (e.g., peptides generated by trypsin
autolysis), many others are inadvertently incorporated
into the analysis. In this paper we present an approach
for automatic identification of contaminant masses so that
they can be removed prior to the identification process.
For this purpose we have developed an algorithm that
clusters mass values. We calculate the frequencies of all
masses and then identify contaminants. We propose that
masses with frequency higher than a given value are
contaminants. In our analysis of 3,029 digested proteins,
yielding 78,384 masses, we identified 16 possible
contaminants. Of these 16, four are known trypsin
autolysis peptides. Removing these contaminant masses from
the database search will lead to more accurate and
reliable protein identification.
Return to Program or Index
A Flexible Pipeline for Experimental Design, Processing, and Analysis of Microarray Data
Stephen Osborn, Scot Kennedy and Daniel Chin
PPD Discovery, Menlo Park, CA
Abstract
Microarray data is nearly unrivalled in the volume of data
generated, as well as the variety of comparisons and
analyses that can be made. It is thus imperative to have a
computational infrastructure capable of more than storing
the raw data: aiding the planning of an experiment;
facilitating communication between the groups involved in
its execution; and analyzing its results in an organized,
consistent manner. Standards like MIAME/MAGE exist for the
exchange of data, but are insufficient for efficient and
flexible processing. Software tools exist for performing a
multitude of analyses, but they often do not provide an
overall framework, leaving scientists to apply post hoc any
of a wide array of analyses to data and organize the
results. We created a database and "pipeline" attempting to
address these issues. A hierarchy with raw datasets at the
"bottom" and high-level multi-array analyses at the "top"
allows the specification of comparisons and relationships in
a generalized, flexible manner. This structure implements
'groupings' of datasets such as replicates, dye-swaps,
treatments, etc, and associates an extensible range of
analyses with each. Data is viewed naturally as features,
probes, genes, or arbitrary sets. Researchers can establish
the structure of a microarray experiment before beginning,
while modifying as needed using a simple web interface. The
system adapts to many applications and platforms: the
modular pipeline allows calculations to be performed by
"plug-ins", built-in or external software, and the
hierarchical object data structure maps naturally to XML/DOM
format standards while
maintaining a scalable efficient relational structure.
Return to Program or Index
Computational Approach to Reconstructing Gene Regulatory Networks
Xutao Deng and Hesham Ali
Computer Science Department, University of Nebraska at Omaha
Abstract
With the rapid accumulation of gene expression data in
publicly accessible databases, computational study of gene
regulation has become an obtainable goal. Intrinsic to
this task will be data mining tools for inferring knowledge
from biological data. In this project, we have developed a
new data mining technique in which we adapt the
connectivity of a recurrent neural network model by
indexing regulatory elements and including nonlinear
interaction terms. As a result, the new technique reduces
the number of parameters by O(n), therefore increasing the
chance of recovering the underlying regulatory network. In
order to fit the model from data, we have developed a
genetic fitting algorithm with O(n) time complexity and
that adapts the connectivity during the fitting process
until a satisfactory fit is obtained. We have implemented
this fitting algorithm and applied it to three data sets:
simulated data with 5 genes, rat central nervous system
development data with 112 genes, and yeast whole genome
data with 2467 genes. With multiple runs of the fitting
algorithm, we were able to efficiently generate a
statistical pattern of the model parameters from the data.
Because of its adaptive features, this method will be
especially useful for reconstructing coarse-grained gene
regulatory network from large scale or whole genome scale
gene expression data sets.
Return to Program or Index
Probability ProfilesNovel Approach in Mass Spectrometry De Novo
Sequencing
T. Fridman1, R. Day1, J. Razumovskaya2, D. Xu2 and A. Gorin1
1Computer Science and Mathematics Division
2Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831
Abstract
Development of the robust computational algorithms for
protein identification in mass spectrometry (MS) experiments
is a challenging data analysis problem with a typical set of
"biological attributes": abundance of experimental noise,
incomplete data, and possibility of wild card factors such
as posttranslational modifications. Novel and robust MS
analysis approaches are crucially important for emerging
Genomes-to-Life Program efforts to compile a comprehensive
catalog of cellular protein-protein complexes through
high-throughput mass spectrometry. The established methods
rely on sequence database lookups and will not be adequate
for the identification of cross-linked peptides from two or
more subunits of protein-protein complexes.
Here we present a novel "probability profile" approach forDe Novo
peptide sequencing from experimental tandem massspectra. A large
database of previously resolved peptide
spectra was used to determine "neighborhood patterns" for
each peak category: C- or N-terminus ions, their dehydrated
fragments, etc. The established patterns are applied to
assign probabilities for new spectra peaks to fit into those
categories. At least a few peaks usually could be
identified with a fair confidence creating strong "anchor
points" for De Novo algorithm assembling sequence subgraphs.
Our approach is utilizing all informational content of
agiven MS experimental data set, including peak intensities,
weak and noisy peaks, and unusual fragments. We also discuss
ways to provide learning features in our method: adjustments
for a specific MS device and user initiated changes in the
list of considered peak categories.
Return to Program or Index
A Symbolic Logic Strategy For Mapping Biological Concepts in the HERBE Prototype
Eric G. Stephan, George Chin, Kyle R. Klicker, Abigail L. Corrigan and Heidi J. Sofia
Pacific Northwest National Lab, Richland, WA
Abstract
Current database strategies are not powerful enough to support
biologists in the efficient use of complex, large scale, and rapidly
changing systems biology data. We are engaged in research to
advance database technologies so that biologists can capture,
manage, compute upon, and search biological concepts in ways
that are presently not possible.
The Heuristic Entity Relationship Building Environment (HERBE)
project is developing mechanisms to automate the translation of
concepts into computable forms. HERBE features an open visual
environment to collect diverse, complex data and construct models
graphically, thus allowing users to apply their own symbolic logic.
As data are entered, HERBE captures the data, then integrates
and heuristically interprets underlying concepts into a computable
form using a multi-tiered data management infrastructure.
We have constructed a prototype as an early proof of concept to
demonstrate the positive impact this approach can make. The
prototype features a fully functional visual concept-mapping layer,
extraction components to translate the user's symbolic logic into
computational logic, and a concept repository. The prototype also
has a derivation feature that automatically builds an overarching
data model that is connected to the concept repository so that
complex queries between concept maps and the resulting data
model are possible.
Return to Program or Index
A Computational Method for Assessing Peptide-Identification Reliability
in Tandem Mass Spectrometry Analysis with SEQUEST
Jane Razumovskaya1,3, Victor Olman1, Dong Xu1,3,
Ed Uberbacher1,3, Nathan Verbermoes3, Ying Xu1,2,3
1Life Sciences Division and 2Computer Sciences and
Mathematics Division, Oak Ridge National Laboratory, TN 37830-6480, USA
3School of Genome Science and Technology, University of
Tennessee, Knoxville, TN 37922, USA
Abstract
High throughput protein identification in mass spectrometry
is predominantly achieved by first identifying tryptic
peptides using SEQUEST and then by combining the peptide
hits for protein identification. Peptide identification is
typically carried out by selecting SEQUEST hits above a
specified threshold, the value of which is typically chosen
empirically in an attempt to separate true identifications
from the false ones. These SEQUEST scores are not
normalized with respect to the composition, length and
other parameters of the peptides. Furthermore, there is
no rigorous reliability estimate assigned to the protein
identifications derived from these scores. Hence the
interpretation of SEQUEST hits generally requires human
involvement, making it difficult to scale up the
identification process for genome-scale applications. To
overcome these limitations, we have developed a method,
which combines a neural network and a statistical model,
for "normalizing" SEQUEST scores, and also for providing a
reliability estimate for each SEQUEST hit. This method
improves the sensitivity and specificity of peptide
identification compared to the standard filtering procedure
used in the SEQUEST package, and provides a basis for
estimating the reliability of protein identifications.
Return to Program or Index
Functional Genomics
Tomato Expression Database (TED)An Interactive Management Tool
for Tomato Expression Profiling Data
Zhangjun Fei, Xuemei Tang, Rob Alba, Paxton Payton and Jim Giovannoni
Cornell University
Abstract
We are embarking on large-scale functional genomics using
cDNA microarrays to obtain expression profiles of tomato
genes during fruit development and ripening. Expression
data for approximately 12000 ESTs over a time course of
fruit development was generated. The ESTs employed
represent sequences unique to specific tissues (e.g. fruit,
flower), as well as defined genetic, biochemical and
physiological functions (e.g. protein kinases,
transcriptional factors). In order to provide the research
community access to our normalized and replicated
microarray data as a tool to assess relative expression of
genes of interest, we developed a publicly accessible
online database Tomato Expression Database (TED:
http://ted.bti.cornell.edu). Through this database, we
provide multiple approaches to pursue analysis of specific
genes of interest and/or access the larger microarray data
set to identify sets of genes that may behave in a pattern
of interest to the user. To achieve this goal, a set of
useful data mining and data visualization tools were
developed and are under continuing expansion according to
user's requirements. Developed initially as a data mining
and analysis resource, TED also contains comprehensive
annotation of each EST including homology derived from
sequence similarity searches of GenBank and GO terms
assigned manually according to putative functions. Future
expansions will include integration of expression data from
tomato fruit ripening mutants and from normal tissues in
response to treatments that influence ripening, integration
of tomato metabolite data, and addition of interfaces for
expanded data analyses including clustering.
Return to Program or Index
Oncogenetic Tree ModelsAn Estimation
Harpreet Kaur Gill
Christian Medical College
Abstract
The development of human cancers is believed to be driven
by a sequence of genetic changes that lead to an abnormal
behavior of cells. For many tumor types, characteristic
alterations (gene mutations, partial gains or losses of
chromosomes) are known, but in many cases, the function of
these alterations in tumor development is yet unknown.
Thus it is desirable to infer which genetic changes are
typically early or late events in tumor development, and
whether there are preferred sequential orders of their
occurrence. The modelling of tumorigenesis as a sequence
of genetic changes was pioneered by Fearon and Vogelstein.
However, the heterogeneity encountered in many tumor
types suggests that more complex models might be needed to
model the dependencies between alterations. Here we show
how the method of maximum likelihood can be applied to
estimate tree models for oncogenesis.
Return to Program or Index
The GeneCards Family of Databases: GeneCards, GeneLoc, GeneNote and GeneAnnot
Marilyn Safran, Vered Chalifa-Caspi, Orit Shmueli, Naomi Rosen,
Hila Benjamin-Rodrig, Ron Ophir, Itai Yanai, Shirley Horn-Saban,
Pavel Kats, Michael Shmoish and Doron Lancet
Weizmann Institute of Science
Abstract
The popular GeneCards integrated database of human genes,
genomic maps, proteins and diseases has recently spawned
three related functional genomics efforts: GeneLoc,
GeneNote and GeneAnnot. As sequence data rapidly
accumulates, the bottleneck in biology shifts from data
production to analysis; researchers seek a profound
understanding of the role of each gene and of the way genes
function together.
GeneLoc integrates human gene collections by comparing
genomic coordinates at the exon level, eliminating
redundancies, and assigning unique and meaningful location-
based identifiers to each GeneCards gene. When upgrading to
new versions, GeneCards has grappled with keeping these
identifier associations persistent and consistent, while
taking into account changes in gene location due to new
elucidations of the human genome.
GeneCards expression tissue vectors are provided by
GeneNote, the first effort to present complex expression
analyses (including clustering) for a variety of normal
human tissues using the full complement of gene
representations (Affymetrix arrays HG-U95A-E). GeneNote
variation plots are presented for each gene, visualizing
Pearson's correlations between individual probe-set vectors
and the average tissue vector for the gene, as well as
relative scalar lengths of individual vectors. Probe-sets
are mapped to GeneCards genes via the GeneAnnot system,
which revises and improves their annotation by aligning
them to major sets of human mRNA sequences. Each GeneCards
gene is annotated with relevant probe-set associations,
including their identifiers, arrays, sensitivity and
specificity scores, and genes-to-probe-sets counts.
GeneLoc, GeneNote and GeneAnnot have their own respective
in-depth web-sites, in addition to providing summary data
for enriching GeneCards.
Return to Program or Index
Wavelet Transforms for the Analysis of Microarray Experiments
Taku Tokuyasu, Donna Albertson, Ajay Jain and Dan Pinkel
UCSF Cancer Center, San Francisco, CA
Abstract
Array comparative genomic hybridization (cgh) is a
microarray technology for measuring the relative copy
number of thousands of genomic regions. Continuing
advancements allow entire genomes to be probed with
increasing resolution and precision. Array cgh is proving
to be a sensitive tool for measuring genomic abnormalities
in diseases such as cancer.
The measurements from an array cgh experiment have a
natural linear ordering, namely the genome ordering of the
spotted clones. Visual examination of cgh profiles shows
that genomic changes occur on a variety of length scales.
It has been hypothesized that such changes are
characteristic of phenotypic variables such as tumor type
and gene mutational status.
To aid in identifying such features and exploring their
relationship with phenotypic outcomes, we are applying
wavelet tranforms to the analysis of cgh profiles. This
allows us to decompose a cgh signal into short and long-
length scale components, which are characteristic of
aberrations over a few clones and on the scale of an entire
chromosome. We have begun to address the question of which
features of the genomic changes are most highly correlated
with outcome.
Wavelet transforms may also be useful in the realm of gene
expression. Genes can be ordered as a result of
hierarchical clustering, permitting an expression signal to
be defined that can be analyzed by wavelets. The wavelet
transform can compress the signal from multiple genes into
a few components, possibly aiding in the development of new
tumor classifiers.
Return to Program or Index
MageBuilder: A Schema Translation Tool for Generating MAGE-ML from Tabular Microarray Data
William Martin and Robert M. Horton
Attotron Biosensors Corporation, East Palo Alto, CA
Abstract
The MicroArray Gene Expression Markup Language (MAGE-ML)
is a standardized XML format for representing the results of
microarray experiments (www.mged.org). It is designed to include
sufficient structural metainformation to make this data useful for
purposes such as meta-analysis and data mining. The MAGE data
model is sufficiently complex that specialized software tools, such
as the MAGE software toolkit (MAGEstk), are invaluable for dealing
with the structured data. Laboratories conducting microarray
experiments typically store results in tabular format, either in
spreadsheets or relational databases, and conversion of this
tabular data to MAGE format may be daunting. Here we present
tools to facilitate mapping from table-based schemas to MAGE-ML.
A 'MageBuilder' object takes a set of 'MageMap' objects and a set
of data streams as input, and produces a MAGEstk object
representation, which is then serialized as MAGE-ML. A
'MageMap' object encapsulates the rules of how data records from
an input stream relate to one MAGE object. Mapping rules may be
as simple as copying a column to an attribute, while custom
subroutines may be required to calculate attribute values for more
complex mappings. Each input "stream" is an anonymous
subroutine that supplies records whose fields represent columns
in the input table. The input tables can be delimited text files,
database queries, or essentially any source that can be coerced
into a set of records with fixed fields. Combining abstract input
streams with flexible mapping of streams to objects provides an
extensively customizable approach to generating MAGE-ML from
sets of data tables.
Return to Program or Index
OptiRNAi, a Web-based Program to Select siRNA Sequences
Wenwu Cui1, Jianchang Ning2, Ulhas P Naik1
and Melinda K. Duncan1
1Department of Biological Sciences, 2Delaware
Biotechnology Institute, University of Delaware, Newark, Delaware
19716
Abstract
RNA interference (RNAi) is a recently developed reverse
genetic tool to knockdown expression of specific genes in
eukaryotes. Successful implementation of this technology
requires the design of double stranded RNAs identical to
specific regions of the target gene. However, not every
possible small interfering RNA (siRNA) is able to reduce
target gene expression. Recently, Elbashir et al (Methods,
26:199-213, 2002) has established empirical criteria for
siRNA sequence selection that significantly improved the
success rate for RNAi attempts. We have developed
OptiRNAi, a computational tool, which uses the Elbashir et
al criteria to predict appropriate sequences for siRNA
production. The output of the program provides researchers
with a numbers of potential siRNA targets, a score that
reflects how well the sequence fulfills the empirical
criteria, and the location of the target sequence in the
input. Specificity of these siRNAs for the target of
interest can then be assessed by the investigator using
the embedded blast server optimized for RNAi design. Thus,
OptiRNAi is a labor saving tool for RNAi design using
criteria that are more stringent than other available
tools.
Return to Program or Index
A Rule-based Framework for Gene Regulation Pathways Discovery
Bartosz Wilczynski, Torgeir Hvidsten, Andriy Kryshtafovych, Lisa Stubbs,
Jan Komorowski and Krzysztof Fidelis
Lawrence Livermore National Lab, Livermore, CA 94550
Abstract
We have developed a novel approach to discover the rules
that govern gene regulation mechanisms. The method is based
on supervised machine learning and is designed to reveal
relationships between transcription factors and gene
promoters. As the representation of the gene regulatory
circuit we have chosen a special form of IF-THEN rules
associating certain features (TFBS) in gene promoters with
specific gene expression profiles.
A given set of genes with known expression levels under
certain conditions is used as an input to our procedure. We
start from gene expression clustering and then iteratively
perform the following steps:
- searching for the significant features (TFBS) in the gene
promoter groups
- inducing rules from clustering and a feature set
- evaluating the rules
- refining the set of features
until no further progress can be achieved.
The core part of the algorithm is the induction of the
rules from expression clustering and feature set. To reach
this goal we create a binary matrix with rows representing
genes associated with expression profile class and columns
representing features and then apply the Rosetta software
to the created matrix. The results of applying our
algorithm to the yeast gene expression data are shown.
Return to Program or Index
Gene Function and Metabolic Pathways in the Yeast
Qing Dong, Rama Balakrishnan, Gail Binkley, Karen R. Christie, Maria Costanzo,
Kara Dolinski, Selina S. Dwight, Stacia Engel, Dianna G. Fisk, Jodi Hirschman,
Eurie L. Hong, Laurie Issel-Tarver, Rob Nash, Anand Sethuraman, Chandra L.
Theesfeld, Shuai Weng, David Botstein and J. Michael Cherry
Saccharomyces Genome Database, Department of Genetics, Stanford University, Stanford, CA
Abstract
The budding yeast, Saccharomyces cerevisiae, has been
experimentally manipulated for several decades. A wealth
of information is available from the Saccharomyces Genome
Database (SGD, http://www.yeastgenome.org/). This poster
will highlight two datasets that are maintained by the
SGD. A large dataset of hand curated information is
provided in machine readable format for each gene of the
Saccharomyces genome. These annotations use the Gene
Ontology (GO) controlled vocabularies for biological
process, molecular function and cellular component. Each
hand curated annotation includes an evidence code to a
literature reference. Tools are provided at SGD for the
analysis of sets of genes. The GO Term Finder helps
identify GO terms that a group of genes has in common. The
GO Term Mapper allows you to map the specific, granular GO
terms used to annotate a list of genes to their more
general parent GO terms. The second area of focus is
metabolic pathways. A new dataset of hand curated
information on metabolic pathways within budding yeast was
released in May 2003 and can be downloaded from our ftp
site. This resource can be searched to view biochemical
reactions and pathways as they occur in S. cerevisiae.
This resource also maps data from genome-wide expression
analyses onto the pathway overview so that changes in gene
expression can be viewed in the context of cellular
metabolism. These pathways are created and edited using
the Pathway Tools software that is developed and
maintained by Peter Karp and his colleagues at SRI
International. SGD is funded by the US National Human
Genome Research Institute.
Return to Program or Index
Using Rule Induction to Analyze Gene Expression Data
Gary Livingston and Xiao Li
University of Massachusetts-Lowell, Lowell, MA
Abstract
We have applied rule induction to a publicly available
adenocarcinoma gene expression data set. The typical
approach to the analysis of gene expression data is to
cluster the genes; which requires a lot of laborious
analysis to interpret the clusters and examine their
significance. With rules, the interpretation is more
obvious (e.g., (CDKN3 > 253) (tumor-stage = 3); in
addition, the rules establish ranges for values, also
aiding interpretation. The data set we used contains 7,129
genes for 86 lung tumor and 10 normal lung samples as well
as patient data, such as tumor stage (1 or 3), survival
time (months), etc. We used standard approaches to reduce
the number of genes to 485. Our rule induction tool was a
new semi-autonomous discovery system we are developing
called HAMB, and we used it to learn rules for survival
status, survival time, and tumor stage. When we searched
the world-wide web for publications relating our top 20
genes from our discovered rules to lung cancer, we found 5
of them are known to be associated with lung cancer-GLRX,
PCNA, CDH17, KAL1, and SCTR, 5 of them are known to be
associated with other types of cancer-CDKN3, KRT6A, FABP4,
ADH1, and SCTR, and the remaining 10 were not found to be
associated with cancer: ACP5, CP, FMO2, ABLIM, CLECSF2,
REG1A, EMP2, AOC3, AGTR1, ALDH2, CPA3. Our results suggest
that the latter two groups of genes be examined more
closely for their association with lung cancer.
Return to Program or Index
Genotyping and SNPs
SNP Analyzing System for Detecting Multifactorial Disease Associated Site
Yoko Higashi, Hirotaka Higuchi, Takashi Kido, Hirohito Matsumine,
Toshihiko Morimoto and Masaaki Muramatsu
NTT Data Corporation, Kayaba-cho Tower, 1-21-2, Shinkawa, Chuo-ku, Tokyo, Japan
Abstract
We have built a system that can efficiently detect genes
that cause multifactorial diseases, such as diabetes and
high blood pressure. Its main function is to detect
polymorphisms in which there are significant differences
between the genotypes of a Case group and a Control group
by method of statistical tests.
Since many factors are thought to contribute together to a
multifactorial disease, single-point analyses are not
enough and multiple single-point analyses is a significant
problem. To improve the capability to detect them, the new
system has a function for extracting the functional and
heredity units (referred to as ldblocks) of a genome and
for performing association analysis on each unit.
Moreover, the existing software for linkage-disequilibrium
analysis requires rearranged genotype data in order of a
row of polymorphisms. In contrast, the developed system
automatically rearranges this data according to the
positional information held interiorly. In addition, the
system automatically detects the ldblock. Owing to these
new functions, the number of polymorphisms it can analyze
is about 30 to 50 times higher than by standard manual
procedures.
The results of the analyses are stored in a database, and
can be browsed as an association-analyses result table.
Furthermore, the system produces a polymorphisms-and-
ldblock map with the position of genes and polymorphisms
recorded in a public data base. It makes investigation of
disease-associated site more efficient.
During a poster session, we will present our results on
operability and performance of the system.
Return to Program or Index
Haplotype Pattern Mining and Classification for Detecting Disease Associated SNPs
Takashi Kido
Joware Hanzonmon, 2F 2-19 Hayabusa-cho, Chiyoda-ku, Tokyo, Japan
Abstract
Finding the causative genes for common diseases using SNP markers is now
becoming really challenging issue. Although conventional single-locus SNP
association tests exist, they could neither explain the effects of abundant
SNP combinations nor evolution of disease chromosomes. Haplotype analysis of
disease chromosomes can not only provide more powerful information than
single-locus SNP analysis, but also can
help identify probable causative mutations.
In this paper, we introduce a new method on the effective haplotype pattern
mining for detecting disease associated mutations. The summary of our developed strategy is as
follows:
- We detect the similar patterns which frequently appear in disease chromosomes
from all possible haplotype combinations (haplotype mining).
- We also compare all possible pairs of haplotypes which have only one different
SNP (mutation). If we detect the significant difference in association tests on these two
haplotypes, we can make assumptions that this mutation may be causative SNP for the disease.
- We classify haplotypes based on the phylogenetic similarities and represent
their relationships by networks. We then map the phenotype association to each node of the
network (Haplotypes).
- By comparing the results of these three strategies repeatedly, we can
localize the causative mutations.
Our method is inspired by data mining methods and fully utilizes haplotype
information. Using this method, we have discovered some of the new disease associated
SNPs which could not be detected by conventional methods. In this poster, we will
introduce the details of our methods and tools with some worked examples.
Return to Program or Index
Strings, Graphs, and Algorithms
Genetic Algorithm Approach to the Center Sequence Problem Arising in
Post-transcriptional Gene Silencing
Holger Mauch
University of Hawaii at Manoa
Abstract
A fundamental aspect of Post-transcriptional gene silencing
(PTGS) is the requirement of homology between the transgene
sequence and the virus or mRNA sequence being silenced.
For example, virus-resistant plants are resistant only to
viruses that are very closely related to the virus from
which the transgene was derived. One idea is to devise an
artificial sequence that is at least 90-95% homologous to
all virus genes. This requires an algorithm which can
determine an artificial sequence with an optimal homology
(or at least a 90-95% homology) to all of the virus
sequences. The genetic algorithm presented in this poster
serves this purpose. It should be of great value to all
researchers who utilize PTGS, gene quelling, or RNAi.
Mathematically, the task is to find the radius of a code $S
\subset \{A,C,G,T\}^{n}$. Experimental results suggest
that this NP-complete optimization problem can be
approached well with a customized genetic algorithm (GA).
Alternative approaches to the problem are discussed and put
in contrast to the GA approach. Heuristics produce results
that are usually far away from a global optimum and
therefore most of the time useless in practice. Exhaustive
search and branch and bound search have exponential time
requirements and are not suitable for real world problem
sizes. The instance studied in this poster consists of
m=116 virus sequences each having a sequence length of
n=626. The GA quickly found a solution with 94.57%
homology, which comes very close to the theoretical upper
bound of 94.73% for this problem instance.
Return to Program or Index
GenericBioMatch: A Novel Generic Pattern Match Algorithm for Biological Sequences
Youlian Pan and A. Fazel Famili
Integrated Reasoning Group, Institute for Information Technology,
National Research Council Canada, 1200 Montreal Road, M-50, Ottawa, ON K1A 0R6, Canada
Abstract
The grammar of biological sequence language varies markedly
from any of the natural languages. The schema of DNA and
protein sequences is less strict than that of natural
languages, such that it allows deletion and insertion of
letters in a sequence pattern. It also allows alternation
of letters within a "word." Many traditional text-searching
algorithms are unsuitable for biological sequences. This
has encouraged many researchers to apply probabilistic
algorithms, such as Hidden Markov Model, Gibbs Sampling,
for biological sequences. However, there are very few
algorithms that are suitable for quick exact pattern match
in biological sequence. This poster presents a novel
algorithm (called GenericBioMatch) for exact match of
biological sequences with provided sequence motifs. This
algorithm allows wild card letters (eg. Y, R, W in DNA
sequence) and one or more gaps of any number of base pairs
within the provided sequence motif pattern. GenericBioMatch
is a relatively fast algorithm [O(mn), m=motif_length,
n=sequence_length], as compared to Hidden Markov Model and
other probabilistic algorithms, and has very little
computational overhead. It is able to perform exact match
of protein motifs as well as DNA motifs. This algorithm can
serve as a quick validation tool for implementation of
other algorithms, and can also serve as a supporting tool
for probabilistic algorithms. This algorithm has been
implemented in the BioMiner (http://iit-iti.nrc-
cnrc.gc.ca/biomine_e.trx), a suite of tools for integrated
data mining, and tested successfully with DNA sequences
from human, yeast, and Arabidopsis.
Return to Program or Index
Aligning ESTs to Genome Using Multi-Layer Unique Markers
F. R. Hsu and J. F. Chen
Department of information technology, Taichung Healthcare and Management University, Taiwan
Abstract
As more and more genomic sequences have been sequenced and
rapid growing of number of EST (Expressed Sequence Tag),
we need very fast and efficient algorithms to align ESTs
to genome. ESTs contain the exon/intron structure. So the
alignment is different to general alignment between two
sequences.
Let G and E denote the given genome and the given EST
sequence respectively. In order to reduce time complexity,
it is quite important to locate the position of E in G
efficiently. A unique marker is a sequence appearing in G
only once. We can locate the position of E in G by finding
unique markers contained in E. Let Uk denote the set of
unique markers of G with length k. The larger of k, the
probability of E contains any element in Uk is higher.
Yet, the larger of k, the number of elements in Uk is
more. The time complexity to align ESTs to genome grows
rapidly when we use larger unique markers. In this paper,
we propose a multi-layer unique marker method to align
ESTs to genome. Our method combines the benefit of both
smaller and larger unique markers. Our algorithm can also
use to locate the position of single nucleotide
polymorphism (SNP). By using four personal computers, our
software has successfully located all SNPs in dbSNP in one
day. Using only personal computers, our software also
successfully aligns the entire EST set to the human genome.
Experimental result shows that our method is much faster
than common tools such as SIM4 and BLAT.
Return to Program or Index
A Parallel Evolutionary Method for Physical Mapping of Chromosomes
Suchendra M. Bhandarkar, Jinling Huang and Jonathan Arnold
Department of Computer Science, The University of Georgia
Abstract
Reconstructing a physical map of a chromosome from a genomic
library presents a central computational problem in genetics.
Physical map reconstruction in the presence of errors
is a problem of high computational complexity. A parallel
evolutionary
method for a maximum likelihood estimation-based approach to
physical
map reconstruction is presented. The estimation procedure
entails
gradient descent search for determining the optimal spacings
between
probes for a given probe ordering. The optimal probe
ordering is determined
using an evolutionary algorithm. A two-tier parallelization
strategy is proposed wherein the gradient descent search is
parallelized
at the lower level and the evolutionary algorithm is
simultaneously parallelized at the higher level.
Implementation and
experimental results on a network of shared-memory symmetric
multiprocessors (SMPs) are presented. The evolutionary
algorithm is seen
to result in physical maps with fewer contig breaks when
compared to
simulated Monte Carlo algorithms such as simulated annealing
and the
large-step Markov chain algorithm.
Return to Program or Index
On Finding Common Intervals among k Protein Sequences Containing Redundant Elements
Xiaolu Huang and Hesham Ali
University of Omaha, Nebraska
Abstract
Predicting protein structures based on the protein amino
acid sequences has been on the most challenging problem in
computational biology. There are many statistics-based
machine-learning methods available to biologists these
days. However, no single method has proven to solve the
problem with its various instances. In this project, we
address the protein structure prediction by using a common
interval search algorithm. The proposed algorithms, the
Redundant element sequence Common Interval Searching (RCIS)
Algorithm is finds all the common intervals within an amino
acid sequence or among k amino acid sequences. Given a
sequence of elements, each of which belongs to a finite set
of elements, (in the case of amino acid sequence, all the
elements belongs to a element set of 20 amino acids), the
intervals within this sequence, consisting of the same set
and number of elements is called a common interval. Given a
sequence of size n, RCIS has space complexity of O(n2), and
time complexity of O(n3). This common interval searching
algorithm provides a new way of approaching the protein
secondary structure prediction.
Return to Program or Index
Algorithms for Bounded-Error Correlation of High Dimensional Data in Microarray Experiments
Mehmet Koyuturk, Ananth Grama and Wojciech Szpankowski
Purdue University
Abstract
The problem of clustering continuous valued data has been
well studied in literature. Its application to microarray
analysis relies on such algorithms as k-means,
dimensionality reduction techniques (including SVD,
principal components, KL transforms, etc.), and graph-based
approaches for building dendrograms of sample data. In
contrast, similar problems for discrete-attributed data are
relatively unexplored. An instance of analysis of
discrete-attributed data arises in detecting co-regulated
samples in microarrays. Here, our interest is in detecting
samples that are up- and down-regulated together. Efficient
solutions to this problem enable more refined correlation
between presence of sequence motifs and underlying
regulation patterns in microarray data.
One of the key challenges associated with clustering
discrete-attributed data is that these problems typically
are NP-hard and few effective heuristics are known for
solving them. In this paper, we present an algorithm and
a software framework, Proximus, for error-bounded clustering
of high-dimensional discrete attributed datasets. In our
past work, we have demonstrated the excellent performance of
Proximus in terms of runtime, scalability to large datasets,
and performance in terms of capability to represent data in
a compact form. In this work, we demonstrate its application
to microarray data in extracting co-regulated samples. We
show that Proximus delivers outstanding performance in
extracting accurate patterns of gene-expression.
Return to Program or Index
Exact and Heuristic Algorithms for the DNA Fragment Assembly Problem
Sami Khuri and Yimin Jiang
San Jose State University
Abstract
As more and more public and private genomic assemblies are
available, the
need for comparison of different computation methods used
in the DNA
fragment assembly becomes apparent. In this work, we
design, implement,
and test four heuristic algorithms for assembling DNA
fragments. We also
implemented and tested an exact algorithm to solve the
Fragment Assembly
problem. As expected, the exact algorithm gives the best
results, but at a
high price; it is very slow compared to our heuristics.
The first heuristic implemented in this work, gaFRAG, is a
hybrid genetic
algorithm that first uses a semiglobal dynamic programming
approach for
finding pairwise overlaps then applies genetic algorithm
operators to
order the fragments, and, finally, progressively builds the
layout. The
second heuristic, mwcFRAG, forms maximum-weight contigs and
then uses the
greedy technique to find Hamiltonian paths, followed by a
progressive
technique to find the order of fragments and build the
layout. The third
algorithm, mFRAG, which considers each fragment as a
contig, merges the
contigs based on pairwise overlaps using the semiglobal
dynamic
programming algorithm. The last heuristic, bbFRAG, is
derived from the
exact algorithm. It builds a tree with a single path to
find the heuristic
solution instead of building a tree by the branch and bound
technique.
In order to accurately compare the performance of the
different assembly
heuristics, we used a collection of benchmark datasets. The
experimental
results show that our mFRAG and bbFRAG algorithms are as
efficient and
accurate as those from PHRAP and CAP3.
Return to Program or Index
Finding Higher Order Motifs under the Levenshtein Measure
Ezekiel F. Adebiyi and Tinuke Dipe
Mathematics and Computer Science Department, University of
Ilorin, Ilorin, Nigeria
Abstract
Pattern discovery in unaligned DNA sequences is a
fundamental problem in computational biology with important
applications in finding regulatory signals. Existing
approaches on finding patterns focus on monad patterns,
otherwise known as common motifs (Sagot, 1998) or simply
motifs (Pevzner and Sze, S2000), that correspond to
relatively short contiguous strings (Adebiyi, 2002). A new
problem of finding composite motifs, otherwise
known as structured motifs (Vanet, Marsan, Labigne and
Sagot, 2000) or higher order motifs (Sinha, 2002), arises,
when the number of monad motifs, $p$,
that participates in a biological process is greater than
one. The relative position of each monad motif is now
important and is not random but sterically
defined. In eukaryotic, these monad motifs may interact
with one another. Recently, increasing efforts have been
made to tackle this problem, but with respect to hamming
distance (Sinha, 2002;Eskin 2001). It is known (Adebiyi,
2002; Pevzner and Sze, 2000; Sand, 2002) that sequences of
organisms exist, where insertion and deletions are
important to find regulatory signals. It is this challenge
that we consider in this project. In the problem set-up, we
are given $N$ sequences, each of average length $n$,
over a finite alphabet $\Sigma$ and thresholds $D$ and $q$,
we are to find composite motifs that contain motifs of
length $P$ (these motifs occur with atmost $D$ differences)
in $1\leq q \leq N$ distinct sequences.
Two interesting but involved algorithms for finding higher
order motifs under the edit distance was presented by
Marsan and Sagot, RECOMB 2000. Their second
algorithm is much more complicated and its complexity is
asymptotically not better. The running time of their first
algorithm under the edit distance is
$O(M\cdot N^2n^{1+\alpha\cdot p\cdot pow(\epsilon)})$,
where $p\geq 2$, $\alpha >0$, $pow(\epsilon)$ is a concave
function that is less than 1, $\epsilon = D/P$ and $M$ is
the expected number of all monad motifs.
$M$ is omitted in the complexity analysis given in Marsan
and Sagot, RECOMB 2000, but $M$ is not constant. Under the
hamming distance, Buhler and Tompa, 2001, estimated $M=
{|\Sigma|}^P(1-(1-P_D)^{n-P+1})^q$, where $P_D$ is the
probability that a given motif of length $P$, occurs with
upto $D$ substitutions at a given position of a random
sequence. We showed in Adebiyi 2002, that under the edit
distance, $P_D<2N^{\alpha\cdot (pow(\epsilon)-1)}$.
We present an alternative algorithmic approach also for
Edit distance based on the concept described in Adebiyi,
Jiang and Kaufmann, ISMB 2001; Adebiyi
and Kaufmann, WABI 2002. The resulting algorithm is simpler
and runs in $O(N^2n^{1+ p\cdot pow(\epsilon)})$ expected
time, where $p\geq 1$, $\epsilon = D/P$ and $pow(\epsilon)$
is a concave function less than zero.
Return to Program or Index
Mapping Discontinuous Antibody Epitopes to Reveal Protein
Structure and Changes in Structure Related to Function
Mumey, B.1, Angel, T.2, Kirkpatrick, B.1,
Bailey, B.4, Hargrave, P.5, Jesaitis, A.3
and Dratz, E.2
Departments of 1Computer Sciences, 2Chemistry & Biochemistry, and
3Microbiology, Montana State Univ., Bozeman, MT
4NIH-NIAAA
5Univ. of Florida, Gainesville, FL.
Abstract
Antibody imprinting is a new method for protein structure
determination that overcomes limitations in traditional
methods, such as x-ray crystallography or nuclear magnetic
resonance, and requires about a thousand fold less
protein. The method uses antibodies against target
proteins and random peptide probe libraries to map the
epitopes of antibody binding sites on target proteins.
The first step in the process is to experimentally
determine a subset of the probes that have a high affinity
to the antibody. Once this set of probes is found, we
apply an existing algorithm we have developed [Mumey et
al., RECOMB 2002] to find the best alignments of each
probe sequence to the target protein sequence. In this
poster, we report on new work to synthesize the set of
probe-target alignments into a single map of the epitope
surface of the protein. We have recently developed a new
algorithm that creates a "surface neighbor graph" on the
epitope residues and finds a putative planar embedding of
this graph which is most consistent with the alignments
found.
We have used hen egg lysozyme as a validation case and
present surface neighbor graphs that correctly identify
regions of the protein that are exposed to solvent and
distant regions in the sequence that are in close
proximity in the folded protein. We have also identified
monoclonal antibodies (mAbs) that stabilize light-excited
conformations of rhodopsin and antibody imprinting is
providing information on the conformations of the light-
excited protein. We have characterized the epitopes of 25
mAbs that recognize at least 5 distinct regions of human
neutrophil flavocytochrome b, an integral membrane
electron transferase that produces host defensive
superoxide. We are using antibody imprint analysis to help
predict its structure in advance of attempts using X-ray
crystallography. We are now testing the ability of
antibody imprinting to screen out incorrect ab initio
protein structure predictions with a goal of arriving at
much more accurate protein structure predictions.
Return to Program or Index
The SCP and Compressed Domain Analysis of Biological Sequences
Don Adjeroh and Jianan Feng
West Virginia University
Abstract
We present new methods to rapidly analyze biological
sequences, by looking
at their compressed representations. We focus on BWT-based compression,
and introduce the SCP - the sorted common prefix, and show how the SCP
could be used for improved algorithms for anchor-based sequence
alignment,
and for locating tandem repeats. Using the relationship between the
Burrows-Wheeler Transform (BWT) and the suffix tree, in addition to our
new auxiliary vectors, we can obtain the sorted suffixes directly from
the
compressed sequence. Based on the properties of the SCP, we develop
algorithms to compute the SCP in O(u + |A|K) number of comparisons, and
O(K) extra space, where u is the sequence length, K is the maximum SCP
value, and |A| is alphabet size. The SCP is a natural data structure
for
the analysis of regularities and repetitions in a sequence. The areas
that
contain tandem arrays form a specific geometric structure in the SCP.
For
each primitive repeat, we can construct an isosceles triangle, such
that
points along the sides correspond to SCP values in the area. We call
this
structure the p-periodic (triangular) neighborhood in the SCP. By
analyzing this triangular structure, we can provide specific
information
about the repeat. Based on the above, we can find all the n canonical
tandem arrays in a given sequence T in O(u + n + |A| K) time on
average.
For a fixed |A|, this implies O(u + n) time. This can be compared with
the previous non-compressed domain algorithms for exact tandem repeats
which
run in O(u log u).
Return to Program or Index
Identifying Regulatory Signals in DNA-sequences with a Non-statistical Approximation Approach
Cun-Quan Zhang1, Yunkai Liu2, Elaine M. Eschen2 and Keqiang Wu3
1Dept. of Mathematics, 2Lane Dept. of Computer
Science and Electrical Engineering, 3Dept. of Biology, West Virginia University
Abstract
With the development of gene-profiling technology, the
expression patterns of thousands of genes under a
variety of conditions are now available. This gives us the opportunity
to analyze the parts of the genome
believed to be responsible for transcription controlthe
transcription factor DNA-binding motifs. We give a
mathematical model for the motif finding problem and a non-statistical
approximation algorithm to find motifs
in a set of genome sequences.
Let S be a set of length n strings and s be a similarity function
between pairs of strings. Let K be the
complete graph with vertex set S and weight s(x,y) for each edge xy.
Given constants c and m, the goal of
motif finding is to find a subgraph H of K of size m such that s(x,y)
is at least c for every pair of vertices of
H. This is equivalent to finding a clique of size at least m in the
graph K' obtained from K by deleting all
edges with weight less than c, which is NP-Complete. Thus, we
introduce the notion of an R-group to
develop an approximation approach.
For a subgraph H of K we construct a vertex R(H) that represents the
distribution of symbols in all strings of
H, and we extend the similarity function s to R(H). An R-group of K is
a subgraph G of K such that s(R(G),x)
is at least c for every x of G. R-groups approximate the desired
cliques of K'. Our algorithm finds an
R-group in O( n(|S|^3) ) time. This yields a practical approximation
to the clique (motif) problem.
Return to Program or Index
A Genetic Algorithm for Simplifying the Amino Acid Alphabet
Matthew Palensky and Hesham Ali
University of Nebraska at Omaha
Abstract
Simplified amino acid alphabets have been successful in
several areas of
bioinformatics, including predicting protein structure, predicting
protein function, and protein classification. Simplification can be
done by clustering the symbols and creating a symbol representing
each cluster. Since there are twenty symbols in the amino acid
alphabet,
the number of possible simplifications is large. It is not practical
to search through all possible simplifications to find one suitable for
a
specific application. A previous study conducted by the authors
identified and evaluated four types of computational simplification
algorithms. This study's initial findings indicate that algorithms
with
heavy reliance on randomness and heuristics tend to produce poor
simplifications. Genetic algorithms have been generally successful in
producing quality solutions to problems with a large solution space,
though their reliance on randomness makes it difficult to create
quality
simplifications. This study's goal is to overcome these difficulties,
and
create a genetic simplification algorithm. The presented results
include
the genetic simplification algorithm, as well as the difficulties of
creating
such an algorithm. The described algorithm has led to the development
of a computer program that uses a genetic algorithm to produce
simplified alphabets, and these outputs will be listed and analyzed.
The
evaluation metrics will include the pair wise alignment scoring,
physical
characteristics of the groupings, algorithm running time, and
suitability
in particular domains.
Return to Program or Index
Structural Biology
Spectral Decomposition of the Laplacian Matrix Applied to RNA Folding Prediction
Danny Barash
Genome Diversity Center, Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel
Abstract
RNA secondary structure consists of elements such as stems,
bulges, loops. The most obvious and important scalar number
that can be attached to an RNA structure is its free energy,
with a landscape that governs the folding pathway. However,
because of the unique geometry of RNA secondary structure,
another interesting single-signed scalar number based on
geometrical scales exists that can assist in RNA structure
computations. This scalar number is the second eigenvalue of
the Laplacian matrix corresponding to a tree-graph
representation of the RNA secondary structure. Because of
the mathematical properties of the Laplacian matrix, the
first eigenvalue is always zero, and the second eigenvalue
(often denoted as the Fiedler eigenvalue) is a measure of
the compactness of the associated tree-graph. The concept of
using the Fiedler eigenvalue/eigenvector is borrowed from
domain decomposition in parallel computing. Thus, along with
the free energy, the Fiedler eigenvalue can be used as a
signature in a clever search among a collection of
structures by providing a similarity measure between RNA
secondary structures. This can also be used for mutation
predictions, classification of RNA secondary folds,
filtering and clustering. Furthermore, the Fiedler
eigenvector may be used to chop large RNAs into smaller
fragments by using spectral graph partitioning, based on the
geometry of the secondary structure. Each fragment may then
be treated differently for the folding prediction of the
entire domain.
Return to Program or Index
Identification of Non-random Patterns in Structural and Mutational
Data: The Case of Prion Protein
Igor B. Kuznetsov and S.Rackovsky
Dept. of Biomathematical Sciences, Mount Sinai School of Medicine
Abstract
Prion diseases (mad cow disease, CJD, etc.) are a group of
fatal neurodegenerative disorders associated with
structural conversion of a normal, mostly a-helical
cellular prion protein, PrPC, into a pathogenic b-sheet-
rich conformation, PrPSc. The structure of PrPC is well
characterized, whereas little is known about the structure
of PrPSc, what parts of PrPC undergo conformational
transition, or how known disease promoting mutations
facilitate this transition. In this work we address these
problems by using a computational statistical analysis of
patterns observed in both sequence and structural
databases, as well as in PrP mutational data.
1. We compare sequence properties of PrP to the
distributions of properties observed in a non-redundant
database of known protein structures. Using z-statistic we
identify PrP fragments that show unusual intrinsic
structural propensities.
2. We construct a novel entropic index which provides a
quantitative measure of context-dependent conformational
flexibility of a sequence fragment. We use this index to
identify PrP fragments that have sequence context with
unusually high degree of conformational flexibility
compared to sequence context of all fragments with similar
properties.
3. We test for statistically significant clusters of PrP
mutations and statistically significant patterns of changes
in physico-chemical properties associated with these
mutations. Estimates of statistical significance for these
tests are based on a stochastic model of mutational process
with unequal substitution rates and context-dependent
mutational hot spots, which allows us to take into account
a non-uniform distribution of the events.
Return to Program or Index
Multiple Protein Structure Alignment by Deterministic Annealing
Luonan Chen, Mituyasu Seo and Junya Nakai
Osaka Sangyo University, Japan
Abstract
Multiple Aligning protein structures is one of the most
important computational problems in molecular biology and
plays a key role in protein structure prediction, fold
family classification, motif finding, tree reconstruction
and so on. Rather than a pairwise sequence alignment that
simply compares two strings of characters, a pairwise
structure alignment is a more difficult problem that
optimally superimposes two structures and further finds
the regions of closest overlap in a three-dimension space.
From the viewpoint of the computational complexity, even a
pairwise structure alignment is a NP-hard problem due to
matching of rigid bodies, in contrast to the existence of
the polynomial time algorithm for pairwise sequence
alignment by dynamical programming (DP).
Although many algorithms have been developed so far for
structure alignment problem based on either distance-based
methods or vector-based methods, such as iterative
dynamical programming, Monte Carlo simulation, graph
theory, and mean field equations, only a few methods are
published for multiple structure alignment, such as
progressive structure alignment and Monte Carlo
optimization.
In this paper, we propose a novel method for solving
multiple structure alignment problem, based on
deterministic or mean field annealing technique. We define
the structure alignment as a mixed integer-programming
(MIP) problem with the inter-atomic distances between two
or more structures as an objective function. The integer
variables represent the marchings among structures whereas
the continuous variables are translation vectors and
rotation matrices with each protein structure as a rigid
body. By exploiting the special structure of continuous
partial problem, we transform the MIP into a nonlinear
optimization problem (NOP) with a nonlinear objective
function and linear constraints, based on mean field
equations. To optimize the NOP, a mean field annealing
procedure is adopted with a modified Potts spin model.
Since all linear constraints are embedded in the mean
field equations, we do not need to add any penalty terms
of the constraints to the error function. In other words,
there is no "soft constraint" in our mean field model and
all constraints are automatically satisfied during the
annealing process, thereby not only making the
optimization more efficiently but also eliminating
unnecessary parameters of penalty that usually require
careful tuning dependent on the problems.
Return to Program or Index
An Exact Algorithm for Determining Protein Backbone Structure from NH
Residual Dipolar Couplings
Lincong Wang, Ramgopal R. Mettu, Ryan H. Lilien and Bruce Randall Donald
Department of Computer Science, Dartmouth College, Hanover, NH 03755
Abstract
We have developed a novel algorithm for protein backbone
structure determination using global orientational
restraints on internuclear vectors derived from residual
dipolar couplings (RDCs) measured in solution NMR. The
algorithm is based on two low-degree polynomial equations
for computing the backbone $(\phi,\psi)$ angles from two
vectors in consecutive peptide planes; each vector is
obtained by solving a quartic equation on two RDCs. Given
RDCs measured on any \emph{single} backbone vector type,
these equations make it possible to compute $(\phi,\psi)$
\emph{exactly} and \emph{in constant time per residue}.
Using these equations, our algorithm computes $(\phi,\psi)$
angles that simultaneously maximize structural agreement
with experimental RDCs and minimize deviation from standard
secondary structure geometry. Our algorithm samples a
Gaussian error interval around each RDC, and then, in a
depth-first manner, searches the discrete sets of
$(\phi,\psi)$ solutions. The search tree has a branching
factor of at most 16 (typically 4), since for each pair of
consecutive RDCs our equations have multiple solutions. The
depth-first search incorporates a pruning strategy based on
favorable Ramachandran regions. We give evidence that under
a Gaussian error model for RDCs, our algorithm has an
expected runtime that is considerably lower than the
worst-case exponential runtime. In practice, our
implementation runs quickly; it has been demonstrated on
Human Ubiquitin using backbone NH RDCs. The secondary
structure elements were subsequently positioned with twelve
hydrogen bonds and four NOE distance restraints. Compared
with other RDC-based automated fold determination
algorithms, our exact method uses fewer restraints but
achieves similar or better accuracy.
Return to Program or Index
Automatic Construction of 3D Structural Motifs for Protein Function Prediction
Mike Hsin-Ping Liang, Doug Brutlag and Russ Altman
Stanford Biomedical Informatics
Abstract
Structural genomics initiatives are on the verge of
generating a vast number of protein structures. The biological roles
for many of these proteins are still unknown, and high-throughput
methods for determining their function are necessary. Understanding
the function of these proteins will have profound impact in drug
development and protein engineering.
Current methods for protein function prediction on structures require
manual creation of structural motifs. Thus only few structural motifs
are available. The lack of structural motifs limits the use of these
methods for function prediction at a structural-genomics scale.
To overcome this limitation, I present a method for automatically
creating a library of three dimensional structural motifs from amino
acid sequence motifs databases. Automatically generating a library of
structural motifs can be used for structural-genomic scale function
prediction on protein structures.
Return to Program or Index
Local Minima-Based Exploration for Off-Lattice Protein Folding
Eugene Santos Jr., Keum Joo Kim and Eunice E. Santos
Dept. of Computer Science and Engineering, University of Connecticut
Abstract
Proteins have specific native folding structures, but
without proper folding, these proteins function
incorrectly and lead to disease. The major challenge lies
in effectively predicting such native conformations. We
present a new and simple algorithmic approach to help
predict protein structures from amino acid sequences based
on energy minimization. In the search for the minimal
energy conformation, we analyze and exploit the protein
structures found at the various local minima to direct the
search the global minimum. As such, we explore the energy
landscape efficiently by considering only the space of
local minima instead of the whole feasible space of
conformations. Our specific algorithmic approach is
comprised of two different elements: local minimization
and operators from genetic algorithms. Unlike existing
hybrid approaches where the local optimization is used to
fine-tune the solutions, we focus primarily on the local
optimization and employ stochastic sampling through
genetic operators for diversification. Our empirical
results indicate that each local minimum is representative
of the substructures contained in the set of solutions
surrounding the local minima. We applied our approach to
determining the minimal energy conformation of proteins
from the Protein Data Bank (PDB) using the UNRES energy
model. We compared against standard genetic algorithms and
Monte Carlo approaches as well as the conformations found
in the PDB as the baseline. In all cases, our new approach
computed the lowest energy conformation.
Return to Program or Index
Text Mining and Ontologies
HyBrow: A System for Hypothesis Validation
Stephen A. Racunas, Nigam H. Shah, Shashi Phoha and Nina V. Fedorof
Penn State University, Dept of Electrical Engineering
Abstract
Representation of diverse data and biological processes in
a single formalism is a major challenge in bioinformatics
because of the variety of available data and the different
levels of detail at which biological processes can be
considered. We have developed an event-based ontology for
representing both biological processes and data from
different sources at varying levels of detail. We have
integrated this ontology with a discrete event systems
framework to support qualitative reasoning. The framework
adopts a hierarchical, contradiction-based approach to
evaluate competing hypotheses based on available data and
provide a ranking among them. We have designed a database
backend to structure information in the ontology and
programs to perform contradiction-based evaluation of
hypotheses constructed using the ontology. We have also
designed command line, web and custom interfaces for the
system. We have named our system HyBrow (from Hypothesis
Browser), which includes the event-based ontology, an
associated database, the hypothesis evaluation framework
and the interface programs. We demonstrate the feasibility
of HyBrow, using the galactose metabolism gene network in
Saccharomyces cerevisiae as our test system, for ranking
alternative hypotheses in an automated manner using
available data. The Hybrow system for modeling biological
systems is highly extensible and can facilitate the process
of understanding how a biological system works by providing
an integrated view of the biological process under study.
Return to Program or Index
Identifying Gene and Protein Names from Biological Texts
Weijian Xuan, Stanley J. Watson, Huda Akil and Fan Meng
Mental Health Research Institute and Department of Psychiatry, University of Michigan
Abstract
Identifying gene and protein names from literature is a
critical step for mining functional information of genes
and proteins. While extensive efforts have been devoted to
this important task, most of them were aiming at extracting
gene and protein names per se without paying much attention
to associate the extracted names with existing gene and
protein database entries. We developed a simple and
efficient method for identifying gene and protein names in
literature using a combination of heuristic and statistical
strategies. We build a large dynamic dictionary based on
LocusLink and UMLS. We use handcrafted rules and natural
language processing techniques to transform the dictionary
and text, and use an efficient algorithm to search for
interesting patterns. A series of procedures are applied to
disambiguate the detected entities. Our approach will map
the identified entities to individual LocusLink entries
thus enable the seamless integration of literature
information with existing gene and protein databases.
Evaluation on two different corpuses shows that our method
achieves around 90% precision and recall, and outperforms
two other publicly available systems for protein name
recognition. The analysis of the entire set of Medline
abstracts can be finished in less than 15 hours on a single
PC. Our method does not require any training data, and can
be used as a building block for large biomedical literature
mining systems.
Return to Program or Index
Refining the Extraction of Relevant Documents from Biomedical
Literature to Create a Corpus for Pathway Text Mining
Rachel Harte, Yan Lu, David Dehoney, Stephen Osborn and Daniel Chin
PPD Discovery, Inc.
Abstract
For biologists to keep up with developments in their field
or related fields, automation is desirable to more
efficiently read and interpret a rapidly growing
literature. Identification of proteins or genes and their
interactions can facilitate the mapping of canonical or
evolving pathways from the literature. In order to mine
such data, we developed procedures and tools to pre-qualify
documents for further analysis. Initially, a corpus of
documents for proteins of interest was built using
alternate symbols from LocusLink and the Stanford SOURCE as
MEDLINE search terms. The query was refined using the
optimum keywords together with MeSH terms combined in a
Boolean query to minimize false positives. The document
space was examined using a Latent Semantic Indexing (LSI)
tool, which uses Entrez's "related papers" utility
for
MEDLINE. Documents' relationships are then visualized using
an undirected graph and scored depending on their degree of
relatedness. Distinct document clusters, formed by the most
highly connected related papers, are mostly composed of
abstracts relating to one aspect of research. This feature
was used to filter irrelevant abstracts, which resulted in
a reduction in corpus size of 10% to 20% depending on the
domain. The excluded documents were examined to confirm
their lack of relevance. Corpora consisted of the most
relevant documents thus reducing the number of false
positives and irrelevant examples in the training set for
pathway mapping. Documents were tagged, using a modified
version of GATE2, with terms based on GO for rule induction
using RAPIER.
Return to Program or Index
Using Natural Language Processing and the Gene Ontology to Populate a Structured Pathway Database
David Dehoney, Rachel Harte, Yan Lu and Daniel Chin
PPD Discovery
Abstract
Reading literature is one of the most time consuming tasks
a busy Scientist has to contend with. As the volume of
literature continues to grow there is a need to sort
through this information in a more efficient manner.
Mapping the pathways of genes and proteins of interest is
one goal that requires frequent reference to the
literature. Pathway databases can help here and Scientists
currently have a choice between buying access to externally
curated pathway databases or building their own in-house.
However such databases are either expensive to license or
slow to populate manually.
Building upon easily available, open-source tools we have
developed a pipeline to automate the collection,
structuring and storage of gene and protein interaction
data from the literature. As a team of both biologists and
computer scientists we integrated our natural language
processing (NLP) software with the Gene Ontology (GO) to
collect and translate unstructured text data into
structured interaction data. For NLP we used a machine
learning approach with a rule induction program, RAPIER
(http://www.cs.utexas.edu/users/ml/rapier.html). RAPIER was
modified to learn rules from tagged documents, and then it
was trained on a corpus tagged by expert curators. The
resulting rules were used to extract information from a
test corpus automatically. Extracted Genes and Proteins
were mapped onto Locuslink, and extracted interactions were
mapped onto GO. Once information was structured in this way
it was stored in a pathway database and this formal
structure allowed us to perform advanced data mining and
visualization.
Return to Program or Index
Text Visualization for Natural Language Analysis of Biology Full Text
Andrea Elaina Grimes and Robert P. Futrelle
Northeastern University
Abstract
Full text articles in the biology literature are a rich
source of knowledge. Our focus is on discovering enough of
the text structure to answer natural language queries such
as, "What are the mutants of the TBP dimer interface?"
Rather than view text as grammar-based, we view it as
"data," just as DNA sequences are. We have developed novel
visualization tools to help us discover significant language
patterns. The corpus is 300M words of text from Highwire
Press. One important class of "Framework" patterns consists
of a left anchor, an endogram and a right anchor. An example
is, "... expression of [e] was ..." in which the endogram
[e] is typically a gene name such as "csgA" or "AcuF-LacZ."
In our Framework Viewer, the anchors are aligned in columns
with the remainder of each sentence displayed also. The
sentences can be sorted in various ways, e.g., by endogram
length or alphabetically by the word preceding the right
anchor. It is possible to manually mark interesting
Framework instances for later processing. The endogram
multisets generated can be combined to broaden or sharpen
them. We are cataloging every Framework in the corpus to
discover the most frequent expressions that authors use.
Queries are processed by computing their distance from items
in the text using measures based on word insertions,
deletions and substitutions. This surface view of language
is still not powerful enough, so extensions of Frameworks
are being developed to induce natural language constituent
structure.
Return to Program or Index
|