Computational Methods for Predicting Intramolecular G-quadruplexes in Nucleotide Sequences
Lawrence D'Antonio
Ramapo College of New Jersey
The ability of single stranded nucleic acid molecules to form three-dimensional G-quadruplex structures
is well documented. These structures arise from a motif consisting of four clusters of repeated guanines
that form tetrads stacked upon one another. Guanine-rich sequences, including those that form G-quadruplexes,
are abundant in various regions of biological significance. For example, we have determined that a conserved
guanine-rich sequence could enhance 3 end processing efficiency of mammalian pre-mRNAs through interaction
with an RNA binding protein of hnRNP H family. Members of this protein family are known to be involved in
alternative and tissue specific regulated splicing events. Our findings suggest that G-quadruplexes play a
regulatory role in differential RNA-processing. In order to map putative G-quadruplex elements within
mammalian genes we have created a suite of computational tools. The suite contains algorithms to search
genes for occurrences of the G-quadruplex motif and analyze their distribution patterns near RNA processing
sites. Z-scores for the number of such occurrences indicate that they are overrepresented near these
processing sites. To determine which sequences are most likely to form G-quadruplexes we have developed a
scoring method based on known properties of these structures. Once a candidate sequence has been identified
and scored, a p-value is computed indicating the likelihood that such a score would occur by chance.
Correlations are then computed between these scores and sequence position relative to RNA processing sites.
These computational tools will be used to develop a database of G-quadruplex forming sequences drawn from a
large number of mammalian genes.
An Investigation of Gene Recovery and Biases Observed in Two Reduced Representation Sequencing
Techniques When Applied to the Genome of Zea mays
X. Xu, W. B. Barbazuk, G. Pertea, Y. Feng, A. P. Chan, F. Cheung, Y. Lee, L. Zheng, and K. Schubert
Donald Danforth Plant Science Center
The large size of the maize genome and the expectation that upwards of 80% of the genome is represented
by repetitive elements has prompted the examination of sequencing technologies expected to target gene
rich regions. The objective of the Consortium for Maize Genomics is to evaluate two approaches to sequencing
the maize genespace (methylation filtration and high Cot selection) which may provide rapid and cost-effective
alternatives to sequencing the whole genome. Methyl filtration involves shotgun sequencing of the
hypomethylated fraction of the corn genome, which is thought to contain most of the genes, while High Cot
selection involves denaturation of total genomic DNA, followed by controlled renaturation to differentiate
between high copy (repetitive) and low copy (genic) fractions of the genome. Approximately 450,000 sequence
reads have been obtained from both methylation filtered clones and high Cot clones, and these sequences
have been clustered and assembled. We observe a five-fold reduction in the effective genome size, and a
minimum of four-fold increase in gene hit rate using the two enrichment techniques compared to a non-enriched
library. Our analysis suggests that methyl-filtration and high Cot selection target non-identical, but
overlapping, portions of the maize sequence space and >90% of the maize genes have been identified. The extent
of maize gene coverage and bias observed for methyl-filtered and high-Cot derived sequences, and their potential
utility in whole genome sequencing of large crop genomes, will be discussed.
Return to Poster Abstract Index
Return to Top
Sequences Meet Biology: Integration of Sequences as Database Objects in Mouse Genome Informatics
Michael B. Walker, James A. Kadin, Richard M. Baldarelli, Benjamin L. King, Lori E. Corbani,
Sharon L. Cousins, Jonathan S. Beal, Jill Lewis, David B. Miers, Josh Winslow, Joel E. Richardson,
Judith A. Blake, Martin Ringwald, Janan T. Eppig, Carol J. Bult, and the Mouse Genome Informatics Group
The Jackson Laboratory
The Mouse Genome Informatics (MGI) Database is a public resource which provides curated and integrated
information on the biology and genetics of the laboratory mouse to the scientific and medical community
worldwide. Over the past 10 years or more MGI has integrated together richly curated domains such as:
genes, gene expression, gene function, phenotypes, strains, mammalian orthology, and chromosomal positioning,
and soon to come, SNPs. And now with the release of 3.0, MGI has begun the process of representing
nucleotide and protein sequence information within the database as distinct database objects rather than
simple accession ids. Now MGI will integrate all mouse sequence data from GenBank, SWISS-PROT, LocusLink,
RefSeq, Ensembl and NCBI gene models, along with the TIGR, DoTS and NIA Mouse Gene Indices. In conjunction
with this new release comes a comprehensive system architecture for data retrieval, integration and storage
called the Data Load Architecture (DLA). A robust software engineering process was applied throughout the
requirements gathering, design, development and testing of this system. At the core of the software is a
portable and extensible Java framework which provides polymorphism in the areas of data parsing, caching
and update strategies and includes a code generated data access layer for transparent database persistency
at the application level. This system was developed with a design-for-maintenance philosophy so that this
release becomes only the first step toward a much more precise representation of sequence features and
attributes in the future through the development of sequence feature maps. The MGI database can be accessed
freely at http://www.informatics.jax.org.
Return to Poster Abstract Index
Return to Top
Navigating through the Biological Maze
Zoe Lacroix, Kaushal Parekh, Louiqa Raschid, Maria Esther Vidal
Arizona State University
Advances in genome science have created a surge of data. These data critical to scientific discovery are
made available in thousands of heterogeneous public resources. Each of these resources provides biological
data with a specific data organization, format, and quality, object identification, and a variety of
capabilities that allow scientists to access, analyze, cluster, visualize and navigate through the datasets.
The heterogeneity of biological resources and their increasing number make it difficult for scientists to
exploit and understand them. Learning the properties of a new resource is a tedious and time-consuming process,
often made more difficult by the many changes made on the resources (new or changed information, capabilities)
that stress scientists keeping their knowledge up-to-date. Therefore many scientists master a few resources
while ignoring others that may provide additional data and useful capabilities. The BioNavigation system
completes existing data integration approaches, by allowing users to explore biological resources. It supports
scientists in the selection of resources when expressing their data collection protocols: 1. Scientists express
queries with a conceptual model (e.g., sequence, gene, etc.). 2. The system evaluates all resources that may be
used to answer the query. 3. The scientist browses through the solutions, accesses properties (data organization,
format, etc. compiled in a capability map) by clicking on each resource, and chooses the one best matching the
protocol requirements. The BioNavigation can be coupled with a data integration query tool that allows users to
collect the data automatically after selecting the resources.
Return to Poster Abstract Index
Return to Top
How Biological Source Capabilities May Affect Data Collection
Zoe Lacroix and Vidyadghari Edupuganti
ASU
Scientific discovery relies partially on the collection of information related to multiple scientific
objects of interest (e.g., Retrieve all genes involved in brain cancer, Retrieve all citations related
to diabetes). Scientists are interested in exploring multiple data sources in order to explore
relationships between scientific objects. Each data source provides specific capabilities that allow
scientists to access, navigate, and analyze the data. This work addresses the impact of resource selection
(data source and capability) in the data collection process as it may affect significantly the quality and
completeness of the data. We present preliminary research that demonstrates that the data collection
process depends on two orthogonal variables: the selection of data sources involved in the process, and
the selection of capabilities available at these resources. We report the results for four commonly used
biological resources: the NCBI Nucleotide, Protein, PubMed and OMIM databases. Keywords were used to
retrieve relevant genes from OMIM via the ESearch utility of NCBI. Then PubMed citations related to
these genes were recorded. The results collected include the number of entries, attributes and cost
(time and space). We study three paths starting with OMIM and ending with PubMed, with Nucleotide and
Protein databases as intermediate resources, and three capabilities provided by the ESearch utility. Our
preliminary results indicate that - For a given path, the 3 capabilities retrieve different results. - For
a given capability, the 3 paths retrieve different results. We analyze how the 9 data collection processes
produce 9 different datasets with different characteristics.
Return to Poster Abstract Index
Return to Top
Labeling and Enhancing Life Sciences Links
Stephan Heymann (1), Felix Naumann (1), Peter Rieger (1), and Louiqa Raschid (2)
(1) Humboldt-Universitat zu Berlin; (2) University of Maryland
Life sciences data sources contain data about scientific objects such as genes and sequences that are richly
interconnected, i.e., a gene object may have links to sequences, proteins, SNPs, citations, etc. Scientific
knowledge is enhanced through exploring all possible relationships between scientific objects and this requires
traversing both links and paths (informally concatenations of links). There are significant limitations and
challenges of such exploration, since the links are inherently poor, with respect to syntactic representation
and semantic knowledge. The links are syntactically poor because the source and the target of the link are
typically specified at the level of data objects (data entries). However, most scientists understand that
the source and the target of a link are potentially at a finer level of granularity, and may correspond to
specific sub-elements or fields within these data entries. The links are semantically poor because they carry
no explicit meaning beyond that the data entries are "related." The lack of syntactic and semantic knowledge
prevent the development of tools that can assist scientists to fully explore data sources and interconnections.
In this project, we develop a methodology to enhance the structure and the meaning associated with links as
follows: A data model to capture both the syntactic representation and semantic knowledge associated with a
link. A concatenation operator whose properties can prescribe when it is meaningful to concatenate links to
form paths, and when links and paths are equivalent. A query language and interpreter for scientists to
meaningfully explore the links and paths.
Return to Poster Abstract Index
Return to Top
|