CSB2004

POSTER ABSTRACTS: OTHER

Computational Methods for Predicting Intramolecular G-quadruplexes in Nucleotide Sequences
Lawrence D'Antonio
Ramapo College of New Jersey

The ability of single stranded nucleic acid molecules to form three-dimensional G-quadruplex structures is well documented. These structures arise from a motif consisting of four clusters of repeated guanines that form tetrads stacked upon one another. Guanine-rich sequences, including those that form G-quadruplexes, are abundant in various regions of biological significance. For example, we have determined that a conserved guanine-rich sequence could enhance 3 end processing efficiency of mammalian pre-mRNAs through interaction with an RNA binding protein of hnRNP H family. Members of this protein family are known to be involved in alternative and tissue specific regulated splicing events. Our findings suggest that G-quadruplexes play a regulatory role in differential RNA-processing. In order to map putative G-quadruplex elements within mammalian genes we have created a suite of computational tools. The suite contains algorithms to search genes for occurrences of the G-quadruplex motif and analyze their distribution patterns near RNA processing sites. Z-scores for the number of such occurrences indicate that they are overrepresented near these processing sites. To determine which sequences are most likely to form G-quadruplexes we have developed a scoring method based on known properties of these structures. Once a candidate sequence has been identified and scored, a p-value is computed indicating the likelihood that such a score would occur by chance. Correlations are then computed between these scores and sequence position relative to RNA processing sites. These computational tools will be used to develop a database of G-quadruplex forming sequences drawn from a large number of mammalian genes.

An Investigation of Gene Recovery and Biases Observed in Two Reduced Representation Sequencing Techniques When Applied to the Genome of Zea mays
X. Xu, W. B. Barbazuk, G. Pertea, Y. Feng, A. P. Chan, F. Cheung, Y. Lee, L. Zheng, and K. Schubert
Donald Danforth Plant Science Center

The large size of the maize genome and the expectation that upwards of 80% of the genome is represented by repetitive elements has prompted the examination of sequencing technologies expected to target gene rich regions. The objective of the Consortium for Maize Genomics is to evaluate two approaches to sequencing the maize genespace (methylation filtration and high Cot selection) which may provide rapid and cost-effective alternatives to sequencing the whole genome. Methyl filtration involves shotgun sequencing of the hypomethylated fraction of the corn genome, which is thought to contain most of the genes, while High Cot selection involves denaturation of total genomic DNA, followed by controlled renaturation to differentiate between high copy (repetitive) and low copy (genic) fractions of the genome. Approximately 450,000 sequence reads have been obtained from both methylation filtered clones and high Cot clones, and these sequences have been clustered and assembled. We observe a five-fold reduction in the effective genome size, and a minimum of four-fold increase in gene hit rate using the two enrichment techniques compared to a non-enriched library. Our analysis suggests that methyl-filtration and high Cot selection target non-identical, but overlapping, portions of the maize sequence space and >90% of the maize genes have been identified. The extent of maize gene coverage and bias observed for methyl-filtered and high-Cot derived sequences, and their potential utility in whole genome sequencing of large crop genomes, will be discussed.

Return to Poster Abstract Index
Return to Top

Sequences Meet Biology: Integration of Sequences as Database Objects in Mouse Genome Informatics
Michael B. Walker, James A. Kadin, Richard M. Baldarelli, Benjamin L. King, Lori E. Corbani, Sharon L. Cousins, Jonathan S. Beal, Jill Lewis, David B. Miers, Josh Winslow, Joel E. Richardson, Judith A. Blake, Martin Ringwald, Janan T. Eppig, Carol J. Bult, and the Mouse Genome Informatics Group
The Jackson Laboratory

The Mouse Genome Informatics (MGI) Database is a public resource which provides curated and integrated information on the biology and genetics of the laboratory mouse to the scientific and medical community worldwide. Over the past 10 years or more MGI has integrated together richly curated domains such as: genes, gene expression, gene function, phenotypes, strains, mammalian orthology, and chromosomal positioning, and soon to come, SNPs. And now with the release of 3.0, MGI has begun the process of representing nucleotide and protein sequence information within the database as distinct database objects rather than simple accession ids. Now MGI will integrate all mouse sequence data from GenBank, SWISS-PROT, LocusLink, RefSeq, Ensembl and NCBI gene models, along with the TIGR, DoTS and NIA Mouse Gene Indices. In conjunction with this new release comes a comprehensive system architecture for data retrieval, integration and storage called the Data Load Architecture (DLA). A robust software engineering process was applied throughout the requirements gathering, design, development and testing of this system. At the core of the software is a portable and extensible Java framework which provides polymorphism in the areas of data parsing, caching and update strategies and includes a code generated data access layer for transparent database persistency at the application level. This system was developed with a design-for-maintenance philosophy so that this release becomes only the first step toward a much more precise representation of sequence features and attributes in the future through the development of sequence feature maps. The MGI database can be accessed freely at http://www.informatics.jax.org.

Return to Poster Abstract Index
Return to Top

Navigating through the Biological Maze
Zoe Lacroix, Kaushal Parekh, Louiqa Raschid, Maria Esther Vidal
Arizona State University

Advances in genome science have created a surge of data. These data critical to scientific discovery are made available in thousands of heterogeneous public resources. Each of these resources provides biological data with a specific data organization, format, and quality, object identification, and a variety of capabilities that allow scientists to access, analyze, cluster, visualize and navigate through the datasets. The heterogeneity of biological resources and their increasing number make it difficult for scientists to exploit and understand them. Learning the properties of a new resource is a tedious and time-consuming process, often made more difficult by the many changes made on the resources (new or changed information, capabilities) that stress scientists keeping their knowledge up-to-date. Therefore many scientists master a few resources while ignoring others that may provide additional data and useful capabilities. The BioNavigation system completes existing data integration approaches, by allowing users to explore biological resources. It supports scientists in the selection of resources when expressing their data collection protocols: 1. Scientists express queries with a conceptual model (e.g., sequence, gene, etc.). 2. The system evaluates all resources that may be used to answer the query. 3. The scientist browses through the solutions, accesses properties (data organization, format, etc. compiled in a capability map) by clicking on each resource, and chooses the one best matching the protocol requirements. The BioNavigation can be coupled with a data integration query tool that allows users to collect the data automatically after selecting the resources.

Return to Poster Abstract Index
Return to Top

How Biological Source Capabilities May Affect Data Collection
Zoe Lacroix and Vidyadghari Edupuganti
ASU

Scientific discovery relies partially on the collection of information related to multiple scientific objects of interest (e.g., Retrieve all genes involved in brain cancer, Retrieve all citations related to diabetes). Scientists are interested in exploring multiple data sources in order to explore relationships between scientific objects. Each data source provides specific capabilities that allow scientists to access, navigate, and analyze the data. This work addresses the impact of resource selection (data source and capability) in the data collection process as it may affect significantly the quality and completeness of the data. We present preliminary research that demonstrates that the data collection process depends on two orthogonal variables: the selection of data sources involved in the process, and the selection of capabilities available at these resources. We report the results for four commonly used biological resources: the NCBI Nucleotide, Protein, PubMed and OMIM databases. Keywords were used to retrieve relevant genes from OMIM via the ESearch utility of NCBI. Then PubMed citations related to these genes were recorded. The results collected include the number of entries, attributes and cost (time and space). We study three paths starting with OMIM and ending with PubMed, with Nucleotide and Protein databases as intermediate resources, and three capabilities provided by the ESearch utility. Our preliminary results indicate that - For a given path, the 3 capabilities retrieve different results. - For a given capability, the 3 paths retrieve different results. We analyze how the 9 data collection processes produce 9 different datasets with different characteristics.

Return to Poster Abstract Index
Return to Top

Labeling and Enhancing Life Sciences Links
Stephan Heymann (1), Felix Naumann (1), Peter Rieger (1), and Louiqa Raschid (2)
(1) Humboldt-Universitat zu Berlin; (2) University of Maryland

Life sciences data sources contain data about scientific objects such as genes and sequences that are richly interconnected, i.e., a gene object may have links to sequences, proteins, SNPs, citations, etc. Scientific knowledge is enhanced through exploring all possible relationships between scientific objects and this requires traversing both links and paths (informally concatenations of links). There are significant limitations and challenges of such exploration, since the links are inherently poor, with respect to syntactic representation and semantic knowledge. The links are syntactically poor because the source and the target of the link are typically specified at the level of data objects (data entries). However, most scientists understand that the source and the target of a link are potentially at a finer level of granularity, and may correspond to specific sub-elements or fields within these data entries. The links are semantically poor because they carry no explicit meaning beyond that the data entries are "related." The lack of syntactic and semantic knowledge prevent the development of tools that can assist scientists to fully explore data sources and interconnections. In this project, we develop a methodology to enhance the structure and the meaning associated with links as follows: A data model to capture both the syntactic representation and semantic knowledge associated with a link. A concatenation operator whose properties can prescribe when it is meaningful to concatenate links to form paths, and when links and paths are equivalent. A query language and interpreter for scientists to meaningfully explore the links and paths.

Return to Poster Abstract Index
Return to Top

HOME • REGISTRATION • PAPERS • POSTERS • TUTORIALS • PROGRAM • KEYNOTE SPEAKERS • INVITED SPEAKERS
SPECIAL EVENTS • COMMITTEES • SPONSORS • NEWS ROOM • CONTACT US • PREVIOUS CONFERENCES