IEEE Computer Society Bioinformatics Conference

Tutorial Abstracts

AM 1:
Michele Markstein
UC Berkeley

Computing Non-coding cis-regulatory DNAs

Genes comprise a surprisingly small fraction of every animal genome sequenced to date. For example, in the human genome, less than 1% of the 3 billion basepairs encodes for proteins. The remainder of the genome falls into two classes: 50% is composed of highly repetitive DNA (a fossil record of past invasions by transposable elements) and the other 50% contains non-coding cis-regulatory DNA elements which control gene expression both globally and locally.

Predicting which DNA sequences function as cis-regulatory elements is one of the greatest challenges of the post-genomics era. Several types of cis-regulatory DNAs have been defined experimentally--such as enhancers, core-promoters, insulators, silencers, and scaffold-attachment regions. Although we have good descriptions and several examples of these elements, they are very difficult to predict computationally.

The goal of the tutorial is to introduce you to the problem of computing non-coding cis-regulatory DNAs. In the first half of the tutorial we will survey the different categories of DNA with a special focus on cis-regulatory DNAs controlling gene expression. This part of the discussion will serve as a very brief review of basic molecular biology and should give you a solid understanding of the different types of cis-regulatory DNAs. In the second half of the tutorial we will discuss ab initio and comparative computational approaches to predicting cis-regulatory DNAs.

This tutorial is designed for computer scientists who are interested in learning about good computational problems from the perspective of a biologist. You are not expected to know any biology, but if you have some time I highly recommend reading sections of Albert's Fourth Edition of Molecular Biology of the Cell. In particular, you may enjoy the conference more if you familiarize yourself with the flow of biological information from DNA to RNA to Protein.

Michele will be receiving her doctorate in Developmental Biology from the University of Chicago this May. She is currently a visiting scholar at UC Berkeley, working in the laboratory of Mike Levine. Her research focuses on deciphering cis-regulatory DNAs using the fruitfly as a model genome and model experimental system. Her work involves collaborating with computer scientists to build user-friendly bioinformatics tools and conducting wet-lab experiments to test predictions derived from those tools. She is a co-founder of opengenomics.org, which makes tools for genome biology freely available on the web.

Return to Tutorials

AM 2:
Randy Gobbel, Ph.D. and Suzane Paley, Ph.D.
SRI International, CA

The Pathway Tools Software

The Pathway Tools software, produced by Dr. Peter Karp's group at SRI International, is the software behind EcoCyc, MetaCyc, and the other BioCyc databases. The software is used to generate new model organism databases that integrate information about the genome, metabolic pathways, and genetic network of an organism in the form of a Pathway/Genome Database. To date close to 20 such databases have been created by a broad user community. The software includes a variety of functionality including prediction of metabolic pathways and operons; interactive editing tools for curation of pathways, genomes, operons, etc; and query, visualization, analysis, and web publishing of Pathway/Genome Databases.

Topics will include:

Query, visualization, and editing capabilities
How to use Pathway Tools to predict the metabolic pathways of an organism from its genome
Global computations with Pathway/Genome databases using flatfiles, the new PerlCyc query API, and the Lisp API

Randy Gobbel is a Computer Scientist in the Bioinformatics Research Group of the Artificial Intelligence Center at SRI International. Randy has almost 30 years of experience in systems development, artificial intelligence, and computational neuroscience. His career includes 12 years as a member of the research staff at Xerox PARC, a postdoctoral fellowship at the Center for the Neural Basis of Cognition at Carnegie-Mellon University, and several years as a systems architect at MDL Information Systems. Randy has a Ph.D. in Cognitive Science from the University of California at San Diego.

Return to Tutorials

AM 3:
Tom Madden, Ph.D.
National Center for Biotechnology Information

Blast: Getting the Most from Your Cycles

BLAST is one of the most popular sequence similarity search programs, and some users may do thousands of runs a day. Very few users are really aware of all BLAST features, and often it turns out that minor changes can lead to improvements such as better throughput, more sensitive searches, or more robust post-processing of results.

First we'll look at some of the different output options available with BLAST (which include the standard report, a tabular format, XML, and ASN.1) and discuss the optimal format for different situations. Second we'll discuss how to improve the throughput of BLAST searches by changing the command-line options or the program used. In this context we'll also look at how scheduling and machine resources can affect throughput. Finally we'll look at how to use some of the lesser-known features of the BLAST databases to achieve a more targeted search and discuss the overall design of the BLAST databases.

Tom Madden leads a team of software engineers at the NCBI that develops and maintains BLAST, a sequence similarity tool. He has been working on BLAST since 1994 at the NCBI. Before coming to the NCBI he performed post-doctoral work in Biophysics.

Return to Tutorials

AM 4:
Jeffrey Chang
Biomedical Informatics, Stanford University

Python and Biopython Scripting for Busy Bioinformaticians

Scripting languages can simplify many common tasks for bioinformaticians. Language features such as built-in data types and automatic memory management free researchers from developing infrastructure not directly related to their research goals. In addition, an interpreted scripting language simplifies the traditional program development cycle and more closely matches the dynamic nature of scientific research.

Python (http://www.python.org) is an open source scripting language that is suitable for general development. It was built to be elegant, simple, and powerful; its design encourages modularity, reusability, and object-oriented programming. The designer's emphasis on readability and simplicity of syntax makes it well-suited for developing large maintained projects. Because of these strengths, Python is used increasingly in the bioinformatics community.

This tutorial will introduce the Python language. At the end of the session, attendees will be able to develop python scripts and be familiar with modules that work with string handling, regular expressions, and web/cgi programming. In addition, the tutorial will introduce the Biopython (http://www.biopython.org) libraries for accessing bioinformatics databases and algorithms.

I am targetting this tutorial towards people without prior Python experience. Programming experience is not required, although some familiarity with another language will be helpful.

Jeffrey is finishing his Ph.D. in the Biomedical Informatics program at Stanford University. His dissertation work is focused on machine learning methods for mining and extracting information from the biological literature. He is the co-founding coordinator of the Biopython project, a popular library that simplifies many bioinformatics tasks. He discovered the joys of Python while developing software for pharmaceutical companies at the Molecular Applications Group and has never looked back.

Return to Tutorials

AM 5:
M. Elizabeth Corey, Ph.D.
UCSC Extension

How Can We Measure the Distance Between Sequences?

Schrödinger once asked the question "What is life?" and it may be that the answer may be found in terms of time. To the extent that biological systems insulate their components from environmental influences, they also isolate their component rate structures. In this tutorial, we will look at the rates underlying different aspects of evolution and examine the statistical system time that emerges when components with intrinsic times interact.

Taking this as our foundation, we will then revisit various scoring formulas, distance finding methods and substitution matrices.

M. Elizabeth Corey, Ph.D. has worked at Sony Research, Oracle Corporation, Incyte Genomics, Monsanto and Finnigan as a software developer, database architect and Bioinformatics manager. She is named in patent applications for database designs at Incyte and MS quantification techniques at Finnigan. She currently teaches Bioinformatics courses at UC Santa Cruz Extension and UC Berkeley Extension and holds a Ph.D. from Northwestern University in Computer Science and Biomedical Engineering, jointly. Her primary research publications have been in the area of nonlinear dynamic systems in biology.

Return to Tutorials

PM 6:
Steven Bennett, Ph.D.
Postdoc Stanford University, Department of Biochemistry

Functional Analysis of Proteins and Proteomes

Steve received his Ph.D. in Biochemistry from Stanford University. His research has focused on computational approaches for structural and functional analyses of conserved protein sequence motifs as well as visualization techniques of genomic data. His current research interests include ab initio protein structure prediction and improving data visualization techniques for genomics and structural genomics applications.

Return to Tutorials

PM 7:
Michael G. Walker, Ph.D.
President, Walker Bioscience

Introduction to Statistics for Bioinformatics, with Applications to Expression Microarrays

This tutorial will introduce the most widely used statistical methods for bioinformatics, including descriptive statistics, probability, analysis of variance, discriminant analysis and cluster analysis. Examples will be drawn from biomedical cases, particularly expression microarray data analysis.

Return to Tutorials

PM 8:
Xiaole Shirley Liu, Ph.D.
Biostatistical Science
Harvard School of Public Health/Dana Farber Cancer Institute

Bioinformatics Approaches for Studying Transcription Regulation and Protein-DNA Interaction

This tutorial will introduce the basic biology on transcripton regulation, experimental techniques for studying transcription regulation, and widely used algorithms and strategies for finding transcription factor binding motifs and sites. We will cover de novo motif finding algorithms such as finding over-represented words, dictionary method, Gibbs sampler, expectation maximization, regression method, etc, most of which have been successfully applied to yeast. We will also introduce strategies developed to study transcription regulation in bacteria such as finding gapped palindrome motifs, and in higher eukaryotes such as phylogenetic footprinting and finding motif site clusters.

X. Shirley Liu received her Ph.D. in Biomedical Informatics from Stanford. After finishing a research fellowship, she will become an assistant professor in Biostatistical Science in Harvard School of Public Health and Dana Farber Cancer Institute this July. She has been working on DNA sequence analysis, mostly on transcription factor binding site discovery and transcription regulation, and developed many computational statistics algorithms for TF motif finding.

Return to Tutorials

PM 9:
Frank Olken, Ph.D.
Lawrence Berkeley National Laboratory

Graph Data Management for Biology

Many kinds of data arising in bioinformatics can be usefully represented as graphs. Biopathways (metabolic pathways, signaling pathways, or gene regulatory networks) are among the most important examples of such graph data. Other examples include: taxonomies, e.g., of enzymes or organisms (directed acyclic graphs), protein interaction networks (undirected graphs), DNA, RNA or protein sequences (linear graphs), chemical structure graphs (undirected graphs), sequence fragment overlap graphs (interval graphs) for shotgun sequence assembly, experimental protocols (directed graphs), bibliographic citations (directed graphs), bibliographic co-citations (undirected graphs), gene co-expression (undirected graphs), genetic maps (partial orders), multiple sequence alignments (partial orders), contact graphs for protein structures (undirected graphs), etc. Graph data models have been studied in the database community for semantic data modeling, hypertext, geographic information systems, semi-structured data, XML, multi-media, etc. Graph data management has long formed the basis for chemical information retrieval systems. This tutorial is concerned with techniques for the modeling and management of such graph data.

The tutorial will cover a variety of biological graph data examples, but concentrate on issues related to biopathways. We will discuss the details of various graph data models directed vs. unidrected graphs, nested vs. unnested graphs, graphs vs. hypergraphs. labeling and type systems on edges and nodes, specification of attributes. We will discuss both edge list and incidence (adjacency) matrix representations of graphs. We will also discuss related work on binary relational data models (NIAM), and on mappings between relational (or entity-relationship) schemas and graph data model schemas. We will discuss various measures used to characterize graphs: average node degrees, graph diameter, number of connected components, various topological indices.

We will also explain a variety of graph queries: path queries, Boolean graph queries, subgraph matching queries, graph characterization queries, approximate subgraph matching queries. Then we will compare various proposals for graph query languages.

We will introduce some of the algorithms for processing graphs queries: e.g., transitive closures, subgraph isomorphism queries, etc. We will introduce basic concepts of query plan specification and optimization.

We will discuss the relative merits of several proposed implementation strategies: building a graph data manager on top of a relational backend, extending object-relational DBMS with graph data types and functions, etc.

We will briefly survey several graph data exchange formats.

Audience   This tutorial is intended primarily for computer scientists, bioinformaticists, database designers, and DBMS developers concerned with graph data management for biological applications. Computer literate biologists working with graph data will also find the course useful.

Prerequisites   It will be useful for students to have some previous exposure to concepts of relational databases (e.g., relational or entity-relation schemas, SQL, relational algebra). No previous knowledge of graph theory is assumed. See also discussion of the related morning tutorial below.

Related Tutorial   Some students may also wish to take the #AM2 The Pathway Tools Software Tutorial by Randy Gobbel and Suzane Paley from SRI in the morning tutorials. Both tutorials are concerned with storing and querying biopathways information. Our tutorial approaches these topics from a database management perspective, while their tutorial approaches similar topics from a knowledge representation perspective.

Notes   Enrolled tutorial attendees will be provided with hardcopies of the tutorial slides.

Frank Olken holds a Ph.D. from U.C. Berkeley in Computer Science and works in the scientific data management group at Lawrence Berkeley National Laboratory in Berkeley, California. He has taught data management for UC Berkeley, and tutorials on XML Schema Language and Work Flow Management Systems for UC Berkeley Extension. He also served on the W3C Committees for standardization of RDF Schemas, and XML Schema Language. He has worked on various topics in file migration, statistical and scientific data management, bioinformatics and computational biology, and workflow management. He curently leads the Biopathways Graph Data Manager Project at LBNL for the Arkin Lab.

Return to Tutorials

PM 10:
David Keyes, Ph.D.
Joint Genome Institute, Walnut Creek, CA

An Introduction to Comparative Genomics

The human genome is only one of a steadily increasing number of animal, plant and microorganism genomes to be sequenced. Comparison of these genomes has proven to be one of the most powerful ways to find and annotate functionally important DNA domains. This tutorial will introduce both evolutionary models and bioinformatic tools being used to identify protein coding and regulatory sequences using comparative genomics. A background in biology/genetics will not be assumed.

Lecture outline This tutorial will not assume more than a basic biology/genetics background.

Large number of genomes becoming available
Reason for sequencing multiple genomes is comparisons are informative.
Phylogentics is the science of comparing genetic data in an evolutionary context.
Tools for sequence comparison.
Genetic drift and neutral selection.
Model that evolutionary conservation implies functional importance.
Evolutionary models and conservation of gene families.
Searching for sequence conservation/similarity.
Conservation of coding sequence.
Conservation of genomic structure.
Conservation of regulatory sequence.
Tutorial will concentrate on animal genomes.
Tutorial will utilize real and experimental data.

David N. Keys received his doctorate from the Department of Genetics at the University of Wisconsin at Madison. He did his postdoc work in the Molecular and Cellular Biology Department at the University of California at Berkeley where he was a campus fellow with the Miller Institute for Research in Basic Science. He currently works at the Department of EnergyÕs Joint Genome Institute in Walnut Creek CA where he is a group leader in the Genomics Technology Division. His research involves large scale experimental screens for DNA with cis-regulatory activity.

Return to Tutorials

RETURN TO
TUTORIALS

Return to Top

HOME | REGISTRATION | STANFORD SITE | PAPERS | POSTERS | TUTORIALS | PROGRAM | KEYNOTE SPEAKERS
INVITED SPEAKERS | SPECIAL EVENTS | COMMITTEES | SPONSORS | NEWS | CONTACT | 2002 CONFERENCE