Tutorial Abstracts
AM 1: Michele Markstein UC Berkeley
Computing
Non-coding cis-regulatory DNAs
Genes comprise a surprisingly small fraction of every animal genome
sequenced to date. For example, in the human genome, less than 1% of the 3
billion basepairs encodes for proteins. The remainder of the genome falls
into two classes: 50% is composed of highly repetitive DNA (a fossil record
of past invasions by transposable elements) and the other 50% contains
non-coding cis-regulatory DNA elements which control gene expression both
globally and locally.
Predicting which DNA sequences function as cis-regulatory elements
is one of the greatest challenges of the post-genomics era. Several
types of cis-regulatory DNAs have been defined experimentally--such
as enhancers, core-promoters, insulators, silencers, and scaffold-attachment
regions. Although we have good descriptions and several examples of
these elements, they are very difficult to predict computationally.
The goal of the tutorial is to introduce you to the problem of computing
non-coding cis-regulatory DNAs. In the first half of the tutorial
we will survey the different categories of DNA with a special focus
on cis-regulatory DNAs controlling gene expression. This part of the
discussion will serve as a very brief review of basic molecular biology
and should give you a solid understanding of the different types of
cis-regulatory DNAs. In the second half of the tutorial we will discuss
ab initio and comparative computational approaches to predicting cis-regulatory
DNAs.
This tutorial is designed for computer scientists who are interested in
learning about good computational problems from the perspective of a
biologist. You are not expected to know any biology, but if you have some
time I highly recommend reading sections of Albert's Fourth Edition of
Molecular Biology of the Cell. In particular, you may enjoy the conference
more if you familiarize yourself with the flow of biological information
from DNA to RNA to Protein.
Michele will be receiving her doctorate in Developmental Biology from the
University of Chicago this May. She is currently a visiting scholar at UC
Berkeley, working in the laboratory of Mike Levine. Her research focuses
on deciphering cis-regulatory DNAs using the fruitfly as a model genome
and model experimental system. Her work involves collaborating with
computer scientists to build user-friendly bioinformatics tools and
conducting wet-lab experiments to test predictions derived
from those tools. She is a co-founder of opengenomics.org, which makes
tools for genome biology freely available on the web.
Return to Tutorials
AM 2:
Randy Gobbel, Ph.D. and Suzane Paley, Ph.D.
SRI International, CA
The Pathway Tools Software
The Pathway Tools software, produced by Dr. Peter Karp's group at
SRI International, is the software behind EcoCyc,
MetaCyc, and the other BioCyc databases. The software is used to
generate new model organism databases that integrate information about
the genome, metabolic pathways, and genetic network of an organism in
the form of a Pathway/Genome Database. To date close to 20 such
databases have been created by a broad user community. The software
includes a variety of functionality including prediction of metabolic
pathways and operons; interactive editing tools for curation of
pathways, genomes, operons, etc; and query, visualization, analysis,
and web publishing of Pathway/Genome Databases.
Topics will include:
- Query, visualization, and editing capabilities
- How to use Pathway Tools to predict the metabolic pathways of an
organism from its genome
- Global computations with Pathway/Genome databases using flatfiles,
the new PerlCyc query API, and the Lisp API
Randy Gobbel is a Computer Scientist in the Bioinformatics Research Group of the
Artificial Intelligence Center at SRI International. Randy has almost 30 years of
experience in systems development, artificial intelligence, and computational neuroscience.
His career includes 12 years as a member of the research staff at Xerox PARC, a postdoctoral
fellowship at the Center for the Neural Basis of Cognition at Carnegie-Mellon University,
and several years as a systems architect at MDL Information Systems. Randy has a Ph.D.
in Cognitive Science from the University of California at San Diego.
Return to Tutorials
AM 3: Tom Madden, Ph.D. National Center for Biotechnology Information
Blast:
Getting the Most from Your Cycles
BLAST is one of the most popular sequence similarity search programs, and
some users may do thousands of runs a day. Very few users are really aware
of all BLAST features, and often it turns out that minor changes can lead to
improvements such as better throughput, more sensitive searches, or more
robust post-processing of results.
First we'll look at some of the different output options available with
BLAST (which include the standard report, a tabular format, XML, and ASN.1)
and discuss the optimal format for different situations. Second we'll
discuss how to improve the throughput of BLAST searches by changing the
command-line options or the program used. In this context we'll also look
at how scheduling and machine resources can affect throughput. Finally we'll
look at how to use some of the lesser-known features of the BLAST
databases to achieve a more targeted search and discuss the overall design
of the BLAST databases.
Tom Madden leads a team of software engineers at the NCBI that develops
and maintains BLAST, a sequence similarity tool. He has been working on
BLAST since 1994 at the NCBI. Before coming to the NCBI he performed
post-doctoral work in Biophysics.
Return to Tutorials
AM 4:
Jeffrey Chang Biomedical Informatics, Stanford University
Python and Biopython Scripting
for Busy Bioinformaticians
Scripting languages can simplify many common tasks for bioinformaticians.
Language features such as built-in data types and automatic memory management free researchers from
developing infrastructure not directly related to their research
goals. In addition, an interpreted scripting language simplifies the
traditional program development cycle and more closely matches the
dynamic nature of scientific research.
Python (http://www.python.org) is an open source scripting language that is suitable for general
development. It was built to be elegant, simple, and powerful; its design encourages
modularity, reusability, and object-oriented programming. The
designer's emphasis on readability and simplicity of syntax makes it
well-suited for developing large maintained projects. Because of these
strengths, Python is used increasingly in the bioinformatics
community.
This tutorial will introduce the Python language. At the
end of the session, attendees will be able to develop python scripts
and be familiar with modules that work with string handling,
regular expressions, and web/cgi programming. In addition, the
tutorial will introduce the Biopython (http://www.biopython.org)
libraries for accessing bioinformatics databases and algorithms.
I am targetting this tutorial towards people without prior Python
experience. Programming experience is not required,
although some familiarity with another language will be helpful.
Jeffrey is finishing his Ph.D. in the Biomedical Informatics program
at Stanford University. His dissertation work is focused on machine
learning methods for mining and extracting information from the
biological literature. He is the co-founding coordinator of the
Biopython project, a popular library that simplifies many
bioinformatics tasks. He discovered the joys of Python while
developing software for pharmaceutical companies at the Molecular
Applications Group and has never looked back.
Return to Tutorials
AM 5:
M. Elizabeth Corey, Ph.D.
UCSC Extension
How
Can We Measure the Distance Between Sequences?
Schrödinger once asked the question "What is life?" and it may be that
the answer may be found in terms of time. To the extent that biological
systems insulate their components from environmental influences, they
also isolate their component rate structures. In this tutorial, we will
look at the rates underlying different aspects of evolution and examine
the statistical system time that emerges when components with intrinsic
times interact.
Taking this as our foundation, we will then revisit various scoring
formulas, distance finding methods and substitution matrices.
M. Elizabeth Corey, Ph.D. has worked at Sony Research, Oracle Corporation, Incyte
Genomics, Monsanto and Finnigan as a software developer, database
architect and Bioinformatics manager. She is named in patent
applications for database designs at Incyte and MS quantification
techniques at Finnigan. She currently teaches Bioinformatics courses at
UC Santa Cruz Extension and UC Berkeley Extension and holds a Ph.D. from
Northwestern University in Computer Science and Biomedical Engineering,
jointly. Her primary research publications have been in the area of
nonlinear dynamic systems in biology.
Return to Tutorials
PM 6: Steven Bennett, Ph.D.
Postdoc Stanford University,
Department of Biochemistry
Functional Analysis of Proteins and Proteomes
Steve received his Ph.D. in Biochemistry from Stanford University. His
research has focused on computational approaches for structural and
functional analyses of conserved protein sequence motifs as well as
visualization techniques of genomic data. His current research interests
include ab initio protein structure prediction and improving data
visualization techniques for genomics and structural genomics applications.
Return to Tutorials
PM 7: Michael G. Walker, Ph.D. President, Walker Bioscience
Introduction
to Statistics for Bioinformatics, with Applications to Expression
Microarrays
This tutorial will introduce the most widely used statistical methods for bioinformatics, including descriptive statistics, probability, analysis of variance, discriminant analysis and cluster analysis.
Examples will be drawn from biomedical cases, particularly expression microarray data analysis.
Return to Tutorials
PM 8:
Xiaole Shirley Liu, Ph.D.
Biostatistical Science
Harvard School of Public Health/Dana Farber Cancer Institute
Bioinformatics
Approaches for Studying Transcription Regulation and Protein-DNA Interaction
This tutorial will introduce the basic biology on transcripton
regulation, experimental techniques for studying transcription regulation,
and widely used algorithms and strategies for finding transcription
factor binding motifs and sites. We will cover de novo motif finding
algorithms such as finding over-represented words, dictionary method,
Gibbs sampler, expectation maximization, regression method, etc, most
of which have been successfully applied to yeast. We will also introduce
strategies developed to study transcription regulation in bacteria
such as finding gapped palindrome motifs, and in higher eukaryotes
such as phylogenetic footprinting and finding motif site clusters.
X. Shirley Liu received her Ph.D. in Biomedical Informatics from
Stanford. After finishing a research fellowship, she will become an
assistant professor in Biostatistical Science in Harvard School of
Public Health and Dana Farber Cancer Institute this July. She has been
working on DNA sequence analysis, mostly on transcription factor binding
site discovery and transcription regulation, and developed many
computational statistics algorithms for TF motif finding.
Return to Tutorials
PM 9:
Frank Olken, Ph.D.
Lawrence Berkeley National Laboratory
Graph Data Management for Biology
Many kinds of data arising in bioinformatics can be usefully represented as graphs. Biopathways
(metabolic pathways, signaling pathways, or gene regulatory networks) are among the most
important examples of such graph data. Other examples include: taxonomies, e.g., of enzymes
or organisms (directed acyclic graphs), protein interaction networks (undirected graphs), DNA,
RNA or protein sequences (linear graphs), chemical structure graphs (undirected graphs), sequence
fragment overlap graphs (interval graphs) for shotgun sequence assembly, experimental protocols
(directed graphs), bibliographic citations (directed graphs), bibliographic co-citations
(undirected graphs), gene co-expression (undirected graphs), genetic maps (partial orders),
multiple sequence alignments (partial orders), contact graphs for protein structures
(undirected graphs), etc. Graph data models have been studied in the database community
for semantic data modeling, hypertext, geographic information systems, semi-structured data,
XML, multi-media, etc. Graph data management has long formed the basis for chemical information
retrieval systems. This tutorial is concerned with techniques for the modeling and management
of such graph data.
The tutorial will cover a variety of biological graph data examples, but concentrate on issues
related to biopathways. We will discuss the details of various graph data models directed vs.
unidrected graphs, nested vs. unnested graphs, graphs vs. hypergraphs. labeling and type systems
on edges and nodes, specification of attributes. We will discuss both edge list and incidence
(adjacency) matrix representations of graphs. We will also discuss related work on binary
relational data models (NIAM), and on mappings between relational (or entity-relationship)
schemas and graph data model schemas. We will discuss various measures used to characterize
graphs: average node degrees, graph diameter, number of connected components, various topological indices.
We will also explain a variety of graph queries: path queries, Boolean graph queries, subgraph
matching queries, graph characterization queries, approximate subgraph matching queries. Then
we will compare various proposals for graph query languages.
We will introduce some of the algorithms for processing graphs queries: e.g., transitive closures,
subgraph isomorphism queries, etc. We will introduce basic concepts of query plan specification and optimization.
We will discuss the relative merits of several proposed implementation strategies: building a
graph data manager on top of a relational backend, extending object-relational DBMS with graph
data types and functions, etc.
We will briefly survey several graph data exchange formats.
Audience
This tutorial is intended primarily for computer scientists, bioinformaticists, database designers,
and DBMS developers concerned with graph data management for biological applications. Computer
literate biologists working with graph data will also find the course useful.
Prerequisites
It will be useful for students to have some previous exposure to concepts of relational databases
(e.g., relational or entity-relation schemas, SQL, relational algebra). No previous knowledge
of graph theory is assumed. See also discussion of the related morning tutorial below.
Related Tutorial
Some students may also wish to take the #AM2 The Pathway Tools Software Tutorial by Randy Gobbel
and Suzane Paley from SRI in the morning tutorials. Both tutorials are concerned with storing and
querying biopathways information. Our tutorial approaches these topics from a database management
perspective, while their tutorial approaches similar topics from a knowledge representation perspective.
Notes
Enrolled tutorial attendees will be provided with hardcopies of the tutorial slides.
Frank Olken holds a Ph.D. from U.C. Berkeley in Computer Science and works in the scientific data
management group at Lawrence Berkeley National Laboratory in Berkeley, California. He has taught
data management for UC Berkeley, and tutorials on XML Schema Language and Work Flow Management
Systems for UC Berkeley Extension. He also served on the W3C Committees for standardization of
RDF Schemas, and XML Schema Language. He has worked on various topics in file migration, statistical
and scientific data management, bioinformatics and computational biology, and workflow management.
He curently leads the Biopathways Graph Data Manager Project at LBNL for the Arkin Lab.
Return to Tutorials
PM 10:
David Keyes, Ph.D.
Joint Genome Institute, Walnut Creek, CA
An Introduction to Comparative Genomics
The human genome is only one of a steadily increasing number of animal, plant and microorganism
genomes to be sequenced. Comparison of these genomes has proven to be one of the most powerful
ways to find and annotate functionally important DNA domains. This tutorial will introduce both
evolutionary models and bioinformatic tools being used to identify protein coding and regulatory
sequences using comparative genomics. A background in biology/genetics will not be assumed.
Lecture outline This tutorial will not assume more than a basic biology/genetics
background.
- Large number of genomes becoming available
- Reason for sequencing multiple genomes is comparisons are informative.
- Phylogentics is the science of comparing genetic data in an evolutionary context.
- Tools for sequence comparison.
- Genetic drift and neutral selection.
- Model that evolutionary conservation implies functional importance.
- Evolutionary models and conservation of gene families.
- Searching for sequence conservation/similarity.
- Conservation of coding sequence.
- Conservation of genomic structure.
- Conservation of regulatory sequence.
- Tutorial will concentrate on animal genomes.
- Tutorial will utilize real and experimental data.
David N. Keys received his doctorate from the Department of Genetics at the
University of Wisconsin at Madison. He did his postdoc work in the Molecular
and Cellular Biology Department at the University of California at Berkeley where
he was a campus fellow with the Miller Institute for Research in Basic Science.
He currently works at the Department of EnergyÕs Joint Genome Institute in Walnut
Creek CA where he is a group leader in the Genomics Technology Division. His research
involves large scale experimental screens for DNA with cis-regulatory activity.
Return to Tutorials
|