Sandrine Dudoit
"Statistical Approaches for Meeting Challenges in the Analysis and Interpretation of mRNA-Seq"
Abstract
For the past decade, microarrays have been the assays of choice for high-throughput studies of gene expression. Recent improvements in the
efficiency, quality, and cost of genome-wide sequencing have prompted biologists to rapidly abandon microarrays in favor of ultra high-throughput
sequencing, a.k.a., second-generation or next-generation sequencing: e.g., Applied Biosystems' SOLiD, Helicos BioSciences' HeliScope, Illumina's
Genome Analyzer, and Roche's 454 Life Sciences sequencing systems. These high-throughput sequencing technologies have already been applied to
monitor genome-wide transcription levels (mRNA-Seq), DNA-protein interactions (ChIP-Seq), DNA copy number (DNA-Seq), chromatin structure, and
NA methylation. While sequencing-based gene expression studies have been touted as overcoming longstanding limitations of microarray-based
studies, these new biotechnologies raise similar as well as novel statistical and computational challenges, in areas such as image analysis,
base-calling, read-mapping, and (differential) expression inference.
This talk concerns statistical methods and software for the analysis of high-throughput transcriptome sequencing (mRNA-Seq) data, with emphasis on mapped reads
from the Illumina Genome Analyzer. We address the following main questions, which trace the process of deriving accurate measures of (differential) expression
for genomic regions of interest (ROI) such as individual exons or multiple isoforms of a given gene.
1. Experimental design: Guidelines for the effective allocation of input mRNA samples (e.g., in terms of library preparation, flow-cells, lanes) and the use of control sequences.
2. Exploratory data analysis: Toolbox of numerical and graphical summaries for mapped reads to detect the main and as well as aberrant features of mRNA-Seq data.
3. Normalization and expression quantitation: Methods for inferring ROI-level expression from base-level mapped read counts, while adjusting for experimental/technical
effects (e.g., library preparation/flow-cell/lane) and sequence-specific effects (e.g., GC-content).
4. Differential expression: Methods for inferring differential expression between ROI and/or input samples.
5. Software: Open-source statistical software implementing the methodology discussed above.
We report on our investigation of several mRNA-Seq datasets, in organisms from yeast to human: inference of (differential) gene expression in reference samples
from the MicroArray Quality Control (MAQC) Project; genome annotation and the discovery of novel transcripts in Saccharomyces cerevisiae; evolutionary genetics
based on allele-specific expression in a Saccharomyces diploid hybrid; regulation of alternative splicing in Drosophila melanogaster.
References (manuscripts and presentations) are posted on the website: www.stat.berkeley.edu/~sandrine.
Biography
Sandrine Dudoit is Associate Professor of Biostatistics and Statistics
and Chair of the Graduate Group in Biostatistics at the University of
California, Berkeley. Professor Dudoit's research and teaching
activities concern the development and application of statistical and
computational methods for the analysis of biomedical and genomic data.
Her methodological research interests regard high-dimensional
inference and include loss-based estimation with cross-validation
(classification and regression, density estimation, model selection)
and multiple hypothesis testing. Much of her methodological work is
motivated by statistical inference questions arising in biological
research, including: the design and analysis of high-throughput
microarray and sequencing gene expression experiments; nucleotide and
protein sequence analysis; the genetic mapping of complex human
traits; biological annotation metadata analysis. Professor Dudoit is
also interested in statistical computing and is a founding core
developer of the Bioconductor Project, an open-source software project
for the analysis of biological data (www.bioconductor.org). She is a
co-author of the book "Multiple Testing Procedures with Applications
to Genomics" and a co-editor of the book "Bioinformatics and
Computational Biology Solutions Using R and Bioconductor". She is
Associate Editor of six journals, including "The Annals of Applied
Statistics", "BMC Bioinformatics", "Statistical Applications in
Genetics and Molecular Biology", and "IEEE/ACM Transactions on
Computational Biology and Bioinformatics".
Professor Dudoit obtained a Bachelor's (1992) and Master's (1994)
degree in Mathematics from Carleton University, Ottawa, Canada. She
first came to UC Berkeley as a graduate student and earned a PhD
degree in 1999 from the Department of Statistics. Her doctoral
research, under the supervision of Professor Terence P. Speed,
concerned the linkage analysis of complex human traits. From 1999 to
2000, she was a postdoctoral fellow at the Mathematical Sciences
Research Institute, Berkeley. Before joining the Faculty at UC
Berkeley in July 2001, she underwent a year of postdoctoral training
in genomics in the laboratory of Professor Patrick O. Brown,
Department of Biochemistry, Stanford University. Her work in the
Brown Lab involved the development and application of statistical
methods and software for the analysis of microarray gene expression
data.