Qvalue Method May Not Always Control False Discovery Rate in Genomics Applications
Xiao Yang
Monsanto Company
The qvalue method by Storey (2002) has been proved to be theoretically sound for controlling false
discovery rate in many functional genomics applications. However, some empirical evidences suggest
that this method can be more stringent than other methods, such as Bonferroni adjustment and the FDR
method by Benjamini and Hochberg (1995). We compare these methods for detection of gene differential
expression in microarray data analysis. For microarray experiment with purposes of gene discovery,
where more genes are expected to be differentially expressed across different experimental conditions,
the qvalue method generally performs well. However, for experiments with only a few genes expected to
be differentially expressed, the qvalue method performs much worse than other methods. Some insights
are provided to examine this discrepancy. Adjustments to q-value method are recommended to accommodate
many applications.
Application of A Genetic Algorithm to the Classification of Renal Cell Carcinoma
Dongqing Liu, Zhong-Hui Duan, Jianping Zhu, and Ting Shi
University of Akron, Department of Computer Science
Renal cell carcinoma (RCC) consists of several different subtypes. Clear cell RCC is the most common one.
The genetic alterations are characterized by mutation or hypermethylation of Von Hippel-Landau gene.
Papillary RCC is the second most common subtype. The genetic alterations of the papillary RCC are different,
which involves the activation of MET proto-oncogene and trisomy chromosomes 7 and 17. It is expected that
the gene expression profiles of the two subtypes of tumors are also distinctive and the subtypes can be
identified on the basis of the expressions of a panel of genes. The objective of this study is to identify
the panel of discriminator genes using a genetic algorithm (GA). In this study, we apply a GA to a set of
microarray gene expression profiles of nine samples (three clear cell tumors and six papillary tumors).
We show that the GA can be efficiently used in identifying a set of discriminator genes. To test the robustness
of the algorithm, we perform a bootstrapping analysis that removes one sample from the data set at a time
and uses the remaining samples for gene selection. We show that each of the removed samples can be classified
correctly. We also demonstrate the effect of different distance metrics and evolutionary changes in the
algorithm such as mutation and cross-over on the fitness function and convergence rate. Furthermore, we use
hierarchical clustering to investigate the selected genes and the singular value decomposition to visualize
the correlation of the expressions of these genes.
Return to Poster Abstract Index
Return to Top
Mining Estrogen Microarray Data: An Approach Using Contrast Data Analysis
Haili Jiao, Peixin Yang, Z. Chen
University of Nebraska at Omaha
Although data mining methods have been widely applied to analyze microarray data, existing data mining
techniques such as classification, clustering, association and discriminative analysis can only accomplish
part of the job. Unique features of microarray data demand a more systematic way of applying data mining
techniques. New frameworks or guidelines are thus needed. Toward this goal, we have explored an approach
called contrast data analysis which makes use an extended relational algebra operator, and incorporates
considerations from granular computing (GrC). We have applied this approach for mining estrogen regulation
from the microarray data obtained from Affymetrix chips. A two-phase methodology is employed, and two
levels of granulations are implemented. At the first level, classification rules mining and association
rules mining are used to get a meaningful general view of the microarray data. At the second level, comparative
queries are designed to retrieve individual gene satisfying certain regulation patterns. Through the analysis,
we found that estrogen regulated many key components of several important protein kinase signaling pathways,
mainly PI¬3 kinase and NF-kappaB, which involved cell survival. Furthermore, two estrogen responsive genes that
were of importance of ovarian function had been discovered and they were inhibin A and TGF. We had
provided experimental evidences that estrogen through ER- impacted genes related to cell differentiation,
whereas ER- regulated cell proliferation genes. More specific results have also been obtained. The
methodology used in this case study, the queries performed, as well as related analysis, are reported in this paper.
Return to Poster Abstract Index
Return to Top
A Comprehensive Genomics and Proteomics Data Analysis and Biological Function Interpretation System
Weimin Feng, Peter Henning, and May D. Wang
The Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology
and Emory University
EGOMiner, stands for Enhanced Gene Ontology mining system, is a CORBA based tool developed for genomic and
proteomic data analysis. One key function of EGOMiner is to provide biological data interpretation based on
Gene Ontology with quantitative and statistical analysis and visualization by direct acyclic graph. Cross
comparison of multiple experimental studies is supported. The other function is quality analysis of microarray
chip images. The input is raw chip data and output is visualization of the data quality of the chip image.
This is system significantly improves the GoMiner system that was designed and developed by the authors.
Return to Poster Abstract Index
Return to Top
A Java Program to Create Simulated Microarray Images
Bill Martin and Robert M. Horton
Bioinformatics consultant, Attotron Corporation
We have developed a program to create images representing "virtual microarray experiments," and are developing
exercises using these images to teach microarray technology in undergraduate biology courses. Actual microarray
images usually represent large amounts of data containing a wide assortment of imperfections, and can be daunting
to students seeking an introduction to the technology. Our simulator can easily generate "toy" images customized
for particular teaching points. For example, a set of idealized images makes it easy for students to perform
measurements quickly, so they can move to clustering and visualization within a single laboratory period. In
more advanced exercises, anomalies can be simulated to produce various degrees of realism. Feature intensities
can be either specified by an instructor, or extracted from actual experimental data. If experimentally observed
data are used, students can compare their results from analyzing "reverse-engineered" images to the original
papers in which the experiments were described. Scientific image analysis relies on conceptual models that
encapsulate assumptions about factors comprising the image. Because simulated images can exactly instantiate
such models, they may prove useful for "black box" evaluation of analytical software products, or for testing
of new software. The image generator is available at www.cybertory.org. It employs the Java Advanced Imaging
(JAI) API, and can be used as a class library, a stand-alone program that reads XML, or on the web. This work
is funded by NIH SBIR grant 2R44RR013645-02A2 to Attotron Corporation.
Return to Poster Abstract Index
Return to Top
Inferring Genetic Networks from Microarray Data
Shawn Martin, George Davidson, Elebeoba May, Margaret Werner Washburne, and Jean-Loup Faulon
Sandia National Laboratory
In theory, it should be possible to infer realistic genetic networks from time series microarray data.
In practice, however, this has proved problematic. The three major challenges are 1) inferring the network;
2) estimating the stability of the inferred network; and 3) making the network visually accessible to the user.
Here we describe a method, tested on the time series microarray data in (Spellman, 1998), that addresses these
concerns. Network inference begins by aggregating small groups of genes with similar expression in the original,
noisy, time- series data. The composite group expression is derived using support vector regression and the
resulting continuous values are divided into on/off expression states. All self-consistent Boolean networks
are then inferred using these states. To assess the stability of our method, the networks are clustered by
dynamic behavior; confidence levels are assigned (via bootstrapping) to components in the networks; and the
most robust networks are simulated using continuous valued electronic circuits. Finally, a visualization
environment was developed for further network investigation by the end user. This environment includes a
clickable drawing of the network, hot-links to plots showing the original time series data, and an annotated,
hot-linked spreadsheet that provides network and gene information. The development of this network and
visualization environment required the collaboration of researchers in math (JL, SM), computer sciences (GD, EM),
and yeast genomics (MWW). The entire process has so far yielded two testable hypotheses, one concerning exit
from arrested states, and one concerning the level of control present in genetic networks.
Return to Poster Abstract Index
Return to Top
ChipQC: A Microarray Chip Quality Control Tool
May Wang
Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and
Emory University, Atlanta, Georgia, USA
Recent advances in genetics, most notably the completion of the human genome project, have fueled an
interest in genetic disposition as the root of many diseases. Chip quality control is often overlooked
in microarray experiments. As such, a new web-based software tool, ChipQC, is developed to perform
standard normalization, error analysis, and statistical techniques on multiple array sets (technical
or biological replicates), which include calculating the coefficient of variation, standard deviation,
mean intensity, fold change, and statistical significance. The purpose of the ChipQC project was to
build an automated method to detect areas of high systematic or random error introduced either by
manufacturing process or experimental technique. ChipQC, using a heat map scheme, graphically represents
the error analysis and various metrics of each gene at its proper chip coordinates, thus mimicking the
analyzed chip's configuration ton enable the removal of subtle artifacts. Use of this visualization tool
revealed localized areas of high variability pattern in some arrays consistent with block spotting effects,
streaking, air bubbles, and edge effect, which all persisted even after lowess or linear normalization had
been applied, demonstrating the need to select and flag specific spots which would potentially yield falsely
positive or negative gene expression results. The reproducibility and reduction of systematic errors is
tackled by extensive quality control. This extensive quality control of chip sets up the foundation that
may ultimately allow microarray systems to migrate from an experimental research tool to a clinical
diagnostic device.
Return to Poster Abstract Index
Return to Top
Extracting Characteristic Patterns from Genome Wide Expression Data by Non-Negative Matrix Factorization
Nini Rao, Dezhong Yao, and Simon J. Shepherd
University of Electronic Science and Technology of China
Some signal processing approaches such as singular value decomposition (SVD) and independent component
analysis (ICA) have been proposed to analyze genome wide expression data. In this paper, we propose a
novel approach, which is called as non-negative matrix factorization (NMF), to analyze genome wide expression
data. One of NMF advantages is that it can directly process these data without normalization. Firstly, we
design an optimal algorithm for NMF approach. Compared with the existing NMF algorithms, our algorithm is
more stable and converges very fast. We have coded the final algorithm in highly optimized C. Secondly; we
describe the use of NMF in the extraction of the characteristic patterns from genome wide expression data.
Thirdly, some simulation experiments are made in order to verify the efficiency of NMF algorithm. In
simulations, NMF is employed to analyze a set of synthetic microarray data with known characteristic
patterns and two biological datasets that have been processed by singular value decomposition (SVD) approach
and are thus thought to be gold standards. The simulation experiments show that NMF performs very well for
the above datasets and is robustness under high noisy level. We found that, like SVD, an important ability
distinguishing NMF and related methods from other analysis methods is the capability to detect weak signals
in the data. Even when the structure of the data does not allow separation of data points, causing clustering
algorithms to fail, NMF can extract biologically meaningful patterns. Finally, our conclusions are that NMF
can be used as a powerful tool to extract the biologically meaningful expression patterns from genomic wide
expression data.
Return to Poster Abstract Index
Return to Top
Imputation of Missing Values in DNA Microarray Gene Expression Data
Hyunsoo Kim (1), Haesun Park(1,2), and Gene H. Golub (3)
(1) Department of Computer Science and Engineering, University of Minnesota;
(2) The National Science Foundation, Arlington, VA; (3) Computer Science Department, Stanford University
Gene expression data sets often contain missing values due to various reasons, e.g. insufficient resolution,
image corruption, dust or scratches on the slides, or experimental error during the laboratory process. Since
it is often very costly or time consuming to repeat the experiment, many algorithms have been developed to
recover the missing values. Moreover, estimating unknown elements in the given matrix has many potential
applications in the other fields. In this paper, imputation methods based on the least squares formulation
are proposed to estimate missing values in the gene expression data, which exploit local similarity structures
in the data as well as least squares optimization process. The proposed local least squares imputation method
(LLSimpute) represents a target gene that has missing values as a linear combination of similar genes. The
similar genes are chosen by $k$-nearest neighbors or $k$ coherent genes that have large absolute values of
Pearson correlation coefficients. Nonparametric missing values estimation method of LLSimpute is designed
by introducing an automatic $k$-value estimator. In our experiments, the proposed LLSimpute method showed
better performance than the other imputation methods such as $k$-nearest neighbors imputation (KNNimpute)
and an imputation method base on Bayesian principal component analysis (BPCA) on various data sets and
percentages of missing values in the data.
Return to Poster Abstract Index
Return to Top
Disease Gene Explorer: Displaying Disease Gene Dependency by Combining Bayesian Networks with Clustering
Qian Diao, We Hu, Hao Zhong, Feng Xue, Yimin Zhang, and Tao Wang
Intel China Research Centre
Constructing gene networks is one of the hot topics in the analysis of the microarray gene expression data.
When combined with the output of disease gene finding, the generated gene networks will give a recommendation
mechanism and an intuitive form for biologists to identify the underlying relationship among those
biomarkers of the disease. In this paper, we present a display system, Disease Gene Explorer, which can
graphically display the dependency among genes, especially those biomarkers of a disease. It combines
Bayesian networks (BN) learning with clustering and disease gene selection. The clustering algorithm is
a novel method based on K-means and t-test. It performs overall better than standard K-means and is less
or even not dependent on the initial partition in both temporal gene expression and time-course gene
expression data. The BN learning includes BN structure learning for genes inside a cluster, and module
networks learning across the different clusters. We test the system on Colon cancer data set and obtain
some interesting results: most high-score biomarkers of the disease are partitioned into one group; the
dependency among these disease genes are displayed as a directed acyclic graph (DAG); there are two genes,
Human mRNA for snRNP E protein (X12466) and Human HF.12 gene mRNA (07290) has causal relations to the group.
Return to Poster Abstract Index
Return to Top
DNA Microarray Image Analysis Using Active Contour Model
Yuan-Kai Wang and Cheng-Wei Huang
Department of Electronic Engineering, Fu Jen University, Taiwan
DNA microarray is a new tool used to examine expression levels for thousands of genes simultaneously.
However, large amounts of bioinformation will be produced in DNA microarray experiments. It's extremely
urgent to develop automatic and precise tools for the analysis of DNA microarray. Image analysis is the
first important step for the automatic processing of microarray experiments. In this paper, a novel
approach for accurate spot finding in DNA microarray using active contour model is proposed. In our
approach, a microarray image is first transformed into a gray-level one. Wavelet-based profile analysis
is then performed for building orthogonal grid system of the image. Each spot is enclosed within a grid
after the wavelet-based profile analysis. The boundary of the grid is specified as initial contour of the
spot. By shrinking the initial contour through energy conservation, more precise contour of the spot will
be located. The approach is tested on microarray images obtained from Stanford Microarray Database.
Comprehensive experiments show highly encouraging results. Major contributions of the proposed approach
include the followings: (1) We solve the problem of image rotation without manual adjustment. (2) The
detection is robust with irregular spot gaps. (3) Our approach is not affected by variations of color
and size of spots. (4) Our approach performs well on different number of input channels. Besides, the
proposed approach can not only handle multiple spots in microarray images, but also find more precise
contours of spots.
Return to Poster Abstract Index
Return to Top
Meanshift Clustering for DNA Microarray Analysis
Danny Barash and Dorin Comaniciu
Department of Computer Science, Ben-Gurion University
Meanshift clustering is a well established algorithm that has been applied successfully in image
processing and computer vision. Cluster centers are derived by local mode seeking identifying maxima
in the normalized density of the data set. Recently, quantum clustering that highly resembles mean
shift clustering has been proposed for analyzing microarray expression data. Quantum clustering is
based on physical intuition derived from quantum mechanics. By an iterative process using a gradient
descent procedure, the potential energy V belonging to the Hamiltonian of the time-indepedent Schrodinger
equation develops minima that are identified with cluster centers. The analogies between the wavefunction
in quantum clustering and the multivariate kernel density estimator in meanshift clustering are leading
to closely related formulations. However, the approach towards the minima of the potential in quantum
clustering needs to be performed unrelatedly to the formulation, by gradient descent steps. In contrast,
in meanshift clustering the approach towards the maxima of the normalized density is performed by the
meanshift vector that is derived by the formulation of the methodology. It points towards the direction
of the maximum increase in the underlying density. Based on these observations, we propose implementing
meanshift clustering to improve the efficiency of local mode seeking in analyzing expression data.
Return to Poster Abstract Index
Return to Top
The Dynamic Range of Gene Expressions Depend on Their Ontology
Yizhou Xie, Parthav Jailwalia, Soumitra Ghosh and Xujing Wang
Medical College of Wisconsin
Comprehensive profiling of gene expressions during physiological and/or pathological processes is
becoming a standard research approach; yet quantitative interpretation of data is still difficult.
This problem is mainly due to the current poor understanding of the intrinsic normal variation in
expression of individual genes. It is known that naturally some transcripts are present at relatively
constant levels, while others are expressed at highly variable levels. However, a comprehensive
characterization of the gene-dependent expression variability is still missing. In this study we
investigate the correlation between the normal variation of gene expressions and their gene ontologies.
Three sets of time-series microarray data are utilized that represent three different but typical
physiological/developmental processes: the yeast cell cycle data from Stanford University; the pancreas
perinatal developmental data from the EpconDB; and the apoptosis progression of pancreatic islet b-cells
data from our own lab. We analyze the dynamic range and the variation of expression levels during these
processes for genes in each of the top-level ontologies of biological process, molecular function and
cellular component, as defined by GO. Significant and consistent differences are observed. For example,
transcription factors were found to have consistent narrower dynamic range of expression in all three
processes The findings will shed light on study design, statistical evaluation, and data mining of genetic
data including microarray data.
Return to Poster Abstract Index
Return to Top
Subspace Clustering for Microarray Data Analysis: Multiple Criteria and Significance Assessment
Hui Fang, Chengxiang Zhai, Lei Liu, and Jiong Yang
University of Illinois at Urbana-Champaign
We present a new clustering method for analyzing microarray data that improves existing approaches in
three aspects -- capturing gene similarities under a subset of gene expression conditions, combining
multiple criteria to capture trend similarity, and assigning statistical significance to detected clusters.
Our algorithm first aims at discovering non-traditional subspace clusters. A subspace cluster is a subset
of genes that exhibit similar expression patterns over a subset of conditions. To capture biologically
meaningful clusters, we apply multiple criteria for constraining a subspace cluster. Specifically, we
define our subspace cluster as a submatrix satisfying two distinct constraints -- fluctuation constraint
and trend constraint. The fluctuation constraint requires that for all genes in a cluster, the difference
of expression levels between two conditions need to be similar. The trend constraint captures the correlation
between genes, i.e. when the expression level of one gene goes up under some conditions, the expression level
of the correlated genes should also go up accordingly. In general, we are advocating constraining a cluster
with multiple criteria capturing biological requirements from different perspectives. Our second method is
to exploit the original sample data points to assess the statistical significance of the discovered clusters.
We propose a method to compute the confidence level for each generated cluster based on the original variances
of cell values. The clusters can then be ranked according to their confidence levels. The proposed new method
has a great potential for helping biologists discover meaningful gene clusters through generating more coherent
clusters and ranking clusters based on statistical significance.
Return to Poster Abstract Index
Return to Top
Comparative Analysis of Gene Expression and DNA Copy Number Data for Pancreatic and Breast
Cancers Using an Orthogonal Decomposition
John A. Berger, Sampsa Hautaniemi, and Sanjit K. Mitra
UCSB
We present the use of a comparative mathematical framework for simultaneously analyzing relations
between DNA copy number and mRNA-expression data. For many diseases, the causes of over-expression
are typically unknown, but current studies show that copy number aberrations may be strong candidates
for driving gene over-expression. The generalized singular value decomposition (GSVD) is utilized here
to formulate similarities in these measurements and locate specific biological processes present in
both inputs. Accordingly, the problem of how to systematically analyze multiple input data is presented.
The GSVD locates relevant gene influences common to only gene expression, copy numbers, or both
measurements in conjunction. These groups are graphically reported and gene ontology annotations are
used as a functional assessment of the groupings. Statistical analysis using combinatorics is presented
to assess and estimate probabilistic significance. We illustrate this method for two independently
published studies of pancreatic cancer and breast cancer, where public gene expression and DNA copy
number data is provided and measured across numerous tumor cell lines. For both cancer data sets, we
locate specific inference patterns and processes common to both genome-wide input measurements across
10k unique clones. Our results suggest that several genes are characterized as having strong influences
by both copy number amplification and gene over-expression and lists of about 100 genes were compared to
the cancer literature for both types of cancers. Accordingly, these genes are likely to be closely
involved in cancer development and progression, and could be promising targets for therapeutic intervention.
Return to Poster Abstract Index
Return to Top
High-throughput Microarray-Based Genotyping
Geoffrey Yang, Ming-Hsiu Ho, and Earl Hubbell
Affymetrix, Inc.
A high throughput genotyping platform that scores ~10,000 single-nucleotide polymorphisms (SNPs) per
individual on a single GeneChip high-density oligonucleotide microarray has been developed to conduct
studies to elucidate the genetic basis for complex diseases. Currently, the genotyping of individual
SNPs relies on the summary statistics (based on the observed intensities of probes on the microarray)
for the entirety of the sample set. A clustering scheme of those statistics across hundreds of
individual DNA samples is obtained to make individual genotyping calls (Liu et al., Algorithms for Large
Scale Genotyping Microarrays. Bioinformatics. 2003 19: 2397-2403.). In contrast to the current approach,
the method in this work makes individual genotyping call by finding the minimum residual, i.e. highest
likelihood, among four possible states corresponding to three genotypes and no-call. Initially, the
residual is calculated based on a given set of probe affinities for that individual sample. Auspiciously,
the multitude of samples was utilized to iterate to refine the probe affinities, sample concentration,
and background intensities. These refined parameters entail an improved call rate while maintaining high
accuracy. Comparison of a subset of SNP genotypes against calls determined by single-based extension
showed > 99% concordance. Mendelian inheritance errors were also checked on SNP genotypes determined in
samples from CEPH pedigrees.
Return to Poster Abstract Index
Return to Top
|