CSB2004

POSTER ABSTRACTS:
MICROARRAY ANALYSIS

Qvalue Method May Not Always Control False Discovery Rate in Genomics Applications
Xiao Yang
Monsanto Company

The qvalue method by Storey (2002) has been proved to be theoretically sound for controlling false discovery rate in many functional genomics applications. However, some empirical evidences suggest that this method can be more stringent than other methods, such as Bonferroni adjustment and the FDR method by Benjamini and Hochberg (1995). We compare these methods for detection of gene differential expression in microarray data analysis. For microarray experiment with purposes of gene discovery, where more genes are expected to be differentially expressed across different experimental conditions, the qvalue method generally performs well. However, for experiments with only a few genes expected to be differentially expressed, the qvalue method performs much worse than other methods. Some insights are provided to examine this discrepancy. Adjustments to q-value method are recommended to accommodate many applications.

Application of A Genetic Algorithm to the Classification of Renal Cell Carcinoma
Dongqing Liu, Zhong-Hui Duan, Jianping Zhu, and Ting Shi
University of Akron, Department of Computer Science

Renal cell carcinoma (RCC) consists of several different subtypes. Clear cell RCC is the most common one. The genetic alterations are characterized by mutation or hypermethylation of Von Hippel-Landau gene. Papillary RCC is the second most common subtype. The genetic alterations of the papillary RCC are different, which involves the activation of MET proto-oncogene and trisomy chromosomes 7 and 17. It is expected that the gene expression profiles of the two subtypes of tumors are also distinctive and the subtypes can be identified on the basis of the expressions of a panel of genes. The objective of this study is to identify the panel of discriminator genes using a genetic algorithm (GA). In this study, we apply a GA to a set of microarray gene expression profiles of nine samples (three clear cell tumors and six papillary tumors). We show that the GA can be efficiently used in identifying a set of discriminator genes. To test the robustness of the algorithm, we perform a bootstrapping analysis that removes one sample from the data set at a time and uses the remaining samples for gene selection. We show that each of the removed samples can be classified correctly. We also demonstrate the effect of different distance metrics and evolutionary changes in the algorithm such as mutation and cross-over on the fitness function and convergence rate. Furthermore, we use hierarchical clustering to investigate the selected genes and the singular value decomposition to visualize the correlation of the expressions of these genes.

Return to Poster Abstract Index
Return to Top

Mining Estrogen Microarray Data: An Approach Using Contrast Data Analysis
Haili Jiao, Peixin Yang, Z. Chen
University of Nebraska at Omaha

Although data mining methods have been widely applied to analyze microarray data, existing data mining techniques such as classification, clustering, association and discriminative analysis can only accomplish part of the job. Unique features of microarray data demand a more systematic way of applying data mining techniques. New frameworks or guidelines are thus needed. Toward this goal, we have explored an approach called contrast data analysis which makes use an extended relational algebra operator, and incorporates considerations from granular computing (GrC). We have applied this approach for mining estrogen regulation from the microarray data obtained from Affymetrix chips. A two-phase methodology is employed, and two levels of granulations are implemented. At the first level, classification rules mining and association rules mining are used to get a meaningful general view of the microarray data. At the second level, comparative queries are designed to retrieve individual gene satisfying certain regulation patterns. Through the analysis, we found that estrogen regulated many key components of several important protein kinase signaling pathways, mainly PI¬3 kinase and NF-kappaB, which involved cell survival. Furthermore, two estrogen responsive genes that were of importance of ovarian function had been discovered and they were inhibin A and TGF. We had provided experimental evidences that estrogen through ER- impacted genes related to cell differentiation, whereas ER- regulated cell proliferation genes. More specific results have also been obtained. The methodology used in this case study, the queries performed, as well as related analysis, are reported in this paper.

Return to Poster Abstract Index
Return to Top

A Comprehensive Genomics and Proteomics Data Analysis and Biological Function Interpretation System
Weimin Feng, Peter Henning, and May D. Wang
The Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University

EGOMiner, stands for Enhanced Gene Ontology mining system, is a CORBA based tool developed for genomic and proteomic data analysis. One key function of EGOMiner is to provide biological data interpretation based on Gene Ontology with quantitative and statistical analysis and visualization by direct acyclic graph. Cross comparison of multiple experimental studies is supported. The other function is quality analysis of microarray chip images. The input is raw chip data and output is visualization of the data quality of the chip image. This is system significantly improves the GoMiner system that was designed and developed by the authors.

Return to Poster Abstract Index
Return to Top

A Java Program to Create Simulated Microarray Images
Bill Martin and Robert M. Horton
Bioinformatics consultant, Attotron Corporation

We have developed a program to create images representing "virtual microarray experiments," and are developing exercises using these images to teach microarray technology in undergraduate biology courses. Actual microarray images usually represent large amounts of data containing a wide assortment of imperfections, and can be daunting to students seeking an introduction to the technology. Our simulator can easily generate "toy" images customized for particular teaching points. For example, a set of idealized images makes it easy for students to perform measurements quickly, so they can move to clustering and visualization within a single laboratory period. In more advanced exercises, anomalies can be simulated to produce various degrees of realism. Feature intensities can be either specified by an instructor, or extracted from actual experimental data. If experimentally observed data are used, students can compare their results from analyzing "reverse-engineered" images to the original papers in which the experiments were described. Scientific image analysis relies on conceptual models that encapsulate assumptions about factors comprising the image. Because simulated images can exactly instantiate such models, they may prove useful for "black box" evaluation of analytical software products, or for testing of new software. The image generator is available at www.cybertory.org. It employs the Java Advanced Imaging (JAI) API, and can be used as a class library, a stand-alone program that reads XML, or on the web. This work is funded by NIH SBIR grant 2R44RR013645-02A2 to Attotron Corporation.

Return to Poster Abstract Index
Return to Top

Inferring Genetic Networks from Microarray Data
Shawn Martin, George Davidson, Elebeoba May, Margaret Werner Washburne, and Jean-Loup Faulon
Sandia National Laboratory

In theory, it should be possible to infer realistic genetic networks from time series microarray data. In practice, however, this has proved problematic. The three major challenges are 1) inferring the network; 2) estimating the stability of the inferred network; and 3) making the network visually accessible to the user. Here we describe a method, tested on the time series microarray data in (Spellman, 1998), that addresses these concerns. Network inference begins by aggregating small groups of genes with similar expression in the original, noisy, time- series data. The composite group expression is derived using support vector regression and the resulting continuous values are divided into on/off expression states. All self-consistent Boolean networks are then inferred using these states. To assess the stability of our method, the networks are clustered by dynamic behavior; confidence levels are assigned (via bootstrapping) to components in the networks; and the most robust networks are simulated using continuous valued electronic circuits. Finally, a visualization environment was developed for further network investigation by the end user. This environment includes a clickable drawing of the network, hot-links to plots showing the original time series data, and an annotated, hot-linked spreadsheet that provides network and gene information. The development of this network and visualization environment required the collaboration of researchers in math (JL, SM), computer sciences (GD, EM), and yeast genomics (MWW). The entire process has so far yielded two testable hypotheses, one concerning exit from arrested states, and one concerning the level of control present in genetic networks.

Return to Poster Abstract Index
Return to Top

ChipQC: A Microarray Chip Quality Control Tool
May Wang
Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, Georgia, USA

Recent advances in genetics, most notably the completion of the human genome project, have fueled an interest in genetic disposition as the root of many diseases. Chip quality control is often overlooked in microarray experiments. As such, a new web-based software tool, ChipQC, is developed to perform standard normalization, error analysis, and statistical techniques on multiple array sets (technical or biological replicates), which include calculating the coefficient of variation, standard deviation, mean intensity, fold change, and statistical significance. The purpose of the ChipQC project was to build an automated method to detect areas of high systematic or random error introduced either by manufacturing process or experimental technique. ChipQC, using a heat map scheme, graphically represents the error analysis and various metrics of each gene at its proper chip coordinates, thus mimicking the analyzed chip's configuration ton enable the removal of subtle artifacts. Use of this visualization tool revealed localized areas of high variability pattern in some arrays consistent with block spotting effects, streaking, air bubbles, and edge effect, which all persisted even after lowess or linear normalization had been applied, demonstrating the need to select and flag specific spots which would potentially yield falsely positive or negative gene expression results. The reproducibility and reduction of systematic errors is tackled by extensive quality control. This extensive quality control of chip sets up the foundation that may ultimately allow microarray systems to migrate from an experimental research tool to a clinical diagnostic device.

Return to Poster Abstract Index
Return to Top

Extracting Characteristic Patterns from Genome Wide Expression Data by Non-Negative Matrix Factorization
Nini Rao, Dezhong Yao, and Simon J. Shepherd
University of Electronic Science and Technology of China

Some signal processing approaches such as singular value decomposition (SVD) and independent component analysis (ICA) have been proposed to analyze genome wide expression data. In this paper, we propose a novel approach, which is called as non-negative matrix factorization (NMF), to analyze genome wide expression data. One of NMF advantages is that it can directly process these data without normalization. Firstly, we design an optimal algorithm for NMF approach. Compared with the existing NMF algorithms, our algorithm is more stable and converges very fast. We have coded the final algorithm in highly optimized C. Secondly; we describe the use of NMF in the extraction of the characteristic patterns from genome wide expression data. Thirdly, some simulation experiments are made in order to verify the efficiency of NMF algorithm. In simulations, NMF is employed to analyze a set of synthetic microarray data with known characteristic patterns and two biological datasets that have been processed by singular value decomposition (SVD) approach and are thus thought to be gold standards. The simulation experiments show that NMF performs very well for the above datasets and is robustness under high noisy level. We found that, like SVD, an important ability distinguishing NMF and related methods from other analysis methods is the capability to detect weak signals in the data. Even when the structure of the data does not allow separation of data points, causing clustering algorithms to fail, NMF can extract biologically meaningful patterns. Finally, our conclusions are that NMF can be used as a powerful tool to extract the biologically meaningful expression patterns from genomic wide expression data.

Return to Poster Abstract Index
Return to Top

Imputation of Missing Values in DNA Microarray Gene Expression Data
Hyunsoo Kim (1), Haesun Park(1,2), and Gene H. Golub (3)
(1) Department of Computer Science and Engineering, University of Minnesota; (2) The National Science Foundation, Arlington, VA; (3) Computer Science Department, Stanford University

Gene expression data sets often contain missing values due to various reasons, e.g. insufficient resolution, image corruption, dust or scratches on the slides, or experimental error during the laboratory process. Since it is often very costly or time consuming to repeat the experiment, many algorithms have been developed to recover the missing values. Moreover, estimating unknown elements in the given matrix has many potential applications in the other fields. In this paper, imputation methods based on the least squares formulation are proposed to estimate missing values in the gene expression data, which exploit local similarity structures in the data as well as least squares optimization process. The proposed local least squares imputation method (LLSimpute) represents a target gene that has missing values as a linear combination of similar genes. The similar genes are chosen by $k$-nearest neighbors or $k$ coherent genes that have large absolute values of Pearson correlation coefficients. Nonparametric missing values estimation method of LLSimpute is designed by introducing an automatic $k$-value estimator. In our experiments, the proposed LLSimpute method showed better performance than the other imputation methods such as $k$-nearest neighbors imputation (KNNimpute) and an imputation method base on Bayesian principal component analysis (BPCA) on various data sets and percentages of missing values in the data.

Return to Poster Abstract Index
Return to Top

Disease Gene Explorer: Displaying Disease Gene Dependency by Combining Bayesian Networks with Clustering
Qian Diao, We Hu, Hao Zhong, Feng Xue, Yimin Zhang, and Tao Wang
Intel China Research Centre

Constructing gene networks is one of the hot topics in the analysis of the microarray gene expression data. When combined with the output of disease gene finding, the generated gene networks will give a recommendation mechanism and an intuitive form for biologists to identify the underlying relationship among those biomarkers of the disease. In this paper, we present a display system, Disease Gene Explorer, which can graphically display the dependency among genes, especially those biomarkers of a disease. It combines Bayesian networks (BN) learning with clustering and disease gene selection. The clustering algorithm is a novel method based on K-means and t-test. It performs overall better than standard K-means and is less or even not dependent on the initial partition in both temporal gene expression and time-course gene expression data. The BN learning includes BN structure learning for genes inside a cluster, and module networks learning across the different clusters. We test the system on Colon cancer data set and obtain some interesting results: most high-score biomarkers of the disease are partitioned into one group; the dependency among these disease genes are displayed as a directed acyclic graph (DAG); there are two genes, Human mRNA for snRNP E protein (X12466) and Human HF.12 gene mRNA (07290) has causal relations to the group.

Return to Poster Abstract Index
Return to Top

DNA Microarray Image Analysis Using Active Contour Model
Yuan-Kai Wang and Cheng-Wei Huang
Department of Electronic Engineering, Fu Jen University, Taiwan

DNA microarray is a new tool used to examine expression levels for thousands of genes simultaneously. However, large amounts of bioinformation will be produced in DNA microarray experiments. It's extremely urgent to develop automatic and precise tools for the analysis of DNA microarray. Image analysis is the first important step for the automatic processing of microarray experiments. In this paper, a novel approach for accurate spot finding in DNA microarray using active contour model is proposed. In our approach, a microarray image is first transformed into a gray-level one. Wavelet-based profile analysis is then performed for building orthogonal grid system of the image. Each spot is enclosed within a grid after the wavelet-based profile analysis. The boundary of the grid is specified as initial contour of the spot. By shrinking the initial contour through energy conservation, more precise contour of the spot will be located. The approach is tested on microarray images obtained from Stanford Microarray Database. Comprehensive experiments show highly encouraging results. Major contributions of the proposed approach include the followings: (1) We solve the problem of image rotation without manual adjustment. (2) The detection is robust with irregular spot gaps. (3) Our approach is not affected by variations of color and size of spots. (4) Our approach performs well on different number of input channels. Besides, the proposed approach can not only handle multiple spots in microarray images, but also find more precise contours of spots.

Return to Poster Abstract Index
Return to Top

Meanshift Clustering for DNA Microarray Analysis
Danny Barash and Dorin Comaniciu
Department of Computer Science, Ben-Gurion University

Meanshift clustering is a well established algorithm that has been applied successfully in image processing and computer vision. Cluster centers are derived by local mode seeking identifying maxima in the normalized density of the data set. Recently, quantum clustering that highly resembles mean shift clustering has been proposed for analyzing microarray expression data. Quantum clustering is based on physical intuition derived from quantum mechanics. By an iterative process using a gradient descent procedure, the potential energy V belonging to the Hamiltonian of the time-indepedent Schrodinger equation develops minima that are identified with cluster centers. The analogies between the wavefunction in quantum clustering and the multivariate kernel density estimator in meanshift clustering are leading to closely related formulations. However, the approach towards the minima of the potential in quantum clustering needs to be performed unrelatedly to the formulation, by gradient descent steps. In contrast, in meanshift clustering the approach towards the maxima of the normalized density is performed by the meanshift vector that is derived by the formulation of the methodology. It points towards the direction of the maximum increase in the underlying density. Based on these observations, we propose implementing meanshift clustering to improve the efficiency of local mode seeking in analyzing expression data.

Return to Poster Abstract Index
Return to Top

The Dynamic Range of Gene Expressions Depend on Their Ontology
Yizhou Xie, Parthav Jailwalia, Soumitra Ghosh and Xujing Wang
Medical College of Wisconsin

Comprehensive profiling of gene expressions during physiological and/or pathological processes is becoming a standard research approach; yet quantitative interpretation of data is still difficult. This problem is mainly due to the current poor understanding of the intrinsic normal variation in expression of individual genes. It is known that naturally some transcripts are present at relatively constant levels, while others are expressed at highly variable levels. However, a comprehensive characterization of the gene-dependent expression variability is still missing. In this study we investigate the correlation between the normal variation of gene expressions and their gene ontologies. Three sets of time-series microarray data are utilized that represent three different but typical physiological/developmental processes: the yeast cell cycle data from Stanford University; the pancreas perinatal developmental data from the EpconDB; and the apoptosis progression of pancreatic islet b-cells data from our own lab. We analyze the dynamic range and the variation of expression levels during these processes for genes in each of the top-level ontologies of biological process, molecular function and cellular component, as defined by GO. Significant and consistent differences are observed. For example, transcription factors were found to have consistent narrower dynamic range of expression in all three processes The findings will shed light on study design, statistical evaluation, and data mining of genetic data including microarray data.

Return to Poster Abstract Index
Return to Top

Subspace Clustering for Microarray Data Analysis: Multiple Criteria and Significance Assessment
Hui Fang, Chengxiang Zhai, Lei Liu, and Jiong Yang
University of Illinois at Urbana-Champaign

We present a new clustering method for analyzing microarray data that improves existing approaches in three aspects -- capturing gene similarities under a subset of gene expression conditions, combining multiple criteria to capture trend similarity, and assigning statistical significance to detected clusters. Our algorithm first aims at discovering non-traditional subspace clusters. A subspace cluster is a subset of genes that exhibit similar expression patterns over a subset of conditions. To capture biologically meaningful clusters, we apply multiple criteria for constraining a subspace cluster. Specifically, we define our subspace cluster as a submatrix satisfying two distinct constraints -- fluctuation constraint and trend constraint. The fluctuation constraint requires that for all genes in a cluster, the difference of expression levels between two conditions need to be similar. The trend constraint captures the correlation between genes, i.e. when the expression level of one gene goes up under some conditions, the expression level of the correlated genes should also go up accordingly. In general, we are advocating constraining a cluster with multiple criteria capturing biological requirements from different perspectives. Our second method is to exploit the original sample data points to assess the statistical significance of the discovered clusters. We propose a method to compute the confidence level for each generated cluster based on the original variances of cell values. The clusters can then be ranked according to their confidence levels. The proposed new method has a great potential for helping biologists discover meaningful gene clusters through generating more coherent clusters and ranking clusters based on statistical significance.

Return to Poster Abstract Index
Return to Top

Comparative Analysis of Gene Expression and DNA Copy Number Data for Pancreatic and Breast Cancers Using an Orthogonal Decomposition
John A. Berger, Sampsa Hautaniemi, and Sanjit K. Mitra
UCSB

We present the use of a comparative mathematical framework for simultaneously analyzing relations between DNA copy number and mRNA-expression data. For many diseases, the causes of over-expression are typically unknown, but current studies show that copy number aberrations may be strong candidates for driving gene over-expression. The generalized singular value decomposition (GSVD) is utilized here to formulate similarities in these measurements and locate specific biological processes present in both inputs. Accordingly, the problem of how to systematically analyze multiple input data is presented. The GSVD locates relevant gene influences common to only gene expression, copy numbers, or both measurements in conjunction. These groups are graphically reported and gene ontology annotations are used as a functional assessment of the groupings. Statistical analysis using combinatorics is presented to assess and estimate probabilistic significance. We illustrate this method for two independently published studies of pancreatic cancer and breast cancer, where public gene expression and DNA copy number data is provided and measured across numerous tumor cell lines. For both cancer data sets, we locate specific inference patterns and processes common to both genome-wide input measurements across 10k unique clones. Our results suggest that several genes are characterized as having strong influences by both copy number amplification and gene over-expression and lists of about 100 genes were compared to the cancer literature for both types of cancers. Accordingly, these genes are likely to be closely involved in cancer development and progression, and could be promising targets for therapeutic intervention.

Return to Poster Abstract Index
Return to Top

High-throughput Microarray-Based Genotyping
Geoffrey Yang, Ming-Hsiu Ho, and Earl Hubbell
Affymetrix, Inc.

A high throughput genotyping platform that scores ~10,000 single-nucleotide polymorphisms (SNPs) per individual on a single GeneChip high-density oligonucleotide microarray has been developed to conduct studies to elucidate the genetic basis for complex diseases. Currently, the genotyping of individual SNPs relies on the summary statistics (based on the observed intensities of probes on the microarray) for the entirety of the sample set. A clustering scheme of those statistics across hundreds of individual DNA samples is obtained to make individual genotyping calls (Liu et al., Algorithms for Large Scale Genotyping Microarrays. Bioinformatics. 2003 19: 2397-2403.). In contrast to the current approach, the method in this work makes individual genotyping call by finding the minimum residual, i.e. highest likelihood, among four possible states corresponding to three genotypes and no-call. Initially, the residual is calculated based on a given set of probe affinities for that individual sample. Auspiciously, the multitude of samples was utilized to iterate to refine the probe affinities, sample concentration, and background intensities. These refined parameters entail an improved call rate while maintaining high accuracy. Comparison of a subset of SNP genotypes against calls determined by single-based extension showed > 99% concordance. Mendelian inheritance errors were also checked on SNP genotypes determined in samples from CEPH pedigrees.

Return to Poster Abstract Index
Return to Top

HOME • REGISTRATION • PAPERS • POSTERS • TUTORIALS • PROGRAM • KEYNOTE SPEAKERS • INVITED SPEAKERS
SPECIAL EVENTS • COMMITTEES • SPONSORS • NEWS ROOM • CONTACT US • PREVIOUS CONFERENCES