Analysis of Time Course Microarray Experiments Claudia Angelini Istituto per le Applicazioni del Calcolo [email protected] Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 1 1 Outline Background From Steady-state to Time-course microarray experiments From Biological to Statistical questions Detection and estimation of gene’ expression profiles Statistical methods for Time Course Microarray and their Software EDGE, TimeCourse, BATS Real Data Application Other related problems Conclusion Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 2 2 Background on Gene Expression Each cell contains a complete copy of the organism's genome and cells are of many different types and states What makes the cells different is the way they synthesize proteins and develop their biological functions. Understanding differences among different types of cells and the way they react to a given stimulus or treatment is the key for understanding functional genomics. In most of the cases the difference among biological samples is proportional to the “genes’ expression” (i.e., transcription level or abundance, during which DNA is transcribed into mRNA) Differential gene expression, i.e., when, where, and how much each gene is expressed. Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 3 3 Background on Microarray Microarray experiments are high throughput biological assays for measuring the abundance of DNA or mRNA sequences in different types of cell samples (for several thousands of sequences simultaneously) and hence yield information on “gene expression levels”. Based on the Hybridization Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 4 4 Background on Microarray Gene Expression assays Spotted cDNA arrays (Brown/Botstein) Short oligonucleotide arrays (Affymetrix) Long oligonugleotide arrays (Agilent Inkjet) Fibre optic arrays (Illumina BeadChips) Etc….. In the following we will focus on cDNA Microarray experiments, however most of the statistical approaches apply to other platforms The fluorescent spectrum in a picture of the relative DNA /mRNA abundance in the two samples under a given conditions (or in a specific time) Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 5 5 cDNA- microarray: experiment description for case-control design “Treated Sample” “Control Sample” Note: After some pre-processing Red LogRatio log 2 Green is computed for each spot (gene) “statistical level” GREEN represents High Control hybridization RED represents High Treated Sample hybridization YELLOW represents a combination of Control and Treated Sample where both hybridized equally. BLACK represents areas where neither the Control nor Sample hybridized. Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 6 6 Static microarray experiments Experiments are replicated (either technical or biological replicates) Array 1 Array 2 … Array k Gene 1 0.133 0.055 … - 0.210 Gene 2 1.77 1.13 … 1.54 Gene 3 - 2.06 - 1.98 … - 1.76 : Gene N : - 0.77 : … 0.53 … : - 0.12 Statistical problems • Testing • Estimation • Classification Several methods have been proposed and implemented in standard software Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica • Clustering • Genes’ Network Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 7 7 More general experimental designs Expression level of genes in a given cell can be influenced by a pathological status, a pharmacological or medical treatment The response to a given stimulus is usually different for different genes and may depend on the time, in fact the gene expression is often a dynamic process Time Course microarray can carried out in order to study this dynamics Study dynamic biological processes Cell Cycle Source: Ernst & Bar –Joseph, 2006 Response to temperature/enviroments changes or treatments .. Developmental studies, Immune response.. Etc (i.e., Age or dose response…) About 30% of microarray experiments are time course Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 8 8 Some comments About 70-80% of microarray time series experiments are short: 5-10 time points & very few replicates Often samples are not taken on regularity spaced grid Cost of microarray Limited availability of biological material Presence of missing data High level of noise in the data (even after preprocessing) Statistical problems related to time series microarray experiments Automatic detection of periodic (cell cycle) genes Automatic detection of differentially expressed profiles Estimation Focus of the talk Clustering (or also Classification) Genes’ Network Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 9 9 Testing and Estimation in time course experiments Several experimental designs can be considered, each of them may require tailored statistical methods “one “two (statistical) (statistical) sample” samples” “Simple and preliminary” statistical questions: Given the observations For which index i the curves are different from zero? Or from which index the two curves are different each other in the two samples ? And if the curve is different from zero (or with respect to the other sample) how we can estimate the “treatment effect”, i.e., the temporal response of each gene to the treatment, etc? Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 10 10 “Standard ”Software for Microarray Data Analysis Most of Analyzing Software are designed for “Static Gene Expression Data” Do not take advantage of the sequential information in time series data Often invariant under permutation of time points Do not use the value of the time explicitely Statistical methods not specifically designed for time course data can lack of statistical power Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 11 11 “Standard ”methods for time series analysis On the other hand, most of the methods for time series data are based on trasformation such as Fourier or wavelet or on asympthotic approaches. They cannot be applied to microarray experimets due to the limited number or observations available . Moreover, when testing significance a “global” answer is required and since thousand of curves are simultaneously compared, some “multiple comparisons control procedures” are mandatory (i.e. to control the false positives) Estimation of GENE6485 3 2.5 2 1.5 Global answer vs. 1 0.5 0 point by point -0.5 -1 -1.5 -2 5 10 15 20 25 30 analysis NEED OF “NEW” STATISTICAL METHODS AND SPECIFIC SOFTWARE Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 12 12 “New” approaches Recently new methods have been proposed in the literature for the identification and the estimation of time-course gene expression profiles from microarray data Among others: • EDGE • TimeCourse • BATS These methods are based on different assumptions and usually can be applied on different contexts • Other are currently underdevelopment… For a comparison among different methods: M. Mutarelli, L. Cicatiello, M. Ravo, O.L. Grober, A. Facchiano, C. Angelini, A. Weisz (2007). Time course whole-genome microarray analysis of estrogen effects on hormone-responsive breast cancer cells. BMC Bioinformatics (to appear) Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 13 13 What is EDGE? EDGE is a user-friendly software for the Extraction and Analysis of Differential Gene Expression. It implements the novel approach proposed in Storey et al (2005) EDGE allows the user to automatically identify and rank differentially expressed genes, but it does not explicitly estimate their expression profiles. It also controls FDR Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 14 14 What EDGE is used for? EDGE can be applied to both ‘one sample’ and ‘two sample’ design; both ‘longitudinal’ and independent samples data EDGE implements a functional approach in which gene expression is expanded into a B-spline basis with fixed (common) degree; an F-test similar to the one used into Anova model is used as a test statistics and the p-values are estimated using bootstrap, an FDR type control is applied. Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 15 15 The statistical model in EDGE i 1,..., N j 1,..., n k 1,..., ki Number of genes Number of timepoints Number of replicates B-spline basis of degree p; The coefficients are estimated from the data Z using “least squares” Functional Approach For the “one sample” problem H 0,i : i t 0 H1,i : i t 0 Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 16 16 Testing significance with EDGE Sum of squares under the null Sum of squares under the alternative Statistic The p-values are estimated using bootstrap/permutations Automatic detection is carried out by controlling q-values For the “two sample” problem H 0,i : 1i t 2i t H1,i : 1i t 2i t Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 17 17 What else about EDGE? EDGE was first to propose a truly functional approach and was found to perform well in several application, however • Since p-values are evaluated via bootstrap EDGE can be computational intensive and can suffer of the so called “granularity” problem • EDGE does not allow “missing values”, however it implements KNNprocedure in order to fill them. • EDGE impose that all genes are expanded in a B-spline bases with fixed degree p (the degree can be estimated from the data or chosen by the user) EDGE has been developed at University of Washington in the Leek JT, Monsen EC, Dabney AR, and Storey JD. (2006) EDGE: Extraction and analysis of differential gene expression. Bioinformatics, 22: 507-508. Storey JD, Xiao W, Leek JT, Tompkins RG, and Davis RW. (2005) Significance analysis of time course microarray experiments. Proceedings of the National Academy of Sciences, 102: 1283712842. Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 18 18 EDGE web site EDGE can be download at http://www.biostat.washington.edu/software/jstorey/edge//index.php; It require R >2.5 and Bioconductor EDGE is free for academic, non-comercial use Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 19 19 What is TimeCourse? TimeCourse is a new R-package that implements the novel Multivariate Empirical Bayes approach described in Tai and Speed (2006) • TimeCourse allows the user only to rank the genes’ expression profiles. It does not provide any automatic cutoff for selecting differentially expressed genes, neither it controls multiple comparisons error. • TimeCourse can be applied to both ‘one sample’ and ‘two sample’ design. However in the latter case it is applicable only to data sets with identical time grids. • TimeCourse is particularly designed for purely ‘longitudinal’ data Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 20 20 The statistical model in TimeCourse Zik zik,1 ,, zik,n T Values observed at t1 tn For gene i, and replicates/induvidual k i 1,..., N k 1,..., K Number of genes Number of replicates Note: TimeCourse requires the number of replicates to be the same for each time points, they can be different for each gene. Missing values are not allowed Designed for longitudinal studies , i.e., the “same individual” k is assumed to be recorded at all time points Note: In the “two” sample design the grid have to be the same for all samples and all replicates Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 21 21 Multivariate Bayesian model in TimeCourse Note the Bayesian approach is elicited in the time-domain!! The “value” of the time points does not enter in the model Z | i , i ~ N i , i k i i |, i ~ 0 0, ,0 1 0 N 0, 1 i Gene “non affected” by the treatment Gene “affected” by the treatment i ~ Inv Wishart v 1 All hyper-parameters can be estimated from the data Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 22 22 Testing significance with TimeCourse For the “one sample” problem H 0,i : i 0; i 0 vs H1,i : i 0; i 0 Model + observed data Posterior Distribution Prior Information Posterior distribution analytically known TimeCourse ranks the gene’ expression profiles using T2-Hotelling statistics if all the genes have the same number of replicates or the MB-statistics if genes have different number of replicates. Explicit form of both Hotelling and MBstatistics are analytically available in Tai and Speed 2006 (not showed here for brevity) For the “two sample” problem H 0,i : 1i 2i ; i 0 vs H1,i : 1i 2i ; i 0 Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 23 23 What else about TimeCourse? TimeCourse only ranks the genes’ expression profiles, not automatically select the ones “differentially expressed”. All the parameters of the method can be estimated from the data of chosen by the user. • The ‘two sample’ design is applicable only to data sets with identical time grids. • No missing data are allowed (preliminary procedure for filling the missing values or filtering the data are necessary) • Since the Multivariate Bayesian approach is elicited in the physical-domain, the time variable does not enter explicitly in the model TimeCourse has been developed at the Speed Berkeley Research Group Tai, Y.C. and Speed, T.P. (2006) A multivariate empirical Bayes statistic for replicated microarray time course data. Annals of Statistics, 34, 2387-2412 Tai Y.C. and Speed T.P. (2007) On the gene ranking of replicated microarray time course data, Tech. rep 735 Berkeley University Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 24 24 What is BATS ? BATS is a user-friendly software for Bayesian Analysis of Time Series microarray experiments based on the novel functional Bayesian approach proposed in Angelini et al (2007). BATS allows the user to automatically identify and rank differentially expressed genes, to control multiple comparisons error and to estimate their expression profiles. BATS manages successfully various technical difficulties which arise in microarray time-course experiments such as a small number of observations available, non-uniform sampling intervals, presence of missing or multiple data as well as temporal dependence between observations for each gene. BATS is suited for the “one sample” statistical design “two sample” statistical design (without any grid restriction) is under implementation. Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 25 25 The statistical model in BATS Estimation of GENE6485 3 2.5 2 1.5 Zi j ,k Si t j i j ,k 1 0.5 0 -0.5 -1 -1.5 -2 Treated at time t t j on array k Zij,k log 2 Control at time t 0 on array k Where we assume Li Si t cill t l 0 i 1,..., N 5 10 15 j 1,..., n Gene i “true” functional profile The gene expression time profile is a smooth curve expanded in an orthogonal system on [0,T] 20 25 30 Number of genes Number of timepoints k 1,..., ki j Number of replicates Noise. i.i.d. Var E i j ,k 0 j ,k i 2 For the “one sample” problem H 0,i : Si t 0 H1,i : Si t 0 Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Si t 0 ci 0 Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 26 26 Bayesian model in BATS We assume that genes are conditionally independent Z i | Li , c i , 2 ~ N Di c i , 2 I M i Z i Di ci i And we place a prior on unknown parameters Li ~ g , Lmax * i.e., Pois , Lmax Poisson truncated at Lmax ci | Li , 2 ~ 0 0,..,0 (1 0 ) N 0, 2 i2Qi1 Gene “non affected” by the treatment 0 Gene “affected” by the treatment Gene’s specific variance Prior probability of not being affected by the treatment (to be estimated from the data) Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 27 27 Noise model in BATS 2 ~ 2 We distinguish 3 models Model 1) 2 2 02 Model 2) Model 3) 2 2 IG , b c i.e., the marginal distribution of the noise Student T M i 1 2 / 2 e i.e., the marginal distribution of the noise is Gaussian i.e., the marginal distribution of the noise Double-exponential It is possible to model ”non gaussian noises” Model + observed data Posterior Distribution Prior Information In cases 1)-3) analitically known Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 28 28 Testing significance with BATS H 0i : c i 0 vs i=1,…,N H1i : c i 0 In general, given the posterior distribution, testing can be carried out looking at the posterior probability of being significant or equivalently at the Bayes Factor E Li | Z i Lˆi arg max pLi | Z i For the models under consideration the Bayes Factor (BF) can be analytically evaluated Multiple comparisons control with BATS In order to account for multiple comparisons BATS implements the Bayesian Multiple Testing Procedure by Abramovich & Angelini (2006). The procedure is based on “orderd Bayes Factors” and is similar in the spirit to Benjamini and Hochberg FDR control. cˆ i E ci | Z i , Lˆi Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Estimating treatment’s effect with BATS Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 29 29 BATS website BATS Version 1.0 freely downloadable at http://www.na.iac.cnr.it/~bats/ Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 30 30 BATS – Main windows “About” & “Help” buttons are available for each section For running a simulation study Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica For analyzing a given dataset Several additional tools for filtering data, display profiles and comparing results Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 31 31 What about BATS BATS can carry out analysis with both simulated and real experimental data BATS is written in MATLAB, executable files for Windows, Linux and Macintosh are now available. BATS is currently implemented for single processor, however future release of the software will include also version for workstation of multi-processors BATS has been developed at IAC-CNR and it is a part of the CNRBioinformatics Interdepartmental Project C. Angelini, D. De Canditiis, M. Mutarelli, M. Pensky, (2007). Bayesian Approach to Estimate and Testing in Time Course Microarray Experiments. Statistical Applications in Genetics and Molecular Biology: vol 6 : Iss. 1, Article 24. C. Angelini,L. Cutillo, D. De Canditiis, M. Mutarelli, M. Pensky (2007). BATS: A Bayesian User-Friendly software for analyzing time series microarray data. (Technical report IAC CNR 331/07) C. Angelini, D. De Canditiis, M. Pensky, (2008). Bayesian models for the two-sample time-course microarray experiments, (Technical report IAC CNR in preparation) M. Mutarelli, L. Cicatiello, M. Ravo, O.L. Grober, A. Facchiano, C. Angelini, A. Weisz (2007). Time course whole-genome microarray analysis of estrogen effects on hormone-responsive breast cancer cells. BMC Bioinformatics (to appear) Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 32 32 Time course experiments: a Case Study Aim of the experiment is to identify estrogen responsive genes in a human breast cancer cell line. Control-sample: ZR-75.1 (human breast cancer cell) Treated-sample: ZR-75.1 cells stimulated with a mitogenic dose of 17ß-estradiol Control samples were always taken at time t=0 Treated samples were taken at time t [0, T ]; j 1,.., n j For each time point t j the experiment was replicated k j times Biological questions: Which genes are activated or repressed due to the treatment? And if a gene is affected by the treatment, what is the treatment effects? Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 33 33 Cicatiello et al (2004) dataset description In the experiment, ZR-75.1 cells were stimulated with a mitogenic dose of 17ß-estradiol, after 5 days of starvation on an hormone-free medium, and samples were taken after t =1,2,4,6,8,12,16,20,24,28,32 hours, (Non regular grid) with a total of 11 time points covering the completion of a full mitotic cycle in hormone-stimulated cells. For each time point at least 2 replicates were available (3 replicates at t = 2,8,16 hours). After suitable filtering and preprocessing Yang et al.(2002) and Cui et al. (2002) N=8161 genes were analyzed by our method in order to detect estrogen response genes (Note that more about 350 genes were presenting at least a missing value) The normalized dataset is contained as an example for a guided analysis in BATS Cicatiello, L., Scarfoglio, C., Altucci, L., Cancemi, M., Natoli, G., Facchiano, A., Iazzetti G., Calogero, R., Biglia, N., De Bortoli, M., Sfiligol, C., Sismondi, P., Bresciani, F. and Weisz, A., (2004). A genomic view of estrogen actions in human breast cancer cells by expression profiling of the hormone-responsive trascriptome. Journal of Molecular Endocrinology, 32, 719--775. Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 34 34 Results using BATS BATS is very robust with respect to the list of genes detected as significant: 574 genes where common to all 28 lists; while 958 genes were selected by at least one combination of methods/parameters Comparing with the 344 genes selected by hand in Cicatiello et al. (2004) the list of 574 common genes includes 270 genes; among the remaining 74 genes, 16 were filtered out in our analysis due to a more stringent selection of quality before processing the data. On the other hand 309 out of 344 were selected by at least one combination. Interestingly, 17 out of 304 newly selected genes were replicate spots of some genes already selected in the Cicatiello et al. (2004) and most of the remaining are known to be involved in biological processes related to estrogen response, such as cell cycle and cell proliferation (AREG, NOLC1, cyclin D1), DNA replication (MCM7, RFC5), mRNA processing (SFRS1) and lipid metabolism (APOD and LDHA). Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 35 35 Comparisons with EDGE and TimeCourse We compare results of BATS analysis with a newly available alternative user friendly software EDGE (Bioinformatics, 2006) and with the R package Timecourse (Speed & Tai, Annals of Statistics, 2006). On real data BATS shows a much wider overlap with “biologists inspired selection” than EDGE and R-timecourse package Similar results are also confirmed by simulations using FDR,FNR etc as “goodness” measure Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 36 36 Other related problems : Clustering The identification of genes which are responsive with respect to a given treatment is often a preliminary step for answering other questions of interest Biological questions: Which genes show similar response to the treatment? Several methods have been proposed for genes’ clustering, however very few of them are designed for time course microarray data As a consequence most of the information contained in the data cannot be properly used and the results are often not stable Heard, N. A., Holmes, C. C., Stephens, D. A., Hand, D. J. and Dimopoulos, G. (2005) Bayesian Coclustering of Anopheles Gene Expression Time Series: A Study of Immune Defense Response To Multiple Experimental Challenges. Proceedings of the National Academy of Science USA, 102, 47, 16939-16944 Heard, N.A., Holmes, C.C., and Stephens, D. A. (2006). A quantitative study of gene regulation involved in the Immune response of Anopheline Mosquitoes: An application of Bayesian hierarchical clustering of curves. J. Amer. Statist. Assoc.,101, 18--29. Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 37 37 Clustering SplineClust “functional” Bayesian approach with the possibility of estimating the number of clusters from the posterior distributions but • No missing data are allowed • Only one observation per time-point is allowed • Same degree for all functions • Computationally fast, but with a price: it uses hierarchical clustering (often not optimal) • No “goodness” measure available There is still a shortage of specifically designed methods and of a careful analysis of their performance Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 38 38 Conclusion Time-course microarray experiments are becoming extremely popular as tool for investigating the gene expression dynamics, however they provide new challenges to statisticians and computer scientists which have to develop specifically designed tools for handling and analyzing them. We have presented and compare several currently available methods and related software for analyzing time course microarray data with particular focus on the problem of the automatic identification and estimation of gene expression profiles. Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 39 39 For any information contact me: [email protected] Metodi e Strumenti per l’Analisi dei Dati BioinfoGRID Symposium 2007 di Espressione Genica Tuesday 1813 December 2007, CNR-Naples Thursday December 2007, Milan 40 40