Analisi di serie temporali per microarray - ICAR-CNR

Analysis of Time Course Microarray
Experiments
Claudia Angelini
Istituto per le Applicazioni del Calcolo
[email protected]
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 1 1
Outline
 Background
 From Steady-state to Time-course microarray experiments
 From Biological to Statistical questions
 Detection and estimation of gene’ expression profiles
 Statistical methods for Time Course Microarray and their Software
EDGE, TimeCourse, BATS
 Real Data Application
 Other related problems
 Conclusion
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 2 2
Background on Gene Expression
Each cell contains a complete copy of the organism's genome and cells are of many
different types and states
What makes the cells different is the way they synthesize proteins and develop their
biological functions. Understanding differences among different types of cells and the
way they react to a given stimulus or treatment is the key for understanding functional
genomics.
 In most of the cases the difference among biological samples is proportional to the
“genes’ expression” (i.e., transcription level or abundance, during which DNA is
transcribed into mRNA)
Differential gene expression, i.e., when, where, and how much each gene is
expressed.
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 3 3
Background on Microarray
Microarray experiments are high throughput biological assays for measuring the
abundance of DNA or mRNA sequences in different types of cell samples (for several
thousands of sequences simultaneously) and hence yield information on “gene
expression levels”.
Based on the Hybridization
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 4 4
Background on Microarray
Gene Expression assays
Spotted cDNA arrays (Brown/Botstein)
Short oligonucleotide arrays (Affymetrix)
Long oligonugleotide arrays (Agilent Inkjet)
Fibre optic arrays (Illumina BeadChips)
Etc…..
In the following we will focus on cDNA Microarray experiments, however
most of the statistical approaches apply to other platforms
The fluorescent spectrum in a picture of the relative DNA /mRNA abundance
in the two samples under a given conditions (or in a specific time)
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 5 5
cDNA- microarray: experiment description for case-control design
“Treated
Sample”
“Control
Sample”
Note: After some
pre-processing
 Red 
LogRatio  log 2 

 Green 
is computed for
each spot (gene)
“statistical level”
GREEN represents High Control hybridization
RED represents High Treated Sample hybridization
YELLOW represents a combination of Control and Treated
Sample where both hybridized equally.
BLACK represents areas where neither the Control nor
Sample hybridized.
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 6 6
Static microarray experiments
Experiments are replicated (either technical or biological replicates)
Array 1
Array 2
…
Array k
Gene 1
0.133
0.055
…
- 0.210
Gene 2
1.77
1.13
…
1.54
Gene 3
- 2.06
- 1.98
…
- 1.76
:
Gene N
:
- 0.77
:
…
0.53
…
:
- 0.12
Statistical problems
• Testing
• Estimation
• Classification
Several methods have been proposed and
implemented in standard software
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
• Clustering
• Genes’ Network
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 7 7
More general experimental designs
Expression level of genes in a given cell can be influenced by
a pathological status, a pharmacological or medical treatment
 The response to a given stimulus is usually different for
different genes and may depend on the time, in fact the gene
expression is often a dynamic process
 Time Course microarray can carried out in order to study this
dynamics
Study dynamic biological processes
 Cell Cycle
Source: Ernst & Bar –Joseph, 2006
 Response to temperature/enviroments changes or
treatments ..
 Developmental studies, Immune response..
 Etc (i.e., Age or dose response…)
About 30% of microarray experiments are time course
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 8 8
Some comments
About 70-80% of microarray time series experiments are short:
 5-10 time points & very few replicates
 Often samples are not taken on regularity spaced grid
 Cost of microarray
 Limited availability of biological material
 Presence of missing data
 High level of noise in the data (even after preprocessing)
Statistical problems related to time series microarray experiments
Automatic detection of periodic (cell cycle) genes
Automatic detection of differentially expressed profiles
Estimation
Focus of the talk
Clustering (or also Classification)
Genes’ Network
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 9 9
Testing and Estimation in time course experiments
Several experimental designs can be considered,
each of them may require tailored statistical methods
“one
“two
(statistical)
(statistical)
sample”
samples”
“Simple and preliminary” statistical questions:
Given the observations
For which index i the curves are different from zero? Or from which index the two
curves are different each other in the two samples ?
And if the curve is different from zero (or with respect to the other sample) how we
can estimate the “treatment effect”, i.e., the temporal response of each gene to the
treatment, etc?
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 10 10
“Standard ”Software for Microarray Data Analysis
Most of Analyzing Software are designed for “Static Gene Expression Data”
 Do not take advantage of the sequential information in time series data
 Often invariant under permutation of time points
 Do not use the value of the time explicitely
Statistical methods not specifically designed for time
course data can lack of statistical power
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 11 11
“Standard ”methods for time series analysis
On the other hand, most of the methods for time series data are based on
trasformation such as Fourier or wavelet or on asympthotic approaches.
They cannot be applied to microarray experimets due to the limited number or
observations available .
Moreover, when testing significance a “global” answer is required and since
thousand of curves are simultaneously compared, some “multiple comparisons
control procedures” are mandatory (i.e. to control the false positives)
Estimation of GENE6485
3
2.5
2
1.5
Global answer vs.
1
0.5
0
point by point
-0.5
-1
-1.5
-2
5
10
15
20
25
30
analysis
NEED OF “NEW” STATISTICAL METHODS AND SPECIFIC SOFTWARE
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 12 12
“New” approaches
Recently new methods have been proposed in the literature
for the identification and the estimation of time-course gene
expression profiles from microarray data
Among others:
• EDGE
• TimeCourse
• BATS
These methods are based on
different assumptions and usually
can be applied on different contexts
• Other are currently underdevelopment…
For a comparison among different methods:
M. Mutarelli, L. Cicatiello, M. Ravo, O.L. Grober, A. Facchiano, C. Angelini, A.
Weisz (2007). Time course whole-genome microarray analysis of estrogen
effects on hormone-responsive breast cancer cells. BMC Bioinformatics (to
appear)
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 13 13
What is EDGE?
EDGE is a user-friendly software for the Extraction and
Analysis of Differential Gene Expression. It implements the
novel approach proposed in Storey et al (2005)
EDGE allows the user to automatically identify and rank
differentially expressed genes, but it does not explicitly
estimate their expression profiles. It also controls FDR
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 14 14
What EDGE is used for?
EDGE can be applied to both ‘one sample’ and ‘two sample’
design; both ‘longitudinal’ and independent samples data
EDGE implements a functional approach in which gene expression is
expanded into a B-spline basis with fixed (common) degree; an F-test
similar to the one used into Anova model is used as a test statistics and
the p-values are estimated using bootstrap, an FDR type control is
applied.
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 15 15
The statistical model in EDGE
i  1,..., N
j  1,..., n
k  1,..., ki
Number of
genes
Number of timepoints
Number of
replicates
B-spline basis of degree p;
The coefficients are estimated from
the data Z using “least squares”
Functional Approach
For the “one sample” problem
H 0,i : i t   0
H1,i : i t   0
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 16 16
Testing significance with EDGE
Sum of squares
under the null
Sum of squares
under the
alternative
Statistic
The p-values are estimated using bootstrap/permutations
Automatic detection is carried out by controlling q-values
For the “two sample” problem
H 0,i : 1i t    2i t 
H1,i : 1i t    2i t 
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 17 17
What else about EDGE?
EDGE was first to propose a truly functional approach and was
found to perform well in several application, however
• Since p-values are evaluated via bootstrap EDGE can be computational
intensive and can suffer of the so called “granularity” problem
• EDGE does not allow “missing values”, however it implements KNNprocedure in order to fill them.
• EDGE impose that all genes are expanded in a B-spline bases with fixed
degree p (the degree can be estimated from the data or chosen by the user)
EDGE has been developed at University of Washington in the
Leek JT, Monsen EC, Dabney AR, and Storey JD. (2006) EDGE: Extraction and analysis of
differential gene expression. Bioinformatics, 22: 507-508.
Storey JD, Xiao W, Leek JT, Tompkins RG, and Davis RW. (2005) Significance analysis of time
course microarray experiments. Proceedings of the National Academy of Sciences, 102: 1283712842.
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 18 18
EDGE web site
EDGE can be download at
http://www.biostat.washington.edu/software/jstorey/edge//index.php;
It require R >2.5 and Bioconductor
EDGE is free for academic, non-comercial use
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 19 19
What is TimeCourse?
TimeCourse is a new R-package that implements the novel
Multivariate Empirical Bayes approach described in
Tai and Speed (2006)
• TimeCourse allows the user only to rank the genes’
expression profiles. It does not provide any automatic cutoff for selecting differentially expressed genes, neither it
controls multiple comparisons error.
• TimeCourse can be applied to both ‘one sample’ and ‘two
sample’ design. However in the latter case it is applicable
only to data sets with identical time grids.
• TimeCourse is particularly designed for purely ‘longitudinal’
data
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 20 20
The statistical model in TimeCourse

Zik  zik,1 ,, zik,n

T
Values observed at
t1
tn
For gene i, and
replicates/induvidual k
i  1,..., N
k  1,..., K
Number of genes
Number of replicates
Note: TimeCourse requires the number of
replicates to be the same for each time
points, they can be different for each
gene. Missing values are not allowed
Designed for longitudinal studies , i.e.,
the “same individual” k is assumed to
be recorded at all time points
Note: In the “two” sample design the
grid have to be the same for all
samples and all replicates
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 21 21
Multivariate Bayesian model in TimeCourse
Note the Bayesian approach is elicited in
the time-domain!!
The “value” of the time points does not
enter in the model
Z | i ,  i ~ N i ,  i 
k
i
i |,  i ~  0 0, ,0  1   0 N 0, 1 i 
Gene “non affected” by
the treatment
Gene “affected” by the treatment

i ~ Inv  Wishart v 
1

All hyper-parameters
can be estimated
from the data
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 22 22
Testing significance with TimeCourse
For the “one sample” problem
H 0,i : i  0; i  0 vs H1,i : i  0; i  0
Model +
observed data
Posterior Distribution
Prior Information
Posterior
distribution
analytically known
TimeCourse ranks the gene’ expression profiles using T2-Hotelling statistics if
all the genes have the same number of replicates or the MB-statistics if genes
have different number of replicates. Explicit form of both Hotelling and MBstatistics are analytically available in Tai and Speed 2006 (not showed here for
brevity)
For the “two sample” problem
H 0,i : 1i   2i ; i  0 vs H1,i : 1i   2i ; i  0
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 23 23
What else about TimeCourse?
TimeCourse only ranks the genes’ expression profiles, not automatically select the
ones “differentially expressed”.
All the parameters of the method can be estimated from the data of chosen by the
user.
• The ‘two sample’ design is applicable only to data sets with identical time grids.
• No missing data are allowed (preliminary procedure for filling the missing values
or filtering the data are necessary)
• Since the Multivariate Bayesian approach is elicited in the physical-domain, the
time variable does not enter explicitly in the model
TimeCourse has been developed at the Speed Berkeley Research Group
Tai, Y.C. and Speed, T.P. (2006) A multivariate empirical Bayes statistic for replicated microarray time course
data. Annals of Statistics, 34, 2387-2412
Tai Y.C. and Speed T.P. (2007) On the gene ranking of replicated microarray time course data, Tech. rep 735
Berkeley University
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 24 24
What is BATS ?
BATS is a user-friendly software for Bayesian Analysis of Time Series
microarray experiments based on the novel functional Bayesian approach
proposed in Angelini et al (2007).
BATS allows the user to automatically identify and rank differentially expressed
genes, to control multiple comparisons error and to estimate their expression
profiles.
BATS manages successfully various technical difficulties which arise in
microarray time-course experiments such as a small number of observations
available, non-uniform sampling intervals, presence of missing or multiple data
as well as temporal dependence between observations for each gene.
BATS is suited for the “one sample” statistical design “two sample” statistical
design (without any grid restriction) is under implementation.
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 25 25
The statistical model in BATS
Estimation of GENE6485
3
2.5
2
1.5
Zi j ,k  Si t j    i j ,k
1
0.5
0
-0.5
-1
-1.5
-2
 Treated at time t  t j on array k 

Zij,k  log 2 
Control
at
time
t

0
on
array
k


Where we assume
Li
Si t    cill t 
l 0
i  1,..., N
5
10
15
j  1,..., n
Gene i “true” functional profile
The gene expression time profile is
a smooth curve expanded in an
orthogonal system on [0,T]
20
25
30
Number of
genes
Number of timepoints
k  1,..., ki j
Number of
replicates
Noise. i.i.d.
 
Var   
E  i j ,k  0
j ,k
i
2

For the “one sample” problem
H 0,i : Si t   0
H1,i : Si t   0
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Si t   0
ci  0
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 26 26
Bayesian model in BATS
We assume that genes are conditionally independent

Z i | Li , c i ,  2 ~ N Di c i ,  2 I M i

Z i  Di ci   i
And we place a prior on unknown parameters
Li ~ g  , Lmax 
*
i.e., Pois  , Lmax  Poisson truncated at Lmax
ci | Li ,  2 ~  0 0,..,0  (1   0 ) N 0,  2 i2Qi1 
Gene “non affected” by
the treatment
0
Gene “affected”
by the treatment
Gene’s specific
variance
Prior probability of not being affected by the treatment (to be estimated from
the data)
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 27 27
Noise model in BATS
 2 ~   2 
We distinguish 3 models
Model 1)
  2     2   02 
Model 2)
 
Model 3)
 
2
2
  IG , b
 c 

i.e., the marginal distribution of
the noise Student T
M i 1  2  / 2
e
i.e., the marginal distribution of
the noise is Gaussian
i.e., the marginal distribution of
the noise Double-exponential
It is possible to model ”non gaussian noises”
Model +
observed data
Posterior Distribution
Prior Information
In cases 1)-3) analitically known
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 28 28
Testing significance with BATS
H 0i : c i  0
vs
i=1,…,N
H1i : c i  0
In general, given the posterior distribution, testing can be carried out looking at the
posterior probability of being significant or equivalently at the Bayes Factor
E Li | Z i 

Lˆi  
arg max pLi | Z i 
For the models under consideration the Bayes Factor (BF) can be analytically evaluated
Multiple comparisons control with BATS
In order to account for multiple comparisons BATS implements the Bayesian Multiple
Testing Procedure by Abramovich & Angelini (2006). The procedure is based on “orderd
Bayes Factors” and is similar in the spirit to Benjamini and Hochberg FDR control.

cˆ i  E ci | Z i , Lˆi
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica

Estimating treatment’s effect with BATS
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 29 29
BATS website
BATS Version 1.0 freely downloadable at
http://www.na.iac.cnr.it/~bats/
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 30 30
BATS – Main windows
“About” & “Help”
buttons are available
for each section
For running a
simulation study
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
For analyzing a
given dataset
Several additional tools
for filtering data, display
profiles and comparing
results
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 31 31
What about BATS
BATS can carry out analysis with both simulated and real experimental data
BATS is written in MATLAB, executable files for Windows, Linux and Macintosh
are now available.
BATS is currently implemented for single processor, however future release of
the software will include also version for workstation of multi-processors
BATS has been developed at IAC-CNR
and it is a part of the CNRBioinformatics
Interdepartmental Project
C. Angelini, D. De Canditiis, M. Mutarelli, M. Pensky, (2007). Bayesian Approach to Estimate and Testing in Time Course
Microarray Experiments. Statistical Applications in Genetics and Molecular Biology: vol 6 : Iss. 1, Article 24.
C. Angelini,L. Cutillo, D. De Canditiis, M. Mutarelli, M. Pensky (2007). BATS: A Bayesian User-Friendly software for analyzing
time series microarray data. (Technical report IAC CNR 331/07)
C. Angelini, D. De Canditiis, M. Pensky, (2008). Bayesian models for the two-sample time-course microarray experiments,
(Technical report IAC CNR in preparation)
M. Mutarelli, L. Cicatiello, M. Ravo, O.L. Grober, A. Facchiano, C. Angelini, A. Weisz (2007). Time course whole-genome
microarray analysis of estrogen effects on hormone-responsive breast cancer cells. BMC Bioinformatics (to appear)
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 32 32
Time course experiments: a Case Study
Aim of the experiment is to identify estrogen responsive genes in a human breast
cancer cell line.
Control-sample: ZR-75.1 (human breast cancer cell)
Treated-sample: ZR-75.1 cells stimulated with a mitogenic dose of 17ß-estradiol
Control samples
were always taken
at time t=0
Treated samples were taken at
time
t  [0, T ]; j  1,.., n
j
For each time point t j the
experiment was replicated k j
times
Biological questions:
Which genes are activated or repressed due to the treatment?
And if a gene is affected by the treatment, what is the treatment effects?
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 33 33
Cicatiello et al (2004) dataset description
In the experiment, ZR-75.1 cells were stimulated with a mitogenic dose of 17ß-estradiol,
after 5 days of starvation on an hormone-free medium, and samples were taken after t
=1,2,4,6,8,12,16,20,24,28,32 hours, (Non regular grid) with a total of 11 time points
covering the completion of a full mitotic cycle in hormone-stimulated cells. For each time
point at least 2 replicates were available (3 replicates at t = 2,8,16 hours).
After suitable filtering and preprocessing Yang et al.(2002) and Cui et al. (2002) N=8161
genes were analyzed by our method in order to detect estrogen response genes (Note
that more about 350 genes were presenting at least a missing value)
The normalized dataset is
contained as an example for a
guided analysis in BATS
Cicatiello, L., Scarfoglio, C., Altucci, L., Cancemi, M., Natoli, G., Facchiano, A., Iazzetti G., Calogero, R.,
Biglia, N., De Bortoli, M., Sfiligol, C., Sismondi, P., Bresciani, F. and Weisz, A., (2004). A genomic view of
estrogen actions in human breast cancer cells by expression profiling of the hormone-responsive trascriptome.
Journal of Molecular Endocrinology, 32, 719--775.
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 34 34
Results using BATS
BATS is very robust with respect to the list of genes detected as significant: 574 genes
where common to all 28 lists; while 958 genes were selected by at least one combination of
methods/parameters
Comparing with the 344 genes selected by hand in Cicatiello et al. (2004) the list of 574
common genes includes 270 genes; among the remaining 74 genes, 16 were filtered out in our
analysis due to a more stringent selection of quality before processing the data. On the other hand
309 out of 344 were selected by at least one combination.
Interestingly, 17 out of 304 newly selected genes were replicate spots of some genes already
selected in the Cicatiello et al. (2004) and most of the remaining are known to be involved in
biological processes related to estrogen response, such as cell cycle and cell proliferation
(AREG, NOLC1, cyclin D1), DNA replication (MCM7, RFC5), mRNA processing (SFRS1) and lipid
metabolism (APOD and LDHA).
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 35 35
Comparisons with EDGE and TimeCourse
We compare results of BATS analysis with a newly available alternative user friendly
software EDGE (Bioinformatics, 2006) and with the R package Timecourse (Speed &
Tai, Annals of Statistics, 2006).
 On real data BATS shows a much wider overlap with “biologists inspired selection”
than EDGE and R-timecourse package
Similar results are also confirmed by simulations
using FDR,FNR etc as “goodness” measure
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 36 36
Other related problems : Clustering
The identification of genes which are responsive with respect to a given
treatment is often a preliminary step for answering other questions of interest
Biological questions:
Which genes show similar response to the treatment?
Several methods have been proposed for genes’ clustering, however very few
of them are designed for time course microarray data
As a consequence most of the information contained in the data cannot be
properly used and the results are often not stable
Heard, N. A., Holmes, C. C., Stephens, D. A., Hand, D. J. and Dimopoulos, G. (2005) Bayesian Coclustering of Anopheles Gene Expression Time Series: A Study of Immune Defense Response To Multiple
Experimental Challenges. Proceedings of the National Academy of Science USA, 102, 47, 16939-16944
Heard, N.A., Holmes, C.C., and Stephens, D. A. (2006). A quantitative study of gene regulation involved in
the Immune response of Anopheline Mosquitoes: An application of Bayesian hierarchical clustering of curves.
J. Amer. Statist. Assoc.,101, 18--29.
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 37 37
Clustering
SplineClust
“functional” Bayesian approach with the possibility
of estimating the number of clusters from the
posterior distributions
but
• No missing data are allowed
• Only one observation per time-point is allowed
• Same degree for all functions
• Computationally fast, but with a price: it uses
hierarchical clustering (often not optimal)
• No “goodness” measure available
There is still a shortage of specifically
designed methods and of a careful
analysis of their performance
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 38 38
Conclusion
Time-course microarray experiments are becoming extremely popular as tool for
investigating the gene expression dynamics, however they provide new challenges to
statisticians and computer scientists which have to develop specifically designed tools
for handling and analyzing them.
We have presented and compare several
currently available methods and related software
for analyzing time course microarray data with
particular focus on the problem of the automatic
identification and estimation of gene expression
profiles.
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 39 39
For any information contact me: [email protected]
Metodi e Strumenti per l’Analisi dei Dati
BioinfoGRID
Symposium 2007
di Espressione Genica
Tuesday
1813
December
2007,
CNR-Naples
Thursday
December
2007,
Milan 40 40