genomica comparata
L’analisi comparata dei genomi
analizza cosa c’è in comune e cosa c’è di unico tra specie
diverse a livello genomico
(anatomia comparata molecolare?)
può costituire il modo più sicuro ed affidabile per
identificare geni (Genscan, Orpheus ecc.) e predire le loro
funzioni e interazioni
ad esempio, la funzione di alcuni geni o sequenze di DNA
umane può essere chiarita studiando le loro controparti in
organismi più semplici
Aree di Pertinenza della
Genomica Comparata
la genomica comparata include il confronto di:
• tutte le proteine note o predette (proteomi),
• la posizione dei geni nei genomi,
• il numero e la posizione dei repeats,
un’altra area della genomica comparata comprende:
• l’analisi della diversità intra-specifica
(SNPs, variabilità dei microsatelliti e dei livelli di espressione
genica)
• l’associazione tra queste variazioni e la
risposta all’ambiente e/o alle malattie
Cosa si può fare con la
Genomica Comparata
si può confrontare la localizzazione relativa di geni correlati
in uno stesso genoma e in genomi separati
come di ogni gruppo locale di geni che possa avere un
significato funzionale o regolativo
gli ortologhi sono geni così conservati in genomi diversi che
si predice che le proteine codificate abbiano la stessa
struttura e funzione e che si siano generati a partire da un
antenato comune in seguito a speciazione
Identificazione di geni “ortologhi”
per identificare gli ortologhi, ogni genedell’organismo 1
viene usata come query in una ricerca di similarità contro
tutti i geni dell’organismo 2
in prima approssimazione, si assume che il gene del genoma
2 più simile al gene “query” ne sia l’ortologo
Genoma 2
Genoma 1
query
Confronto di interi genomi
per aumentare l’affidabilità della predizione, si richiede che
per ogni coppia di ortologhi, ognuno dei due geni risulti
quello più simile all’altroquando confrontato con l’intero
genoma
Genoma 1
Genoma2
se entrambe le relazioni sono vere ==> le due proteine
gialle si possono proporre come ortologhe
Similitudine e Posizione
in genere si richiedono anche bassi E-value (< 10-20) e che
l’allineamento includa almeno il 60-80% della sequenza
query (per evitare di incappare in paraloghi)
in organismi correlati, in genere si trovano conservati sia il
contenuto in geni che l’ordine dei geni nei genomi
la similarità di sequenza e di posizione confermano
eventualmente l’ortologia
per organismi via via meno correlati, gruppi di geni locali
restano correlati, ma riarrangiamenti cromosomici
possono spostare i cluster in altre posizioni sul genoma
Pathway metabolici nei procarioti
nei genomi dei microorganismi, i geni di uno stesso
pathway metabolico possono sono contigui in quanto coregolati trascrizionalmente in un operone o posti sotto il
controllo di uno stesso promotore
Pathway metabolici negli
eucarioti
anche negli eucarioti, geni correlati possono
ritrovarsi in cluster
se ne deduce che, nota la funzione di geni contigui,
la funzione di un gene può esserne inferita
per facilitare il confronto tra genomi, è necessario
proporre un vocabolario comune
(Gene Ontology)
confronto intra- genoma
• identifica famiglie geniche (distingue geni unici da
geni appartenenti a famiglie geniche)
• identifica paraloghi (coppie di geni simili possono
essere paraloghi)
confronto tra genomi
• identifica ortologhi
• identifica famiglie geniche
• identifica domini
Confronto Intra-Genoma
il confronto INTRA-genoma aiuta ad evitare casi di errata
valutazione di ortologia tra i
Es.=> X (genoma A) è ortologo di Y (genoma B) ?
Potrebbe invece essere che Y e Z siano paraloghi,
e che X sia ortologo di Z
Genoma A
X
Genoma B
?
Y
Z
Definiamo un grafo come una struttura matematica
destinata a rappresentare una relazione binaria tra
elementi di uno o più insiemi
possiamo utilizzare dei grafi per descrivere relazioni
binarie (per esempio di similarità) tra proteine di uno
stesso proteoma e spostare l’ottica del confronto dalla
coppia al cluster
Tools for Comparative
Genomics
UCSC Browser: This site contains the reference sequence and
• working draft assemblies for a large collection of genomes.
The Ensembl project produces genome databases
• Ensembl:
for vertebrates and other eukaryotic species, and makes this
information freely available online.
The Map Viewer provides a wide variety of genome
• MapView:
mapping and sequencing data.[26]
A comprehensive suite of programs and databases for
• VISTA:
comparative analysis of genomic sequences. It was built to
visualize the results of comparative analysis based on DNA
alignments. The presentation of comparative data generated by
VISTA can easily suit both small and large scale of data.
Bacteria Genomes
Alignments
This alignment of eight Yersinia bacteria genomes reveals 78
locally collinear blocks conserved among all eight taxa.
Evolution of the FOXP2 gene
Human FOXP2 gene and evolutionary conservation is shown in and
multiple alignment (at bottom of figure) in this image from the UCSC
Genome Browser. Conservation tends to cluster around exons.
Gene Ontology (GO)
PRODOTTO
GENICO
FUNZIONE
MOLECOLARE
PROCESSO
BIOLOGICO
COMPONENTE
CELLULARE
Un prodotto genico può avere una o più funzioni molecolari,
partecipare a uno o più processi biologici, e essere associato ad uno o
più componenti cellulari.
(DAG)
Le categorie indicate in rosso sono
rappresentate con una frequenza
significativamente (p < 0.01) più
elevata di quella attesa.
COMPARATIVE GENOMICS AT THE
VERTEBRATE EXTREMES
Dario Boffelli, Marcelo A. Nobrega and Edward M. Rubin
NATURE REVIEWS | GENETICS
VOLUME 5 | JUNE 2004 | 457
Annotators of the human genome are increasingly exploiting
comparisons with genomes at both the distal and proximal
evolutionary edges of the vertebrate tree. Despite the sequence
similarity between primates, comparisons among members of this
clade are beginning to identify primate- as well as human-specific
functional elements. At the distal evolutionary extreme, comparing
the human genome to that of non-mammal vertebrates such as
fish has proved to be a powerful filter to prioritize sequences that
most probably have significant functional activity in all vertebrates.
Human–Fugu rubripes conserved non-coding
sequences (CNS) in the human genome
51 human–F. rubripes
CNSs (in purple).
One cluster of human–F. rubripes
CNS in more detail. DACH is
involved in embryonic development
some of the non-coding
sequences conserved in
humans and F. rubripes act as
enhancers in mouse embryos.
In this assay, the sequence being tested is cloned upstream of a β-galactosidase reporter
gene. If the cloned sequence is an enhancer, it will activate the reporter gene, which can
be detected in an assay that stains the tissues that express β-galactosidase (in blue).
Extreme conservation in enhancers
shared by human and fish
A core enhancer in an intron
in DACH is >98% identical
for 350 bp in humans, mice
and rats. In the ~1 billion
years of parallel evolutionary
time that separates human,
mouse, rat, chicken, frog
and fish, only 6 substitutions
occurred in a 120-bp
fragment that corresponds to
an enhancer, 4 of which
occurred in the frog lineage
alone, and none occurred in
the mammalian lineage.
Sonic hedgehog expression in the limbs is
regulated by an enhancer at a distance of 1
Mb
a | Human–Fugu rubripes sequence comparisons, generated by VISTA, identify a
conserved non-coding sequence in intron 5 of LMBR1 (red box), which drives the
expression of a reporter gene in a pattern that resembles the expression of sonic
hedgehog (SHH) (arrows in b). Insertional mutagenesis in this region in mice results in
preaxial polydactly (arrows in c). In humans, mutations in this enhancer are also
associated with preaxial polydactyly (arrows in d).
*Between mammals and fish. The molecular function and biological
process of each gene were obtained from the Gene Ontology
Phylogenetic shadowing analyses sequence
variation in a multiple alignment identifies
regions that accumulate variation at a
slower rate
Phylogenetic
Generally, positions with several sequence differences
in multiple branches of the phylogenetic tree are more
likely to be evolving at a fast rate, and in turn identify the
least variable regions
shadowing analyses
sequence variation in
a multiple alignment to
identify regions that
accumulate variation
at a slower rate.
Each position in the
multiple alignment is
fitted to a
phylogenetic model
to calculate the
likelihood that the
position is evolving at
a fast or a slow rate
.
The slowly evolving regions often
correspond to functional
The slowly evolving
sequences. regions
often correspond
to functional sequences.
The use of highly similar
sequences minimizes
ambiguity in the
computation of the
multiple alignment.
Moreover, the
phylogenetic tree that
relates the data is easy
to infer and facilitates the
comparative assembly of
draft sequence from nonhuman primates to the
reference human
genome.
Identification of adaptively evolving
genes
d /d >1
N S
can result from
a decrease in
population size
or a relaxation
of selection.
The ratio of non-synonymous to synonymous substitutions, dN/dS, indicates the type
of selection that a gene is subject to. An excess of non-synonymous substitutions in
pairwise sequence comparisons (d /d >1) indicates that 1 of the 2 sequences is
S
undergoing positive selection. To N
determine
in which lineage positive selection
occurred, however, the sequence of a third species is needed.
Defining functional DNA elements
in the human genome
Manolis
Kellis et al.
PNAS April 29, 2014 | vol. 111 | no. 17 | 6131–6138
With the completion of the human genome sequence, attention turned to identifying and
annotating its functional DNA elements. As a complement to genetic and comparative
genomics approaches, the Encyclopedia of DNA Elements Project was launched to
contribute maps of RNA transcripts, transcriptional regulator binding sites, and chromatin
states in many cell types. The resulting genome-wide data reveal sites of biochemical
activity with high positional resolution and cell type specificity that facilitate studies of
gene regulation and interpretation of noncoding variants associated with human disease.
However, the biochemically active regions cover a much larger fraction of the genome
than do evolutionarily conserved regions, raising the question of whether nonconserved
but biochemically active regions are truly functional. Here, we review the strengths and
limitations of biochemical, evolutionary, and genetic approaches for defining functional
DNA segments, potential sources for the observed differences in estimated genomic
coverage, and the biological implications of these discrepancies. We also analyze the
relationship between signal intensity, genomic coverage, and evolutionary conservation.
Our results reinforce the principle that each approach provides complementary
information and that we need to use combinations of all three to elucidate genome
function in human biology and disease.
The complementary nature of
evolutionary, biochemical and genetic
evidence
Encyclo pedia of DNA Elements
(ENCODE) Project
DNA that produces a
phenotype upon alteration
GERP++ elements from 34
mammal alignments
Human genome coverage by
ENCODE
fragments per kilobase of
exon per million reads
(FPKM)
RNA-seq
Genomic footprint / ChIP-seq
Relationship between ENCODE signals
and conservation
only 5% of mammalian genomes are under strong evolutionary
constraint across multiple species
At present, we cannot distinguish which
low-abundance transcripts are functional
Regions with higher signals generally exhibit
higher levels of evolutionarily conservation
Epigenetic and evolutionary signals in cisregulatory modules (CRMs) of the HBB
complex
Anse e interazioni a distanza
nel locus delle -globine
Lo “switch” fetale-adulto nel
locus delle -globine
Il locus Albumina / Alfa-fetoproteina
(ALB/AFP)
Hind III
(AAGCTT)
13.4
5
10
13.5
45.9 46.3 47.8 48.2
15
20
25
30
35
40
50
45
55
60
III II
ALB Prom
inattive
ALB Prom
AFP +
AFP -
ALB -
ALB +
III II
Ealb Eafp
Before birth
Eafp
AFP Prom
AFP Prom
ACTIVE
AFP Prom
inactive
III II
Ealb
Eafp
ALB Prom
After birth
ACTIVE
x
Ealb
x
Kb
Sau3A
(GATC)
BASTA !!