Arabidopsis

Genomi 5
Omologia
Omologia:
due o più sequenze di
DNA o proteine si dicono
omologhe se provengono
da un antenato comune
(evoluzione).
Ortologia
Speciazione
L’emoglobina in due specie diverse (uomo e scimmia), hanno la
stessa funzione, anche se sono leggermente diverse.
I geni per l’emoglobina dell’uomo e della scimmia originano da un
unico gene ancestrale.
Geni ortologhi: geni omologhi presenti in specie diverse ma
correlate, che codificano per proteine con strutture e funzioni
simili
I geni ortologhi si sono separati non in seguito ad un evento di
duplicazione ma per la separazione delle specie (speciazione)
avvenuta nel corso dell’evoluzione.
Paralogia
Duplicazione
Durante l’evoluzione dell’uomo si è duplicato il gene della
proteina della pompa di sodio, poi una delle due è mutata
diventando una pompa sodio-potassio. Entrambi comunque
originano da un unico gene ancestrale.
Geni paraloghi: della stessa specie che codificano per prodotti
diversi. Sono originati da un evento di duplicazione
Sintenia
In classical genetics, synteny describes the physical co-localization of genetic loci on the
same chromosome within an individual or species. The concept is related to genetic linkage:
Linkage between two loci is established by the observation of lower-than-expected
recombination frequencies between them.
Shared synteny (also known as conserved synteny) describes preserved co-localization of
genes on chromosomes of different species.
During evolution, rearrangements to the genome such as chromosome translocations may
separate two loci apart, resulting in the loss of synteny between them. Conversely,
translocations can also join two previously separate pieces of chromosomes together, resulting
in a gain of synteny between loci. Stronger-than-expected shared synteny can reflect selection
for functional relationships between syntenic genes, such as combinations of alleles that are
advantageous when inherited together, or shared regulatory mechanisms
The analysis of synteny in the gene order sense has several applications in genomics. Shared
synteny is one of the most reliable criteria for establishing the orthology of genomic
regions in different species. Additionally, exceptional conservation of synteny can reflect
important functional relationships between genes. For example, the order of genes in the "Hox
cluster", which are key determinants of the animal body plan and which interact with each
other in critical ways, is essentially preserved throughout the animal kingdom
A qualitative distinction is sometimes drawn between macrosynteny, preservation of synteny
in large portions of a chromosome, and microsynteny, preservation of synteny for only a few
genes at a time.
Synteny relationships between the human and mouse IRG genes (a) Synteny between mouse chromosome 7 and human chromosome 19 in
the region of the IRGC and IRGQ genes. The figures indicate distances from the centromere in megabases. The locations of three further
syntenic markers are given. Gene orientation is given by black arrows. (b) Complex synteny relationship between human chromosome 5 and
mouse chromosomes 11 and 18 in the regions containing the mouse Irg genes. Figures indicate distances from the centromere in
megabases. The locations of IRG genes are shown in the yellow panels. Positions of diagnostic syntenic markers are also indicated. Syntenic
blocks are given in full color, and the rest is shaded.
Bekpen et al. Genome Biology 2005 6:R92 doi:10.1186/gb-2005-6-11-r92
An international collaboration (the Arabidopsis Genome Initiative, AGI) began
sequencing the genome in 1996.
Perché sequenziare una pianta
• Capire le basi genetiche delle differenze tra vegetali/altri
eucarioti
• Confronto con altri genomi vegetali (genetica comparativa)
• Identificazione geni batterici
eucariotico (studi evolutivi)
integrati
nel
genoma
• Miglioramento genetico di specie di interesse (crop
improvement)
• Studio di mutazioni erditarie (analisi cliniche)
Perché A. thaliana?
• genoma piccolo (5 cromosomi)
• ciclo vitale rapido (circa 6 settimane)
• produce molti semi
• cresce in spazi ridotti
• facile da manipolare (trasformazione con
agrobatterio)
• grande disponibilità di mutanti
• nessuna importanza agronomica ma
vantaggiosa per la ricerca di base
STRATEGIA DI SEQUENZIAMENTO
TECNOLOGIA TOP DOWN
(sequenziamento gerarchico)
•
•
•
Costruzione di BAC e YAC libraries
Costruzione di un tiling path di
sequenze allineate e ordinate
Frammentazione e sequenziamento
dei singoli cloni
VANTAGGI
• favorisce la ricostruzione di mappe
fisiche e genetiche ad alta risoluzione
• permette a gruppi di lavoro di tutto il
mondo di formare consorzi e lavorare
insieme senza essere ridondanti
The transformation-competent bacterial artificial chromosome vector (TAC) contains the P1 bacteriophage
replicon, which maintains the vector in a single copy, and therefore renders foreign DNA fragments stable, in
E. coli cells. The vector also contains the pRiA4 replicon of the Ri plasmid, which ensures a vector copy
number of 1 in Agrobacterium tumefaciens. The kanamycin resistance marker gene (NPTI), modified by
removal of the Hind III site, is included in the vector to allow selection of clones in both E. coli and A.
tumefaciens by culture in the presence of kanamycin.
Fig. 1. Structure and characteristics of the TAC vector pYLTAC7. (A) Physical map of pYLTAC7. The map1 shows the locations of recognition
sites for endonucleases that cleave the vector once or twice. Abbreviations not defined in text: KmR, kanamycin resistance gene (NPT1); PE.
coli, a synthetic E. coli promoter.
(B) Sequence of the cloning site region of the vector upstream of the sacB gene. The indicated primer sequences (R1, R2, R3, L1, L2, and L3)
are designed for isolation of end fragments of the inserted DNA by TAIL-PCR. The initiation codon (atg) of sacB is indicated in lowercase letters.
http://www.kazusa.or.jp/en/plant/TAC/Fig1_TACvector.html
Il Minimal Tiling Path
come costruire il tiling path?
1.
IBRIDAZIONE CON SONDA MARCATA (chromosome walking)
come costruire il tiling path?
1.
IBRIDAZIONE CON SONDA MARCATA (chromosome walking)
2.
FINGERPRINTING
come costruire il tiling path?
1.
IBRIDAZIONE CON SONDA MARCATA (chromosome walking)
2.
FINGERPRINTING
3.
END SEQUENCING  sequenziare le estremità della raccolta di cloni BAC e
verificare la correttezza della ricostruzione delle sequenze
Strategia di sequenziamento del genoma di A. thaliana
Telomere sequence was obtained from specific yeast artificial chromosome
(YAC) and phage clones, and from inverse polymerase chain reaction (IPCR)
products derived from genomic DNA.
In the centromeric regions, these physical mapping methods were
supplemented with genetic mapping to identify contig positions and
orientation.
Strategie di sequenziamento
•BRACCI CROMOSOMICI: BAC, YAC,
TAC, cosmidi e cloni P1
•TELOMERI: YAC e tecnica iPCR
•CENTROMERI: BAC e PCR
La mappa fisica del genoma è stata realizzata combinando
tecniche di analisi di frammenti di restrizione “fingerprint”
di cloni BAC, ibridazione/PCR di regioni con marcatori
molecolari noti e Southern blotting.
Caratteristiche del genoma di A.
thaliana
• 125 Mb (114.5 + 10 Mb di regioni centromeriche non
sequenziate/regioni ripetute)
• 25.498 geni codificano per 11.601 tipi di proteine (circa 150 famiglie)
Roughly 30% of the 25,498 predicted gene products, comprising both plant-specific proteins and proteins with similarity
to genes of unknown function from other organisms, could not be assigned to functional categories.
A combination of algorithms, all optimized with parameters based on known
Arabidopsis gene structures, was used to define gene structure.
We used similarities to known protein and expressed sequence tag (EST)
sequence to refine gene models.
•
Eighty per cent of the gene structures predicted by the three centres
involved were completely consistent, 93% of ESTs matched gene models,
and less than 1% of ESTs matched predicted non-coding regions, indicating
that most potential genes were identified.
•
The 25,498 genes predicted was the largest gene set published to that
date: C. elegans has 19,099 genes and Drosophila 13,601 genes.
•
Arabidopsis and C. elegans have similar gene density, whereas
Drosophila has a lower gene density
Confronto con genomi di altri organismi:
• Funzioni comuni  geni altamente conservati
• Funzioni specifiche (fotosintesi, tropismo ecc.)
 minor livello di conservazione
The proportion of Arabidopsis proteins having related counterparts in eukaryotic genomes varies by a factor of 2 to 3
depending on the functional category. Only 8-23% of Arabidopsis proteins involved in transcription have related genes
in other eukaryotic genomes, reflecting the independent evolution of many plant transcription factors. In contrast, 4860% of genes involved in protein synthesis have counterparts in the other eukaryotic genomes, reflecting highly
conserved gene functions
The absolute number of Arabidopsis gene families and singletons (types) is in the
same range as the other multicellular eukaryotes, indicating that a proteome of
11,000-15,000 types is sufficient for a wide diversity of multicellular life.
The proportion of gene families with more than two members is considerably more
pronounced in Arabidopsis than in other eukaryotes.
Pronounced redundancy in the Arabidopsis genome is evident in segmental
duplications and tandem arrays, and many other genes with high levels of sequence
conservation are also scattered over the genome.
Segmental duplication is responsible for 6,303 gene duplications, the extent of
tandem gene duplications accounts for a significant proportion of the increased family
size.
Gene duplication indicates more relaxed constraints on genome size in plants, OR a
more prominent role of unequal crossing over to generate new copies!
Genome Organization
The Arabidopsis genome sequence provides a complete view of chromosomal organization and clues to its
evolutionary history.




17% dei geni sono ri-arrangiati in tandem arrays;
24 lunghi (> 100 kb) tratti duplicati (~ 58% del cromosoma);
Il 60% dei geni è duplicato;
Forse è avvenuto un evento di speciazione da una pianta ancestrale
tetraploide (~ 112 milioni di anni fa).
What does the Arabidopsis genome tell us about the
ancestry of the species?
Polyploidy occurs widely in plants and is proposed to be a key factor in plant
evolution. As the majority of the Arabidopsis genome is represented in
duplicated (but not triplicated) segments, it appears most likely that
Arabidopsis, like maize, had a tetraploid ancestor. A comparative sequence
analysis of Arabidopsis and tomato estimated that a duplication occurred
~112 Myr ago to form a tetraploid.
It is also possible, however, that several independent segmental duplication
events took place instead of tetraploid formation and stabilization.
Comparative analysis of Arabidopsis
accessions
Comparing the multiple accessions of Arabidopsis allows us to identify commonly occurring changes in
genome microstructure. It also enables the development of new molecular markers for genetic
mapping.
- Columbia (Col-0)
- Landsberg erecta (Ler)
High rates of polymorphism between Arabidopsis accessions,
including both DNA sequence and copy number of tandem arrays,
are prevalent at loci involved in disease resistance. This has been
observed for other plant species, and such loci are thought to serve as
templates for illegitimate recombination to create new pathogen
response specificities.
Comparative analysis of Arabidopsis
accessions
A comparative analysis between 82Mb of the genome sequence of
Arabidopsis accession Columbia (Col-0) and 92.1Mb of nonredundant
low-pass (twofold redundant) sequence data of the genomic DNA of
accession Landsberg erecta (Ler) revealed two classes of differences:
• InDels (14,570 InDels at an average spacing of 6.1 kb)
• SNPs ( 1 SNP every 3.3 kb)
InDels ranged from 2 bp to over 38 kilobase-pairs, although 95% were
smaller than 50 bp. Many InDels contained entire active genes not
related to transposons. Half of such genes absent from corresponding
positions in the Col-0 sequence were found elsewhere on the genome
of Ler. This indicates that genes have been transferred to new genomic
locations.
SNPs were found in exons, introns and intergenic regions at frequencies
of 1 SNP per 3.1, 2.2 and 3.5 kb, respectively.
These analyses show that sequence polymorphisms between accessions
of Arabidopsis are common, and that they occur in both coding and non-coding
regions.
Integrazione del genoma
degli organelli nel
genoma di Arabidopsis
thaliana
Gli elementi trasponibili
Il 10% del genoma di Arabidopsis è formato da elementi trasponibili
- di CLASSE I, si replicano mediante intermedi a RNA (2109)
• LTR (long terminal repeats) retrotrasposoni / LINEs / SINEs
- di CLASSE II, si muovono in forma di DNA (2203)
• hAT-like elements / CACTA-like elements / MITEs / MULEs
- NUOVI GRUPPI (1209)
• Basho / Katytid
Telomers and centromers
• ~15% dei trasportatori sono pro-proteine canale
• Il 50% delle proteine canale sono acquaporine
• proteine MIP (Major intrinsic protein ) 10 volte in
più sistema idraulico fondamentale in numerosi
processi
• trasportatori
di
anioni
inorganici
(fosfato,
solfato,nitrato e cloruro) e canali catione-metallo
• ~ 1.000 geni codificanti proteine Ser/Thr chinasi 
peptidi importanti per il trasporto dei segnali in pianta
• ~ 12% sono trasportatori di zuccheri
• Sorprendentemente,
possiede
omologhi
ai
trasportatori umani ABC TAP di peptidi antigenici per
la presentazione al complesso maggiore di
istocompatibilità (MHC)
Il genoma di Arabidopsis ha una complessità di regolazione genica
comparabile agli altri eucarioti.
FATTORI DI TRASCRIZIONE
o Identificati usando ricerca di similarità e il matches-domain;
o 1709 proteine codificanti per fattori di trascrizione;
o 29 classi di fattori di trascrizione di Arabidopsis (16 unici vegetali);
o 8-23% di similarità con fattori di trascrizione in altri eucarioti.
ORGANIZZAZIONE CELLULARE
Le divergenze evolutive nell’organizzazione del citoscheletro e la citochinesi
sembrano derivate dalla presenza della parete cellulare
esempi:
• mancanza di proteine che connettono il citoscheletro alla matrice
cellulare come nelle cellule animali;
• presenza di plasmodesmi;
• piastra cellulare formata de novo;
• assenza di analoghi del centrosoma o dell’anello contrattile delle cellule
animali.
SVILUPPO
La regolazione dello sviluppo in Arabidopsis coinvolge
• Comunicazione cellula-cellula
• Fattori di trascrizione
• Regolazione dello stato della cromatina
confronto con regno
animale
STESSI MECCANISMI di SVILUPPO, ma “STRUMENTI” DIFFERENTI
esempi:
•
Sviluppo antero-posteriore dell’embrione coinvolge l’attivazione spaziospecifica di membri di una famiglia di geni
• Animali  Homeo box
• Piante  MADS box
•
Comunicazione cellula-cellula
• Animali  recettori Tyr chinasici
• Piante  recettori Ser/Tyr chinasici
TRASDUZIONE DEL SEGNALE
La pianta risponde ai cambiamenti ambientali attuando
1.
RICEZIONE DEL SEGNALE
Messaggeri secondari come
2.
TRASDUZIONE DEL SEGNALE
ormoni (auxina, etilene,
3.
MODIFICAZIONE DEL PATTERN
brassinosteroidi) e peptidi
DI ESPRESSIONE
(CLV3)
le piante hanno EVOLUTO il loro PATHWAY di
TRASDUZIONE del SEGNALE
Esempi:
•
RISPOSTA all’ETILENE  sistema a 2 componenti (combinazione dei
pathway di batteri e animali)
•
CASCATE MAPK  differenti componenti nel sistema di fosforilazione Histo-Asp rispetto ai mammiferi
–
–
–
Regolatori della risposta (ARRs)
Regolatori della pseudorisposta (PRRs)
Proteine trasportatrici della fosforilazione (HPt)
RICONOSCIMENTO E RISPOSTA AI PATOGENI
•
Nei mammiferi, il polimorfismo per il riconoscimento dei parassiti è codificato
nei geni MHC e contribuisce alla resistenza.
•
Nelle piante, la resistenza alle malattie geni (R) che conferiscono il
riconoscimento dei parassiti sono estremamente polimorfi.
•
A differenza dei geni MHC, i geni di resistenza delle piante si trovano in luoghi
diversi, e la sequenza completa del genoma consente l'analisi della loro
integrazione e la struttura.
•
Il genoma di Arabidopsis contiene geni di resistenza diversi distribuiti in
molti loci, insieme con i componenti delle vie di segnalazione, e
molti altri geni, il cui ruolo nella resistenza alle malattie è stata dedotta
da fenotipi mutanti.
•
L'evoluzione dei geni di resistenza può comportare la duplicazione e la
divergenza del gene legato, tuttavia, la maggior parte (46) geni di resistenza
sono singoli, 50 sono a coppie, 21 sono in 7 gruppi di 3 membri, con singoli
gruppi di 4, 5, 7, 8 e 9 membri, rispettivamente.
FOTOMORFOGENESI E FOTOSINTESI
• 100 geni candidati coinvolti nella percezione della luce e nella
segnalazione
– nuove proteine simili ai regolatori della fotomorfogenesi: COP / DET / FUS, PKS1,
PIF3, NDPK2, Spa1, FAR1, gigantea, FIN219, HY5, CCA1, ATHB-2, Zeitlupe, FKF1, LKP1,
NPH3 e RPT2.
• 139 geni codificati nel nucleo che potenzialmente hanno una funzione
nella fotosintesi
– 11 proteine del core del fotosistema I, compresi i componenti eucariotico-specifici
PSAG e PsaH101, e 8 proteine del fotosistema II e
anche un membro (psbW) del nucleo del fotosistema II
– 26 proteine simili alle proteine di legame
della clorofilla-a / b (8 Lhca e 18 LHCb)
Altre analisi hanno identificato alcuni componenti del pathway di percezione della
luce e hanno dimostrato che la dei componenti complessi dell'apparato
fotosintetico si ripartiscono tra il genoma nucleare e plastidiale
METABOLOMA


Numero consistente di geni che codificano enzimi coinvolti nei
processi metabolici (fotosintesi, respirazione, acquisizione di
minerali…);
Geni spesso ridondanti (per lo più tessuto-specifici);
Sintesi di oltre 100.000 metaboliti secondari (ingegneria
metabolica);

Definizioni importanti
Contig: The result of joining an overlapping collection of sequences
or clones.
Scaffold: The result of connecting contigs by linking information from
paired-end reads from plasmids, paired-end reads from BACs, known
messenger RNAs or other sources. The contigs in a scaffold are
ordered and oriented with respect to one another.
N50 length: A measure of the contig length (or scaffold length)
containing a `typical' nucleotide. Specifcally, it is the maximum
length L such that 50% of all nucleotides lie in contigs (or scaffolds)
of size at least L.
N50 length: A contig N50 is calculated by first ordering every contig by length from longest to shortest.
Next, starting from the longest contig, the lengths of each contig are summed, until this running sum
equals one-half of the total length of all contigs in the assembly. The contig N50 of the assembly is the
length of the shortest contig in this list. The scaffold N50 is calculated in the same fashion but uses
scaffolds rather than contigs. The longer the scaffold N50 is, the better the assembly is. The N90 statistic
is smaller than or equal to the N50 statistic; it is the length for which the collection of all contigs of that
length or longer contains at least 90% of the total of the lengths of the contigs, and for which the
collection of all contigs of that length or shorter contains at least 10% of the total of the lengths of the
contigs.
Note that N50 is calculated in the context of the assembly size rather than the genome size. The NG50
statistic is the same as the N50 except that the genome size is used rather than the assembly size.
Cosa ha rivelato il sequenziamento del genoma
umano
Background to the Human Genome Project
Scelta del metodo di sequenziamento
Coordinamento e data sharing
Server pubblico senza restrizione di accesso
“…we felt that the human genome sequence is the common heritage of all humanity, and the work should
transcend national boundaries…’’
“We believed that scientific progress would be most rapidly advanced by immediate and free availability
of the human genome sequence.”
Generazione della Draft sequence
1) SELEZIONE DEI CLONI
2) SEQUENZIAMENTO
3) ASSEMBLAGGIO
SELEZIONE DEI CLONI
8 librerie di grandi inserti contenenti cloni BAC e PAC
 costruite dalla digestione parziale di DNA genomico
 preparate da DNA ottenuto da donatori anonimi
 insieme rappresentano circa 65 volte la copertura del genoma
For the large-scale sequence production phase, a genome-wide physical map of
overlapping clones was also constructed by systematic analysis of BAC clones
representing 20-fold coverage of the human genome
“Volunteers of diverse backgrounds were accepted on a first-come, first-taken basis”
BAC DNAs are digested with HindIII and
visualized on a SYBR-green-stained 1%
agarose gel. Every fifth lane contains a
mixture of marker DNAs; the sizes of
selected marker fragments are indicated.
0, origin of fragment migration.
SEQUENZIAMENTO DEI CLONI
SEQUENZIAMENTO DEI CLONI
ASSEMBLAGGIO :
Suddiviso in 3 fasi:
A) FILTERING
Eliminare contaminazione da sequenze non umane
B) LAYOUT
Associare i cloni sequenziati a specifici cloni su una
mappa fisica, per produrre un “layout”.
Cloni ottenuti: 29,298
METODO
N° CLONI POSIZIONATI
Associare i cloni sequenziati ai corrispondenti fingerprint clone contigs della
mappa fisica sulla base dei profili di digestione “in silico”
16,193
Posizionare i cloni sequenziati su una mappa fisica usando database di end
sequences da fingerprinted BACs (Tabella 1)
Combinazione dei due approcci
22,566
Sfruttare la sovrapposizione di sequenza con i cloni già posizionati
25,403
29,298 (- 152)
Poi i fingerprint clone contig sono stati posizionati sui cromosomi usando i match di
sequenza di STS mappati su 2 mappe genetiche e su 4 radiation hybrid maps, assieme ai
dati della FISH. Il mappaggio è stato interamente rifinito comparando l’ordine e
l’orientamento degli STS nei fingerprint clone contigs e le varie mappe basate su STS
Orientati sbagliati,
Riorientati in B
In tutto 942 fingerprint clone contig contenevano cloni sequenziati. Di questi, 892 (99,2%
del DNA) erano stati assegnati a specifiche posizioni sui cromosomi, 51 (0,5% del DNA)
erano stati assegnati a specifici cromosomi ma non in precise posizioni, e 39 (0,3% del
DNA) non avevano nessuna localizzazione.
ASSEMBLAGGIO :
A) FILTERING
B) LAYOUT
C) MERGING
Unire le sequenze da cloni sequenziati sovrapposti tramite
GigAssembler.
Sequence-contig
Merged Sequence-contig
Sequence-contig scaffold
Il risultato dell’assemblaggio è una draft sequence del genoma umano.
Broad genomic landscape
CG content
Il genoma dei Vertebrati può essere
considerato un mosaico di isocore, cioè di
ampi segmenti di DNA aventi una
composizione nucleotidica omogenea
It has been proposed that the long-range variation in GC content may reflect that the genome
is composed of a mosaic of compositionally homogeneous regions that have been dubbed
`isochores'.
CpG island
CpG sites are regions of DNA where a cytosine nucleotide occurs next to a guanine nucleotide in the
linear sequence of bases along its length. "CpG" is shorthand for "—C—phosphate—G—", that is,
cytosine and guanine separated by a phosphate, which links the two nucleosides together in DNA. The
"CpG" notation is used to distinguish this linear sequence from the CG base-pairing of cytosine and
guanine.
CpG islands typically occur at or near the transcription start site of genes, particularly housekeeping
genes, in vertebrates. Normally a C base followed immediately by a G base (a CpG) is rare in
vertebrate DNA because the cytosines in such an arrangement tend to be methylated. However,
over evolutionary time methylated cytosines tend to turn into thymines because of spontaneous
deamination. As a result residual CpG islands are created in areas where methylation is rare, and CpG
sites stick.
Methylation of CpG sites is followed by spontaneous deamination leading to a lack of CpG sites in
methylated DNA.
Metilazione di mantenimento: durante la replicazione del DNA, il filamento di nuova sintesi viene metilato
nelle posizioni corrispondenti ai siti metilati sul filamento parentale per assicurare alle molecole figlie lo
stesso profilo di metilazione della molecola originaria
Metilazione de novo: quella dinamica che cambia il profilo del metiloma
The count of 28,890 CpG islands is reasonably
close to the previous estimate of about
35,000.
Most of the islands are short, with 60±70% GC
content. More than 95% of the islands are
less than 1,800 bp long, and more than 75%
are less than 850 bp.
The longest CpG island (on chromosome 10) is
36,619 bp long, and 322 are longer than 3,000
bp.
The density of CpG islands varies substantially
among some of the chromosomes. Most
chromosomes have 5±15 islands per Mb, with a
mean of 10.5 islands per Mb. However,
chromosome Y has an unusually low 2.9 islands
per Mb, and chromosomes 16, 17 and 22 have
19±22 islands per Mb. The extreme outlier is
chromosome 19, with 43 islands per Mb.
Comparison of genetic and physical distance
Sequenze ripetute nel genoma umano
In the human, coding sequences comprise less than 5% of the genome, whereas repeat sequences
account for at least 50% and probably much more.
Broadly, the repeats fall into five classes:
1. transposon-derived repeats, often referred to as interspersed repeats;
2. inactive (partially) retroposed copies of cellular genes (including protein-coding genes and small
structural RNAs), usually referred to as processed pseudogenes;
3. simple sequence repeats, consisting of direct repetitions of relatively short k-mers such as (A)n, (CA)n
or (CGG)n;
4. segmental duplications, consisting of blocks of around 10±300 kb that have been copied from one
region of the genome into another region;
5. blocks of tandemly repeated sequences, such as at centromeres, telomeres, the short arms of
acrocentric chromosomes and ribosomal gene clusters.
Repeats are often described as `junk' and dismissed as uninteresting. However, they actually represent
an extraordinary trove of information about biological processes. The repeats constitute a rich
palaeontological record, holding crucial clues about evolutionary events and forces. As passive markers,
they provide assays for studying processes of mutation and selection.
Transposable elements
Most human repeat sequence is derived from transposable elements. We can currently recognize about
45% of the genome as belonging to this class. Much of the remaining ‘unique’ DNA must also be derived
from ancient transposable element copies that have diverged too far to be recognized as such.
In mammals, almost all transposable elements fall into one of four types, of which three transpose
through RNA intermediates and one transposes directly as DNA. These are long interspersed elements
(LINEs), short interspersed elements (SINEs), LTR retrotransposons and DNA transposons. SINEs, LINEs,
LTR retroposons and DNA transposon copies comprise 13%, 20%, 8% and 3% of the sequence,
respectively.
Long Interspersed Elements (LINEs)
LINEs are one of the most ancient and successful inventions in eukaryotic genomes. In humans, these
transposons are about 6 kb long, harbour an internal polymerase II promoter and encode two open
reading frames (ORFs).
ORF1 encodes an RNA binding protein and ORF2 encodes a protein having an endonuclease (e.g. RNase
H) as well as a reverse transcriptase. Upon translation, a LINE RNA assembles with its own encoded
proteins and moves to the nucleus. The reverse transcriptase makes a DNA copy of the RNA that can be
integrated into the genome at a new site. Because LINEs (and other class I transposons, e.g. LTR
retrotransposons and SINEs) move by copying themselves (instead of moving by a cut and paste like
mechanism, as class II transposons do), they enlarge the genome.
Three distantly related LINE families are found in the human genome: LINE1, LINE2 and LINE3.
Only LINE1 is still active.
Short Interspersed Elements (SINE)
SINEs are wildly successful freeloaders on the backs of LINE elements. They are short (about 100±400
bp), harbour an internal polymerase III promoter and encode no proteins. These nonautonomous
transposons are thought to use the LINE machinery for transposition. Indeed, most SINEs `live' by
sharing the 3’ end with a resident LINE element. The promoter regions of all known SINEs are derived
from tRNA sequences, with the exception of a single monophyletic family of SINEs derived from the
signal recognition particle component 7SL. This family, which also does not share its 3’ end with a LINE,
includes the only active SINE in the human genome: the Alu element. By contrast, the mouse has both
tRNA-derived and 7SL-derived SINEs.
The human genome contains three distinct monophyletic families of SINEs: the active Alu, and the
inactive MIR and Ther2/MIR3.
Short Interspersed Elements (SINEs): The Alu
Example
Alu elements are highly repetitive DNA sequences that can be classified as SINEs (short interspersed
elements), which are themselves a type of "nonautonomous" retrotransposon. An Alu element is
transcribed into messenger RNA by RNA polymerase and then converted into a double-stranded DNA
molecule by reverse transcriptase. The new double-stranded DNA molecule is then inserted into a
new location in the genome.
Because they are nonautonomous, like all SINEs, Alu elements don't have the genetic capacity to
produce DNA copies of themselves or to integrate into new chromosomal locations. For those activities,
they rely on another type of transposon, called L1.
Most Alu elements are approximately 300 base pairs long, with considerable sequence variation.
Alu elements frequently duplicate when they jump, and scientists estimate that the human genome
acquires one new Alu insert in approximately every 200 births.
Alu TEs are believed to have emerged in primates around 65 million years ago. Today, they are the
most abundant type of human TE, making up an amazing 10% of the (diploid) human genome. Thus,
in a mere 65 million years, these transposons have gone from zero to about 1 million copies per cell!
These elements are spread throughout the genome and occur at varying densities in different loci.
LTR retroposons
LTR retroposons are flanked by long terminal direct repeats (LTR) that contain all of the necessary
transcriptional regulatory elements. The autonomous elements (retrotransposons) contain gag and pol
genes, which encode a protease, reverse transcriptase, RNAse H and integrase.
Exogenous retroviruses seem to have arisen from endogenous retrotransposons by acquisition of a
cellular envelope gene (env). Transposition occurs through the retroviral mechanism with reverse
transcription occurring in a cytoplasmic virus-like particle, primed by a tRNA (in contrast to the nuclear
location and chromosomal priming of LINEs). Although a variety of LTR retrotransposons exist, only the
vertebrate-specifc endogenous retroviruses (ERVs) appear to have been active in the mammalian
genome.
Mammalian retroviruses fall into three classes (I±III), each comprising many families with independent
origins. Most (85%) of the LTR retroposon-derived `fossils' consist only of an isolated LTR, with the
internal sequence having been lost by homologous recombination between the flanking LTRs.
DNA transposons
LTR DNA transposons resemble bacterial transposons, having terminal inverted repeats and encoding a
transposase that binds near the inverted repeats and mediates mobility through a `cut-and-paste‘
mechanism.
The human genome contains at least seven major classes of DNA transposon, which can be subdivided
into many families with independent origins. DNA transposons tend to have short life spans within a
species. This can be explained by contrasting the modes of transposition of DNA transposons and LINE
elements. LINE transposition tends to involve only functional elements by which LINE proteins assemble
with the RNA from which they were translated. For DNA transposons, the encoded transposase is
produced in the cytoplasm and, when it returns to the nucleus, it cannot distinguish active from
inactive elements. As inactive copies accumulate in the genome, transposition becomes less efficient.
This checks the expansion of any DNA transposon family and in due course causes it to die out. To
survive, DNA transposons must eventually move by horizontal transfer to virgin genomes, and there is
considerable evidence for such transfer
Età media degli elementi trasponibili nel genoma umano
uomo
The overall activity of all
transposons has declined markedly
over the past 35±50 Myr, with the
possible exception of LINE1. Indeed,
apart from an exceptional burst of
activity of Alu peaking around 40
Myr ago, there would appear to
have been a fairly steady decline in
activity in the hominid lineage since
the mammalian radiation
Transposon activity in the mouse
genome has not undergone the
decline seen in humans and
proceeds at a much higher rate. LTR
retroposons are alive and LINE1 and
a variety of SINEs are quite active.
topo
Today -------------------------------- 200 MYR ago
Comparison with other organisms
•
The euchromatic portion of the human genome has a much higher density of transposable element
copies than the euchromatic DNA of the other three organisms.
•
The human genome is filled with copies of ancient transposons, whereas the transposons in the
other genomes tend to be of more recent origin.
•
Whereas in the human two repeat families (LINE1 and Alu) account for 60% of all interspersed repeat
sequence, the other organisms have no dominant families.
Distribuzione delle repeats nel genoma
Some regions of the genome are extraordinarily dense in repeats. The prizewinner appears to be a
525-kb region on chromosome X p11, with an overall transposable element density of 89%. This region
contains a 200-kb segment with 98% density, as well as a segment of 100 kb in which LINE1 sequences
alone comprise 89% of the sequence.
In contrast, some genomic regions are nearly devoid of repeats. The absence of repeats may be a sign
of large-scale cis-regulatory elements that cannot tolerate being interrupted by insertions. The four
regions with the lowest density of interspersed repeats in the human genome are the four homeobox
gene clusters, HOXA, HOXB, HOXC and HOXD.
Distribuzione del contenuto in GC
LINE:regioni con maggiore densità AT
SINE (MIR, Alu): trend opposto.
Positive selection for Alus in GC-rich regions would imply that they benefit the organism
(perché è dove stanno i geni).
This hypothesis is based on the observation that in many species SINEs are transcribed
under conditions of stress, and the resulting RNAs specifically bind a particular protein
kinase (PKR) and block its ability to inhibit protein translation. SINE RNAs would thus
promote protein translation under stress. SINE RNA may be well suited to such a role in
regulating protein translation, because it can be quickly transcribed in large quantities from
thousands of elements and it can function without protein translation. Therefore, there
could be positive selection for SINEs in readily transcribed open chromatin such as is found
near genes.
This could explain the retention of Alus in gene-rich GC-rich regions. It is also consistent
with the observation that SINE density in AT-rich DNA is higher near genes.
Il cromosoma Y
The genetic material on chromosome Y is unusually young, probably owing to a high tolerance
for gain of new material by insertion and loss of old material by deletion.
Several lines of evidence support this picture:
•
LINE elements on chromosome Y are on average much younger than those on autosomes
•
MaLR family retroposons on chromosome Y are younger than those on autosomes
•
Chromosome Y has a relative over-representation of the younger retroviral class II (ERVK) and
a relative under-representation of the primarily older class III (ERVL) compared with other
chromosomes.
Mutation rate in males and females
Interspersed repeats on chromosome Y can also be used to estimate the relative mutation rates
in the male and female germlines. Chromosome Y always resides in males, whereas
chromosome X resides in females twice as often as in males.
They identified the repeat elements from recent subfamilies (effectively, birth cohorts dating
from the past 50Myr) and measured the substitution rates for subfamily members on
chromosomes X and Y (Fig. 29). There is a clear linear relationship corresponding to mutation rate
in the male germline to be 2.1 higher than in the female germline
Trasposoni attivi
- 950 LINE
- trovate sequenze con full-length elements e ORF intatti
61 LINE
potenzialmente attivi
TRASPOSONI COME FORZA CREATIVA
- 47 geni umani probabilmente derivati da trasposoni (RAG1 e 2 ricombinasi etc)
- Retroposoni LTR
usati come terminatori della trascrizione (few hundred genes)
- Sequenze ripetute
trasformate in elementi regolatori
Simple Sequence Repeats
Simple sequence repeats (SSRs) are a rather different type of repetitive structure that is common
in the human genome - perfect or slightly imperfect tandem repeats of a particular k-mer. SSRs
with a short repeat unit (n = 1±13 bases) are often termed microsatellites, whereas those with
longer repeat units (n = 14±500 bases) are often termed minisatellites.
SSRs comprise about 3% of the human genome,
with the greatest single contribution coming from
dinucleotide repeats (0.5%). Trinucleotide SSRs are
much less frequent than dinucleotide SSRs.
There is approximately one SSR per 2 kb
SSRs have been extremely important in human
genetic studies, because they show a high degree of
length polymorphism in the human population
owing to frequent slippage by DNA polymerase
during replication. Genetic markers based on SSRs particularly (CA)n repeats - have been the workhorse
of most human disease mapping studies. The
availability of a comprehensive catalogue of SSRs is
thus a boon for human genetic studies
Duplicazioni segmentali
Le duplicazioni segmentali comportano il trasferimento di blocchi di sequenze
genomiche di 1-200 Kb in uno o più posizioni del genoma.
- Il draft del genoma umano contiene almeno il 3,3% di duplicazioni
segmentali.
- Stima definitiva: 5% di duplicazioni segmentali.
Possono essere suddivise in due categorie.
1) Le duplicazioni intercromosomali sono definite come i segmenti che sono duplicati
tra i cromosomi non omologhi
2) Le duplicazioni intracromosomali, che si verificano all'interno di un particolare
cromosoma o braccio cromosomico.
Questa categoria comprende diversi segmenti ripetuti, anche noti come sequenze
ripetute a basso numero di copie, che mediano i ricorrenti riarrangiamenti
strutturali dei cromosomi associati a numerose malattie genetiche.
La percentuale elevata di duplicazioni di grandi dimensioni distingue chiaramente il
genoma umano da altri genomi sequenziati.
Distribuzione delle duplicazioni segmentali
Chromosome 22 contains a region of 1.5Mb adjacent to the centromere in which 90% of sequence can
now be recognized to consist of interchromosomal duplication. Conversely, 52% of the
interchromosomal duplications on chromosome 22 were located in this region, which comprises only 5%
of the chromosome. Also, the subtelomeric end consists of a 50-kb region consisting almost entirely of
interchromosomal duplications. The Chromosome 21 presents a similar landscape (erano i 2 meglio
assemblati)
Duplicazioni intercromosomali
Duplicazioni intracromosomali
Gene content of the human genome
In organisms with small genomes, it is straightforward to identify most genes by the presence of
long ORFs.
In contrast, human genes tend to have small exons (encoding an average of only 50 codons)
separated by long introns (some exceeding 10 kb).
This creates a signal-to-noise problem, with the result that computer programs for direct gene
prediction have only limited accuracy. Instead, computational prediction of human genes must rely
largely on the availability of cDNA sequences or on sequence conservation with genes and
proteins from other organisms. This approach is adequate for strongly conserved genes (such as
histones or ubiquitin), but may be less sensitive to rapidly evolving genes (including many crucial
to speciation, sex determination and fertilization).
Non-coding RNAs
Although biologists often speak of a tight coupling between “genes and their encoded protein
products”, it is important to remember that thousands of human genes produce noncoding RNAs
(ncRNAs) as their ultimate product.
There are several major classes of ncRNA.
1. Transfer RNAs (tRNAs) are the adapters that translate the triplet nucleic acid code of RNA into
the amino-acid sequence of proteins;
2. Ribosomal RNAs (rRNAs) are also central to the translational machinery, and recent X-ray
crystallography results strongly indicate that peptide bond formation is catalysed by rRNA, not
protein;
3. Small nucleolar RNAs (snoRNAs) are required for rRNA processing and base modification in the
nucleolus;
4. Small nuclear RNAs (snRNAs) are critical components of spliceosomes, the large
ribonucleoprotein (RNP) complexes that splice introns out of pre-mRNAs in the nucleus.
ncRNAs do not have translated ORFs, are often small and are not polyadenylated. Accordingly,
novel ncRNAs cannot readily be found by computational gene-finding techniques (which search
for features such as ORFs) or experimental sequencing of cDNA or EST libraries
tRNAs
Although 61 sense codons need to be decoded,
not all 61 different anticodons are present in
tRNAs. Rather, tRNAs generally follow stereotyped
and conserved wobble rules.
Wobble reduces the number of required
anticodons substantially, and provides a
connection between the genetic code and the
hybridization stability of modifed and unmodifed
RNA bases.
In eukaryotes, it has been predicted that about 46
tRNA species will be sufficient to read the 61
sense codons (counting the initiator and elongator
methionine tRNAs as two species). According to
these rules, in the codon's third (wobble)
position, U and C are generally decoded by a
single tRNA species, whereas A and G are decoded
by two separate tRNA species.
Wobble
tRNAs
The classical experimental estimate of the number of
human tRNA genes is 1,310. In the draft genome sequence,
were found only 497 human tRNA genes + 324 tRNAderived putative pseudogenes
This indicates that the human has fewer tRNA genes than
the worm, but more than the fly. This may seem
surprising, but tRNA gene number in metazoans is thought
to be related not to organismal complexity, but more to the
demand for tRNA abundance in certain tissues or stages of
embryonic development.
The tRNA genes are dispersed throughout the human
genome, but this dispersal is nonrandom. More than 25%
of the tRNA genes (140) are found in a region of only
about 4Mb on chromosome 6. This small region, only
about 0.1% of the genome, contains an almost sufficient
set of tRNA genes all by itself
Anyway, the human tRNA gene set predicted from the draft
genome sequence appears to include most of the known
human tRNA species. The draft genome sequence contains
37 of 38 human tRNA species listed in a tRNA database
Satisfyingly, the human tRNA set follows these wobble
rules almost perfectly
Negli eucarioti 46 specie di tRNA sono sufficienti per
leggere 61 codoni-senso, dato che per la terza
posizione vale la teoria del vacillamento:
• un tRNA per U/C
• A e G decodificati da due separati tRNA
Ribosomal RNAs, snoRNAs and snRNAs
Ribosomal RNA genes.
The ribosome, the protein synthetic machine of the cell, is made up of two subunits and contains four
rRNA species and many proteins. The large ribosomal subunit contains 28S and 5.8S rRNAs (collectively
called `large subunit‘ rRNA) and also a 5S rRNA. The small ribosomal subunit contains 18S rRNA (`small
subunit' rRNA). The genes for large subunit and small subunit rRNA occur in the human genome as a
44-kb tandem repeat unit. There are about 150±200 copies of this repeat unit arrayed on the short
arms of acrocentric chromosomes 13, 14, 15, 21 and 22
The 5S rDNA genes also occur in tandem arrays, the largest of which is on chromosome 1 close to the
telomere. There are 200±300 true 5S genes in these arrays.
Small nucleolar RNA genes.
Eukaryotic rRNA is extensively processed and modifed in the nucleolus. Much of this activity is directed
by numerous snoRNAs. There is a compiled set of 97 known human snoRNA gene sequences; 84 of
these (87%) have at least one copy in the draft genome sequence, almost all as single-copy genes
Spliceosomal RNAs and other ncRNA genes.
It was found at least one copy of 21 (95%) of 22 known ncRNAs, including the spliceosomal snRNAs.
There were multiple copies for several ncRNAs, as expected; for example, 44 dispersed genes for U6
snRNA, and 16 for U1 snRNA
Non-coding RNA genes
Properties characterization of known proteincoding genes
Identifying the protein-coding genes in the human genome is one of the most important applications of
the sequence data, but also one of the most difficult challenges
Before attempting to identify new genes, it has been explored what could be learned by aligning the
cDNA sequences of known genes to the draft genome sequence. Genomic alignments allow one to
study exon±intron structure and local GC content.
The `known' genes studied were those in the RefSeq database, a manually curated collection designed
to contain non-redundant representatives of most full-length human mRNA sequences in GenBank
(RefSeq intentionally contains some alternative splice forms of the same genes). The version of RefSeq
used contained 10,272 mRNAs.
The RefSeq genes were aligned with the draft genome sequence. Because this sequence is incomplete
and contains errors, not all genes could be fully aligned and some may have been incorrectly aligned.
More than 92% of the RefSeq entries could be aligned at high stringency over at least part of their
length, and 85% could be aligned over more than half of their length. Some genes (16%) had high
stringency alignments to more than one location in the draft genome sequence owing, for example,
to paralogues or pseudogenes.
Protein-coding genes
There is considerable variation in overall gene size and intron size, with both distributions having very
long tails.
Many genes are over 100 kb long, the largest known example being the dystrophin gene (DMD) at
2.4Mb.
The titin gene has the longest currently known coding sequence at 80,780 bp; it also has the largest
number of exons (178) and longest single exon (17,106 bp).
Properties of human genes compared to those from worm and fly
For all three organisms, the typical length of a
coding sequence is similar (1,311 bp for
worm, 1,497 bp for fly and 1,340 bp for
human), and most internal exons fall within a
common peak between 50 and 200 bp.
However, the worm and fly exon distributions
have a fatter tail, resulting in a larger mean
size for internal exons (218 bp for worm
versus 145 bp for human)
In contrast to the exons, the intron size
distributions differ substantially among the
three species. The worm and fly have most
introns near the preferred minimum intron
length (47 bp for worm, 59 bp for fly) and an
extended tail (overall average length of 267 bp
for worm and 487 bp for fly). Intron size is
much more variable in humans, with a peak
at 87 bp but a very long tail resulting in a
mean of more than 3,300 bp. The variation in
intron size results in great variation in gene
size.
Distribution of GC content in genes and in the genome
The variation in gene size and intron size can
partly be explained by the fact that GC-rich
regions tend to be gene-dense with many
compact genes, whereas AT-rich regions tend
to be gene-poor with many sprawling genes
containing large introns
The correlation appears to be due primarily
to intron size, which drops markedly with
increasing GC content. In contrast, coding
properties such as exon length or exon
number (data not shown) vary little.
Alternative splicing
To investigate the prevalence of alternative splicing, reconstructed mRNA transcripts
covering the entire coding regions of genes on chromosome 22 were analysed (omitting
small genes with coding regions of less than 240 bp). Potential transcripts identified by
alignments of ESTs and cDNAs to genomic sequence were verified by human inspection.
Were found 642 transcripts, covering 245 genes (average of 2.6 distinct transcripts per
gene). Two or more alternatively spliced transcripts were found for 145 (59%) of these
genes.
A similar analysis for the gene-rich chromosome 19 gave 1,859 transcripts, corresponding to
544 genes (average 3.2 distinct transcripts per gene).
Because it has been sampled only a subset of all transcripts, the true extent of alternative
splicing is likely to be greater. Nevertheless, these figures are considerably higher than
those for worm, in which analysis reveals alternative splicing for 22% of genes for which
ESTs have been found, with an average of 1.34 (12,816/9,516) splice variants per gene.
Seventy per cent of alternative splice forms found in the genes on chromosomes 19 and 22
affect the coding sequence, rather than merely changing the 3’ or 5’ UTR
Gene prediction
Gene identification is almost trivial in bacteria and yeast, because the absence of introns in
bacteria and their paucity in yeast means that most genes can be readily recognized by ab
initio analysis as unusually long ORFs. It is not as simple, but still relatively straightforward,
to identify genes in animals with small genomes and small introns, such as worm and fly. A
major factor is the high signal-to-noise ratio: coding sequences comprise a large
proportion of the genome and a large proportion of each gene (about 50% for worm and
fly), and exons are relatively large.
Gene identification is more difficult in human DNA. The signal-to-noise ratio is lower:
coding sequences comprise only a few per cent of the genome and an average of about 5%
of each gene; internal exons are smaller than in worms; and genes appear to have more
alternative splicing.
Previous estimates of human gene number
• Early estimates based on reassociation kinetics estimated the mRNA complexity of typical vertebrate
tissues to be 10,000±20,000, and were extrapolated to suggest around 40,000 for the entire
genome.
• In the mid-1980s, Gilbert suggested that there might be about 100,000 genes, based on the
approximate ratio of the size of a typical gene (3 x 104 bp) to the size of the genome (3 x 109 bp).
• An estimate of 70,000±80,000 genes was made by extrapolating from the number of CpG islands
and the frequency of their association with known genes.
• As human sequence information has accumulated, it has been possible to derive estimates on the
basis of ESTs. Such calculations consistently produce low estimates, in the region of 35,000
Gene prediction
Gene prediction methods employed combinations of three basic approaches:
• direct evidence of transcription provided by ESTs or mRNAs
• indirect evidence based on sequence similarity to previously identifed genes and
proteins
• ab initio recognition of groups of exons on the basis of hidden Markov models (HMMs)
that combine statistical information about splice sites, coding bias and exon and intron
lengths
The process resulted in version 1 of the integrated gene index (IGI). The composition of the
corresponding integrated protein index (IPI) 1, obtained by translating IGI 1, is given below. There
are 31,778 protein predictions, with 14,882 from known genes, 4,057 predictions from Ensembl
merged with Genie and 12,839 predictions from Ensembl alone. The IGI set thus contains about
15,000 known genes and about 17,000 gene predictions
The average lengths are 469 amino acids for the known proteins, 443 amino acids for protein
predictions from the Ensembl±Genie merge, and 187 amino acids for those from Ensembl alone.
Ensemble parte da predizione ab initio fatta da Genscan e cerca la conferma per similarità a proteine, mRNA e EST
Genie parte da match con mRNA o EST e estende i match con markov models
Chromosomal distribution of genes
The average density of gene predictions is 11.1 per Mb across the genome, with the
extremes being chromosome 19 at 26.8 per Mb and chromosome Y at 6.4 per Mb. It is
likely that a significant number of the predictions on chromosome Y are pseudogenes
(this chromosome is known to be rich in pseudogenes) and thus that the density for
chromosome Y is an overestimate.
The density of both genes and Alu on chromosome 19 is much higher than expected,
even accounting for the high GC content of the chromosome; this supports the idea
that Alu density is more closely correlated with gene density than with GC content
itself.
“If there are 30,000±35,000 genes, with an average coding length of
about 1,400 bp and average genomic extent of about 30 kb, then about
1.5% of the human genome would consist of coding sequence and onethird of the genome would be transcribed in genes.”
“The human thus appears to have only about twice as many genes as
worm or fly. However, human genes differ in important respects from
those in worm and fly. They are spread out over much larger regions of
genomic DNA, and they are used to construct more alternative
transcripts. This may result in perhaps five times as many primary
protein products in the human as in the worm or fly.”
Valutazione/validazione IGI/IPI
Programma principalmente basato su Ensembl: -Sensibilità
-Specificità
-Frammentazione
• Comparazione con nuovi geni non ancora annotati
31 nuovi geni  28 erano nella sequenza draft  19 identificati in IGI/IPI
Sensibilità = 68% (19/28)
Geni
Predetti
Geni x predizioni
Specificità = 61% (19/31)
14
1
14
Frammentazione = 1,4 (27/19) 
3
2
6
1
3
3
1
4
4
19
27
Presi 31 geni scoperti nel frattempo ma non ancora annotati e visto che 28 stavano
sul draft e 19 erano stati predetti. Frammentazione guarda geni che corrispondono a
più predizioni
Comparative proteome analysis
Compared with the two invertebrates, humans appear to have many proteins involved
in cytoskeleton, defence and immunity, and transcription and translation. These
expansions are clearly related to aspects of vertebrate physiology. Humans also have
many more proteins that are classified as falling into more than one functional
category (426 in human versus 80 in worm and 57 in fly, data not shown).
Interestingly, 32% of these are transmembrane receptors
Comparative proteome analysis
Probable horizontal transfer
An interesting category is a set of 223 proteins that have significant similarity to proteins from bacteria,
but no comparable similarity to proteins from yeast, worm, fly and mustard weed, or indeed from any
other (nonvertebrate) eukaryote.
These sequences should not represent bacterial contamination in the draft human sequence, because
the sequences were filtered to eliminate those essentially identical to known bacterial plasmid,
transposon or chromosomal DNA (such as the host strains for the large-insert clones).
To investigate whether these were genuine human sequences, PCR primers were designed for 35 of
these genes and confirmed that most could be readily detected directly in human genomic DNA.
Orthologues of many of these genes have also been detected in other vertebrates.
A more detailed computational analysis indicated that at least 113 of these genes are widespread
among bacteria, but, among eukaryotes, appear to be present only in vertebrates. It is possible that
the genes encoding these proteins were present in both early prokaryotes and eukaryotes, but were lost
in each of the lineages of yeast, worm, fly, mustard weed and, possibly, from other nonvertebrate
eukaryote lineages. A more parsimonious explanation is that these genes entered the vertebrate (or
prevertebrate) lineage by horizontal transfer from bacteria. Many of these genes contain introns,
which presumably were acquired after the putative horizontal transfer event.
Flow diagram for sequencing pipeline
Costruzione delle librerie e sequenziamento
“Celera believed that the initial version of a completed human genome should be a
composite derived from multiple donors of diverse ethnic backgrounds Prospective donors
were asked, on a voluntary basis, to self-designate an ethnogeographic category (e.g.,
African-American, Chinese, Hispanic, Caucasian, etc.). We enrolled 21 donors»
•
•
•
From females, 130 ml of whole, heparinized blood was collected.
From males, 130 ml of whole, heparinized blood was collected, as well as five specimens
of semen, collected over a 6-week period.
Permanent lymphoblastoid cell lines were created by Epstein-Barr virus immortalization.
DNA from five subjects was selected for genomic DNA sequencing: two males and three
females — one African-American, one Asian-Chinese, one Hispanic-Mexican, and two
Caucasians
DNA from each donor was used to construct plasmid libraries in one or more of three size
classes: 2 kbp, 10 kbp, and 50 kbp. This was done at the Celera facility, which occupied
about 30,000 square feet of laboratory space and produced sequence data continuously at a
rate of 175,000 total reads per day
After quality and vector trimming, the average trimmed sequence length was 543 bp, and
the sequencing accuracy was exponentially distributed with a mean of 99.5% and with less
than 1 in 1000 reads being less than 98% accurate
«We used automated high-throughput DNA sequencing and the computational infrastructure to enable
efficient tracking of enormous amounts of sequence information (27.3 million sequence reads; 14.9
billion bp of sequence).
Sequencing and tracking from both ends of plasmid clones from 2-, 10-, and 50-kbp libraries were
essential to the computational reconstruction of the genome. Our evidence indicates that the accurate
pairing rate of end sequences was greater than 98%.»
I dati utilizzati per l’assemblaggio
Celera
Two independent sets of data were used for the assemblies.
•
The first was a random shotgun data set of 27.27 million reads of average length 543 bp
produced at Celera. This consisted largely of mate-pair reads from 16 libraries constructed
from DNA samples taken from five different donors. Libraries with insert sizes of 2, 10, and
50 kbp were used. Assuming a genome size of 2.9 Gbp, the Celera trimmed sequences gave
a 5.13X coverage of the genome, and clone coverage was 3.42X, 16.40X, and 18.84X for the
2-, 10-, and 50-kbp libraries, respectively, for a total of 38.7X clone coverage.
•
The second data set was from the publicly funded Human Genome Project (PFP) and is
primarily derived from BAC clones. For the whole-genome assembly, the PFP data was
first disassembled or “shredded” into a synthetic shotgun data set of 550-bp reads that
form a perfect 2X covering of the bactigs. This resulted in 16.05 million “faux” reads that
were sufficient to cover the genome 2.96X because of redundancy in the BAC data set,
without incorporating the biases inherent in the PFP assembly process.
Strategie di assemblaggio e caratterizzazione del
genoma
Two different approaches to assembly were pursued:
•
a whole-genome assembly process that used Celera data and the PFP data in the form of additional
synthetic shotgun data
•
a compartmentalized assembly process that first partitioned the Celera and PFP data into sets
localized to large chromosomal segments and then performed ab initio shotgun assembly on each
set.
(clone coverage)
Whole Genome Assembly (WGA)
Adattamento dell’algoritmo sviluppato per Drosophila ad un genoma più complicato come
quello umano.
The combined data set of 43.32 million reads, and all associated mate-pair information, were then
subjected to our whole-genome assembly algorithm to produce a reconstruction of the genome. Neither
the location of a BAC in the genome nor its assembly of bactigs was used in this process. Bactigs were
shredded into reads because we found strong evidence that 2.13% of them were misassembled
Consiste di cinque fasi principali:
1- SCREENER
2- OVERLAPPER
3- UNITIGGER/DISCRIMINATOR
4- SCAFFOLDER
5- REPEAT RESOLVER
1- SCREENER:
Trova e marca tutti i microsatelliti che si ripetono con meno di 6 bp e trova tutti gli
elementi ripetuti (Alu, LINE e ribosomalDNA);
2- OVERLAPPER:
Paragona tutte le reads alla ricerca di una sovrapposizione end to end di almeno 40 bp
e non più del 6% di differenze nel match;
3- UNITIGGER/DISCRIMINATOR
Elimina le sovrapposizioni dovute a regioni ripetute
UNITIGGER: forma unitigs (UNIquely assembled conTIGS): sono contigs formati dalla
sovrapposizione di reads che appaiono non contestabili rispetto alle altre (vere
sovrapposizioni, non falsi dovuti per es. a regioni ripeture;
Unfortunately, although empirically many of these assemblies are correct (and thus involve only true
overlaps), some are in fact collections of reads from several copies of a repetitive element that have
been overcollapsed into a single subassembly. However, the overcollapsed unitigs are easily identified
because their average coverage depth is too high to be consistent with the overall level of sequence
coverage.
DISCRIMINATOR: distingue tra vere sovrapposizioni e sovrapposizioni dovute ad elementi
ripetuti;
Screener
...finds and “masks” microsatellite repeats, known repeated
regions and ribosomal DNA,
– “masked” regions not used to make contigs,
– “marks” the rest for overlapping.
read:
atgacttacttactgcatatttatttatttatttatttatttatttatttatttat
ttatttatttatttatttatttgacgtgtacgtgtacgtgtagctgtacgtgtacg
tgacgggccgcattatcgtgatgctacgtgtacgttatatctgatcgtgcatgtga
atgacttacttactgcatatttatttatttatttatttatttatttatttatttat
masked: ttatttatttatttatttatttgacgtgtacgtgtacgtgtagctgtacgtgtacg
tgacgggccgcattatcgtgatgctacgtgtacgttatatctgatcgtgcatgtga
atgacttacttactgcatatttatttatttatttatttatttatttatttatttat
marked: ttatttatttatttatttatttgacgtgtacgtgtacgtgtagctgtacgtgtacg
tgacgggccgcattatcgtgatgctacgtgtacgttatatctgatcgtgcatgtga
Overlapper
...looks for end-to end overlaps of at least 40 bp
with no more than 6% differences in match,
<--tactgtacgtagctgtgatgttcctcggatatagcgggcatatttattacgctattgtacgtgt-3’
5’- gttcctcggatatagcgggcatatttattacgctattgtacgtgtaaagtatcgt-->
> 40 bp, < 6% mismatch
What’s the significance?
17
...a one in 10 event.
…given perfect randomness.
Unitigger
...differentiates between a true overlap, and an overlap that includes more
than one loci.
...in a world where
real data matches
expected data,
each locus would
have 8X coverage,
...over-collapsed.
...if there are genomic
repeats, then
sequences would be
“over-represented”,
on average, 8 more
per repeat, per contig.
Discriminator
- DISCRIMINATOR: utilizzando
sequenze al di fuori della
zona di sovrapposizione
identifica elimina le reads
che appartendono a contig
diversi generando gli Uunitigs
4- SCAFFOLDER:
Unisce U-unitgs in scaffold utilizzando le informazioni derivanti dalle mate pair reads
- informazioni derivanti dalle librerie di 2-10 kb sono utilizzate per appaiare gli U-unitigs;
per un appaiamento accurato sono richieste almeno 2 mate pairs
- Poi informazioni dalle librerie di 50 kb e BAC ends sono usate per unire ulteriormente gli
scaffold
5- REPEAT RESOLVER:
ROCKS substage: unitigs con un buon ma non definitivo DISCRIMINATOR score possono
riempire i gap se supportati da più di 2 mate pairs
STONES substage: unitigs con un buon ma non definitivo DISCRIMINATOR score possono
riempire i gap se supportati da almeno 1 mate pair
GAP WALKING substage: i rimanenti gaps sono coperti con i dati di BAC ends
I gap vengono riempiti con le reads dei mate pairs. Con stones, basta che una read cada in un contig perché l’altra read
venga considerata buona per riempire il gap
For the assembly operations, the total compute infrastructure consisted of 10 four-processor SMPs with
4 gigabytes of memory per cluster and a 16-processor NUMA machine with 64 gigabytes of memory. The
total compute for a run of the assembler was roughly 20,000 CPU hours.
The assembly of Celera’s data, together with the shredded bactig data, produced a set of scaffolds
totaling 2.848 Gbp in span and consisting of 2.586 Gbp of sequence
Risultato WGA
Compartimentalized Shotgun Assembly
(CSA)
In addition to the WGA approach, we pursued a
localized assembly approach that was intended to
subdivide the genome into segments, each of
which could be shotgun assembled individually.
We expected that this would help in resolution of
large interchromosomal duplications and improve
the statistics for calculating U-unitigs.
The compartmentalized assembly process
involved clustering Celera reads and bactigs into
large, multiple megabase regions of the genome
(components), and then running the WGA
assembler on the Celera data and shredded, faux
reads obtained from the bactig data to ensure an
independent ab initio assembly of the component.
By subsetting the data in this way, the overall
computational effort was reduced and the effect of
interchromosomal duplications was ameliorated
27,27 million Celera reads mappate sui Bactig
dell’HGP con MATCHER, e poi assemblate assieme
ai BAC con COMBINING ASSEMBLER
MATCHER
Of Celera’s 27.27 million reads, 20.76 million
matched a bactig and another 0.62 million reads,
which did not have any matches, were nonetheless
identified as belonging in the region of the bactig’s
BAC because their mate matched the bactig.
COMBINING ASSEMBLER
The quality of the partitioning into components was
crucial so that different genome regions were not
mixed together.
We constructed components from (i) the longest
scaffolds of the sequence from each BAC and (ii)
assembled scaffolds of data unique to Celera’s data
set. The BAC assemblies were obtained by a
combining assembler that used the bactigs and the
5X Celera data mapped to those bactigs as input.
The 5.89 million Celera fragments not matching
the GenBank data were assembled with the
whole-genome assembler. The assembly
resulted in a set of scaffolds totaling 442 Mbp in
span and consisting of 326 Mbp of sequence.
More than 20% of the scaffolds were >5 kbp
long, and these averaged 63% sequence and 27%
gaps with a total of 302 Mbp of sequence.
All scaffolds >5 kbp were forwarded along with
all scaffolds produced by the combining
assembler to the subsequent tiling phase.
At this stage, they typically had one or two scaffolds
for every BAC region constituting at least 95% of the
relevant sequence, and a collection of disjoint
Celera-unique scaffolds.
The next step in developing the genome
components was to determine the order and
overlap tiling of these BAC and Celera-unique
scaffolds across the genome. For this, they used
Celera’s 50-kbp mate-pairs information, and BAC-end
pairs and sequence tagged site (STS) markers to
provide long-range guidance and chromosome
separation.
The result of this process was a collection of
“components,” where each component was a tiled
set of BAC and Celera-unique scaffolds that had
been curator-approved.
The process resulted in 3845 components with an
estimated span of 2.922 Gbp.
Finally, each component was assembled with the
WGA algorithm. As was done in the WGA process,
the bactig data were shredded into a synthetic 2X
shotgun data set in order to give the assembler the
freedom to independently assemble the data.
By using faux reads rather than bactigs, the assembly
algorithm could correct errors in the assembly of
bactigs and remove chimeric content in a PFP data
entry.
In effect, the previous steps in the CSA process
served only to bring together Celera fragments and
PFP data relevant to a large contiguous segment of
the genome, wherein we applied the assembler used
for WGA to produce an ab initio assembly of the
region.
WGA assembly of the components resulted in a set
of scaffolds totaling 2.906 Gbp in span and
consisting of 2.654 Gbp of sequence.
22% of reads were not incorporated into the
assembly. More than 90.0% of the genome was
covered by scaffolds spanning >100 kbp long,
Comparazione degli Scaffold WGA e CSA
Confrontati 2.218 WGA scaffolds con 1.714 scaffolds CSA.
- Valutata la consistenza di copertura
Risultato:
1.982 Gbp del WGA sono coperti dal CSA (95%) .
 2.169 Gbp del CSA sono coperti dal WGA ( 87,69%)

- Valutata anche le incoerenze di ordine e orientamento degli scaffolds
Risultato:
2,1 Mbp (0,11%) nell'assemblaggio WGA è incoerente con CSA
 295 Kbp (0,012%) nell'assemblaggio CSA è incoerente con WGA

CSA può essere ritenuto migliore di WGA in termini di copertura e consistenza
Mapping scaffolds to the genome
Tappe fondamentali:
1)
2)
Raggruppamento degli scaffold sulla base del loro ordine nei “components” del CSA
Mappatura dei gruppi di scaffold sui cromosomi sulla base delle mappe fisiche
Mappe usate:

Mappe STS [GeneMap99]

Mappa di fingerprintig di cloni BAC [WashUp]
Creazione di due ordini di scaffolds sulla base delle due mappe, dal confronto tra le due si
identificano differenti tipi di ordinamento di scaffolds



Anchor scaffolds : l'ordine degli scaffolds è lo stesso per le due mappe ----> 70% genoma
Ordered scaffolds : alcuni scaffolds definiti unmapped sulla base di WashUp sono stati
ordinati sulla base di GM99 ----->13,9% genoma
Bounded scaffolds : scaffolds che possono essere posizionati ma non ordinati tra anchors,
sono stati solo assegnati all’intervallo fra gli anchored scaffolds
Risultato : 98% genoma nei tre tipi di scaffolds elencati
Localizzazione degli scaffolds sui cromosomi
Incontrati problemi, soprattutto a causa di 978 BAC scoperti essere chimerici, e di problemi
causati da pseudogeni, duplicazioni genomiche etc etc: necessario validare l’assembly!
Mapping scaffolds to the genome
The final step in assembling the genome was to order and orient the scaffolds on the chromosomes.
First, scaffolds where grouped together on the basis of their order in the components from CSA. Next
the scaffold groups were mapped onto the chromosome using physical mapping data
There were two genome-wide types of map information available: high-density STS maps and fingerprint maps of BAC
clones developed at Washington University. Among the genome-wide STS maps, GeneMap99 (GM99) has the most
markers and therefore was most useful for mapping scaffolds
1. Where the order of scaffolds agreed between GM99 and the WashU BAC map these scaffolds were
termed “anchor scaffolds.” 70.1% of the genome was in anchored scaffolds, more than 99% of
which are also oriented
2. Scaffold anchored with GM99 but not the WashU (because of occasional WashU global ordering
discrepancies) were termed “ordered scaffolds.” 13.9% of the genome was in ordered scaffolds,
and thus 84.0% of the genome was ordered unambiguously.
3. All scaffolds that could be placed, but not ordered, between anchors were assigned to the interval
between the anchored scaffolds and were deemed to be “bounded” between them. These scaffolds
were termed “bounded scaffolds.”
Using the above approaches, > 98% of the genome was anchored, ordered, or bounded.
Finally, a location for each scaffold placed on the chromosome was assigned by spreading out the
scaffolds per chromosome.
Gene Prediction and Annotation
To enumerate the gene inventory, Celera developed an integrated, evidence-based approach named
Otto. The evidence used to increase the likelihood of identifying genes included:
• regions conserved between the mouse and human genomes
• similarity to ESTs or other mRNA-derived data
• similarity to other proteins.
1. Otto can promote observed evidence to a gene annotation in one of two ways. First, if the evidence
includes a high quality match to the sequence of a known gene (a RefSeq gene), then Otto can
promote this to a gene annotation. Second, Otto evaluates a broad spectrum of evidence and
determines if this evidence is adequate to support promotion to a gene annotation. Otto predicted
17,968 genes
2. Otto-predicted genes were complemented with a set of genes from three gene-prediction
programs that exhibited weaker, but still significant, evidence that they may be expressed.
Conservative criteria, requiring at least two lines of evidence, were used to define a set of 26,383
genes with good confidence.
OTTO
1. Initially, Otto searches the scaffold sequences against protein, EST, and genome-sequence databases
to define regions of sequence similarity in order to predict the gene boundaries
2. Known genes (RefSeq genes with exact matches of a full-length cDNA sequence to the genome) were
identified, and the region corresponding to the cDNA was annotated as a predicted transcript. A total
of 6538 genes were identified and transcripts predicted in this way.
3. Regions that have a substantial amount of sequence similarity, but do not match known genes,
were analyzed by Otto: conservation between mouse and human genomic DNA, similarity to human
transcripts (ESTs and cDNAs), similarity to rodent transcripts (ESTs and cDNAs), and similarity of the
translation of human genomic DNA to known proteins to predict potential genes in the human
genome.
4. The regions covered by an homology evidence, were analyzed by Genscan to see if a consistent
gene model could be generated. This procedure simplified the gene-prediction task by first
establishing the boundary for the gene and by eliminating regions with no supporting evidence. The
final Genscan predictions were often quite different from the prediction that Genscan returned on
the same region of native genomic sequence.
5. Finally, Otto compare each predicted transcript with the homology based evidence that was used in
previous steps to evaluate the depth of evidence for each exon in the prediction. Otto predicted
11,226 additional genes by means of sequence similarity.
1a) Identificazione di OMOLOGHI DI GENI CONOSCIUTI
MATCH AD ALTA QUALITA’ con una sequenza di un gene conosciuto
Se esiste un trascritto di RefSeq corrispondente alla sequenza in esame per almeno il 50% della sua
lunghezza con identità >92%, allora l’allineamento SIM4 del trascritto con la regione del genoma
viene segnata come annotazione di Otto
Un totale di 6538 geni sono stati identificati e predetti in questo modo.
1b) Identificazione di GENI omologhi NON CONOSCIUTI
Le regioni che hanno una CONSISTENTE SIMILARIETÀ DI SEQUENZA, MA NON CORRISPONDONO A GENI
CONOSCIUTI, sono stati analizzati da quella parte del sistema Otto che usa un vasto SPETTRO DI EVIDENZE per
predire un trascritto
Evidenze generate dalla pipeline computazionale:

Similarietà di sequenza con altri trascritti umani (EST e cDNA);

Similarietà rispetto a proteine conosciute;


Similarietà a trascritti di roditore;
Conservazione tra il DNA genomico umano e di topo;
• Le regioni supportate da una qualsiasi
evidenza di omologia di una sequenza
proveniente da un contenitore genico sono
state marcate;
• Le basi non coperte da alcuna evidenza di
omologia sono state sostituite da nucleotidi;
• Il segmento generato è ri-analizzato da
Genscan.
Otto ha predetto altri 11.226 geni per mezzo della similarietà di sequenza.
De novo
Recognizing that the Otto system is quite conservative, a different gene-prediction strategy was used in
regions where the homology evidence was less strong.
This final class of predicted genes is a subset of the predictions made by the three gene-finding programs
that were used in the computational pipeline. For these, there was not sufficient sequence similarity
information for Otto to attempt to predict a gene structure.
The three de novo gene-finding programs resulted in about 76,410 predictions that were nonredundant
(non-overlapping with one another). Of these, 57,935 did not overlap known genes or predictions made
by Otto.
For these genes, Celera insisted that a predicted transcript have at least two of the following types of
evidence to be included in the gene set for further analysis:
1. protein,
2. human EST
3. rodent EST
4. mouse genome fragment matches
Only 21,350 of the gene predictions that did not overlap Otto predictions were partially supported by at
least one type of sequence similarity evidence, and 8619 were partially supported by two types of
evidence (Table 8).
Predizione genica
Four kinds of evidence (conservation in 3X mouse genomic DNA, similarity to human EST or cDNA, similarity to rodent
EST or cDNA, and similarity to known proteins) were considered to support gene predictions from the different
methods
La somma di questo numero (21.350) e il numero di annotazioni Otto (17.968) produce 39.318 che è paragonabile al
limite massimo di geni del genoma umano.
Se la richiesta per altre evidenze di supporto è fatta più stringente, questo numero crolla in modo che:
-il richiedere minimo due tipi di evidenze faccia ridurre il numero a 26.000
-il richiederne almeno tre lo riduce a 23.000
In a further attempt to identify genes that were not found by the autoannotation
process or any of the de novo gene finders, we examined regions outside of gene
predictions that were similar to the EST sequence, and where the EST matched the
genomic sequence across a splice junction. After correcting for potential 39 UTRs of
predicted genes, about 2500 such regions remained. Addition of a requirement for
at least one of the following evidence types—homology to mouse genomic sequence
fragments, rodent ESTs, or cDNAs—or similarity to a known protein reduced this
number to 1010.
Adding this to the numbers from the previous paragraph would give us estimates of
about 40,000, 27,000, and 24,000 potential genes in the human genome, depending
on the stringency of evidence considered.
Ohno’s postulate
Genoma dei mammiferi è paragonato ad un deserto
con piccoli oasi “geni”. (1985)
Deserto > 500 kpb senza un gene
605 Mbp , 20% del genoma “deserto”
12 %
171 Mbp
27,5%
492 Mbp
Ohno’s postulate
Ohno’s postulate: Mammalian genomes consist of oases of genes in otherwise essentially empty
deserts. (1985)
How valid is Ohno’s postulate? It appears that the human genome does indeed contain deserts, or large,
gene-poor regions.
If we define a desert as a region > 500 kbp without a gene, then we see that 605 Mbp, or about 20% of
the genome, is in deserts.
These are not uniformly distributed over the various chromosomes. Gene-rich chromosomes 17, 19, and
22 have only about 12% of their collective 171 Mbp in deserts, whereas gene-poor chromosomes 4, 13,
18, and X have 27.5% of their 492 Mbp in deserts (Table 11).
SNPs
Classi di regioni genomiche Ampiezza delle regioni
esaminate (Mb)
Densità di SNP in
Celera-PFP (SNP/Mb)
Regioni intergeniche
2185
707
Geni (introni + esoni)
646
917
Introni
615
921
primo introne
164
808
Esoni
31
529
primo esone
10
592
Identificati circa 3 milioni di SNPs
La maggior parte degli SNP si trovano in regioni non codificanti
 Il tasso di SNP è maggiore negli introni rispetto alle regioni intergeniche
 Gli SNP presenti negli esoni possono portare a modificazioni silenti o
missenso
 Le modificazioni significative a carico di proteine sono una piccolissima
parte degli SNPs (< 1%)
The 1000 Genomes Project
The goal of the 1000 Genomes Project is to find most genetic variants that have
frequencies of at least 1% in the populations studied. This goal can be attained by
sequencing many individuals lightly. The Project currently plans to sequence each
sample to about 4X coverage; at this depth sequencing cannot provide the complete
genotype of each sample, but should allow the detection of most variants with
frequencies as low as 1%. Combining the data from 2500 samples should allow highly
accurate estimation (imputation) of the variants and genotypes for each sample that
were not seen directly by the light sequencing.
Disease researchers will use 1000 Genomes data in two ways:
• They will combine the 1000 Genomes data with the genotype data in their disease
GWA study to impute the genotypes in their samples for millions of additional
variants beyond those they genotyped directly. They will do this computationally,
with no genotyping cost. The additional genotype data will allow the researchers to
localize the disease-associated regions more precisely.
• They will compare the frequency of the variant with the frequency of the disease to
see if they are compatible
The samples for the 1000 Genomes Project mostly are anonymous and have no
associated medical or phenotype data
Variant Frequency in the population