12_10_2007_SequencingRun_3_1_119_647 (actual sequence)

Next Generation
Sequencing
Giulio Pavesi
University of Milano
[email protected]
Next generation sequencing vs Sanger sequencing
http://en.wikipedia.org/wiki/DNA_sequencing
Next Generation Sequencing

Applicazioni:





Sequenziamento de novo di genomi
Risequenziamento di genomi per
identificazione di varianti
Metagenomica
Sequenziamento e quantificazione di
trascrittomi
Sequenziamento di “campioni” di
DNA/RNA (estratti secondo diversi
criteri)
“Epigenetica”


L'epigenetica (dal greco επί, epì =
"sopra" e γεννετικός, gennetikòs =
"relativo all'eredità familiare") si
riferisce a quei cambiamenti che
influenzano il fenotipo senza alterare il
genotipo, ed è una branca della
genetica che descrive tutte quelle
modificazioni ereditabili che variano
l’espressione genica pur non
alterando la sequenza del DNA
Che cosa c’entra il sequenziamento
del DNA con qualcosa che *non*
riguarda la sequenza del DNA?!?!?!
“Nucleosome”



The nucleosome core particle
consists of approximately 147 base
pairs of DNA wrapped in 1.67 lefthanded superhelical turns around a
histone octamer
Octamer: 2 copies each of the core
histones H2A, H2B, H3, and H4
Core particles are connected by
stretches of "linker DNA", which can
be up to about 80 bp long
The histone code





Example H3K4me3
H3 is the histone
K4 is the residue that is modified
and its position (K lysine in
position 4 of the sequence)
me3 is the modification (threemethyl groups attached to K4)
If no number at the end like in
H3K9ac means only one group
Different chromatin states
Chromatin structure (and thus, gene expression) depend
also on the post-translational modifications associated
with histones forming nuclesomes
“ChIP”




If we have the “right”
antibody, we can extract
(“immunoprecipitate”)
from living cells the
protein of interest bound
to the DNA
And - we can try to
identify which were the
DNA regions bound by
the protein
Can be done for
transcription factors
But can be done also for
histones - and separately
for each modification
ChIPSeq
TF ChIP
Histone
ChIP
Many cellsmany copies
of the same
region bound
by the protein
After ChIP
Size selection: only
fragments of the
“right size” (200 bp)
are kept
Identification of the
DNA fragment bound
by the protein
Sequencing
So - if we found
that a region has
been sequenced
many times, then
we can suppose
that it was bound
by the protein, but…
Platform
Sequencing
Amplification
Mb/run
Time/run
Read length
Cost per run
Cost per Mb
Roche(454)
Pyrosequencing
Emulsion PCR
100 Mb
7h
250 bp
$8439
$84.39
Solexa - Illumina
By-synthesis
Bridge amplification
1300 Mb
4 days
32–40 bp
$8950
$5.97
Abi SOLiD
Ligation-based
Emulsion PCR
3000 Mb
5 days
35 bp
$17 447
$5.81
Only a short fragment of the extracted DNA region can
be sequenced, at either or both ends
(“single” vs “paired end” sequencing)
for no more than
35 (before) / 50 (yesterday) / 100 (now) bps
Thus, original regions have to be “reconstructed”
Read Mapping



Each sequence read has to be assigned to
its original position in the genome
A typical ChIP-Seq experiment produces
from 6 (before) to 100 million (now) reads
of 50-70 and more base pairs for each
sequencing “lane” (Solexa/Illumina)
There exist efficient “sequence mappers”
against the genome for NGS read
Read Mapping “Typical”
Output
ID
Sequence
>HWI-EAS413_4:1:100:825:1989
CTAGAAGCAGAAGCAGGTATTTGGGGGGAGGGTTG
>HWI-EAS413_4:1:100:1076:1671
AACTGCTTTGAGATAGGGTCTCTCTTGTTCACTTT
>HWI-EAS413_4:1:100:573:1957
TCGAGACGTAAACTAGCTAACCTACATTATCCCCT
>HWI-EAS413_4:1:100:1784:660
AATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>HWI-EAS413_4:1:100:133:987
CGCGATGATGTCTCAATACACCCCCCCGCTACCAG
>HWI-EAS413_4:1:100:1361:1636
CATGTCATGCGCTCTAATCTCTGGGCATCTTGAGA
>HWI-EAS413_4:1:100:1733:932
CCGAACTTCTGACAGGTTTGAGCCTTCTGCTCAAG
>HWI-EAS413_4:1:100:992:1902
CAATTAAATAATAATAAACTAACACACAATACAAA
>HWI-EAS413_4:1:100:1230:1718
TCAGCAAACAAACCCCCAACATAAAATCCATTATG
>HWI-EAS413_4:1:100:324:130
TCATCGAGAGGGGACTGAAGTGGAAGCTAGTCAGC
#0mm #1mm #2mm CHR
R0
NM
NM
R0
NM
NM
U1
NM
NM
U0
3
0
0
204
0
0
0
0
0
1
0
0
0
255
0
0
1
0
0
0
0
0
0
255
0
0
0 chr9
0
0
0 chr14
HIT POS
STR MM
110761807 F
33191761 F
@12_10_2007_SequencingRun_3_1_119_647 (actual sequence)
TTTGAATATATTGAGAAAATATGACCATTTTT
+12_10_2007_SequencingRun_3_1_119_647 (“quality” scores)
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 27 40 40 4 27 40
13A
“Peak finding”



The critical part of any ChIP-Seq analysis is the
identification of the genomic regions that produced
a significantly high number of sequence reads,
corresponding to the region where the protein
(nucleosome) of interest was bound to DNA
Since a graphical visualization of the “piling” of read
mapping on the genome produces a “peak” in
correspondence of these regions, the problem is
often referred to as “peak finding”
A “peak” then marks the region that was enriched in
the original DNA sample
“Peak finding”
Peaks:
How tall?
How wide?
How much
enriched?
“Peak finding”

The main issue: the DNA sample
sequenced (apart from sequencing
errors/artifacts) contains a lot of “noise”





Sample “contamination” - the DNA of the PhD
student performing the experiment
DNA shearing is not uniform: open chromatin
regions tend to be fragmented more easily and
thus are more likely to be sequenced
Repetitive sequences might be artificially
enriched due to inaccuracies in genome
assembly
Amplification pushed too much: you see a single
DNA fragment amplified, not enriched
As yet unknown problems, that anyway seem to
produce “noisy” sequencings and screw the
experiment up
ChIP-Seq histone data


Histone modifications tend to be located at
preferred locations with respect to gene
annotations/transcribed regions
Hence, enrichment can be assessed in two
ways


Enrichment with respect a the control
experiment and peak identification
“Local” enrichment in given regions with respect
to gene annotations




Promoters (active/non active)
Upstream of transcribed/non transcribed genes
Within transcribed/not transcribed regions
Enhancers, whatever else
Esperimento


Eseguire una ChIP-Seq per diverse
modificazioni istoniche, partendo da
quelle più “classiche”
Verificare:


Se ciascuna modifica ha una sua
localizzazione “preferenziale” sul
genoma o rispetto ai geni (es. nel
promotore, nella regione trascritta, etc.)
Se ciascuna modifica è “correlata” in
qualche modo alla
trascrizione/espressione dei geni
Genome wide histone
modifications maps through
ChIP-Seq


Barski et.al - Cell 129 823-837, 2007
20 histone lysine and arginine methylations in CD4+ T
cells









H3K27
H3K9
H3K36
H3K79
H3R2
H4K20
H4R3
H2BK5
Plus:



Pol II binding
H2A.Z (replaces H2A in some nucleosomes)
insulator-binding protein (CTCF)
Genome wide histone
modifications maps through
ChIP-Seq
Esperimento




ChIP-Seq associata a una particolare modificazione
(es, H3K4me3)
Domanda: la modificazione è “correlabile” alla
trascrizione dei geni?
Ovvero, la modificazione “marca” particolari
nucleosomi rispetto all’inizio della trascrizione, o
alla regione trascritta
Esempio: potrebbero esserci modificazioni che:
 Marcano l’inizio della trascrizione
 Marcano tutta e solo la regione trascritta
 “Silenziano” particolari loci genici impedendo la
trascrizione
 Non c’entrano nulla con la trascrizione vera e
propria e sono localizzate altrove
Esperimento





Sequenze ottenute da ChIP-Seq per la
modificazione studiata
Input: coordinate genomiche delle posizioni in
ciascuna delle sequenze mappa (vedi file di
esempio)
Input: coordinate genomiche dei geni RefSeq
annotati
Un nucleosoma marcato dalla modificazione
dovrebbe corrispondere a un “mucchietto” di
read che si sovrappongono (“picco”)
Andiamo a contare, nucleosoma per
nucleosoma, quanto alto è il “mucchietto”,
ovvero quanti read sono associabili al
nucleosoma
Esempio: se si trovasse la modifica nel nucleosoma a monte
del TSS dei geni trascritti, troveremmo un “mucchietto” così
Modificazione
Nucleosoma
Esempio: se si trovasse la modifica nei nucleosomi associati
alle regioni trascritte, troveremmo “mucchietti” così
Modificazione
Nucleosoma
“Inizi della trascrizione”

Tecniche di laboratorio come il “CAGE”
(Cap-Analysis-Gene-Expression)
permettono:



L’esatta mappatura del 5’ degli RNA sul
genoma, ovvero localizzare gli esatti TSS
Quantificare il livello di trascritto prodotto a
partire da ciascuno del TSS identificati
Poiché cerchiamo la precisa
localizzazione delle modifiche istoniche
rispetto ai TSS, è importante localizzare
anche i TSS con precisione