PhD-IBMG-II-Matinée Genetic information Genomi, geni, organizzazione comparativa Funzioni, famiglie, mutazioni, evoluzione Mutazione e riparazione del DNA NGS Alternative splicing Il genoma umano 3,200 Mb 23 cromosomi (x2) The Human Genome Project Animated tutorials on the Human Genome Project: http://www.genome.gov/Pages/ EducationKit/ (free downloads or on-line view) 1. DNA sequence Sito NCBI Genomes Eukaryotic (Mammals) Genomic Data 1990-2003 Human Genome Project 2001 The HGP consortium and Celera release the first draft of 95% Human Genome 2003 Sequencing is completed 2001-today Several other genomes sequenced NGS technology 2007 The ENCODE project releases results on 1% human genome 2012 The ENCODE project publishes complete results Table 1-1 (part 2 of 2) Molecular Biology of the Cell, Fifth Edition (© Garland Science 2008) prokaryotes 5x 50x Eukaryotes and Prokaryotes different genome organization different gene structure The rationale for genom organization in Prokaryotes and Eukaryotes is fundamentally divergent Prokaryotes: essentials, reproductive speed, high mutation rate Eukaryotes: diversification, adaptation, reassortment Let’s see first some common traits: The overall «order» of transcription units (see below for definition) does not follow a simple logic. Transcription Units are in both orientations on the chromosomes Noncoding parts are present that are either functional to chromosome dynamics (e.g. origins of replication) or mobile genetic elements (e.g. virus, transposons). Transcription Units are in both orientations on the chromosome Transcription Unit (T.U. or TU) is a part of DNA that is found transcribed in RNA in at least some circumstances. A «promoter» or «promoter-like» sequence always flanks 5’ a TU. 5’ 3’ P 5’ 3’ Why do we say «TU» and not «gene». TU is a physical entity, experimentally proven. The «gene» is a concept Definitions of «gene» - One gene, one character One gene, one protein One gene, some protein isoforms (after alternative splicing) One gene, one molecular function One gene one functional module (Molecular Biology) One gene one Transcriptional Unit (Genomics) Protein coding genes Even though proteins that perform similar functions are extraordinarily conserved, genes may be remarkably different in genomic organization Essentials: promoter ORF terminator Prokaryotic genes essentially follow the operon model leader or 5’UTR DNA 5’ tail or 3’-UTR ORF2 ORF1 Transcriptional termination ORFn This part is copied into RNA (transcription) Coding A Coding B promoter (position site for RNA Polymerase) Coding N Spacers (0-few bp) ATG 1st codon operator (regulatory element) Stop codon Transcribed RNA is «polycistronic» (i.e. contain more than one string of information Un esempio molto noto in E.coli: l’operone del lattoso LacZ: -galactosidase cleaves lactose to galactose + glucose Lac Y: lactose permease, pumps in lactose against electrochemical gradients LacA: Thiogalactoside transacetylase. Lac I: lac repressor Eukaryotes Protein coding genes are often discontinuous. Coding sequences (exons) are interrupted by noncoding sections (introns) Transcribed RNAs can be mono-cistronic or polycistronic 3’utr For large eukaryotic genomes, the presence of introns is a complication. In this case we should consider separately two distinct entities: 1. The genomic sequence (i.e. the Transcription Unit) 2. The RNA produced after processing of the primary RNA transcript (mRNA in the case of protein-coding genes) Eukaryotes: monocystronic parts remaining in mRNA 5’ Intron N Intron 1 poly(A) signal Intron 2 Regulatory region and promoter 3’-UTR (regulatory) Exon 1 Exon 2 this part is copied in primary RNA Exon 3 Exon N Il processo di splicing congiunge le diverse parti che si ritroveranno nell’RNA messaggero Comparative: Human Yeast Drosophila Mais A 50 Kb tract of the Human genome 25K 50K H. sapiens Genoma umano 3200 Mb DNA intragenico 2000 Mb Sequenze di geni e genecorrelate 1200 Mb Coding Sequenze correlate 1152 Mb 48 Mb Pseudogeni Ripetizioni intersperse 1400 Mb Altre regioni intrageniche 600 Mb Introni, UTR Frammenti genici Microsatelliti 90 Mb LINE 640 Mb Elementi LTR 250 Mb Varie 510 Mb SINE 420 Mb Trasposoni DNA 90 Mb Interspersed repetitive elements - Mobile genetic elements pseudogenes Second class of pseudogenes are gene copies inactivated by multiple mutations, or: Retrotranscription-insertion Genes Protein coding (mRNA) noncoding: ncRNA Range 1–cent. 0-cent. Range 30- some Kb 300- some Mb 100- some Kb - z I valori mostrati in tabella sono i valori medi 2Kb- 100Mb Range (appr.) While genes vary enormously in size from bacteria to mammals, due to intronic prevalence, coding regions (ORF) are quite uniform, possibly due to protein structural constraints. Predicted ORF products mean size in completely sequenced organisms Average a.a. 128 Da in peptides: 110 Da Functional catalogue Gene functions Protein coding genes are organized in Gene families Bacillus subtilis Figure 1-24 Molecular Biology of the Cell, Fifth Edition (© Garland Science 2008) Duplicazioni per ricombinazione diseguale Traslocazioni Duplicazioni esoniche Figure 1-23 Molecular Biology of the Cell, Fifth Edition (© Garland Science 2008) Figure 1-25a Molecular Biology of the Cell, Fifth Edition (© Garland Science 2008) Figure 1-25c Molecular Biology of the Cell, Fifth Edition (© Garland Science 2008) Le similitudini tra proteine hanno rivelato un addizionale livello di organizzazione: il dominio Il dominio è una sottostruttura prodotta da qualunque parte del polipeptide che si possa ripiegare in una conformazione stabile indipendentemente dal resto della proteina. Il concetto di dominio è molto importante in genomica, perchè spesso i domini delle proteine sono codificati da singoli esoni, giustificando la teoria dell’”exon shuffling” per l’evoluzione delle proteine. Mutation & Repair Vai a 33 M-Phase Ripasso da Biologia della Cellula Ripasso da Biologia della Cellula La domanda che affronteremo oggi è: Come si fronteggia la possibilità di errore durante la fase di replicazione del DNA ? Come viene risolto il problema di eventuali danneggiamenti chimicofisici del DNA in cellule stazionarie ? Mutazioni e riparazione del DNA Mantenere inalterato il proprio patrimonio genetico e passarlo inalterato alla discendenza è un necessità assoluta per gli organismi viventi. Qualsiasi irregolarità durante la replicazione del DNA oppure qualsiasi danno chimico-fisico che succeda al DNA in fase non replicativa è potenziale fonte di mutazione Tutti gli organismi mantengono il più possibile inalterato il DNA, tramite complicati e dispendiosi meccanismi di riparazione e manutenzione cambio di una base 5’ 5’ 5’ 5’ 5’ DNA mutato scivolamento durante la replicazione Risultato di una mutazione In aploidi (es.: E. coli) se tollerata si fissa nella discendenza se non tollerata la cellula muore In eucarioti pluricellulari (es. Uomo): In cellule somatiche se tollerata a) nulla b) la cellula trasforma (cancro) In cellule germinali se non tollerata la cellula muore se tollerata si fissa nella discendenza se non tollerata a) la cellula muore b) lo sviluppo non è possibile Esempi: Riarrangiamenti o ricombinazioni (grandi mutazioni) Transposition = change in the order of two or more sequences Crossing-over = exchange of fragments between DNA molecules Deletion = loss of a fragment of DNA Duplication = replication of a fragment of DNA Amplification = repeated replication of a DNA fragment Insertion = addition of a fragment of DNA within a sequence Il “peso” di una mutazione dipende dal contesto in cui la mutazione avviene All’interno di una sequenza codificante: Il codice genetico è ridondante Deletions & insertions (= 1 a.a.) (indels) La proteina perderà un aminoacido, ma il resto è invariato. (reading frame changed) La proteina cambia totalmente a partire dal punto della mutazione Frequenza dei cambiamenti La velocità con la quale si fissano le mutazioni dipende da quanto la funzione di ogni particolare tratto di DNA (per es.: un gene) dipende dalla sua sequenza nucleotidica. La fedeltà di mantenimento della sequenza del proprio DNA è essenziale per ogni forma di vita Fedeltà nella replicazione Replicazione senza “proofreading” (3’-5’ esonucleasi): 1 / 104 - 105 Replicazione con proofreading: 1 / 107 – 108 Riparazione degli errori replicativi: Mismatch repair 1 / 109 Riparazione dei danni accidentali in fase non replicativa (vari) Mantiene la frequenza di mutazione a < 1 / 109 1° categoria: Mutazioni che avvengono in fase replicativa, per errori del replisoma che non vengano riparati immediatamente dalla DNA Polimerasi stessa, con il meccanismo di “proofreading” (3’5’ esonucleasi) Errori dovuti a errata incorporazione: attività proofreading (3’-5’-esonucleasi) della DNA Polimerasi (vedi: Replicazione) 1° categoria: Errori residui in replicazione (1 / 107 – 108). Vengono riparati dal meccanismo del Mismatch Repair (riparo dell’appaiamento scorretto). E. coli: sistema enzimi MutS, MutL, MutH. Uomo: enzimi MSH (MutS Homologs), MLH, PMS Mismatch T:G MutS MutL MutH ATP DNA Pol. I DNA Ligase DNA ligasi salda l’ultima interruzione Distinzione del filamento “vecchio” da quello neosistetizzato in E. Coli. E. Coli metila tutte le sequenze 3’-GATC-5’ (dam metilasi) Il DNA neosintetizzato per un po’ resta libero da metilazioni MutH taglia il filamento non metilato Eucarioti: esistono sistemi molto simili. Uomo: enzimi MSH (MutS Homologs) la predisposizine genetica al tumore al colon è dovuta a mutazioni in uno dei geni che codificano proteine del sistema (MSH2) Il sistema della dam metilasi, tuttavia, è limitata a E. coli. Non si sa ancora con esattezza come, negli organismi superiori, il sistema riconosca il filamento neosintetizzato. 2° categoria: Mutazioni post-replicative 1. Alterazioni spontanee 2. Alterazioni dovute a mutageni Post-replicative DNA repair is active in all cells, including terminally differentiated cells such as myocytes or neurons that will never replicate Idrolisi del legame glicosidico e perdita della base La rimozione di purine (max: G) è massiva: in un giorno, dal genoma di un mammifero vengono perse circa 10.000 purine / cellula. Alterazioni spontanee deossiriboso Negli organismi superiori, molte C sono metilate: una C metilata se deaminata dà una T. Per evitare l’accumulo di mutazioni, esiste un sistema di riparo che, in presenza di appaiamenti errati T:G, rimuove selettivamente le T, riportando la situazione alla normalità. Base deamination is spontaneous: circa 100 bases / day / cell. Some chemical compounds increase the rate of deamination, such as nitrous acid (deaminates A, C and G) or sodium bisulfite (deaminates C) Pairs with C rather than T Cytosine deamination uracil (pairs with A rather than G) Guanine deamination xantine (blocks replication) radiation UV analoghi delle basi agenti intercalanti Riparazione Talune modificazioni possono essere corrette in situ da appositi enzimi Riparazione dei dimeri di timina con fotoliasi (E.coli) (no eucarioti) (Base excision repair – BER) Riparazione mediante escissione di base La maggior parte delle piccole modificazioni delle basi, come deaminazioni, alchilazioni etc., sono riconosciute da una batteria di enzimi appositi che continuamente percorrono il DNA, trovano le alterazioni e rimuovono le basi danneggiate tagliando il legame glicosidico. Le DNA glicosilasi (Base excision repair – BER) Dopo il taglio, il sito viene riconosciuto da speciali endonucleasi, chiamate AP-endonucleases che rimuovono lo zucchero dallo scheletro del DNA. In seguito DNA polimerasi e DNA ligasi riparano il “gap”. Alcuni agenti fisici o composti mutageni inducono cambiamenti più grandi, che danno deformazioni più importanti nella struttura della doppia elica. (Alchilazioni, dimeri di timina, intercalanti) Questi vengono riparati con il meccanismo della NER (nucleotide excision repair). NER nucleotide excision repair (E.coli) NER E.coli: Uvr A, UvrB, UvrC, UvrD H. sapiens: XPC, XPA, XPD, XPF (ERCCI), XPG (dove “XP” sta per xeroderma pigmentosum) Mutazioni puntiformi e riarrangiamenti 1) durante la fase di replicazione del DNA 1-Proofreading 2-Mismatch Repair 2) Non dipendenti dalla replicazione del DNA - spontanee - da agenti esterni 1-Base excision repair 2-Nucleotide excision repair 3-Direct reversal Se la lesione nel DNA non viene riparata ..... DNA Polimerasi “trans-lesione” Generazione di “double strand breaks” (DSB) Se la lesione nel DNA non viene riparata ..... DNA Polimerasi translesione: famiglia Y In figura, sono rappresentate le proteine di E. coli DSB=double strand break Se la lesione riguarda entrambi i filamenti ? riparazione di tipo ricombinativo Ricombinazione Omologa Concetto di omologia – concetto di identità Ricombinazione generale o ricombinazione omologa In caso di danni del doppio filamento (DSB) ? Il meccanismo di riparazione funziona schematicamente come se ognuno dei due filamenti lesi andasse a recuperare lo “stampo” dal cromosoma omologo Modello di riparazione di DSB mediante ricombinazione omologa Ligazione dei filamenti 2 giunzioni di Holliday La risoluzione del chiasma porta a: Quali enzimi portano avanti la ricombinazione riparativa ? La ricombinazione omologa in generale Crossing-over meiotico Trasfezione sperimentale di cellule con DNA ricombinante Divisione cellulare: http://www.cellsalive.com /mitosis.htm Only in meiosis Crossing-over (ricombinazione) Ricombinazione sito-specifica Nella ricombinazione degli elementi mobili Caratteristica principale della ricombinazione omologa è che le due molecole di DNA che ricombinano devono essere “omologhe” ovvero la loro sequenza deve avere un elevato grado di identità per consentire lo scambio di filamenti e la migrazione per lunghi tratti. trasposizioni, retrotrasposizioni, delezioni e inserzioni dipedono da un tipo di ricombinazione diversa dalla ricombinazione omologa • ricombinazione eterologa • ricombinazione sito-specifica La ricombinazione sito-specifica: CSSR (conservative site-specific recombination) elemento mobile brevi sequenze identiche DNA ricevente elemento mobile “integrato” nel DNA ricevente Questo tipo di ricombinazione porta avanti diversi eventi: inserzione delezione inversione Le “ricombinasi” sono enzimi che riconoscono corte sequenze che si trovano, come ripetizioni invertite, ai lati del punto di inserzione L’integrazione del fago lambda nel cromosoma di E. coli è un fenomeno di ricombinazione sito specifica. Lisogeno Se il fago procede nella via lisogenica, si ha integrazione del DNA fagico nel cromosoma di E. coli (ricombinazione sito-specifica). The Human Genome Project Hierarchical Shotgun The sequencing phase Fragments are cloned into appropriate vectors Individual recombinants (bacterial clones) are grown, purified and sequenced (Sanger) Il clonaggio classico Sistema vettore-ospite in E. coli Dideoxy-sequencing: the “chain-terminator” method of Sanger: il metodo di maggior successo, si basa sulla sintesi di catene di DNA complementari troncate in corrispondenza di una determinata base. La sintesi avviene in presenza di molecole particolari: i di-deossi-nucleotidi d Reazione con di-deossi-Guanosintrifosfato The label is usually 32P, so that detection requires autoradiography Separazione con elettroforesi Una successiva evoluzione del metodo ha permesso l’automazione del sequenziamento del DNA. In questo caso, i nucleotidi dideossiterminatori sono marcati mediante l’addizione di un gruppo chimico fluorogenico, diverso per ogni base. La reazione viene fatta insieme, ed ogni molecola “terminata” sarà marcata col colore corrispondente alla base relativa. Sequenziamento automatico del DNA con dideossinucleotidi marcati con fluorofori. Le reazioni di terminazione di catena vengono effettuate in una singola provette, con ciascun dideossinucleotide marcato con un fluoroforo diverso. Il frazionamento viene effettuato con elettroforesi capillare. Movie 1 Movie 2 Post-genome 1. Human genetic variation 1000 Human Genome project exome project (exon-targeted resequencing) 2. Comparative genomic analysis 3. Functional analysis (ENCODE) transcriptome proteome interactome epigenome … Amplification of fragments fragmentation Adapter ligation No cloning step PCR Library No cloning step NGS sequencing reads Reads are mapped to the reference genome reference genome In re-sequencing, the number of independent sequences (called «reads») is more important than lenght The % of reference genome that is represented in «reads» is the «coverage». Other essential aspects: 1) speed 2) cost 3) error-to-depth ratio Next Generation Sequencing (deep-sequencing / mass sequencing) generation of “DNA-nanoclones” on distinct solid surfaces by PCR or singlemolecule isolation highly parallel in situ sequencing record read-out i.e. millions or short sequences (“reads”) align reads on genomes or assembly Next generation sequencing methods: Number of molecules per sequence • Amplification • Single-molecule Biochemical measurement • Sequencing by synthesis (Sanger is synthesis + termination) • Nucleotide chemistry • Associated chemistry • Sequencing by annealing and ligation • Sequencing by direct physico-chemical measurements Detection • Optical detection • Ion or conductance detection Amplification by emulsion PCR (Roche 454, Polonator) Biotinylated template http://www.pyrosequencing.com/DynPage.aspx?id=7454 In the second case, a surface sequencing is used This is the 5’ Sequencing direction -CGCCTT ATACGTCGTACTCGCAAGGCG This must be 3’ This is the 3’ This is 5’ CGCCTT- AAGGCGTACGTCGTACTCGCAA Sequencing direction CGCCTT- AAGGCGTACGTCGTATCTCGCAA TTGCGA This is the 3’ Sequencing direction Il vetrino viene chiuso per dare una camera microfluidica Le frecce indicano il flusso di reagenti chimici necessari alle reazioni che rivelano progressivamente l’ordine dei nucleotidi di ciascun frammento Un laser scanner registra in continuo la fluorescenza in tutto il vetrino, ogni segnale corrisponde ad una «base» (es. Red=A, green=T, yellow=G, blue=C ) leggendo così tutte le sequenze Reading: 2 spots are exemplified here: Immagine monocolore: sono esguiti per ogni posizioe quattro cicli successivi, con ciascuna delle basi (come reversible terminator Il «readout» sono milioni di corte sequenze (secondo i tipi, da 30 a 500 nucleotidi) che vengono chiamate «reads» Tipicamente, ogni campione produce tra 10 e 100 milioni di «reads» I «reads» vengono quindi allineati al genoma e assegnati di conseguenza ai diversi geni. sequence # of reads map region CTAGTCATGCTCTCGATCGGTCATAGTTTAGTCTGACT 12 Chr 1: 12,345,678-12,345,711 CHD1 coding TTAAAGTACTGTCATGATTTCATGCTAGCTTTTCAAAG 3 Chr 7: 1,987,654-1,987,694 PTT1 intron GGCTACTAGTCTATTACAAGGGCATCGCGGATCGCGT 1 Chr 16: 23,456,711-23,456,755 intergenic 22 Chr 1: 12,345,123-12,345,183 CHD1 coding .... .... TACTGCTGACGCCGCATGCATTTACGCTGCGGCATCGG .... Illumina-Solexa Genome Analyzer Read lengths: 36 bp, 50 bp, 75 bp, 100bp for fragment or paired-end sequencing Throughput (reads): 120 million reads per run, fragment SoliD – Applied Biosystems Read lengths: 50 bp fragment, 25 bp and 35 bp paired Throughput (reads): > 160 million reads per slide, fragment Roche – 454 Read lengths: Averaging 350 - 400 bp Throughput (reads): ~1 million reads per run Extensive sequencing of RNA from several cell types and tissues has provided answer to a long-lasting question: Why do Humans have so few genes ? To be sequenced, RNA must be copied to DNA (called cDNA – complementary DNA) This is done using an enzyme called Reverse Transcriptase (RT) cDNA is then fragmented, linked to adaptors and used to generate libraries for NGS Wang et al. (2009). Nat. Rev. Genet. 10:57-69 Wang et al. (2009) Figure 1 | A typical RnA-seq experiment. Briefly, long RNAs are first converted into a library of cDNA fragments through either RNA fragmentation or DNA fragmentation (see main text). Sequencing adaptors (blue) are subsequently added to each cDNA fragment and a short sequence is obtained from each cDNA using highthroughput sequencing technology. The resulting sequence reads are aligned with the reference genome or transcriptome, and classified as three types: exonic reads, junction reads and poly(A) end-reads. These three types are used to generate a base-resolution expression profile for each gene, as illustrated at the bottom; a yeast ORF with one intron is shown. Information is dependent on the kind of RNA preparation. Due to the uneven representation of each RNA species, it is unpractical to run RNA-Seq analysis using «total RNA preparations» (most of the reads would map to highly represented RNAs, whereas rare RNA wouldn’t be represented at all in the sequencing library) Cells, tissue Tot RNA fractionation By type: non-rRNA all nonribosomal By size: Small < 200 nt Long > 200 nt small noncoding coding + other By adds: poly(A)+ poly(A)- >99% coding large number noncoding noncoding By location: nuclear cytoplasmatic snRNA, snoRNA, lncRNA coding, noncoding By function: exome microarray-selected coding exons guanine-N7methyltransferase Elongation factor EJC deposited 20-24 nt upstream the exon-exon junction Splicing factors and snRNPs are replaced by the EJC (exon junction complex) proteins From Aguilera 2005, Curr Op Cell Biol, 17:242. How prevalent is splicing (i.e. exon-intron gene organization) in different organisms ? S. cerevisiae has only 253 introns (3% of genes), only 6 genes have 2 introns. S. pombe 43% of the genes have introns, many of them contains >1 intron H. sapiens >99% of genes contain multiple introns (40-75 nt) Average human gene: Length: 28,000 bp No. of exons: 8.8 Exon length: 120 bp No. of introns: 7.8 Intron length: 10 to >100.000 bp Da «intronless» a pochi introni, a parecchie decine one intron in the human neurexin gene is approx. 480,000 nt ! Lo splicing alternativo consiste nella scelta di considerare esoni o introni alcune sequenze. Storico: 1983 Un gene umano va incontro a processamento differenziale La calcitonina è un ormone costituito da un polipeptide di 32-aminoacidi che viene prodotto, negli esseri umani, dalle cellule parafollicolari della tiroide. La principale funzione della calcitonina è l'abbassamento della concentrazione di calcio nel sangue CGRP is produced in both peripheral and central neurons. It is a potent peptide vasodilator and can function in the transmission of pain. In the spinal cord, the function and expression of CGRP may differ depending on the location of synthesis. Alternative Splicing may concern one or more exons. Quite often many isoforms are coexpressed; sometimes there are tissue-specific isoforms. Alternative Splicing of fibronectin pre-mRNA Introns are drawn not to scale Alternative splicing H. sapiens S. cerevisiae: H. sapiens: Estimated number of protein-coding genes: < 22,000 Estimated number of proteins: > 90,000 253 genes contain introns only 3 genes shown experimentally to undergo alternative splicing >99% predicted to have exon-intron structure >90% predicted to undergo alternative splicing Materiale per uso didattico AS in many cases give rise to proteins with differential functions and roles. One example is already well-known in this course: ERBB4 (e) Isoforms of the Slo protein lacking sequences encoded by the STREX exon have fast deactivation kinetics and low Ca 2+ sensitivity, whereas isoforms containing STREX-encoded sequences have slower deactivation kinetics and higher Ca 2+ sensitivity. From: Graveley BR (2001) Trends Genet., 17:100-106. Fig. 1. Alternative splicing of the slo gene. (a) The mammalian cochlea. The cochlea is a snail-like structure of the inner ear that contains hair cells organized along a basilar membrane. The basilar membrane traverses the length of the curled-up cochlea. (b) The cochlea is sliced transversely as shown in (a) and the section of the cochlea containing the basilar membrane and the hair cells depicted. There are four rows of hair cells, one inner hair cell and three outer hair cells, situated above the basilar membrane. (c) The cochlea is unrolled to reveal the basilar membrane viewed from above. The four hair cells are arranged in rows along the length of the basilar membrane. The hair cells are tuned to unique narrow sound frequencies along the basilar membrane creating a tonotopic gradient. At one end of the membrane, hair cells are tuned to respond to a frequency of 20 Hz, where as hair cells at the other end respond to 20 000 Hz. (d) Organization of the human slo gene. The exon–intron organization of the slogene (determined by an analysis of draft sequence of the human genome) is depicted. The constitutive splicing events are indicated below the gene and alternative splicing events are depicted above the gene. The constitutive exons are white and the alternative exons are shaded. The STREX exon is purple. present in the postsynaptic cell, and thus function to initiate synaptogenesis. In contrast, b-neurexin I containing exon 20 encoded sequences can not interact with neuroligins. This form of b-neurexin I might indirectly function in releasing synapses. Neurexins From: Graveley BR (2001) Trends Genet., 17:100-106. Drosophila Dscam gene provides probably the extreme example of alternative splicing. Perhaps the most complex event that takes place during development is the migration and connection of neurons. Even in a ‘simple’ organism such as Drosophila melanogaster, which contains only ~250 000 neurons, accurately wiring neurons together would appear to be a daunting task. In flies, the gene encoding the Down syndrome cell adhesion molecule (Dscam) appears to fulfill at least part of this role. Dscam encodes an axon guidance receptor with an extracellular domain that contains ten immunoglobulin (Ig) repeats. The most striking feature of the Dscam gene is that it’s pre-mRNA can be alternatively spliced into over 38,000 different mRNA isoforms (Fig. 3a). This is 2–3 times the number of predicted genes in the entire organism ! Each mRNA encodes a distinct receptor with the potential ability to interact with different molecular guidance cues, directing the growing axon to its proper location. as an axon guidance receptor. It is thought that each Dscam variant will interact with a unique set of axon guidance cues. The form of Dscam shown on the left will interact with guidance cue A. The form of Dscam shown on the right contains different sequences encoded by exons 4, 6 and 9 and thus interacts with guidance cue B, rather than guidance cue A. Neurons expressing the form of Dscam shown on the right will be attracted in a different direction than neurons expressing the form shown on the left. From: Graveley BR (2001) Trends Genet., 17:100-106. Potentially 38,000 splicing variants How extensive is Alternative Splicing usage in Humans ? Exon-exon junction micro-arrays Oligonucleotide probes, typically 25–60 nucleotides in length, can be designed to hybridize to isoform-specific mRNA regions. Recently, alternative splicing microarrays have been designed with probes that are specific to both exons and exon–exon junctions. Probes e1, e2 and e3 are exon specific, whereas j1–2, j2–3 and j1–3 are isoform-specific junction probes. Some arrays also contain intron probes (i1 and i2) to indicate signals from pre-mRNA. Various array design and data processing strategies facilitate the quantitative analysis of alternative splicing patterns, some of which have been subsequently confirmed by PCR after reverse transcription of RNA (RT-PCR). Johnson et al.(2003) used arrays with probes for all adjacent exon–exon junctions in 10,000 human genes and hybridized these with samples from 52 human tissues and cell lines. This revealed cell-type-specific clustering of alternative splicing events, and allowed the discovery of new alternative splicing events. Pan et al. analysed 3,126 known cassettetype alternative splicing events in mouse using exon-specific and exon–exon junction probes. Analysis of RNAs in ten tissues showed clustering of alternative splicing events by tissue type, and further revealed that tissue-specific programmes of transcription and alternative splicing operate on different subsets of genes. A direct comparison also showed that computational prediction of tissue-specific alternative splicing based on ESTs and cDNAs performed poorly compared with the alternative splicing microarray and RT-PCR. From: Matlin et al. (2005), Nature Rev Mol Cell Biol, 6: 386. RNA-Seq (NGS) Figure 1 | Frequency and relative abundance of alternative splicing isoforms in human genes. a, mRNA-Seq reads mapping to a portion of the SLC25A3 gene locus. The number of mapped reads starting at each nucleotide position is displayed (log10) for the tissues listed at the right. Arcs represent junctions detected by splice junction reads. Bottom: exon/intron structures of representative transcripts containing mutually exclusive exons 3A and 3B (GenBank accession numbers shown at the right). b, Mean fraction of multi-exon genes with detected alternative splicing in bins of 500 genes, grouped by total read count per gene. A gene was considered as alternatively spliced if splice junction reads joining the same 5’ splice site (5’SS) to different 3’ splice sites (3’SS) (with at least two independently mapping reads supporting each junction), or joining the same 3’SS to different 5’SS, were observed. The true extent of alternative splicing was estimated from the upper asymptote of the best-fit sigmoid curve (red curve). Circles show the fraction of alternatively spliced genes. «Pure» alternative splicing Figure 3 Types of alternative splicing. In all five examples of alternative splicing, constitutive exons are shown in red and alternatively spliced regions in green, introns are represented by solid lines, and dashed lines indicate splicing activities. Relative abundance of alternative splicing events that are conserved between human and mouse transcriptomes are shown above each example (in % of total alternative splicing events). From: Ast G. (2004) Nature Rev Genetics 5: 773. Note that the indicated percentages derive from older studies and are slightly different from those demonstrated by recent, RNA-Seq based evaluations Figure 2 | Pervasive tissue-specific regulation of alternative mRNA isoforms. Rows represent the eight different alternative transcript event types diagrammed. Mapped reads supporting expression of upper isoform, lower isoform or both isoforms are shown in blue, red and grey, respectively. Columns 1–4 show the numbers of events of each type: (1) supported by cDNA and/or EST data; (2) with ≥ 1 isoform supported by mRNA-Seq reads; (3) with both isoforms supported by reads; and (4) events detected as tissue regulated (Fisher’s exact test) at an FDR of 5% (assuming negligible technical variation). Columns 5 and 6 show: (5) the observed percentage of events with both isoforms detected that were observed to be tissue-regulated; and (6) the estimated true percentage of tissue-regulated isoforms after correction for power to detect tissue bias (Supplementary Fig. 6) and for the FDR. For some event types, ‘common reads’ (grey bars) were used in lieu of (for tandem 39UTR events) or in addition to ‘exclusion’ reads for detection of changes in isoform levels between tissues. Note that Aa use the following definition for “tissue-specific”: at least 10% variation in isoforms. Il progetto ENCODE ha mostrato che la maggior parte dei geni umani dà origine a trascritti plurimi, utilizzando: • Siti di inizio alternativi • Ampie variazioni nel quadro di splicing • 3’ alternativi Il numero medio tendenziale sarebbe tra 9 e 12 trascritti. Figure 1 from Licatalosi et al., 2010 Some genes display “alternative promoters” Proximal promoter Distal promoter 5’ 3’ 1 2 Sometimes an exon is present between the two promoters alternative parts 5’ 3’ 1 2 Coding or noncoding exon If an acceptor site and a donor site are present This is different from the story of multiple TSS Unique TSS 5’ 5’ 3’ 1 3’ 1 Multiple TSS Other genes possess “alternative polyadenylation sites” Distal pA site Proximal pA site 5’ 3’ 1 stop stop Coding or noncoding exon How is Alternative Splicing Regulated ?? Costitutivo: snRNP (U1,U2) Siti di splicing «conservati» Alternativo: elementi cis (ESE, ESS, ISE, ISS) riconosciuti da SR, hnRNP, Attivatori, Repressori Basics of the mechanisms of alternative splicing. (a) The architecture of a pre-mRNA and the important cis-acting sequence elements that direct the splicing reaction. The consensus sequences for the 50 splice site, branchpoint and 3’ splice site for human introns is shown. (b) Schematic diagram of the sequences and proteins involved in regulating alternative splicing. Four types of regulatory sequences are known: intronic splicing enhancers (ISEs), intronic splicing silencers (ISSs), exonic splicing enhancers (ESEs) and exonic splicing silencers (ESSs). The enhancer elements are recognized by activator proteins. Within exons, these activators are most commonly members of the SR protein family. The silencer elements are bound by repressor proteins. Within exons, these repressors tend to be members of the hnRNP protein family. Regardless of their binding location, activators tend to enhance the binding of spliceosomal components to the regulated splice site while repressors tend to inhibit binding or function of the spliceosomal components. From: McManus & Graveley, COGD, 2011 Pervasive Transcriptome RNA-Seq is used both for quantitative and qualitative evaluation of transcriptomes Read mapping Quantitative (measuring gene expression) In this case, reads are mapped to a reference «transcriptome», then the number of reads for each known RNA is counted to obtain a table of expression or a density graph. Qualitative: discovering • new RNA transcripts • new exon-exon junctions (i.e. alternative splicing forms) • aberrant transcripts and splicing forms •new RNA forms (e.g. circular RNA) Experiments published to date using RNA-Seq show extensive «pervasive» transcription of genomes with more than 20,000 NONCODING RNA transcripts (human) Noncoding are high percentage qualitatively low percentage quantitatively Nota: i numeri sono un po’ diversi da quelli che successivamente presenterà ENCODE. La differenza è dovuta al numero di esperimenti considerati, che qui sono relativamente pochi (2007-2008). Ponting & Belgard, 2009 ENCODE Today: many RNA-Seq experiments on tissues, cell lines published. ENCODE reports results from 15 cell lines RNA-Seq from either «cytoplasmic» or «nuclear» fractions, poly(A+) or (A-) RNA 62% of genomic bases are represented in long RNA transcripts (>200nt) only 5.5% map to GENECODE exons only 31% were classified «intergenic» CAGE analysis: 62,403 TSS Nuclear to cytoplasmatic analysis showed many transcripts processed to give short RNA (<200) Very elevated prevalence of Alternative splicing Each gene show several transcripts (6-9 per gene, depending on classification) plateauing at 12. Short noncoding snRNA small U-RNA components of spliceosome particles snoRNA small RNA that guide chemical modifications to other RNAs micro-RNA (miRNA) single-stranded 21-24 nt post-transcriptional regulators siRNA double-stranded 21-25 bp , silencing RNA various pathways piRNA 26-31 nt involved in transcriptional silencing (transposons) sRNA various transcription associated small RNA, unknown Long noncoding (15-20,000 transcripts identified) The simplest classification of long ncRNAs is based on their loci of origin. • Antisense transcripts Transcribed in antisense of coding RNA • Intronic transcripts (sense or a/s) Spanning introns • Divergent transcripts starting from promoters in opposite direction • Long intergenic transcripts Transcript in gene-desert regions • e-RNA In either direction, starting from active enhancers No class-specific function identified. Sporadic functions described: • Transcriptional regulators, connecting activators or repressors • Silencing, by acting as scaffold for chromatin modifying enzymes • «Sponges» for micro RNA • Translational regulators Dal punto di vista della regolazione, le due classi più nuove e più interessanti sono: micro RNA lnc RNA Nella seconda parte di regolazione