Genomi 5 Omologia Omologia: due o più sequenze di DNA o proteine si dicono omologhe se provengono da un antenato comune (evoluzione). Ortologia Speciazione L’emoglobina in due specie diverse (uomo e scimmia), hanno la stessa funzione, anche se sono leggermente diverse. I geni per l’emoglobina dell’uomo e della scimmia originano da un unico gene ancestrale. Geni ortologhi: geni omologhi presenti in specie diverse ma correlate, che codificano per proteine con strutture e funzioni simili I geni ortologhi si sono separati non in seguito ad un evento di duplicazione ma per la separazione delle specie (speciazione) avvenuta nel corso dell’evoluzione. Paralogia Duplicazione Durante l’evoluzione dell’uomo si è duplicato il gene della proteina della pompa di sodio, poi una delle due è mutata diventando una pompa sodio-potassio. Entrambi comunque originano da un unico gene ancestrale. Geni paraloghi: della stessa specie che codificano per prodotti diversi. Sono originati da un evento di duplicazione Sintenia In classical genetics, synteny describes the physical co-localization of genetic loci on the same chromosome within an individual or species. The concept is related to genetic linkage: Linkage between two loci is established by the observation of lower-than-expected recombination frequencies between them. Shared synteny (also known as conserved synteny) describes preserved co-localization of genes on chromosomes of different species. During evolution, rearrangements to the genome such as chromosome translocations may separate two loci apart, resulting in the loss of synteny between them. Conversely, translocations can also join two previously separate pieces of chromosomes together, resulting in a gain of synteny between loci. Stronger-than-expected shared synteny can reflect selection for functional relationships between syntenic genes, such as combinations of alleles that are advantageous when inherited together, or shared regulatory mechanisms The analysis of synteny in the gene order sense has several applications in genomics. Shared synteny is one of the most reliable criteria for establishing the orthology of genomic regions in different species. Additionally, exceptional conservation of synteny can reflect important functional relationships between genes. For example, the order of genes in the "Hox cluster", which are key determinants of the animal body plan and which interact with each other in critical ways, is essentially preserved throughout the animal kingdom A qualitative distinction is sometimes drawn between macrosynteny, preservation of synteny in large portions of a chromosome, and microsynteny, preservation of synteny for only a few genes at a time. Synteny relationships between the human and mouse IRG genes (a) Synteny between mouse chromosome 7 and human chromosome 19 in the region of the IRGC and IRGQ genes. The figures indicate distances from the centromere in megabases. The locations of three further syntenic markers are given. Gene orientation is given by black arrows. (b) Complex synteny relationship between human chromosome 5 and mouse chromosomes 11 and 18 in the regions containing the mouse Irg genes. Figures indicate distances from the centromere in megabases. The locations of IRG genes are shown in the yellow panels. Positions of diagnostic syntenic markers are also indicated. Syntenic blocks are given in full color, and the rest is shaded. Bekpen et al. Genome Biology 2005 6:R92 doi:10.1186/gb-2005-6-11-r92 An international collaboration (the Arabidopsis Genome Initiative, AGI) began sequencing the genome in 1996. Perché sequenziare una pianta • Capire le basi genetiche delle differenze tra vegetali/altri eucarioti • Confronto con altri genomi vegetali (genetica comparativa) • Identificazione geni batterici eucariotico (studi evolutivi) integrati nel genoma • Miglioramento genetico di specie di interesse (crop improvement) • Studio di mutazioni erditarie (analisi cliniche) Perché A. thaliana? • genoma piccolo (5 cromosomi) • ciclo vitale rapido (circa 6 settimane) • produce molti semi • cresce in spazi ridotti • facile da manipolare (trasformazione con agrobatterio) • grande disponibilità di mutanti • nessuna importanza agronomica ma vantaggiosa per la ricerca di base STRATEGIA DI SEQUENZIAMENTO TECNOLOGIA TOP DOWN (sequenziamento gerarchico) • • • Costruzione di BAC e YAC libraries Costruzione di un tiling path di sequenze allineate e ordinate Frammentazione e sequenziamento dei singoli cloni VANTAGGI • favorisce la ricostruzione di mappe fisiche e genetiche ad alta risoluzione • permette a gruppi di lavoro di tutto il mondo di formare consorzi e lavorare insieme senza essere ridondanti The transformation-competent bacterial artificial chromosome vector (TAC) contains the P1 bacteriophage replicon, which maintains the vector in a single copy, and therefore renders foreign DNA fragments stable, in E. coli cells. The vector also contains the pRiA4 replicon of the Ri plasmid, which ensures a vector copy number of 1 in Agrobacterium tumefaciens. The kanamycin resistance marker gene (NPTI), modified by removal of the Hind III site, is included in the vector to allow selection of clones in both E. coli and A. tumefaciens by culture in the presence of kanamycin. Fig. 1. Structure and characteristics of the TAC vector pYLTAC7. (A) Physical map of pYLTAC7. The map1 shows the locations of recognition sites for endonucleases that cleave the vector once or twice. Abbreviations not defined in text: KmR, kanamycin resistance gene (NPT1); PE. coli, a synthetic E. coli promoter. (B) Sequence of the cloning site region of the vector upstream of the sacB gene. The indicated primer sequences (R1, R2, R3, L1, L2, and L3) are designed for isolation of end fragments of the inserted DNA by TAIL-PCR. The initiation codon (atg) of sacB is indicated in lowercase letters. http://www.kazusa.or.jp/en/plant/TAC/Fig1_TACvector.html Il Minimal Tiling Path come costruire il tiling path? 1. IBRIDAZIONE CON SONDA MARCATA (chromosome walking) come costruire il tiling path? 1. IBRIDAZIONE CON SONDA MARCATA (chromosome walking) 2. FINGERPRINTING come costruire il tiling path? 1. IBRIDAZIONE CON SONDA MARCATA (chromosome walking) 2. FINGERPRINTING 3. END SEQUENCING sequenziare le estremità della raccolta di cloni BAC e verificare la correttezza della ricostruzione delle sequenze Strategia di sequenziamento del genoma di A. thaliana Telomere sequence was obtained from specific yeast artificial chromosome (YAC) and phage clones, and from inverse polymerase chain reaction (IPCR) products derived from genomic DNA. In the centromeric regions, these physical mapping methods were supplemented with genetic mapping to identify contig positions and orientation. Strategie di sequenziamento •BRACCI CROMOSOMICI: BAC, YAC, TAC, cosmidi e cloni P1 •TELOMERI: YAC e tecnica iPCR •CENTROMERI: BAC e PCR La mappa fisica del genoma è stata realizzata combinando tecniche di analisi di frammenti di restrizione “fingerprint” di cloni BAC, ibridazione/PCR di regioni con marcatori molecolari noti e Southern blotting. Caratteristiche del genoma di A. thaliana • 125 Mb (114.5 + 10 Mb di regioni centromeriche non sequenziate/regioni ripetute) • 25.498 geni codificano per 11.601 tipi di proteine (circa 150 famiglie) Roughly 30% of the 25,498 predicted gene products, comprising both plant-specific proteins and proteins with similarity to genes of unknown function from other organisms, could not be assigned to functional categories. A combination of algorithms, all optimized with parameters based on known Arabidopsis gene structures, was used to define gene structure. We used similarities to known protein and expressed sequence tag (EST) sequence to refine gene models. • Eighty per cent of the gene structures predicted by the three centres involved were completely consistent, 93% of ESTs matched gene models, and less than 1% of ESTs matched predicted non-coding regions, indicating that most potential genes were identified. • The 25,498 genes predicted was the largest gene set published to that date: C. elegans has 19,099 genes and Drosophila 13,601 genes. • Arabidopsis and C. elegans have similar gene density, whereas Drosophila has a lower gene density Confronto con genomi di altri organismi: • Funzioni comuni geni altamente conservati • Funzioni specifiche (fotosintesi, tropismo ecc.) minor livello di conservazione The proportion of Arabidopsis proteins having related counterparts in eukaryotic genomes varies by a factor of 2 to 3 depending on the functional category. Only 8-23% of Arabidopsis proteins involved in transcription have related genes in other eukaryotic genomes, reflecting the independent evolution of many plant transcription factors. In contrast, 4860% of genes involved in protein synthesis have counterparts in the other eukaryotic genomes, reflecting highly conserved gene functions The absolute number of Arabidopsis gene families and singletons (types) is in the same range as the other multicellular eukaryotes, indicating that a proteome of 11,000-15,000 types is sufficient for a wide diversity of multicellular life. The proportion of gene families with more than two members is considerably more pronounced in Arabidopsis than in other eukaryotes. Pronounced redundancy in the Arabidopsis genome is evident in segmental duplications and tandem arrays, and many other genes with high levels of sequence conservation are also scattered over the genome. Segmental duplication is responsible for 6,303 gene duplications, the extent of tandem gene duplications accounts for a significant proportion of the increased family size. Gene duplication indicates more relaxed constraints on genome size in plants, OR a more prominent role of unequal crossing over to generate new copies! Genome Organization The Arabidopsis genome sequence provides a complete view of chromosomal organization and clues to its evolutionary history. 17% dei geni sono ri-arrangiati in tandem arrays; 24 lunghi (> 100 kb) tratti duplicati (~ 58% del cromosoma); Il 60% dei geni è duplicato; Forse è avvenuto un evento di speciazione da una pianta ancestrale tetraploide (~ 112 milioni di anni fa). What does the Arabidopsis genome tell us about the ancestry of the species? Polyploidy occurs widely in plants and is proposed to be a key factor in plant evolution. As the majority of the Arabidopsis genome is represented in duplicated (but not triplicated) segments, it appears most likely that Arabidopsis, like maize, had a tetraploid ancestor. A comparative sequence analysis of Arabidopsis and tomato estimated that a duplication occurred ~112 Myr ago to form a tetraploid. It is also possible, however, that several independent segmental duplication events took place instead of tetraploid formation and stabilization. Comparative analysis of Arabidopsis accessions Comparing the multiple accessions of Arabidopsis allows us to identify commonly occurring changes in genome microstructure. It also enables the development of new molecular markers for genetic mapping. - Columbia (Col-0) - Landsberg erecta (Ler) High rates of polymorphism between Arabidopsis accessions, including both DNA sequence and copy number of tandem arrays, are prevalent at loci involved in disease resistance. This has been observed for other plant species, and such loci are thought to serve as templates for illegitimate recombination to create new pathogen response specificities. Comparative analysis of Arabidopsis accessions A comparative analysis between 82Mb of the genome sequence of Arabidopsis accession Columbia (Col-0) and 92.1Mb of nonredundant low-pass (twofold redundant) sequence data of the genomic DNA of accession Landsberg erecta (Ler) revealed two classes of differences: • InDels (14,570 InDels at an average spacing of 6.1 kb) • SNPs ( 1 SNP every 3.3 kb) InDels ranged from 2 bp to over 38 kilobase-pairs, although 95% were smaller than 50 bp. Many InDels contained entire active genes not related to transposons. Half of such genes absent from corresponding positions in the Col-0 sequence were found elsewhere on the genome of Ler. This indicates that genes have been transferred to new genomic locations. SNPs were found in exons, introns and intergenic regions at frequencies of 1 SNP per 3.1, 2.2 and 3.5 kb, respectively. These analyses show that sequence polymorphisms between accessions of Arabidopsis are common, and that they occur in both coding and non-coding regions. Integrazione del genoma degli organelli nel genoma di Arabidopsis thaliana Gli elementi trasponibili Il 10% del genoma di Arabidopsis è formato da elementi trasponibili - di CLASSE I, si replicano mediante intermedi a RNA (2109) • LTR (long terminal repeats) retrotrasposoni / LINEs / SINEs - di CLASSE II, si muovono in forma di DNA (2203) • hAT-like elements / CACTA-like elements / MITEs / MULEs - NUOVI GRUPPI (1209) • Basho / Katytid Telomers and centromers • ~15% dei trasportatori sono pro-proteine canale • Il 50% delle proteine canale sono acquaporine • proteine MIP (Major intrinsic protein ) 10 volte in più sistema idraulico fondamentale in numerosi processi • trasportatori di anioni inorganici (fosfato, solfato,nitrato e cloruro) e canali catione-metallo • ~ 1.000 geni codificanti proteine Ser/Thr chinasi peptidi importanti per il trasporto dei segnali in pianta • ~ 12% sono trasportatori di zuccheri • Sorprendentemente, possiede omologhi ai trasportatori umani ABC TAP di peptidi antigenici per la presentazione al complesso maggiore di istocompatibilità (MHC) Il genoma di Arabidopsis ha una complessità di regolazione genica comparabile agli altri eucarioti. FATTORI DI TRASCRIZIONE o Identificati usando ricerca di similarità e il matches-domain; o 1709 proteine codificanti per fattori di trascrizione; o 29 classi di fattori di trascrizione di Arabidopsis (16 unici vegetali); o 8-23% di similarità con fattori di trascrizione in altri eucarioti. ORGANIZZAZIONE CELLULARE Le divergenze evolutive nell’organizzazione del citoscheletro e la citochinesi sembrano derivate dalla presenza della parete cellulare esempi: • mancanza di proteine che connettono il citoscheletro alla matrice cellulare come nelle cellule animali; • presenza di plasmodesmi; • piastra cellulare formata de novo; • assenza di analoghi del centrosoma o dell’anello contrattile delle cellule animali. SVILUPPO La regolazione dello sviluppo in Arabidopsis coinvolge • Comunicazione cellula-cellula • Fattori di trascrizione • Regolazione dello stato della cromatina confronto con regno animale STESSI MECCANISMI di SVILUPPO, ma “STRUMENTI” DIFFERENTI esempi: • Sviluppo antero-posteriore dell’embrione coinvolge l’attivazione spaziospecifica di membri di una famiglia di geni • Animali Homeo box • Piante MADS box • Comunicazione cellula-cellula • Animali recettori Tyr chinasici • Piante recettori Ser/Tyr chinasici TRASDUZIONE DEL SEGNALE La pianta risponde ai cambiamenti ambientali attuando 1. RICEZIONE DEL SEGNALE Messaggeri secondari come 2. TRASDUZIONE DEL SEGNALE ormoni (auxina, etilene, 3. MODIFICAZIONE DEL PATTERN brassinosteroidi) e peptidi DI ESPRESSIONE (CLV3) le piante hanno EVOLUTO il loro PATHWAY di TRASDUZIONE del SEGNALE Esempi: • RISPOSTA all’ETILENE sistema a 2 componenti (combinazione dei pathway di batteri e animali) • CASCATE MAPK differenti componenti nel sistema di fosforilazione Histo-Asp rispetto ai mammiferi – – – Regolatori della risposta (ARRs) Regolatori della pseudorisposta (PRRs) Proteine trasportatrici della fosforilazione (HPt) RICONOSCIMENTO E RISPOSTA AI PATOGENI • Nei mammiferi, il polimorfismo per il riconoscimento dei parassiti è codificato nei geni MHC e contribuisce alla resistenza. • Nelle piante, la resistenza alle malattie geni (R) che conferiscono il riconoscimento dei parassiti sono estremamente polimorfi. • A differenza dei geni MHC, i geni di resistenza delle piante si trovano in luoghi diversi, e la sequenza completa del genoma consente l'analisi della loro integrazione e la struttura. • Il genoma di Arabidopsis contiene geni di resistenza diversi distribuiti in molti loci, insieme con i componenti delle vie di segnalazione, e molti altri geni, il cui ruolo nella resistenza alle malattie è stata dedotta da fenotipi mutanti. • L'evoluzione dei geni di resistenza può comportare la duplicazione e la divergenza del gene legato, tuttavia, la maggior parte (46) geni di resistenza sono singoli, 50 sono a coppie, 21 sono in 7 gruppi di 3 membri, con singoli gruppi di 4, 5, 7, 8 e 9 membri, rispettivamente. FOTOMORFOGENESI E FOTOSINTESI • 100 geni candidati coinvolti nella percezione della luce e nella segnalazione – nuove proteine simili ai regolatori della fotomorfogenesi: COP / DET / FUS, PKS1, PIF3, NDPK2, Spa1, FAR1, gigantea, FIN219, HY5, CCA1, ATHB-2, Zeitlupe, FKF1, LKP1, NPH3 e RPT2. • 139 geni codificati nel nucleo che potenzialmente hanno una funzione nella fotosintesi – 11 proteine del core del fotosistema I, compresi i componenti eucariotico-specifici PSAG e PsaH101, e 8 proteine del fotosistema II e anche un membro (psbW) del nucleo del fotosistema II – 26 proteine simili alle proteine di legame della clorofilla-a / b (8 Lhca e 18 LHCb) Altre analisi hanno identificato alcuni componenti del pathway di percezione della luce e hanno dimostrato che la dei componenti complessi dell'apparato fotosintetico si ripartiscono tra il genoma nucleare e plastidiale METABOLOMA Numero consistente di geni che codificano enzimi coinvolti nei processi metabolici (fotosintesi, respirazione, acquisizione di minerali…); Geni spesso ridondanti (per lo più tessuto-specifici); Sintesi di oltre 100.000 metaboliti secondari (ingegneria metabolica); Definizioni importanti Contig: The result of joining an overlapping collection of sequences or clones. Scaffold: The result of connecting contigs by linking information from paired-end reads from plasmids, paired-end reads from BACs, known messenger RNAs or other sources. The contigs in a scaffold are ordered and oriented with respect to one another. N50 length: A measure of the contig length (or scaffold length) containing a `typical' nucleotide. Specifcally, it is the maximum length L such that 50% of all nucleotides lie in contigs (or scaffolds) of size at least L. N50 length: A contig N50 is calculated by first ordering every contig by length from longest to shortest. Next, starting from the longest contig, the lengths of each contig are summed, until this running sum equals one-half of the total length of all contigs in the assembly. The contig N50 of the assembly is the length of the shortest contig in this list. The scaffold N50 is calculated in the same fashion but uses scaffolds rather than contigs. The longer the scaffold N50 is, the better the assembly is. The N90 statistic is smaller than or equal to the N50 statistic; it is the length for which the collection of all contigs of that length or longer contains at least 90% of the total of the lengths of the contigs, and for which the collection of all contigs of that length or shorter contains at least 10% of the total of the lengths of the contigs. Note that N50 is calculated in the context of the assembly size rather than the genome size. The NG50 statistic is the same as the N50 except that the genome size is used rather than the assembly size. Cosa ha rivelato il sequenziamento del genoma umano Background to the Human Genome Project Scelta del metodo di sequenziamento Coordinamento e data sharing Server pubblico senza restrizione di accesso “…we felt that the human genome sequence is the common heritage of all humanity, and the work should transcend national boundaries…’’ “We believed that scientific progress would be most rapidly advanced by immediate and free availability of the human genome sequence.” Generazione della Draft sequence 1) SELEZIONE DEI CLONI 2) SEQUENZIAMENTO 3) ASSEMBLAGGIO SELEZIONE DEI CLONI 8 librerie di grandi inserti contenenti cloni BAC e PAC costruite dalla digestione parziale di DNA genomico preparate da DNA ottenuto da donatori anonimi insieme rappresentano circa 65 volte la copertura del genoma For the large-scale sequence production phase, a genome-wide physical map of overlapping clones was also constructed by systematic analysis of BAC clones representing 20-fold coverage of the human genome “Volunteers of diverse backgrounds were accepted on a first-come, first-taken basis” BAC DNAs are digested with HindIII and visualized on a SYBR-green-stained 1% agarose gel. Every fifth lane contains a mixture of marker DNAs; the sizes of selected marker fragments are indicated. 0, origin of fragment migration. SEQUENZIAMENTO DEI CLONI SEQUENZIAMENTO DEI CLONI ASSEMBLAGGIO : Suddiviso in 3 fasi: A) FILTERING Eliminare contaminazione da sequenze non umane B) LAYOUT Associare i cloni sequenziati a specifici cloni su una mappa fisica, per produrre un “layout”. Cloni ottenuti: 29,298 METODO N° CLONI POSIZIONATI Associare i cloni sequenziati ai corrispondenti fingerprint clone contigs della mappa fisica sulla base dei profili di digestione “in silico” 16,193 Posizionare i cloni sequenziati su una mappa fisica usando database di end sequences da fingerprinted BACs (Tabella 1) Combinazione dei due approcci 22,566 Sfruttare la sovrapposizione di sequenza con i cloni già posizionati 25,403 29,298 (- 152) Poi i fingerprint clone contig sono stati posizionati sui cromosomi usando i match di sequenza di STS mappati su 2 mappe genetiche e su 4 radiation hybrid maps, assieme ai dati della FISH. Il mappaggio è stato interamente rifinito comparando l’ordine e l’orientamento degli STS nei fingerprint clone contigs e le varie mappe basate su STS Orientati sbagliati, Riorientati in B In tutto 942 fingerprint clone contig contenevano cloni sequenziati. Di questi, 892 (99,2% del DNA) erano stati assegnati a specifiche posizioni sui cromosomi, 51 (0,5% del DNA) erano stati assegnati a specifici cromosomi ma non in precise posizioni, e 39 (0,3% del DNA) non avevano nessuna localizzazione. ASSEMBLAGGIO : A) FILTERING B) LAYOUT C) MERGING Unire le sequenze da cloni sequenziati sovrapposti tramite GigAssembler. Sequence-contig Merged Sequence-contig Sequence-contig scaffold Il risultato dell’assemblaggio è una draft sequence del genoma umano. Broad genomic landscape CG content Il genoma dei Vertebrati può essere considerato un mosaico di isocore, cioè di ampi segmenti di DNA aventi una composizione nucleotidica omogenea It has been proposed that the long-range variation in GC content may reflect that the genome is composed of a mosaic of compositionally homogeneous regions that have been dubbed `isochores'. CpG island CpG sites are regions of DNA where a cytosine nucleotide occurs next to a guanine nucleotide in the linear sequence of bases along its length. "CpG" is shorthand for "—C—phosphate—G—", that is, cytosine and guanine separated by a phosphate, which links the two nucleosides together in DNA. The "CpG" notation is used to distinguish this linear sequence from the CG base-pairing of cytosine and guanine. CpG islands typically occur at or near the transcription start site of genes, particularly housekeeping genes, in vertebrates. Normally a C base followed immediately by a G base (a CpG) is rare in vertebrate DNA because the cytosines in such an arrangement tend to be methylated. However, over evolutionary time methylated cytosines tend to turn into thymines because of spontaneous deamination. As a result residual CpG islands are created in areas where methylation is rare, and CpG sites stick. Methylation of CpG sites is followed by spontaneous deamination leading to a lack of CpG sites in methylated DNA. Metilazione di mantenimento: durante la replicazione del DNA, il filamento di nuova sintesi viene metilato nelle posizioni corrispondenti ai siti metilati sul filamento parentale per assicurare alle molecole figlie lo stesso profilo di metilazione della molecola originaria Metilazione de novo: quella dinamica che cambia il profilo del metiloma The count of 28,890 CpG islands is reasonably close to the previous estimate of about 35,000. Most of the islands are short, with 60±70% GC content. More than 95% of the islands are less than 1,800 bp long, and more than 75% are less than 850 bp. The longest CpG island (on chromosome 10) is 36,619 bp long, and 322 are longer than 3,000 bp. The density of CpG islands varies substantially among some of the chromosomes. Most chromosomes have 5±15 islands per Mb, with a mean of 10.5 islands per Mb. However, chromosome Y has an unusually low 2.9 islands per Mb, and chromosomes 16, 17 and 22 have 19±22 islands per Mb. The extreme outlier is chromosome 19, with 43 islands per Mb. Comparison of genetic and physical distance Sequenze ripetute nel genoma umano In the human, coding sequences comprise less than 5% of the genome, whereas repeat sequences account for at least 50% and probably much more. Broadly, the repeats fall into five classes: 1. transposon-derived repeats, often referred to as interspersed repeats; 2. inactive (partially) retroposed copies of cellular genes (including protein-coding genes and small structural RNAs), usually referred to as processed pseudogenes; 3. simple sequence repeats, consisting of direct repetitions of relatively short k-mers such as (A)n, (CA)n or (CGG)n; 4. segmental duplications, consisting of blocks of around 10±300 kb that have been copied from one region of the genome into another region; 5. blocks of tandemly repeated sequences, such as at centromeres, telomeres, the short arms of acrocentric chromosomes and ribosomal gene clusters. Repeats are often described as `junk' and dismissed as uninteresting. However, they actually represent an extraordinary trove of information about biological processes. The repeats constitute a rich palaeontological record, holding crucial clues about evolutionary events and forces. As passive markers, they provide assays for studying processes of mutation and selection. Transposable elements Most human repeat sequence is derived from transposable elements. We can currently recognize about 45% of the genome as belonging to this class. Much of the remaining ‘unique’ DNA must also be derived from ancient transposable element copies that have diverged too far to be recognized as such. In mammals, almost all transposable elements fall into one of four types, of which three transpose through RNA intermediates and one transposes directly as DNA. These are long interspersed elements (LINEs), short interspersed elements (SINEs), LTR retrotransposons and DNA transposons. SINEs, LINEs, LTR retroposons and DNA transposon copies comprise 13%, 20%, 8% and 3% of the sequence, respectively. Long Interspersed Elements (LINEs) LINEs are one of the most ancient and successful inventions in eukaryotic genomes. In humans, these transposons are about 6 kb long, harbour an internal polymerase II promoter and encode two open reading frames (ORFs). ORF1 encodes an RNA binding protein and ORF2 encodes a protein having an endonuclease (e.g. RNase H) as well as a reverse transcriptase. Upon translation, a LINE RNA assembles with its own encoded proteins and moves to the nucleus. The reverse transcriptase makes a DNA copy of the RNA that can be integrated into the genome at a new site. Because LINEs (and other class I transposons, e.g. LTR retrotransposons and SINEs) move by copying themselves (instead of moving by a cut and paste like mechanism, as class II transposons do), they enlarge the genome. Three distantly related LINE families are found in the human genome: LINE1, LINE2 and LINE3. Only LINE1 is still active. Short Interspersed Elements (SINE) SINEs are wildly successful freeloaders on the backs of LINE elements. They are short (about 100±400 bp), harbour an internal polymerase III promoter and encode no proteins. These nonautonomous transposons are thought to use the LINE machinery for transposition. Indeed, most SINEs `live' by sharing the 3’ end with a resident LINE element. The promoter regions of all known SINEs are derived from tRNA sequences, with the exception of a single monophyletic family of SINEs derived from the signal recognition particle component 7SL. This family, which also does not share its 3’ end with a LINE, includes the only active SINE in the human genome: the Alu element. By contrast, the mouse has both tRNA-derived and 7SL-derived SINEs. The human genome contains three distinct monophyletic families of SINEs: the active Alu, and the inactive MIR and Ther2/MIR3. Short Interspersed Elements (SINEs): The Alu Example Alu elements are highly repetitive DNA sequences that can be classified as SINEs (short interspersed elements), which are themselves a type of "nonautonomous" retrotransposon. An Alu element is transcribed into messenger RNA by RNA polymerase and then converted into a double-stranded DNA molecule by reverse transcriptase. The new double-stranded DNA molecule is then inserted into a new location in the genome. Because they are nonautonomous, like all SINEs, Alu elements don't have the genetic capacity to produce DNA copies of themselves or to integrate into new chromosomal locations. For those activities, they rely on another type of transposon, called L1. Most Alu elements are approximately 300 base pairs long, with considerable sequence variation. Alu elements frequently duplicate when they jump, and scientists estimate that the human genome acquires one new Alu insert in approximately every 200 births. Alu TEs are believed to have emerged in primates around 65 million years ago. Today, they are the most abundant type of human TE, making up an amazing 10% of the (diploid) human genome. Thus, in a mere 65 million years, these transposons have gone from zero to about 1 million copies per cell! These elements are spread throughout the genome and occur at varying densities in different loci. LTR retroposons LTR retroposons are flanked by long terminal direct repeats (LTR) that contain all of the necessary transcriptional regulatory elements. The autonomous elements (retrotransposons) contain gag and pol genes, which encode a protease, reverse transcriptase, RNAse H and integrase. Exogenous retroviruses seem to have arisen from endogenous retrotransposons by acquisition of a cellular envelope gene (env). Transposition occurs through the retroviral mechanism with reverse transcription occurring in a cytoplasmic virus-like particle, primed by a tRNA (in contrast to the nuclear location and chromosomal priming of LINEs). Although a variety of LTR retrotransposons exist, only the vertebrate-specifc endogenous retroviruses (ERVs) appear to have been active in the mammalian genome. Mammalian retroviruses fall into three classes (I±III), each comprising many families with independent origins. Most (85%) of the LTR retroposon-derived `fossils' consist only of an isolated LTR, with the internal sequence having been lost by homologous recombination between the flanking LTRs. DNA transposons LTR DNA transposons resemble bacterial transposons, having terminal inverted repeats and encoding a transposase that binds near the inverted repeats and mediates mobility through a `cut-and-paste‘ mechanism. The human genome contains at least seven major classes of DNA transposon, which can be subdivided into many families with independent origins. DNA transposons tend to have short life spans within a species. This can be explained by contrasting the modes of transposition of DNA transposons and LINE elements. LINE transposition tends to involve only functional elements by which LINE proteins assemble with the RNA from which they were translated. For DNA transposons, the encoded transposase is produced in the cytoplasm and, when it returns to the nucleus, it cannot distinguish active from inactive elements. As inactive copies accumulate in the genome, transposition becomes less efficient. This checks the expansion of any DNA transposon family and in due course causes it to die out. To survive, DNA transposons must eventually move by horizontal transfer to virgin genomes, and there is considerable evidence for such transfer Età media degli elementi trasponibili nel genoma umano uomo The overall activity of all transposons has declined markedly over the past 35±50 Myr, with the possible exception of LINE1. Indeed, apart from an exceptional burst of activity of Alu peaking around 40 Myr ago, there would appear to have been a fairly steady decline in activity in the hominid lineage since the mammalian radiation Transposon activity in the mouse genome has not undergone the decline seen in humans and proceeds at a much higher rate. LTR retroposons are alive and LINE1 and a variety of SINEs are quite active. topo Today -------------------------------- 200 MYR ago Comparison with other organisms • The euchromatic portion of the human genome has a much higher density of transposable element copies than the euchromatic DNA of the other three organisms. • The human genome is filled with copies of ancient transposons, whereas the transposons in the other genomes tend to be of more recent origin. • Whereas in the human two repeat families (LINE1 and Alu) account for 60% of all interspersed repeat sequence, the other organisms have no dominant families. Distribuzione delle repeats nel genoma Some regions of the genome are extraordinarily dense in repeats. The prizewinner appears to be a 525-kb region on chromosome X p11, with an overall transposable element density of 89%. This region contains a 200-kb segment with 98% density, as well as a segment of 100 kb in which LINE1 sequences alone comprise 89% of the sequence. In contrast, some genomic regions are nearly devoid of repeats. The absence of repeats may be a sign of large-scale cis-regulatory elements that cannot tolerate being interrupted by insertions. The four regions with the lowest density of interspersed repeats in the human genome are the four homeobox gene clusters, HOXA, HOXB, HOXC and HOXD. Distribuzione del contenuto in GC LINE:regioni con maggiore densità AT SINE (MIR, Alu): trend opposto. Positive selection for Alus in GC-rich regions would imply that they benefit the organism (perché è dove stanno i geni). This hypothesis is based on the observation that in many species SINEs are transcribed under conditions of stress, and the resulting RNAs specifically bind a particular protein kinase (PKR) and block its ability to inhibit protein translation. SINE RNAs would thus promote protein translation under stress. SINE RNA may be well suited to such a role in regulating protein translation, because it can be quickly transcribed in large quantities from thousands of elements and it can function without protein translation. Therefore, there could be positive selection for SINEs in readily transcribed open chromatin such as is found near genes. This could explain the retention of Alus in gene-rich GC-rich regions. It is also consistent with the observation that SINE density in AT-rich DNA is higher near genes. Il cromosoma Y The genetic material on chromosome Y is unusually young, probably owing to a high tolerance for gain of new material by insertion and loss of old material by deletion. Several lines of evidence support this picture: • LINE elements on chromosome Y are on average much younger than those on autosomes • MaLR family retroposons on chromosome Y are younger than those on autosomes • Chromosome Y has a relative over-representation of the younger retroviral class II (ERVK) and a relative under-representation of the primarily older class III (ERVL) compared with other chromosomes. Mutation rate in males and females Interspersed repeats on chromosome Y can also be used to estimate the relative mutation rates in the male and female germlines. Chromosome Y always resides in males, whereas chromosome X resides in females twice as often as in males. They identified the repeat elements from recent subfamilies (effectively, birth cohorts dating from the past 50Myr) and measured the substitution rates for subfamily members on chromosomes X and Y (Fig. 29). There is a clear linear relationship corresponding to mutation rate in the male germline to be 2.1 higher than in the female germline Trasposoni attivi - 950 LINE - trovate sequenze con full-length elements e ORF intatti 61 LINE potenzialmente attivi TRASPOSONI COME FORZA CREATIVA - 47 geni umani probabilmente derivati da trasposoni (RAG1 e 2 ricombinasi etc) - Retroposoni LTR usati come terminatori della trascrizione (few hundred genes) - Sequenze ripetute trasformate in elementi regolatori Simple Sequence Repeats Simple sequence repeats (SSRs) are a rather different type of repetitive structure that is common in the human genome - perfect or slightly imperfect tandem repeats of a particular k-mer. SSRs with a short repeat unit (n = 1±13 bases) are often termed microsatellites, whereas those with longer repeat units (n = 14±500 bases) are often termed minisatellites. SSRs comprise about 3% of the human genome, with the greatest single contribution coming from dinucleotide repeats (0.5%). Trinucleotide SSRs are much less frequent than dinucleotide SSRs. There is approximately one SSR per 2 kb SSRs have been extremely important in human genetic studies, because they show a high degree of length polymorphism in the human population owing to frequent slippage by DNA polymerase during replication. Genetic markers based on SSRs particularly (CA)n repeats - have been the workhorse of most human disease mapping studies. The availability of a comprehensive catalogue of SSRs is thus a boon for human genetic studies Duplicazioni segmentali Le duplicazioni segmentali comportano il trasferimento di blocchi di sequenze genomiche di 1-200 Kb in uno o più posizioni del genoma. - Il draft del genoma umano contiene almeno il 3,3% di duplicazioni segmentali. - Stima definitiva: 5% di duplicazioni segmentali. Possono essere suddivise in due categorie. 1) Le duplicazioni intercromosomali sono definite come i segmenti che sono duplicati tra i cromosomi non omologhi 2) Le duplicazioni intracromosomali, che si verificano all'interno di un particolare cromosoma o braccio cromosomico. Questa categoria comprende diversi segmenti ripetuti, anche noti come sequenze ripetute a basso numero di copie, che mediano i ricorrenti riarrangiamenti strutturali dei cromosomi associati a numerose malattie genetiche. La percentuale elevata di duplicazioni di grandi dimensioni distingue chiaramente il genoma umano da altri genomi sequenziati. Distribuzione delle duplicazioni segmentali Chromosome 22 contains a region of 1.5Mb adjacent to the centromere in which 90% of sequence can now be recognized to consist of interchromosomal duplication. Conversely, 52% of the interchromosomal duplications on chromosome 22 were located in this region, which comprises only 5% of the chromosome. Also, the subtelomeric end consists of a 50-kb region consisting almost entirely of interchromosomal duplications. The Chromosome 21 presents a similar landscape (erano i 2 meglio assemblati) Duplicazioni intercromosomali Duplicazioni intracromosomali Gene content of the human genome In organisms with small genomes, it is straightforward to identify most genes by the presence of long ORFs. In contrast, human genes tend to have small exons (encoding an average of only 50 codons) separated by long introns (some exceeding 10 kb). This creates a signal-to-noise problem, with the result that computer programs for direct gene prediction have only limited accuracy. Instead, computational prediction of human genes must rely largely on the availability of cDNA sequences or on sequence conservation with genes and proteins from other organisms. This approach is adequate for strongly conserved genes (such as histones or ubiquitin), but may be less sensitive to rapidly evolving genes (including many crucial to speciation, sex determination and fertilization). Non-coding RNAs Although biologists often speak of a tight coupling between “genes and their encoded protein products”, it is important to remember that thousands of human genes produce noncoding RNAs (ncRNAs) as their ultimate product. There are several major classes of ncRNA. 1. Transfer RNAs (tRNAs) are the adapters that translate the triplet nucleic acid code of RNA into the amino-acid sequence of proteins; 2. Ribosomal RNAs (rRNAs) are also central to the translational machinery, and recent X-ray crystallography results strongly indicate that peptide bond formation is catalysed by rRNA, not protein; 3. Small nucleolar RNAs (snoRNAs) are required for rRNA processing and base modification in the nucleolus; 4. Small nuclear RNAs (snRNAs) are critical components of spliceosomes, the large ribonucleoprotein (RNP) complexes that splice introns out of pre-mRNAs in the nucleus. ncRNAs do not have translated ORFs, are often small and are not polyadenylated. Accordingly, novel ncRNAs cannot readily be found by computational gene-finding techniques (which search for features such as ORFs) or experimental sequencing of cDNA or EST libraries tRNAs Although 61 sense codons need to be decoded, not all 61 different anticodons are present in tRNAs. Rather, tRNAs generally follow stereotyped and conserved wobble rules. Wobble reduces the number of required anticodons substantially, and provides a connection between the genetic code and the hybridization stability of modifed and unmodifed RNA bases. In eukaryotes, it has been predicted that about 46 tRNA species will be sufficient to read the 61 sense codons (counting the initiator and elongator methionine tRNAs as two species). According to these rules, in the codon's third (wobble) position, U and C are generally decoded by a single tRNA species, whereas A and G are decoded by two separate tRNA species. Wobble tRNAs The classical experimental estimate of the number of human tRNA genes is 1,310. In the draft genome sequence, were found only 497 human tRNA genes + 324 tRNAderived putative pseudogenes This indicates that the human has fewer tRNA genes than the worm, but more than the fly. This may seem surprising, but tRNA gene number in metazoans is thought to be related not to organismal complexity, but more to the demand for tRNA abundance in certain tissues or stages of embryonic development. The tRNA genes are dispersed throughout the human genome, but this dispersal is nonrandom. More than 25% of the tRNA genes (140) are found in a region of only about 4Mb on chromosome 6. This small region, only about 0.1% of the genome, contains an almost sufficient set of tRNA genes all by itself Anyway, the human tRNA gene set predicted from the draft genome sequence appears to include most of the known human tRNA species. The draft genome sequence contains 37 of 38 human tRNA species listed in a tRNA database Satisfyingly, the human tRNA set follows these wobble rules almost perfectly Negli eucarioti 46 specie di tRNA sono sufficienti per leggere 61 codoni-senso, dato che per la terza posizione vale la teoria del vacillamento: • un tRNA per U/C • A e G decodificati da due separati tRNA Ribosomal RNAs, snoRNAs and snRNAs Ribosomal RNA genes. The ribosome, the protein synthetic machine of the cell, is made up of two subunits and contains four rRNA species and many proteins. The large ribosomal subunit contains 28S and 5.8S rRNAs (collectively called `large subunit‘ rRNA) and also a 5S rRNA. The small ribosomal subunit contains 18S rRNA (`small subunit' rRNA). The genes for large subunit and small subunit rRNA occur in the human genome as a 44-kb tandem repeat unit. There are about 150±200 copies of this repeat unit arrayed on the short arms of acrocentric chromosomes 13, 14, 15, 21 and 22 The 5S rDNA genes also occur in tandem arrays, the largest of which is on chromosome 1 close to the telomere. There are 200±300 true 5S genes in these arrays. Small nucleolar RNA genes. Eukaryotic rRNA is extensively processed and modifed in the nucleolus. Much of this activity is directed by numerous snoRNAs. There is a compiled set of 97 known human snoRNA gene sequences; 84 of these (87%) have at least one copy in the draft genome sequence, almost all as single-copy genes Spliceosomal RNAs and other ncRNA genes. It was found at least one copy of 21 (95%) of 22 known ncRNAs, including the spliceosomal snRNAs. There were multiple copies for several ncRNAs, as expected; for example, 44 dispersed genes for U6 snRNA, and 16 for U1 snRNA Non-coding RNA genes Properties characterization of known proteincoding genes Identifying the protein-coding genes in the human genome is one of the most important applications of the sequence data, but also one of the most difficult challenges Before attempting to identify new genes, it has been explored what could be learned by aligning the cDNA sequences of known genes to the draft genome sequence. Genomic alignments allow one to study exon±intron structure and local GC content. The `known' genes studied were those in the RefSeq database, a manually curated collection designed to contain non-redundant representatives of most full-length human mRNA sequences in GenBank (RefSeq intentionally contains some alternative splice forms of the same genes). The version of RefSeq used contained 10,272 mRNAs. The RefSeq genes were aligned with the draft genome sequence. Because this sequence is incomplete and contains errors, not all genes could be fully aligned and some may have been incorrectly aligned. More than 92% of the RefSeq entries could be aligned at high stringency over at least part of their length, and 85% could be aligned over more than half of their length. Some genes (16%) had high stringency alignments to more than one location in the draft genome sequence owing, for example, to paralogues or pseudogenes. Protein-coding genes There is considerable variation in overall gene size and intron size, with both distributions having very long tails. Many genes are over 100 kb long, the largest known example being the dystrophin gene (DMD) at 2.4Mb. The titin gene has the longest currently known coding sequence at 80,780 bp; it also has the largest number of exons (178) and longest single exon (17,106 bp). Properties of human genes compared to those from worm and fly For all three organisms, the typical length of a coding sequence is similar (1,311 bp for worm, 1,497 bp for fly and 1,340 bp for human), and most internal exons fall within a common peak between 50 and 200 bp. However, the worm and fly exon distributions have a fatter tail, resulting in a larger mean size for internal exons (218 bp for worm versus 145 bp for human) In contrast to the exons, the intron size distributions differ substantially among the three species. The worm and fly have most introns near the preferred minimum intron length (47 bp for worm, 59 bp for fly) and an extended tail (overall average length of 267 bp for worm and 487 bp for fly). Intron size is much more variable in humans, with a peak at 87 bp but a very long tail resulting in a mean of more than 3,300 bp. The variation in intron size results in great variation in gene size. Distribution of GC content in genes and in the genome The variation in gene size and intron size can partly be explained by the fact that GC-rich regions tend to be gene-dense with many compact genes, whereas AT-rich regions tend to be gene-poor with many sprawling genes containing large introns The correlation appears to be due primarily to intron size, which drops markedly with increasing GC content. In contrast, coding properties such as exon length or exon number (data not shown) vary little. Alternative splicing To investigate the prevalence of alternative splicing, reconstructed mRNA transcripts covering the entire coding regions of genes on chromosome 22 were analysed (omitting small genes with coding regions of less than 240 bp). Potential transcripts identified by alignments of ESTs and cDNAs to genomic sequence were verified by human inspection. Were found 642 transcripts, covering 245 genes (average of 2.6 distinct transcripts per gene). Two or more alternatively spliced transcripts were found for 145 (59%) of these genes. A similar analysis for the gene-rich chromosome 19 gave 1,859 transcripts, corresponding to 544 genes (average 3.2 distinct transcripts per gene). Because it has been sampled only a subset of all transcripts, the true extent of alternative splicing is likely to be greater. Nevertheless, these figures are considerably higher than those for worm, in which analysis reveals alternative splicing for 22% of genes for which ESTs have been found, with an average of 1.34 (12,816/9,516) splice variants per gene. Seventy per cent of alternative splice forms found in the genes on chromosomes 19 and 22 affect the coding sequence, rather than merely changing the 3’ or 5’ UTR Gene prediction Gene identification is almost trivial in bacteria and yeast, because the absence of introns in bacteria and their paucity in yeast means that most genes can be readily recognized by ab initio analysis as unusually long ORFs. It is not as simple, but still relatively straightforward, to identify genes in animals with small genomes and small introns, such as worm and fly. A major factor is the high signal-to-noise ratio: coding sequences comprise a large proportion of the genome and a large proportion of each gene (about 50% for worm and fly), and exons are relatively large. Gene identification is more difficult in human DNA. The signal-to-noise ratio is lower: coding sequences comprise only a few per cent of the genome and an average of about 5% of each gene; internal exons are smaller than in worms; and genes appear to have more alternative splicing. Previous estimates of human gene number • Early estimates based on reassociation kinetics estimated the mRNA complexity of typical vertebrate tissues to be 10,000±20,000, and were extrapolated to suggest around 40,000 for the entire genome. • In the mid-1980s, Gilbert suggested that there might be about 100,000 genes, based on the approximate ratio of the size of a typical gene (3 x 104 bp) to the size of the genome (3 x 109 bp). • An estimate of 70,000±80,000 genes was made by extrapolating from the number of CpG islands and the frequency of their association with known genes. • As human sequence information has accumulated, it has been possible to derive estimates on the basis of ESTs. Such calculations consistently produce low estimates, in the region of 35,000 Gene prediction Gene prediction methods employed combinations of three basic approaches: • direct evidence of transcription provided by ESTs or mRNAs • indirect evidence based on sequence similarity to previously identifed genes and proteins • ab initio recognition of groups of exons on the basis of hidden Markov models (HMMs) that combine statistical information about splice sites, coding bias and exon and intron lengths The process resulted in version 1 of the integrated gene index (IGI). The composition of the corresponding integrated protein index (IPI) 1, obtained by translating IGI 1, is given below. There are 31,778 protein predictions, with 14,882 from known genes, 4,057 predictions from Ensembl merged with Genie and 12,839 predictions from Ensembl alone. The IGI set thus contains about 15,000 known genes and about 17,000 gene predictions The average lengths are 469 amino acids for the known proteins, 443 amino acids for protein predictions from the Ensembl±Genie merge, and 187 amino acids for those from Ensembl alone. Ensemble parte da predizione ab initio fatta da Genscan e cerca la conferma per similarità a proteine, mRNA e EST Genie parte da match con mRNA o EST e estende i match con markov models Chromosomal distribution of genes The average density of gene predictions is 11.1 per Mb across the genome, with the extremes being chromosome 19 at 26.8 per Mb and chromosome Y at 6.4 per Mb. It is likely that a significant number of the predictions on chromosome Y are pseudogenes (this chromosome is known to be rich in pseudogenes) and thus that the density for chromosome Y is an overestimate. The density of both genes and Alu on chromosome 19 is much higher than expected, even accounting for the high GC content of the chromosome; this supports the idea that Alu density is more closely correlated with gene density than with GC content itself. “If there are 30,000±35,000 genes, with an average coding length of about 1,400 bp and average genomic extent of about 30 kb, then about 1.5% of the human genome would consist of coding sequence and onethird of the genome would be transcribed in genes.” “The human thus appears to have only about twice as many genes as worm or fly. However, human genes differ in important respects from those in worm and fly. They are spread out over much larger regions of genomic DNA, and they are used to construct more alternative transcripts. This may result in perhaps five times as many primary protein products in the human as in the worm or fly.” Valutazione/validazione IGI/IPI Programma principalmente basato su Ensembl: -Sensibilità -Specificità -Frammentazione • Comparazione con nuovi geni non ancora annotati 31 nuovi geni 28 erano nella sequenza draft 19 identificati in IGI/IPI Sensibilità = 68% (19/28) Geni Predetti Geni x predizioni Specificità = 61% (19/31) 14 1 14 Frammentazione = 1,4 (27/19) 3 2 6 1 3 3 1 4 4 19 27 Presi 31 geni scoperti nel frattempo ma non ancora annotati e visto che 28 stavano sul draft e 19 erano stati predetti. Frammentazione guarda geni che corrispondono a più predizioni Comparative proteome analysis Compared with the two invertebrates, humans appear to have many proteins involved in cytoskeleton, defence and immunity, and transcription and translation. These expansions are clearly related to aspects of vertebrate physiology. Humans also have many more proteins that are classified as falling into more than one functional category (426 in human versus 80 in worm and 57 in fly, data not shown). Interestingly, 32% of these are transmembrane receptors Comparative proteome analysis Probable horizontal transfer An interesting category is a set of 223 proteins that have significant similarity to proteins from bacteria, but no comparable similarity to proteins from yeast, worm, fly and mustard weed, or indeed from any other (nonvertebrate) eukaryote. These sequences should not represent bacterial contamination in the draft human sequence, because the sequences were filtered to eliminate those essentially identical to known bacterial plasmid, transposon or chromosomal DNA (such as the host strains for the large-insert clones). To investigate whether these were genuine human sequences, PCR primers were designed for 35 of these genes and confirmed that most could be readily detected directly in human genomic DNA. Orthologues of many of these genes have also been detected in other vertebrates. A more detailed computational analysis indicated that at least 113 of these genes are widespread among bacteria, but, among eukaryotes, appear to be present only in vertebrates. It is possible that the genes encoding these proteins were present in both early prokaryotes and eukaryotes, but were lost in each of the lineages of yeast, worm, fly, mustard weed and, possibly, from other nonvertebrate eukaryote lineages. A more parsimonious explanation is that these genes entered the vertebrate (or prevertebrate) lineage by horizontal transfer from bacteria. Many of these genes contain introns, which presumably were acquired after the putative horizontal transfer event. Flow diagram for sequencing pipeline Costruzione delle librerie e sequenziamento “Celera believed that the initial version of a completed human genome should be a composite derived from multiple donors of diverse ethnic backgrounds Prospective donors were asked, on a voluntary basis, to self-designate an ethnogeographic category (e.g., African-American, Chinese, Hispanic, Caucasian, etc.). We enrolled 21 donors» • • • From females, 130 ml of whole, heparinized blood was collected. From males, 130 ml of whole, heparinized blood was collected, as well as five specimens of semen, collected over a 6-week period. Permanent lymphoblastoid cell lines were created by Epstein-Barr virus immortalization. DNA from five subjects was selected for genomic DNA sequencing: two males and three females — one African-American, one Asian-Chinese, one Hispanic-Mexican, and two Caucasians DNA from each donor was used to construct plasmid libraries in one or more of three size classes: 2 kbp, 10 kbp, and 50 kbp. This was done at the Celera facility, which occupied about 30,000 square feet of laboratory space and produced sequence data continuously at a rate of 175,000 total reads per day After quality and vector trimming, the average trimmed sequence length was 543 bp, and the sequencing accuracy was exponentially distributed with a mean of 99.5% and with less than 1 in 1000 reads being less than 98% accurate «We used automated high-throughput DNA sequencing and the computational infrastructure to enable efficient tracking of enormous amounts of sequence information (27.3 million sequence reads; 14.9 billion bp of sequence). Sequencing and tracking from both ends of plasmid clones from 2-, 10-, and 50-kbp libraries were essential to the computational reconstruction of the genome. Our evidence indicates that the accurate pairing rate of end sequences was greater than 98%.» I dati utilizzati per l’assemblaggio Celera Two independent sets of data were used for the assemblies. • The first was a random shotgun data set of 27.27 million reads of average length 543 bp produced at Celera. This consisted largely of mate-pair reads from 16 libraries constructed from DNA samples taken from five different donors. Libraries with insert sizes of 2, 10, and 50 kbp were used. Assuming a genome size of 2.9 Gbp, the Celera trimmed sequences gave a 5.13X coverage of the genome, and clone coverage was 3.42X, 16.40X, and 18.84X for the 2-, 10-, and 50-kbp libraries, respectively, for a total of 38.7X clone coverage. • The second data set was from the publicly funded Human Genome Project (PFP) and is primarily derived from BAC clones. For the whole-genome assembly, the PFP data was first disassembled or “shredded” into a synthetic shotgun data set of 550-bp reads that form a perfect 2X covering of the bactigs. This resulted in 16.05 million “faux” reads that were sufficient to cover the genome 2.96X because of redundancy in the BAC data set, without incorporating the biases inherent in the PFP assembly process. Strategie di assemblaggio e caratterizzazione del genoma Two different approaches to assembly were pursued: • a whole-genome assembly process that used Celera data and the PFP data in the form of additional synthetic shotgun data • a compartmentalized assembly process that first partitioned the Celera and PFP data into sets localized to large chromosomal segments and then performed ab initio shotgun assembly on each set. (clone coverage) Whole Genome Assembly (WGA) Adattamento dell’algoritmo sviluppato per Drosophila ad un genoma più complicato come quello umano. The combined data set of 43.32 million reads, and all associated mate-pair information, were then subjected to our whole-genome assembly algorithm to produce a reconstruction of the genome. Neither the location of a BAC in the genome nor its assembly of bactigs was used in this process. Bactigs were shredded into reads because we found strong evidence that 2.13% of them were misassembled Consiste di cinque fasi principali: 1- SCREENER 2- OVERLAPPER 3- UNITIGGER/DISCRIMINATOR 4- SCAFFOLDER 5- REPEAT RESOLVER 1- SCREENER: Trova e marca tutti i microsatelliti che si ripetono con meno di 6 bp e trova tutti gli elementi ripetuti (Alu, LINE e ribosomalDNA); 2- OVERLAPPER: Paragona tutte le reads alla ricerca di una sovrapposizione end to end di almeno 40 bp e non più del 6% di differenze nel match; 3- UNITIGGER/DISCRIMINATOR Elimina le sovrapposizioni dovute a regioni ripetute UNITIGGER: forma unitigs (UNIquely assembled conTIGS): sono contigs formati dalla sovrapposizione di reads che appaiono non contestabili rispetto alle altre (vere sovrapposizioni, non falsi dovuti per es. a regioni ripeture; Unfortunately, although empirically many of these assemblies are correct (and thus involve only true overlaps), some are in fact collections of reads from several copies of a repetitive element that have been overcollapsed into a single subassembly. However, the overcollapsed unitigs are easily identified because their average coverage depth is too high to be consistent with the overall level of sequence coverage. DISCRIMINATOR: distingue tra vere sovrapposizioni e sovrapposizioni dovute ad elementi ripetuti; Screener ...finds and “masks” microsatellite repeats, known repeated regions and ribosomal DNA, – “masked” regions not used to make contigs, – “marks” the rest for overlapping. read: atgacttacttactgcatatttatttatttatttatttatttatttatttatttat ttatttatttatttatttatttgacgtgtacgtgtacgtgtagctgtacgtgtacg tgacgggccgcattatcgtgatgctacgtgtacgttatatctgatcgtgcatgtga atgacttacttactgcatatttatttatttatttatttatttatttatttatttat masked: ttatttatttatttatttatttgacgtgtacgtgtacgtgtagctgtacgtgtacg tgacgggccgcattatcgtgatgctacgtgtacgttatatctgatcgtgcatgtga atgacttacttactgcatatttatttatttatttatttatttatttatttatttat marked: ttatttatttatttatttatttgacgtgtacgtgtacgtgtagctgtacgtgtacg tgacgggccgcattatcgtgatgctacgtgtacgttatatctgatcgtgcatgtga Overlapper ...looks for end-to end overlaps of at least 40 bp with no more than 6% differences in match, <--tactgtacgtagctgtgatgttcctcggatatagcgggcatatttattacgctattgtacgtgt-3’ 5’- gttcctcggatatagcgggcatatttattacgctattgtacgtgtaaagtatcgt--> > 40 bp, < 6% mismatch What’s the significance? 17 ...a one in 10 event. …given perfect randomness. Unitigger ...differentiates between a true overlap, and an overlap that includes more than one loci. ...in a world where real data matches expected data, each locus would have 8X coverage, ...over-collapsed. ...if there are genomic repeats, then sequences would be “over-represented”, on average, 8 more per repeat, per contig. Discriminator - DISCRIMINATOR: utilizzando sequenze al di fuori della zona di sovrapposizione identifica elimina le reads che appartendono a contig diversi generando gli Uunitigs 4- SCAFFOLDER: Unisce U-unitgs in scaffold utilizzando le informazioni derivanti dalle mate pair reads - informazioni derivanti dalle librerie di 2-10 kb sono utilizzate per appaiare gli U-unitigs; per un appaiamento accurato sono richieste almeno 2 mate pairs - Poi informazioni dalle librerie di 50 kb e BAC ends sono usate per unire ulteriormente gli scaffold 5- REPEAT RESOLVER: ROCKS substage: unitigs con un buon ma non definitivo DISCRIMINATOR score possono riempire i gap se supportati da più di 2 mate pairs STONES substage: unitigs con un buon ma non definitivo DISCRIMINATOR score possono riempire i gap se supportati da almeno 1 mate pair GAP WALKING substage: i rimanenti gaps sono coperti con i dati di BAC ends I gap vengono riempiti con le reads dei mate pairs. Con stones, basta che una read cada in un contig perché l’altra read venga considerata buona per riempire il gap For the assembly operations, the total compute infrastructure consisted of 10 four-processor SMPs with 4 gigabytes of memory per cluster and a 16-processor NUMA machine with 64 gigabytes of memory. The total compute for a run of the assembler was roughly 20,000 CPU hours. The assembly of Celera’s data, together with the shredded bactig data, produced a set of scaffolds totaling 2.848 Gbp in span and consisting of 2.586 Gbp of sequence Risultato WGA Compartimentalized Shotgun Assembly (CSA) In addition to the WGA approach, we pursued a localized assembly approach that was intended to subdivide the genome into segments, each of which could be shotgun assembled individually. We expected that this would help in resolution of large interchromosomal duplications and improve the statistics for calculating U-unitigs. The compartmentalized assembly process involved clustering Celera reads and bactigs into large, multiple megabase regions of the genome (components), and then running the WGA assembler on the Celera data and shredded, faux reads obtained from the bactig data to ensure an independent ab initio assembly of the component. By subsetting the data in this way, the overall computational effort was reduced and the effect of interchromosomal duplications was ameliorated 27,27 million Celera reads mappate sui Bactig dell’HGP con MATCHER, e poi assemblate assieme ai BAC con COMBINING ASSEMBLER MATCHER Of Celera’s 27.27 million reads, 20.76 million matched a bactig and another 0.62 million reads, which did not have any matches, were nonetheless identified as belonging in the region of the bactig’s BAC because their mate matched the bactig. COMBINING ASSEMBLER The quality of the partitioning into components was crucial so that different genome regions were not mixed together. We constructed components from (i) the longest scaffolds of the sequence from each BAC and (ii) assembled scaffolds of data unique to Celera’s data set. The BAC assemblies were obtained by a combining assembler that used the bactigs and the 5X Celera data mapped to those bactigs as input. The 5.89 million Celera fragments not matching the GenBank data were assembled with the whole-genome assembler. The assembly resulted in a set of scaffolds totaling 442 Mbp in span and consisting of 326 Mbp of sequence. More than 20% of the scaffolds were >5 kbp long, and these averaged 63% sequence and 27% gaps with a total of 302 Mbp of sequence. All scaffolds >5 kbp were forwarded along with all scaffolds produced by the combining assembler to the subsequent tiling phase. At this stage, they typically had one or two scaffolds for every BAC region constituting at least 95% of the relevant sequence, and a collection of disjoint Celera-unique scaffolds. The next step in developing the genome components was to determine the order and overlap tiling of these BAC and Celera-unique scaffolds across the genome. For this, they used Celera’s 50-kbp mate-pairs information, and BAC-end pairs and sequence tagged site (STS) markers to provide long-range guidance and chromosome separation. The result of this process was a collection of “components,” where each component was a tiled set of BAC and Celera-unique scaffolds that had been curator-approved. The process resulted in 3845 components with an estimated span of 2.922 Gbp. Finally, each component was assembled with the WGA algorithm. As was done in the WGA process, the bactig data were shredded into a synthetic 2X shotgun data set in order to give the assembler the freedom to independently assemble the data. By using faux reads rather than bactigs, the assembly algorithm could correct errors in the assembly of bactigs and remove chimeric content in a PFP data entry. In effect, the previous steps in the CSA process served only to bring together Celera fragments and PFP data relevant to a large contiguous segment of the genome, wherein we applied the assembler used for WGA to produce an ab initio assembly of the region. WGA assembly of the components resulted in a set of scaffolds totaling 2.906 Gbp in span and consisting of 2.654 Gbp of sequence. 22% of reads were not incorporated into the assembly. More than 90.0% of the genome was covered by scaffolds spanning >100 kbp long, Comparazione degli Scaffold WGA e CSA Confrontati 2.218 WGA scaffolds con 1.714 scaffolds CSA. - Valutata la consistenza di copertura Risultato: 1.982 Gbp del WGA sono coperti dal CSA (95%) . 2.169 Gbp del CSA sono coperti dal WGA ( 87,69%) - Valutata anche le incoerenze di ordine e orientamento degli scaffolds Risultato: 2,1 Mbp (0,11%) nell'assemblaggio WGA è incoerente con CSA 295 Kbp (0,012%) nell'assemblaggio CSA è incoerente con WGA CSA può essere ritenuto migliore di WGA in termini di copertura e consistenza Mapping scaffolds to the genome Tappe fondamentali: 1) 2) Raggruppamento degli scaffold sulla base del loro ordine nei “components” del CSA Mappatura dei gruppi di scaffold sui cromosomi sulla base delle mappe fisiche Mappe usate: Mappe STS [GeneMap99] Mappa di fingerprintig di cloni BAC [WashUp] Creazione di due ordini di scaffolds sulla base delle due mappe, dal confronto tra le due si identificano differenti tipi di ordinamento di scaffolds Anchor scaffolds : l'ordine degli scaffolds è lo stesso per le due mappe ----> 70% genoma Ordered scaffolds : alcuni scaffolds definiti unmapped sulla base di WashUp sono stati ordinati sulla base di GM99 ----->13,9% genoma Bounded scaffolds : scaffolds che possono essere posizionati ma non ordinati tra anchors, sono stati solo assegnati all’intervallo fra gli anchored scaffolds Risultato : 98% genoma nei tre tipi di scaffolds elencati Localizzazione degli scaffolds sui cromosomi Incontrati problemi, soprattutto a causa di 978 BAC scoperti essere chimerici, e di problemi causati da pseudogeni, duplicazioni genomiche etc etc: necessario validare l’assembly! Mapping scaffolds to the genome The final step in assembling the genome was to order and orient the scaffolds on the chromosomes. First, scaffolds where grouped together on the basis of their order in the components from CSA. Next the scaffold groups were mapped onto the chromosome using physical mapping data There were two genome-wide types of map information available: high-density STS maps and fingerprint maps of BAC clones developed at Washington University. Among the genome-wide STS maps, GeneMap99 (GM99) has the most markers and therefore was most useful for mapping scaffolds 1. Where the order of scaffolds agreed between GM99 and the WashU BAC map these scaffolds were termed “anchor scaffolds.” 70.1% of the genome was in anchored scaffolds, more than 99% of which are also oriented 2. Scaffold anchored with GM99 but not the WashU (because of occasional WashU global ordering discrepancies) were termed “ordered scaffolds.” 13.9% of the genome was in ordered scaffolds, and thus 84.0% of the genome was ordered unambiguously. 3. All scaffolds that could be placed, but not ordered, between anchors were assigned to the interval between the anchored scaffolds and were deemed to be “bounded” between them. These scaffolds were termed “bounded scaffolds.” Using the above approaches, > 98% of the genome was anchored, ordered, or bounded. Finally, a location for each scaffold placed on the chromosome was assigned by spreading out the scaffolds per chromosome. Gene Prediction and Annotation To enumerate the gene inventory, Celera developed an integrated, evidence-based approach named Otto. The evidence used to increase the likelihood of identifying genes included: • regions conserved between the mouse and human genomes • similarity to ESTs or other mRNA-derived data • similarity to other proteins. 1. Otto can promote observed evidence to a gene annotation in one of two ways. First, if the evidence includes a high quality match to the sequence of a known gene (a RefSeq gene), then Otto can promote this to a gene annotation. Second, Otto evaluates a broad spectrum of evidence and determines if this evidence is adequate to support promotion to a gene annotation. Otto predicted 17,968 genes 2. Otto-predicted genes were complemented with a set of genes from three gene-prediction programs that exhibited weaker, but still significant, evidence that they may be expressed. Conservative criteria, requiring at least two lines of evidence, were used to define a set of 26,383 genes with good confidence. OTTO 1. Initially, Otto searches the scaffold sequences against protein, EST, and genome-sequence databases to define regions of sequence similarity in order to predict the gene boundaries 2. Known genes (RefSeq genes with exact matches of a full-length cDNA sequence to the genome) were identified, and the region corresponding to the cDNA was annotated as a predicted transcript. A total of 6538 genes were identified and transcripts predicted in this way. 3. Regions that have a substantial amount of sequence similarity, but do not match known genes, were analyzed by Otto: conservation between mouse and human genomic DNA, similarity to human transcripts (ESTs and cDNAs), similarity to rodent transcripts (ESTs and cDNAs), and similarity of the translation of human genomic DNA to known proteins to predict potential genes in the human genome. 4. The regions covered by an homology evidence, were analyzed by Genscan to see if a consistent gene model could be generated. This procedure simplified the gene-prediction task by first establishing the boundary for the gene and by eliminating regions with no supporting evidence. The final Genscan predictions were often quite different from the prediction that Genscan returned on the same region of native genomic sequence. 5. Finally, Otto compare each predicted transcript with the homology based evidence that was used in previous steps to evaluate the depth of evidence for each exon in the prediction. Otto predicted 11,226 additional genes by means of sequence similarity. 1a) Identificazione di OMOLOGHI DI GENI CONOSCIUTI MATCH AD ALTA QUALITA’ con una sequenza di un gene conosciuto Se esiste un trascritto di RefSeq corrispondente alla sequenza in esame per almeno il 50% della sua lunghezza con identità >92%, allora l’allineamento SIM4 del trascritto con la regione del genoma viene segnata come annotazione di Otto Un totale di 6538 geni sono stati identificati e predetti in questo modo. 1b) Identificazione di GENI omologhi NON CONOSCIUTI Le regioni che hanno una CONSISTENTE SIMILARIETÀ DI SEQUENZA, MA NON CORRISPONDONO A GENI CONOSCIUTI, sono stati analizzati da quella parte del sistema Otto che usa un vasto SPETTRO DI EVIDENZE per predire un trascritto Evidenze generate dalla pipeline computazionale: Similarietà di sequenza con altri trascritti umani (EST e cDNA); Similarietà rispetto a proteine conosciute; Similarietà a trascritti di roditore; Conservazione tra il DNA genomico umano e di topo; • Le regioni supportate da una qualsiasi evidenza di omologia di una sequenza proveniente da un contenitore genico sono state marcate; • Le basi non coperte da alcuna evidenza di omologia sono state sostituite da nucleotidi; • Il segmento generato è ri-analizzato da Genscan. Otto ha predetto altri 11.226 geni per mezzo della similarietà di sequenza. De novo Recognizing that the Otto system is quite conservative, a different gene-prediction strategy was used in regions where the homology evidence was less strong. This final class of predicted genes is a subset of the predictions made by the three gene-finding programs that were used in the computational pipeline. For these, there was not sufficient sequence similarity information for Otto to attempt to predict a gene structure. The three de novo gene-finding programs resulted in about 76,410 predictions that were nonredundant (non-overlapping with one another). Of these, 57,935 did not overlap known genes or predictions made by Otto. For these genes, Celera insisted that a predicted transcript have at least two of the following types of evidence to be included in the gene set for further analysis: 1. protein, 2. human EST 3. rodent EST 4. mouse genome fragment matches Only 21,350 of the gene predictions that did not overlap Otto predictions were partially supported by at least one type of sequence similarity evidence, and 8619 were partially supported by two types of evidence (Table 8). Predizione genica Four kinds of evidence (conservation in 3X mouse genomic DNA, similarity to human EST or cDNA, similarity to rodent EST or cDNA, and similarity to known proteins) were considered to support gene predictions from the different methods La somma di questo numero (21.350) e il numero di annotazioni Otto (17.968) produce 39.318 che è paragonabile al limite massimo di geni del genoma umano. Se la richiesta per altre evidenze di supporto è fatta più stringente, questo numero crolla in modo che: -il richiedere minimo due tipi di evidenze faccia ridurre il numero a 26.000 -il richiederne almeno tre lo riduce a 23.000 In a further attempt to identify genes that were not found by the autoannotation process or any of the de novo gene finders, we examined regions outside of gene predictions that were similar to the EST sequence, and where the EST matched the genomic sequence across a splice junction. After correcting for potential 39 UTRs of predicted genes, about 2500 such regions remained. Addition of a requirement for at least one of the following evidence types—homology to mouse genomic sequence fragments, rodent ESTs, or cDNAs—or similarity to a known protein reduced this number to 1010. Adding this to the numbers from the previous paragraph would give us estimates of about 40,000, 27,000, and 24,000 potential genes in the human genome, depending on the stringency of evidence considered. Ohno’s postulate Genoma dei mammiferi è paragonato ad un deserto con piccoli oasi “geni”. (1985) Deserto > 500 kpb senza un gene 605 Mbp , 20% del genoma “deserto” 12 % 171 Mbp 27,5% 492 Mbp Ohno’s postulate Ohno’s postulate: Mammalian genomes consist of oases of genes in otherwise essentially empty deserts. (1985) How valid is Ohno’s postulate? It appears that the human genome does indeed contain deserts, or large, gene-poor regions. If we define a desert as a region > 500 kbp without a gene, then we see that 605 Mbp, or about 20% of the genome, is in deserts. These are not uniformly distributed over the various chromosomes. Gene-rich chromosomes 17, 19, and 22 have only about 12% of their collective 171 Mbp in deserts, whereas gene-poor chromosomes 4, 13, 18, and X have 27.5% of their 492 Mbp in deserts (Table 11). SNPs Classi di regioni genomiche Ampiezza delle regioni esaminate (Mb) Densità di SNP in Celera-PFP (SNP/Mb) Regioni intergeniche 2185 707 Geni (introni + esoni) 646 917 Introni 615 921 primo introne 164 808 Esoni 31 529 primo esone 10 592 Identificati circa 3 milioni di SNPs La maggior parte degli SNP si trovano in regioni non codificanti Il tasso di SNP è maggiore negli introni rispetto alle regioni intergeniche Gli SNP presenti negli esoni possono portare a modificazioni silenti o missenso Le modificazioni significative a carico di proteine sono una piccolissima parte degli SNPs (< 1%) The 1000 Genomes Project The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. This goal can be attained by sequencing many individuals lightly. The Project currently plans to sequence each sample to about 4X coverage; at this depth sequencing cannot provide the complete genotype of each sample, but should allow the detection of most variants with frequencies as low as 1%. Combining the data from 2500 samples should allow highly accurate estimation (imputation) of the variants and genotypes for each sample that were not seen directly by the light sequencing. Disease researchers will use 1000 Genomes data in two ways: • They will combine the 1000 Genomes data with the genotype data in their disease GWA study to impute the genotypes in their samples for millions of additional variants beyond those they genotyped directly. They will do this computationally, with no genotyping cost. The additional genotype data will allow the researchers to localize the disease-associated regions more precisely. • They will compare the frequency of the variant with the frequency of the disease to see if they are compatible The samples for the 1000 Genomes Project mostly are anonymous and have no associated medical or phenotype data Variant Frequency in the population