GENOMICA
Uso di Genome Browser per l'annotazione
di sequenze genomiche.
UCSC
University of California Santa Cruz
Genome Browser
Dott.ssa Inga Prokopenko
1
Sequenziamento del genoma –
assemblaggio delle sequenze disponibili
Una sequenza viene detta “finita” quando presenta un
livello di errore inferiore a 1/10000 basi e non ha gaps.
Il Progetto Genoma Umano era complesso dal punto di
vista tecnico ma anche dal punto di vista
computazionale.
L’output di una singola reazione di sequenza (read) =
500-800 bp Tutti i singoli frammenti dovevano essere
assemblati in una singola stringa lineare.
2
Biological Databases
L’annotazione del genoma è un’area di investigazione
attiva ed include molte organizzazioni che pubblicano i
risultati nella data banche biologiche disponibili per tutta
la comunità:
ENCyclopedia Of DNA Elements (ENCODE)
Entrez Gene
Ensembl
Gene Ontology Consortium
GeneRIF
RefSeq
Uniprot
Vertebrate and Genome Annotation Project (Vega)
3
Browser genomici
aiutano a visualizzare genomi annotati completi (inclusi
geni e loro strutture, proteine, espressione, regulazione,
variazione ed analisi comparative)
Le risorse di annotazione sono multiple.
Genome Browsers
UCSC Genome Bioinformatics Genome Browser and
Tools (UCSC)
NCBI-UniGene (National Center for Biotechnology
Information)
Ensembl The Ensembl Genome Browser (Sanger
Institute and EBI)
Copyright OpenHelix. No use or reproduction without express written consent
4
3 milioni di basi in formato testo = nessuna utilita’
Servono:
•Annotazione dell’informazione sulla sequenza
•Possibilita’ di recuperare velocemente la sequenza di
regioni specifiche del genoma in base a criteri di
• Contenuto di informazione
• Caratteristiche di sequenza
UCSC Genome Browser
Sistema per la “navigazione” della sequenza e
dell’annotazione di genomi, che permette la
visualizzazione dell’informazione a “diverso
ingrandimento” ed il recupero di porzioni di sequenza
con associate le informazioni di annotazione, come:
Geni noti e geni predetti
ESTs, mRNAs
Isole CpG
assembly gaps e coverage, bande cromosomiche
Omologia con altri genomi
…
Genomi
disponibili
Human
Homo sapiens
assembly
• 99% delle regioni
contenenti geni
• accuratezza 99.99%
• 2.84 Gb finite “highly
contiguous”
Species
A. gambiae
A. mellifera
C. briggsae
C. elegans
C. intestinalis
Chicken
Chimp
Cow
D. ananassae
D. erecta
D. grimshawi
D. melanogaster
D. mojavensis
D. persimilis
D. pseudoobscura
D. sechellia
D. simulans
D. virilis
5
Organizazione dei dati genomici…
Annotation Tracks
sequenza Genome backbone: base position number
chromosome band
sts sites
gap locations
known genes
Links out to
more data
predicted genes
microarray/expression data
evolutionary conservation
SNPs
repeated regions
more…
6
The UCSC Home page: genome.ucsc.edu
navigate
General information
navigate
Specific information—
new features, current status, etc.
Copyright OpenHelix. No use or reproduction without express written consent
7
A sample of what we will find:
Copyright OpenHelix. No use or reproduction without express written consent
8
The Genome Browser Gateway
start page, basic search
text/ID
searches
,
les
p
m
xa w
e
h
o
arc bel
se stions
l
u
pf ugge
l
e
s
H
Use this Gateway to search by:
Gene names, symbols
Chromosome number: chr7, or region: chr11:1038475-1075482
Keywords: kinase, receptor
IDs: NP, NM, OMIM, and more…
9
UCSC Genome Browser
Molte possibilita’ per la ricerca di una regione specifica:
• chr7
un cromosoma intero
• 20p13
una regione (banda p13 del cr. 20)
• chr3:1-1000000
il primo milione di basi del cr. 3 dal ptel
• D16S3046
regione intorno al marcatore (100,000 basi per lato)
• RH18061;RH80175 regione tra i due marcatori
• AA205474
regione genomica che si allinea con la sequenza con
questo GB accession number
• PRNP
regione del genoma che comprende il gene PRNP
• NM_017414
regione del genoma con indificatore di RefSeq
• NP_059110
regione del genoma con “protein accession number”
• 11274 (LLID)
Oppure di liste di regioni:
• pseudogene mRNA Lists transcribed pseudogenes, but not cDNAs
• homeobox caudal
Lists mRNAs for caudal homeobox genes
• zinc finger
Lists many zinc finger mRNAs
• huntington
Lists candidate genes associated with Huntington's
disease
10
11
The Genome Browser Gateway
start page choices, February 2005
1
2
3
4
5
6
Make your Gateway choices:
1.
Select Clade
2.
Select species: search 1 species at a time
3.
Assembly: the official backbone DNA sequence
4.
Position: location in the genome to examine
5.
Image width: how many pixels in display window; 5000 max
6.
Configure: make fonts bigger + other choices
12
The Genome Browser Gateway
sample search for Human BRCA1
Sample search: human, May 2004 assembly, BRCA1
select
•Often you will have to select the
right gene from a results list
•Sometimes, you will go directly
to a browser image (use an ID)
•AF005068, breast cancer 1, early onset
13
Overview of the whole
Genome Browser page
(first day, new human release)
}
Genome viewer section
Track and image controls
(day 1 = 40 tracks)
Copyright OpenHelix. No use or reproduction without express written consent
14
Overview of the whole
Genome Browser page
(mature release)
}
Genome viewer section
Groups of data
Mapping and Sequencing Tracks
Genes and Gene Prediction Tracks
mRNA and EST Tracks
Expression and Regulation
Comparative Genomics
ENCODE Tracks
Variation and Repeats
Copyright OpenHelix. No use or reproduction without express written consent
15
Different species, different tracks, same software
Copyright OpenHelix. No use or reproduction without express written consent
16
Sample Genome Viewer image, BRCA1 region
Genome backbone
STS markers
Known genes
RefSeq genes
MGC clones
Gene predictions
GenBank mRNAs
GenBank ESTs
conservation
SNPs
repeats
Copyright OpenHelix. No use or reproduction without express written consent
17
Annotation Track options, defined
Hide: removes a track from view
Dense: all items collapsed into a single line
Squish: each item = separate line, but 50% height + packed
Pack: each item separate, but efficiently stacked (full height)
Full: each item on separate line
Copyright OpenHelix. No use or reproduction without express written consent
18
Clicking an annotation line,
new page of detailed information
You will get detail for that single item you click
Example: click on the BRCA1 Black “Known Genes” line
Click the line
New
web page
opens
Many details
and links
to more data
about BRCA1
Copyright OpenHelix. No use or reproduction without express written consent
19
informative
description
Click annotation track = BRCA1
“Known gene” detail page
other resource links
links to sequences
microarray data
Not all genes have
This much detail.
Different
annotation tracks
carry different detail
data.
mRNA secondary structure
protein domains/structure
homologs in other species
SNP
detail page
sample
Gene Ontology™ descriptions
mRNA descriptions
pathways
Copyright OpenHelix. No use or reproduction without express written consent
20
Getting the sequences
Get DNA, with Extended Options; or Details pages
Copyright OpenHelix. No use or reproduction without express written consent
Use the DNA link at
the top
Plain or Extended
options
Change colors,
fonts, etc.
21
Accessing the BLAT tool
BLAT = BLAST-like Alignment Tool
Rapid searches by INDEXING the entire genome
Works best with high similarity matches
Copyright OpenHelix. No use or reproduction without express written consent
22
BLAT: Blast-like alignment tool
Blat is really really fast—it has been optimized to search
the whole genomes more quickly than BLAST does.
UCSC have created an INDEX of all the unique 11mers if
it’s DNA, 4mers if protein (or stretches of 11nucleotides
or 4 amino acids).
it looks down its index of 11mers, finds a match and
works out from there.
Blast does it the other way—it indexes your query and
then runs your smaller index over everything…that’s the
essential difference in the algorithm.
Copyright OpenHelix. No use or reproduction without express written consent
23
BLAT
UCSC documentation:
“On DNA queries, BLAT is designed to quickly find
sequences with 95% or greater similarity of length 40
bases or more. It may miss genomic sequences that are
more divergent or shorter than these minimums, although
it will find perfect sequence matches of 33 bases and
sometimes as few as 22 bases. The tool is capable of
aligning sequences that contain large introns.
On protein queries, BLAT rapidly locates genomic
sequences with 80% or greater similarity of length 20
amino acids or more. In general, gene family members
that arose within the last 350 million years can generally
be detected.”
Copyright OpenHelix. No use or reproduction without express written consent
24
BLAT tool overview:
www.openhelix.com/sampleseqs.html
Make
choices
Paste one
or more
sequences
DNA limit 25000 bases
Protein limit 10000 aa
25 total sequences
Or
upload
Submit
Copyright OpenHelix. No use or reproduction without express written consent
25
BLAT results, with links
sorting
Results with demo sequences, settings default; sort = Query, Score
Score is a count of matches—higher number, better match
Click browser to go to Genome Browser image location (next slide)
Click details to see the alignment to genomic sequence (2nd slide)
Copyright OpenHelix. No use or reproduction without express written consent
26
BLAT results, alignment details browser
Click to flip frame
query
matches
From browser click in BLAT results
A new line with your Sequence from BLAT Search appears!
Watch out for reading frame! Click - - - > to flip frame
Base position = full and zoomed in enough to see
amino acids
Copyright OpenHelix. No use or reproduction without express written consent
27
BLAT results,
alignment details
Your query
Genomic match, color cues
Side-by-side alignment
Copyright OpenHelix. No use or reproduction without express written consent
28
In Silico PCR:
find genomic sequence using primers
Select genome
Enter primers
Minimum 15 bases
Flip reverse primer?
Submit
(note: the tool does not handle ambiguous bases at this time—don’t use Ns)
Copyright OpenHelix. No use or reproduction without express written consent
29
In Silico PCR: results
location
size
your primers
Tm for primers
Genomic location shown, links to Genome Viewer
Product size shown
Your primers displayed, flipped if necessary
Predicted genomic sequence shown
Primer melting temperatures provided
Copyright OpenHelix. No use or reproduction without express written consent
30
Proteome Browser
more
protein
data
Access from homepage or Known Gene pages
Exon diagram, amino acids
Many protein properties (pI, mw, composition, 3D…)
Copyright OpenHelix. No use or reproduction without express written consent
31
Gene Sorter
From homepage select ‘Gene sorter’
Copyright OpenHelix. No use or reproduction without express written consent
32
Gene Sorter interface
Sorts genes by several criteria
Copyright OpenHelix. No use or reproduction without express written consent
33
Gene Sorter interface
Choose from 11 sorting options
Copyright OpenHelix. No use or reproduction without express written consent
34
Gene Sorter results
Copyright OpenHelix. No use or reproduction without express written consent
35