The PAISA’ Project
Marco Baroni
HLT for Italian Workshop
PAISA’
Coming soon at www.corpusitaliano.it
I
Piattaforma per l’Apprendimento dell’Italiano Su Corpora
Annotati
I
MIUR FIRB project 2009-2012
Partners:
I
I
I
I
I
University of Bologna: Sergio Scalise, Claudia Borghetti,
Emiliano Guevara
University of Trento: Marco Baroni, Marco Brunello, Sara
Castagnoli, Egon Stemle
ILC, Pisa: Vito Pirrelli, Alessandro Lenci, Felice Dell’Orletta
EURAC, Bolzano: Andrea Abel, Verena Lyding, Christopher
Culy
Goals
I
Build and annotate Web-derived “reference” corpus of
Italian language
I
Innovative corpus visualization and exploration tools
I
Applications in Italian language and culture teaching
The WaCky initiative
http://wacky.sslmit.unibo.it
I
I
Very large Web-crawled, automatically cleaned and
annotated corpora for English, French, German and Italian
(2005-2009)
Lessons learnt:
I
I
I
I
Copyright
Corpus composition
Cleaning
Annotating Web text
Copyright
I
PAISA’ corpus constructed by crawling Web pages
published under CreativeCommons License only:
I
I
I
I
Attribution
Derived works OK
Non-commercial derived works only
Same licensing propagates to PAISA’ corpus
Corpus composition
0
50000
100000
150000
200000
250000
470K documents
wikipedia
blogs
news
edu
books
other
Web cleaning:
The boilerplate problem
Web cleaning on steroids: krdwrd
Cues from text
Main Index Page
General Ratings Page
One true sign of a truly great band is when said band
ardently defies categorisation, that is, when for every
"well, they sound like this reggae-influenced heavy metal
band playing avantgarde bebop" remark you can have
yourself a "funny, I thought they were this raw punk
outfit doing acoustic folk" counterproposal. And I don’t
simply mean "being diverse" here, I mean "being different".
Blazing off every colour of the spectrum. Baring one’s
soul in all of its existing aspects. That sort of thing.
READER COMMENTS SECTION
Return to the Index page! NOW!
Web cleaning on steroids: krdwrd
Cues from the DOM tree (figure from http://www.w3schools.com/)
Web cleaning on steroids: krdwrd
Cues from visual rendering
Web cleaning on steroids
Cross-validation on gold standard
wacky heuristics
text cues only
DOM cues only
visual cues only
full krdwrd
features
NA
21
13
8
42
precision
80%
92%
89%
90%
93%
recall
99%
93%
91%
93%
92%
F
88%
92%
90%
91%
92%
Corpus comparison
I
ITWAC: 1.9B tokens
I
PAISA’: 700M tokens
I
La Repubblica: 380M tokens
PAISA’ vs. La Repubblica
nouns
PAISA’
punto
blog
commento
settembre
guida
articolo
galleria
video
set
pubblicità
Rep.
miliardo
anno
presidente
governo
ministro
lira
paese
giorno
partito
stato
verbs
PAISA’
registrare
pubblicare
leggere
segnare
inserire
riservare
inviare
usare
contattare
mostrare
Rep.
dire
chiedere
andare
spiegare
arrivare
restare
parlare
sembrare
cominciare
mettere
adjs
PAISA’
informatico
simile
esterno
hi-tech
opzionale
famoso
successivo
gay
digitale
bello
Rep.
scorso
politico
pubblico
primo
stesso
grande
generale
finanziario
americano
lungo
PAISA’ vs. ITWAC
nouns
PAISA’
punto
blog
commento
guida
settembre
galleria
set
video
pubblicità
e-mail
ITWAC
numero
articolo
legge
comma
servizio
lavoro
presidente
decreto
commissione
caso
verbs
PAISA’
registrare
pubblicare
segnare
leggere
riservare
commentare
mostrare
contattare
continuare
deselezionare
adjs
ITWAC
svolgere
chiedere
ritenere
presentare
dire
tenere
riguardare
andare
consentire
prevedere
PAISA’
ultimo
pubblicitario
hi-tech
simile
opzionale
esterno
informatico
famoso
nuovo
grande
ITWAC
pubblico
regionale
presente
sociale
previsto
nazionale
necessario
generale
seguente
legislativo
Ongoing work
I
Annotation (POS tagging, dependency parsing. . . )
I
Classification by genre and topic
I
Visualization, user-friendly interface
I
E-mail me ([email protected]) if you want a
copy of the raw corpus!