The PAISA’ Project Marco Baroni HLT for Italian Workshop PAISA’ Coming soon at www.corpusitaliano.it I Piattaforma per l’Apprendimento dell’Italiano Su Corpora Annotati I MIUR FIRB project 2009-2012 Partners: I I I I I University of Bologna: Sergio Scalise, Claudia Borghetti, Emiliano Guevara University of Trento: Marco Baroni, Marco Brunello, Sara Castagnoli, Egon Stemle ILC, Pisa: Vito Pirrelli, Alessandro Lenci, Felice Dell’Orletta EURAC, Bolzano: Andrea Abel, Verena Lyding, Christopher Culy Goals I Build and annotate Web-derived “reference” corpus of Italian language I Innovative corpus visualization and exploration tools I Applications in Italian language and culture teaching The WaCky initiative http://wacky.sslmit.unibo.it I I Very large Web-crawled, automatically cleaned and annotated corpora for English, French, German and Italian (2005-2009) Lessons learnt: I I I I Copyright Corpus composition Cleaning Annotating Web text Copyright I PAISA’ corpus constructed by crawling Web pages published under CreativeCommons License only: I I I I Attribution Derived works OK Non-commercial derived works only Same licensing propagates to PAISA’ corpus Corpus composition 0 50000 100000 150000 200000 250000 470K documents wikipedia blogs news edu books other Web cleaning: The boilerplate problem Web cleaning on steroids: krdwrd Cues from text Main Index Page General Ratings Page One true sign of a truly great band is when said band ardently defies categorisation, that is, when for every "well, they sound like this reggae-influenced heavy metal band playing avantgarde bebop" remark you can have yourself a "funny, I thought they were this raw punk outfit doing acoustic folk" counterproposal. And I don’t simply mean "being diverse" here, I mean "being different". Blazing off every colour of the spectrum. Baring one’s soul in all of its existing aspects. That sort of thing. READER COMMENTS SECTION Return to the Index page! NOW! Web cleaning on steroids: krdwrd Cues from the DOM tree (figure from http://www.w3schools.com/) Web cleaning on steroids: krdwrd Cues from visual rendering Web cleaning on steroids Cross-validation on gold standard wacky heuristics text cues only DOM cues only visual cues only full krdwrd features NA 21 13 8 42 precision 80% 92% 89% 90% 93% recall 99% 93% 91% 93% 92% F 88% 92% 90% 91% 92% Corpus comparison I ITWAC: 1.9B tokens I PAISA’: 700M tokens I La Repubblica: 380M tokens PAISA’ vs. La Repubblica nouns PAISA’ punto blog commento settembre guida articolo galleria video set pubblicità Rep. miliardo anno presidente governo ministro lira paese giorno partito stato verbs PAISA’ registrare pubblicare leggere segnare inserire riservare inviare usare contattare mostrare Rep. dire chiedere andare spiegare arrivare restare parlare sembrare cominciare mettere adjs PAISA’ informatico simile esterno hi-tech opzionale famoso successivo gay digitale bello Rep. scorso politico pubblico primo stesso grande generale finanziario americano lungo PAISA’ vs. ITWAC nouns PAISA’ punto blog commento guida settembre galleria set video pubblicità e-mail ITWAC numero articolo legge comma servizio lavoro presidente decreto commissione caso verbs PAISA’ registrare pubblicare segnare leggere riservare commentare mostrare contattare continuare deselezionare adjs ITWAC svolgere chiedere ritenere presentare dire tenere riguardare andare consentire prevedere PAISA’ ultimo pubblicitario hi-tech simile opzionale esterno informatico famoso nuovo grande ITWAC pubblico regionale presente sociale previsto nazionale necessario generale seguente legislativo Ongoing work I Annotation (POS tagging, dependency parsing. . . ) I Classification by genre and topic I Visualization, user-friendly interface I E-mail me ([email protected]) if you want a copy of the raw corpus!