Come diventare Data Scientist

COME DIVENTARE UN DATA SCIENTIST
PRIMI CONSIGLI PER STUDENTI E PERSONE ALLE PRIME ARMI
Paolo Pellegrini, Senior Consultant
giugno 2016
BIG DATA & DATA SCIENCE
AGENDA
Perché diventare un Data Scientist?
Chi è un Data Scientist?
Come viene selezionato un Data Scientist?
Cos’è un algoritmo?
Un esempio di problema di Data Science
Alcuni Tool di Data Science
Dove aggiornarsi e formarsi?
BIG DATA & DATA SCIENCE
1
IL 56% DELLE IMPRESE ITALIANE INDICA BIG
DATA E DATA SCIENCE COME PRIMARIO
SVILUPPO STRATEGICO PER IL 2016/17
Big Data e Analytics
Dematerializzazione
Sistemi ERP
Device Mobili e Mobile Apps
Sistemi CRM
Consolidamento applicativo
Cloud pubblico e privato
Mobile Marketing e CRM
Data Center
Mobile e eCommerce
Storage e virtualizzazione
Collaboration
Compliance e Risk Management
Cyber Security
Progetti commerciali web social
Smart Working
Internet of Things
Smart Manufacturing
56%
53%
48%
40%
31%
31%
25%
25%
18%
18%
17%
17%
10%
10%
7%
6%
5%
3%
0%
BIG DATA & DATA SCIENCE
10%
20%
30%
40%
50%
60%
2
HARVARD, GIA’ ANNI FA, LO AVEVA DEFINITO IL
LAVORO PIÙ SEXY DEL NOSTRO SECOLO…ED È
ANCHE BEN REMUNERATO!
GOOGLE TREND
«DATA SCIENTIST»
AVERAGE SALARY
123,000 $
BIG DATA & DATA SCIENCE
3
AGENDA
Perché diventare un Data Scientist?
Chi è un Data Scientist?
Come viene selezionato un Data Scientist?
Cos’è un algoritmo?
Un esempio di problema di Data Science
Alcuni Tool di Data Science
Dove aggiornarsi e formarsi?
BIG DATA & DATA SCIENCE
4
UN DATA SCIENTIST PUÒ FARE TUTTO!
Nate Silver è la persona che ha cambiato il
concetto di “Psephology”, usando Big Data
& Data Science per predire i risultati
delle elezioni Americane.
Oggi, è uno dei più famosi Data Scientist
al mondo
BIG DATA & DATA SCIENCE
5
UN DATA SCIENTIST È UNA FIGURA FORTMENTE
INTERDISCIPLINARE, CHE CONIUGA STATISTICA,
PROGRAMMAZIONE E LOGICHE DI BUSINESS
«On any given day a team member might author a
multistage processing pipeline in Python, design a
hypothesis test, perform a regression analysis over
data samples with R, design and implement an
algorithm for some data-intensive product or service
in Hadoop, or communicate the results of an
analysis to other members of the organization in a
clear and concise fashion»
2009 – Jeff Hammerbacher | Data Scientist @ Facebook
BIG DATA & DATA SCIENCE
6
LA DIFFUSIONE NEL MONDO È SEMPRE PIÙ
GRANDE, E VEDE L’ITALIA PROTAGONISTA
RJ Metrics on 11.400 Data Scientist profile on LinkedIn
BIG DATA & DATA SCIENCE
7
I SETTORI CHE IMPIEGANO PIÙ DATA SCIENTIST
SONO QUELLI MAGGIORMENTE ORIENTATI ALL’IT,
MA LA DIFFUSIONE E’ SEMPRE PIÙ CAPILLARE
RJ Metrics on 11.400 Data Scientist profile on LinkedIn
BIG DATA & DATA SCIENCE
8
LE COMPETENZE PIÙ DIFFUSE VERTONO SU
LINGUAGGI E STRUMENTI COME «R» E «PYTHON»
RJ Metrics on 11.400 Data Scientist profile on LinkedIn
BIG DATA & DATA SCIENCE
9
LE COMPETENZE MODELLISITICHE E DI
PROGRAMMAZIONE SONO FONDAMENTALI PER
UNA RISORSA JUNIOR
RJ Metrics on 11.400 Data Scientist profile on LinkedIn
BIG DATA & DATA SCIENCE
10
UN DATA SCIENTIST PUÒ AVERE QUALSIASI TIPO
DI BACKGROUND: CONTA SOLO VOGLIA E
ATTITUDINE A LAVORARE SUI DATI
RJ Metrics on 11.400 Data Scientist profile on LinkedIn
BIG DATA & DATA SCIENCE
11
PRESTO OGNI AZIENDA AVRÀ UN DATA SCIENTIST
Data Scientist
2%
11%
14%
73%
Present, with a well defined role
Present, but without a well defined role
Introduction planned for 2016
Possible introduction in the future
BIG DATA & DATA SCIENCE
12
AGENDA
Perché diventare un Data Scientist?
Chi è un Data Scientist?
Come viene selezionato un Data Scientist?
Cos’è un algoritmo?
Un esempio di problema di Data Science
Alcuni Tool di Data Science
Dove aggiornarsi e formarsi?
BIG DATA & DATA SCIENCE
13
PERCORSO DI VALUTAZIONE
Job posting
Test development
CV selection
Test assignment
Test evaluation
Technical interview
Data Scientist evaluation
BIG DATA & DATA SCIENCE
14
PERCORSO DI VALUTAZIONE
Junion Data Scientist Selection
Job posting
Test development
CV selection
Test assignment
Test evaluation
Technical interview
Data Scientist evaluation
BIG DATA & DATA SCIENCE
1) You have two tables in an existing RDBMS. One contains
information about the products you sell (name, size, color, etc.)
The other contains images of the products in JPEG format.
These tables are frequently joined in queries to your database.
You would like to move this data into HBase. What is the most
efficient schema design for this scenario?
• Create a single table, with two column family
• Create a single table, with one column family
• Create two tables, with one column family
2) A sandwich shop studies the number of men, and women, that
enter the shop during the lunch hour from noon to 1pm each day.
They find that the number of men that enter can be modeled as a
random variable with distribution Poisson(M), and likewise the
number of women that enter as Poisson(W). What is likely to be
the best model of the total number of customers that enter during
the lunch hour?
• Poisson (M+W)
• Possion (M/W)
• Poisson (M*W)
15
PERCORSO DI VALUTAZIONE
Senior Data Scientist Selection
Job posting
Test development
Consegna di un Data Set, via mail o tramite piattaforme
come University2Business, che i candidate devono
analizzare al fine di sviluppare un modello predittivo
CV selection
Test assignment
Test evaluation
Technical interview
Data Scientist evaluation
BIG DATA & DATA SCIENCE
16
PERCORSO DI VALUTAZIONE
Senior Data Scientist Selection
Job posting
Test development
CV selection
Seguono domande generiche sulla costruzione dei modelli o
discussioni di dettaglio su quando svolto nel test. Ad esempio:
• Pulizia dati
• Costruzione modello
• Sviluppo algoritmo
• …
Test assignment
Test evaluation
Technical interview
Data Scientist evaluation
BIG DATA & DATA SCIENCE
17
PERCORSO DI VALUTAZIONE
Job posting
Socioeconomic
Test development
CV selection
Soft – Business
generic
Business
Soft – Role
specific
Math
Test assignment
Test evaluation
Statistic
Technical interview
Computer
Science
Data Scientist evaluation
BIG DATA & DATA SCIENCE
18
DESCRIZIONE DELLE COMPETENZE
Matematiche
Informatiche
Statistiche
Capacità di lettura del
contesto sociale e di
come questo impatti
sul contesto economico
Capacità di sistemizzare
la realtà attraverso
classificazioni e modelli
che tengano conto delle
interazioni fra gli
elementi
Capacità di trattamento
dell’informazione, mediante
lo sviluppo di procedure
automatizzate (es.
algoritmi) e di un supporto
HW/SW
Capacità trarre
deduzioni logiche ed
estrarre conoscenza
dallo studio di un
particolare fenomeno
non deterministico
Storytelling
Etica
Capacità di inventiva nella creazione di scenari da
esplorare e di inserire le informazioni all’interno di un
framework che ne facilitino la trasmissione e la
comprensione all’esterno, anche attraverso capacità
di sintesi e di presentazione delle informazioni
Capacità di fare uso
coscienzioso dei dati, anche
a fronte del possesso di dati
sensibili
Capacità di guida e
coordinamento di un gruppo di
risorse, assunzione di decisioni
per garantire l'ottenimento di
risultati aziendali
BIG DATA & DATA SCIENCE
Teamwork
Capacità di operare in gruppo,
attraverso spartizione di ruoli e
aggregazione di competenze, al
fine di raggiungere un obiettivo
comune
Coaching/Mentoring
Relazioni interpersonali
Capacità di relazionarsi con altri
Capacità di formazione di
risorse con meno esperienza, al soggetti, ponendosi nel modo
opportuno a seconda di status,
fine di migliorarne le
relazioni gerarchiche,
potenzialità, partendo
contingenze, ecc.
dall’unicità dell'individuo
SOFT
BUSINESS GENERIC
Management
SOFT
ROLE SPECIFIC
Hacking
Capacità di fare uso di
creatività e immaginazione
nella ricerca della
conoscenza
TECHNICAL SKILLS
Settoriali
Conoscenza di
processi, mercato e
anticipazione degli
impatti delle variabili
esogene sullo specifico
settore
Socio-Economiche
19
PROFILI TIPICI
Socio-Economic
Socio-Economic
Soft
Bsuiness Generic
Business
Soft
Role Specic
Math
Computer Science
Statistic
Soft
Bsuiness Generic
Business
Soft
Role Specic
Math
Statistic
Computer Science
Junior Data Scientist
Senior Data Scientist
Socio-Economic
Socio-Economic
Soft
Bsuiness Generic
Business
Soft
Role Specic
Math
Statistic
Computer Science
Chief Data Scientist
BIG DATA & DATA SCIENCE
Soft
Bsuiness Generic
Business
Soft
Role Specic
Math
Statistic
Computer Science
Business Manager
20
AGENDA
Perché diventare un Data Scientist?
Chi è un Data Scientist?
Come viene selezionato un Data Scientist?
Cos’è un algoritmo?
Un esempio di problema di Data Science
Alcuni Tool di Data Science
Dove aggiornarsi e formarsi?
BIG DATA & DATA SCIENCE
21
DAL DATO ALL’ALGORITMO
“Data is inherently dumb - Algorithms are where
the real value lies. Algorithms define action”
Peter Sondergaard
Senior Vice President
Gartner Research
graphical expression of
Euclid's algorithm to find
the greatest common
A
divisor for 1599 and 650
Algorithm is a self-contained step-by-step
set of operations to be performed
BIG DATA & DATA SCIENCE
22
COME GLI ALGORITMI SUPPORTANO IL BUSINESS
Decisional
Support
+
OPTIMIZATION
STRATEGY
OLD-STYLE
STRATEGY
PREDICTIVE
PREEMPTIVE
What future?
How to avoid
bad events?
DIAGNOSTIC
PRESCRIPTIVE
Why does it
happened?
How to react to
recent events?
DESCRIPTIVE
What
happened?
INFORMATION
BIG DATA & DATA SCIENCE
ANALYTICS
STRATEGY
INSIGHTS
DATA-DRIVEN
STRATEGY
DECISION
ACTION
23
COME SI DIFINISCE IL «MACHINE LEARNING»
• Machine learning is a subfield of computer science, that evolved from the study of pattern recognition
and computational learning theory in artificial intelligence
• In 1959, Arthur Samuel defined machine learning as a "Field of study that gives computers the ability
to learn without being explicitly programmed“
• Machine learning explores the study and construction of algorithms that can learn from and
make predictions on data. Such algorithms operate by building a model from an example training set
of input observations in order to make data-driven predictions or decisions expressed as outputs
rather than following strictly static program instructions
• Machine learning is closely related to (and often overlaps with) computational statistics; a discipline
which also focuses in prediction-making through the use of computers. It has strong ties to
mathematical optimization, which delivers methods, theory and application domains to the field
BIG DATA & DATA SCIENCE
24
ALCUNI ESEMPI DI ALGORITMI
• C4.5 - Constructs a classifier in the form of a decision tree. In order to do this, C4.5 is
given a set of data representing things that are already classified. This is supervised
learning, since the training dataset is labeled with classes
BIG DATA & DATA SCIENCE
25
ALCUNI ESEMPI DI ALGORITMI
• k-means - creates k groups from a set of objects so that the members of a group are
more similar. It’s a popular cluster analysis technique for exploring a dataset. Most would
classify k-means as unsupervised. Other than specifying the number of clusters, k-means
“learns” the clusters on its own without any information about which cluster an observation
belongs to k-means can be semi-supervised
BIG DATA & DATA SCIENCE
26
ALCUNI ESEMPI DI ALGORITMI
• Support vector machines - SVM teaches a hyperplane to classify data into 2 classes. At
a high-level, SVM performs a similar task like C4.5 except SVM doesn’t use decision trees
at all. It is a supervised learning, since a dataset is used to first teach the SVM about the
classes
BIG DATA & DATA SCIENCE
27
ALCUNI ESEMPI DI ALGORITMI
• Naive Bayes - it is not a single algorithm, but a family of classification algorithms that
share one common assumption: every feature of the data being classified is independent
of all other features given the class. This is supervised learning, since Naive Bayes is
provided a labeled training dataset in order to construct the tables
BIG DATA & DATA SCIENCE
28
ALCUNI ESEMPI DI ALGORITMI
• PCA - Principal component analysis uses an orthogonal transformation to convert a set of
observations of possibly correlated variables into a set of values of linearly uncorrelated
variables called principal components. The number of principal components is less than or
equal to the number of original variables. The first principal component has the largest
possible variance, and each succeeding component in turn has the highest variance
possible under the constraint that it is orthogonal to the preceding components. The
resulting vectors are an uncorrelated orthogonal basis set. The principal components are
orthogonal because they are the eigenvectors of the covariance matrix, which is
symmetric. PCA is sensitive to the relative scaling of the original variables. This is
unsupervised learning
BIG DATA & DATA SCIENCE
29
PROCESSO LOGICO DI USO DEGLI ALGORITMI
Ricezione
Dataset
Analisi esplorativa
dei dati
Pulizia dei dati
Costruzione
modello logico
Sviluppo algoritmo
ad-hoc di supporto
al business
Testing & Tunion
Uso di Algoritmi per
trovare variabili più
predittive
•
•
•
•
BIG DATA & DATA SCIENCE
Random Forest
Decision Tree
SVM
…
30
AGENDA
Perché diventare un Data Scientist?
Chi è un Data Scientist?
Come viene selezionato un Data Scientist?
Cos’è un algoritmo?
Un esempio di problema di Data Science
Alcuni Tool di Data Science
Dove aggiornarsi e formarsi?
BIG DATA & DATA SCIENCE
31
DATA SCIENTIST: MAGICIAN OR SUPERHERO?
Può un Data Scientist predire i crimini a San Francisco?
Può un Data Scientist aiutare la città ad esser più sicura?
BIG DATA & DATA SCIENCE
32
THE SAN FRANCISCO CHALLENGE
From 1934 to 1963, San Francisco was infamous for housing some of the world's
most notorious criminals on the inescapable island of Alcatraz.
Today, the city is known more for its tech scene than its criminal past. But, with
rising wealth inequality, housing shortages, and a proliferation of expensive digital
toys riding BART to work, there is no scarcity of crime in the city by the bay.
Given time and location, you must predict the category of crime that can occur
• It’s provided a dataset of 12 years of incidents from across all of San
Francisco's neighborhoods, from 1/1/2003 to 13/05/2015.
• Dataset has been divided in two parts: a training set, to be used for the model
development, and a test set, useful to verify the predictive algorithm.
BIG DATA & DATA SCIENCE
33
TRAINING SET STRUCTURE
For every incident is provided:
• Data and time
• Category
878.049 INCIDENTS
WITH
39 CATEGORIES OF CRIME
• Description
• Day of week
• Pd District
• Resolution
• Address
• Latitude
• Longitude
BIG DATA & DATA SCIENCE
34
UN ESEMPIO DI DATA VISUALIZATION
BIG DATA & DATA SCIENCE
35
Q1 – HOW TO ACT WITH THE DATASET?
• To manage Dataset are used CSV files, but also JSON. No xls!
• 800.000 record is Big Data!
• You can use only variables known when the model is applied
Taget: Category
Variable: Data and time, Day of week, Pd District, Address, Latitude,
Longitude
Variable not to be included: Description, Resolution
BIG DATA & DATA SCIENCE
36
STEP 1: DATA CLEANSING
Manage Dataset in order to make all valid variable usable by a predictive model:
• Generate an ID for every record
• Verify the structure of every variable and search for data that need to be
cleaned up (e.g. empty record, double space, ecc.)
• Split “Date” (13/05/2015 23:53:00) into single variables (Month, Year, Hour)
• Merge “Latitude” and “Longitude” to verify the presence of unique place
Verify the distribution of every variable to find out the presence of “non-normal
distribution” or other kind of problems to be fix
BIG DATA & DATA SCIENCE
37
STEP 2: LAUNCH THE FIRST EXPLORATIVE MODEL
You can use professional (and free) tools like Rapid Miner, Weka, Knime, ecc.
BIG DATA & DATA SCIENCE
38
STEP 2: LAUNCH THE FIRST EXPLORATIVE MODEL
• Tool: IBM Watson
• Algorithm: Decision Tree CHAID
• Predictive Strength: 17% - less than 1/4 category crime is correctly predicted
Decision Tree for Arson
Top Predictors
BIG DATA & DATA SCIENCE
39
Q1 – HOW TO ACT WITH THE DATASET?
• Imagine how to transform Data variable
• Imagine how to transform Time variable
• Imagine how to transform Address variable
• Imagine other external information to be included in the model
• Try to select sources to import these information
• Data
Weekend&HolidayDummy [using historical calendar holidays]
• Time
NightDummy, Cold/RainDummy, HotDummy,
WorkingTimeDummy [using weather, sunborn/set, ecc. information]
• Address
StreetType [managing strings]
• Other
UnemploymentRateByMonth, VisitorsRateByMonth,
PoupulationDensityByDistrict, HouseCostByDistrict,
EducationLevelByDistrict [desk analysis]
BIG DATA & DATA SCIENCE
40
STEP 3: LAUNCH THE FINAL EXPLORATIVE MODEL
• Tool: IBM Watson
• Algorithm: Decision Tree CHAID
• Predictive Strength: 32% - about 1/3 category crime is correctly predicted
Top Predictors
BIG DATA & DATA SCIENCE
41
AGENDA
Perché diventare un Data Scientist?
Chi è un Data Scientist?
Come viene selezionato un Data Scientist?
Cos’è un algoritmo?
Un esempio di problema di Data Science
Alcuni Tool di Data Science
Dove aggiornarsi e formarsi?
BIG DATA & DATA SCIENCE
42
BIG DATA = TANTISSIME TECNOLOGIE
BIG DATA & DATA SCIENCE
43
SCEGLIERE UN TOOL DI DATA SCIENCE
MAGIC QUADRANT 2016 - GARTNER
CHALLENGERS
LEADERS
Linguaggi
•
•
SAS
IBM
SAP
Angoss
KNIME
RapidMiner
Dell
R
Phyton
Altri Tool
•
HP Vertica
•
Weka
•
Tableau
•
Neo4K
Microsoft
FICO
Predixion Software
Pay
Prognoz
LEGENDA
ABILITY TO EXECUTE
Alteryx
Alphine Data
Lavastorm
Megaputer
Accenture
Free
NICHE PLAYER
VISIONARIES
COMPLETENESS OF VISION
BIG DATA & DATA SCIENCE
44
RAPID MINER: UN LABORATORIO VISUALE
BIG DATA & DATA SCIENCE
45
IBM WATSON: UNO DEGLI STRUMENTI PIÙ FAMOSI
A MENO DI 50€/MESE PER UTENTE
BIG DATA & DATA SCIENCE
46
AGENDA
Perché diventare un Data Scientist?
Chi è un Data Scientist?
Come viene selezionato un Data Scientist?
Cos’è un algoritmo?
Un esempio di problema di Data Science
Alcuni Tool di Data Science
Dove aggiornarsi e formarsi?
BIG DATA & DATA SCIENCE
47
MOOC
BIG DATA & DATA SCIENCE
48
COME DIVENTARE UN DATA SCIENTIST
PRIMI CONSIGLI PER STUDENTI E PERSONE ALLE PRIME ARMI
Paolo Pellegrini, Senior Consultant
giugno 2016
BIG DATA & DATA SCIENCE