Introduzione ai sistemi NoSQL e document - e-Learning

12/2/2015
NoSql
Indice
• Definizione e ragioni del NoSql
• Modelli document based
• Modelli a grafo
• Modelli key/value
• Modelli wide column
• confronti
1
12/2/2015
In principio fu…
Storage!
2
12/2/2015
Proprietà ACID
• Atomicità
• Consistenza
• Isolamento
• Durability (persistenza)
Sviluppo di applicaizoni
Code
XML Config
DB Schema
Application
Object Relational
Mapping
Relational
Database
©Massimo Brignoli, Mongodb
3
12/2/2015
And Even Harder To Iterate
New
Table
New
Column
New
Table
Name
Pet
Phone
Email
New
Column
3 months later…
©Massimo Brignoli, Mongodb
Performance
• I sistemi RDBMS scalano aggiungendo capacità
elaborativa ai server
– Scale up
4
12/2/2015
Performance
• Fino a un limite…
Storia
•
•
•
•
•
•
•
•
•
•
•
•
MultiValue databases at TRW in 1965.
DBM is released by AT&T in 1979.
Lotus Domino released in 1989.
Carlo Strozzi used the term NoSQL in 1998 to name his lightweight, opensource relational database that did not expose the standard SQL interface.
Graph database Neo4j is started in 2000.
Google BigTable is started in 2004. Paper published in 2006.
CouchDB is started in 2005.
The research paper on Amazon Dynamo is released in 2007.
The document database MongoDB is started in 2007 as a part of a open
source cloud computing stack and first standalone release in 2009.
Facebooks open sources the Cassandra project in 2008.
Project Voldemort started in 2008.
The term NoSQL was reintroduced in early 2009.
5
12/2/2015
NoSQL
• Not only SQL
• Insieme di modelli di rappresentazione dei dati e
relativi software di gestione
• Schema free (o schemaless)
• CAP theorem
• Base
– Basic Available, Soft state, Eventually consistency
CAP theorem
6
12/2/2015
Cap theorem
• RDBMS sono essenzialmente CA
– Esistono tentativi per avere anche la P
• I sistemi NOSQL sono CP o AP soprattutto
– CP-> si aspetta che I dati siano coerenti
– AP -> si accetta di avere ogni tanti I dati inconsistenti
NoSQL
http://blog.nahurst.com/visual-guide-to-nosql-systems
7
12/2/2015
Modelli NoSQL
• Key-Value Stores
• Column Family Stores
• Document Databases
• Graph Databases
• RDF databases as well as Tuple stores
Key value
• Dynamo, Voldemort, Rhino DHT ...
– DeCandia et al. "Dynamo: Amazon’s Highly Available
Key-value Store", 2007
• Key-Value sono tabelle di hash dove la chiave
punta a un particolare valore
• Il mapping chiave-valore è supportato da
mecchanismi di hash per massimizzare le
performance
8
12/2/2015
Column
• BigTable, Cassandra, HBase, Hadoop ...
– Chang et al. "Bigtable: A Distributed Storage System
for Structured Data", 2006
• Memorizzano grandissime quantità di dati
– "Petabytes di data across centinaia di servers"
• La chiave punt a colonne multiple
Graph DB
• Neo4J, FlockDB, GraphBase, InfoGrip, ...
• Graph Databases sono costruiti da nodi e
relazioni fra nodi (archi).
• I nodi hanno proprietà
– Nodes rappresentano entittà (e.g. "Bob" or "Alice").
– Proprietà sono informazioni pertinenti ai nodi (e. g.
età:18).
• Gli archi connettono nodi a nodi o nodi a
proprietà
• I graph DBs non scalano bene
9
12/2/2015
Document Based
• CouchDB, MongoDB, Lotus Notes, Redis ...
• I documenti sono indirizzati in nel db tramite una
chiave unica
• Modello semistrutturato come json o xml
• Oltre alla chiave I documenti possono essere
cercati
MongoDB
MongoDB
RDBMS
{
_id : ObjectId("4c4ba5e5e8aabf3"),
employee_name: "Dunham, Justin",
department : "Marketing",
title : "Product Manager, Web",
report_up: "Neray, Graham",
pay_band: “C",
benefits : [
{ type : "Health",
plan : "PPO Plus" },
{ type :
"Dental",
plan : "Standard" }
]
}
10
12/2/2015
Enterprise Big Data Stack
Data Management
Online Data
Offline Data
RDBMS
RDBMS
Hadoop
EDW
Security & Auditing
Management & Monitoring
Applications
CRM, ERP, Collaboration, Mobile, BI
Infrastructure
OS & Virtualization, Compute, Storage, Network
Modello dati
Relational
MongoDB
{
first_name: ‘Paul’,
surname: ‘Miller’,
city: ‘London’,
location:
[45.123,47.232],
cars: [
{ model: ‘Bentley’,
year: 1973,
value: 100000, … },
{ model: ‘Rolls Royce’,
year: 1965,
value : 330000, … }
]
}
11
12/2/2015
Document
_id
• N-dimensional storage
• Each field can contain 0, 1,
many, or embedded values
• Query on any field & level
• Flexible schema
• Inline updates *
• Embedding related data has optimal data
locality, requires fewer indexes, has better
performance
Referencing
Contacts
Addresses
{
{
}
“_id”: ,
“name”:
“title”:
“company”:
“phone”:
“address_id”:
,
,
”,
,
}
“_id”: ,
“street”:
“city”:
”,
“state”:
“zip_code”:
“country”:
,
,
,
12
12/2/2015
Embedding
Contacts
{
}
“_id”: ,
“name”:
“title”:
“company”:
“address”: {
“street”:
“city”:
“state”:
,
“zip_code”:
“country”:
},
“phone”:
,
,
,
,
,
,
Contact
•
•
•
•
name
company
title
phone
Address
•
•
•
•
street
city
state
zip_code
Relational Schema
13
12/2/2015
Contact
•
•
••
name
company
adress
address
•• Street
street
•• City
city
•• State
State
•• Zip
zip_code
• title
• phone
Document Schema
Contact
Contact
•
•
•
•
name
company
title
phone
Address
•
•
•
•
street
city
state
zip_code
•
•
•
name
company
adress
address
• Street
street
• City
city
• State
state
• Zip
zip_code
• title
• phone
How are they different? Why?
14
12/2/2015
Schema Flexibility
{
}
“name”:
“title”:
“company”:
“address”: {
“street”:
“city”:
“state”:
,
“zip_code”:
},
“phone”:
{
,
,
,
“name”:
“url”:
“title”:
,
“company”:
“email”:
“address”: {
“street”:
“city”:
,
“state”:
“zip_code”:
}
“phone”:
“fax”
,
,
,
,
,
,
,
,
,
}
MongoDB
• Uno dei db document based
• Memorizzazioen dati in Bson
• Indici per accesso ai dati
• Non esegue join
15
12/2/2015
MongoDB
• Scalabilità
• Tre tipi di sharding: hash-based, range-based, tagaware
• Aumento o diminuzione alla pay as you go
• Bilanciamento automatico
Query Routing
• Multiple query optimization models
• Each sharding option appropriate for different apps
16
12/2/2015
Availability Considerations
• High Availability – Ensure application availability during
many types of failures
• Disaster Recovery – Address the RTO and RPO goals
for business continuity
• Maintenance – Perform upgrades and other
maintenance operations with no application downtime
Replica Sets
• Replica Set – two or more copies
• “Self-healing” shard
• Addresses many concerns:
- High Availability
- Disaster Recovery
- Maintenance
17
12/2/2015
Single Data Center
Primary – A
Primary – B
Primary – C
Secondary – B
Secondary – A
Secondary – A
Secondary – C
Secondary – C
Secondary – B
• Automated failover
• Tolerates server failures
• Tolerates rack failures
• Number of replicas
defines failure tolerance
Active/Standby Data Center
Primary – A
Primary – B
Primary – C
Secondary – B
Secondary – C
Secondary – A
Secondary – A
Data Center - West
Secondary – B
Secondary – C
Data Center - East
• Tolerates server and rack failure
• Standby data center
18
12/2/2015
Active/Active Data Center
Primary – A
Primary – B
Primary – C
Secondary – C
Secondary – A
Secondary – B
Arbiter – A
Data Center - West
Arbiter – B
Secondary – A
Secondary – B
Secondary – C
Secondary – B
Secondary – C
Secondary – A
Arbiter – C
Data Center - Central
Data Center - East
• Tolerates server, rack, data center failures, network
partitions
Global Data Distribution
Real-time
Secondary
Primary
Realtime
Realtime
Secondary
Secondary
Real-time
Realtime
Realtime
Secondary
Realtime
Secondary
Secondary
Secondary
19
12/2/2015
Read Global/Write Local
Primary:LON
Secondary:NYC
Primary:NYC
Secondary:SYD
Secondary:LON
Secondary:SYD
Primary:SYD
Secondary:LON
Secondary:NYC
Scaling Data
20
12/2/2015
Working Set Exceeds Physical
Memory
Partitioning
• User defines shard key
• Shard key defines range of data
• Key space is like points on a line
• Range is a segment of that line
21
12/2/2015
Data Distribution
• Initially 1 chunk
• Default max chunk size: 64mb
• MongoDB automatically splits & migrates chunks
when max reached
Routing and Balancing
• Queries routed to specific
shards
• MongoDB balances cluster
• MongoDB migrates data to
new nodes
22
12/2/2015
MongoDB Auto-Sharding
• Minimal effort required
– Same interface as single mongod
• Two steps
– Enable Sharding for a database
– Shard collection within database
Architecture
23
12/2/2015
What is a Shard?
• Shard is a node of the cluster
• Shard can be a single mongod or a replica set
Meta Data Storage
• Config Server
– Stores cluster chunk ranges and locations
– Can have only 1 or 3 (production must have 3)
– Not a replica set
24
12/2/2015
Routing and Managing Data
• Mongos
– Acts as a router / balancer
– No local data (persists to config database)
– Can have 1 or many
25