12/2/2015 NoSql Indice • Definizione e ragioni del NoSql • Modelli document based • Modelli a grafo • Modelli key/value • Modelli wide column • confronti 1 12/2/2015 In principio fu… Storage! 2 12/2/2015 Proprietà ACID • Atomicità • Consistenza • Isolamento • Durability (persistenza) Sviluppo di applicaizoni Code XML Config DB Schema Application Object Relational Mapping Relational Database ©Massimo Brignoli, Mongodb 3 12/2/2015 And Even Harder To Iterate New Table New Column New Table Name Pet Phone Email New Column 3 months later… ©Massimo Brignoli, Mongodb Performance • I sistemi RDBMS scalano aggiungendo capacità elaborativa ai server – Scale up 4 12/2/2015 Performance • Fino a un limite… Storia • • • • • • • • • • • • MultiValue databases at TRW in 1965. DBM is released by AT&T in 1979. Lotus Domino released in 1989. Carlo Strozzi used the term NoSQL in 1998 to name his lightweight, opensource relational database that did not expose the standard SQL interface. Graph database Neo4j is started in 2000. Google BigTable is started in 2004. Paper published in 2006. CouchDB is started in 2005. The research paper on Amazon Dynamo is released in 2007. The document database MongoDB is started in 2007 as a part of a open source cloud computing stack and first standalone release in 2009. Facebooks open sources the Cassandra project in 2008. Project Voldemort started in 2008. The term NoSQL was reintroduced in early 2009. 5 12/2/2015 NoSQL • Not only SQL • Insieme di modelli di rappresentazione dei dati e relativi software di gestione • Schema free (o schemaless) • CAP theorem • Base – Basic Available, Soft state, Eventually consistency CAP theorem 6 12/2/2015 Cap theorem • RDBMS sono essenzialmente CA – Esistono tentativi per avere anche la P • I sistemi NOSQL sono CP o AP soprattutto – CP-> si aspetta che I dati siano coerenti – AP -> si accetta di avere ogni tanti I dati inconsistenti NoSQL http://blog.nahurst.com/visual-guide-to-nosql-systems 7 12/2/2015 Modelli NoSQL • Key-Value Stores • Column Family Stores • Document Databases • Graph Databases • RDF databases as well as Tuple stores Key value • Dynamo, Voldemort, Rhino DHT ... – DeCandia et al. "Dynamo: Amazon’s Highly Available Key-value Store", 2007 • Key-Value sono tabelle di hash dove la chiave punta a un particolare valore • Il mapping chiave-valore è supportato da mecchanismi di hash per massimizzare le performance 8 12/2/2015 Column • BigTable, Cassandra, HBase, Hadoop ... – Chang et al. "Bigtable: A Distributed Storage System for Structured Data", 2006 • Memorizzano grandissime quantità di dati – "Petabytes di data across centinaia di servers" • La chiave punt a colonne multiple Graph DB • Neo4J, FlockDB, GraphBase, InfoGrip, ... • Graph Databases sono costruiti da nodi e relazioni fra nodi (archi). • I nodi hanno proprietà – Nodes rappresentano entittà (e.g. "Bob" or "Alice"). – Proprietà sono informazioni pertinenti ai nodi (e. g. età:18). • Gli archi connettono nodi a nodi o nodi a proprietà • I graph DBs non scalano bene 9 12/2/2015 Document Based • CouchDB, MongoDB, Lotus Notes, Redis ... • I documenti sono indirizzati in nel db tramite una chiave unica • Modello semistrutturato come json o xml • Oltre alla chiave I documenti possono essere cercati MongoDB MongoDB RDBMS { _id : ObjectId("4c4ba5e5e8aabf3"), employee_name: "Dunham, Justin", department : "Marketing", title : "Product Manager, Web", report_up: "Neray, Graham", pay_band: “C", benefits : [ { type : "Health", plan : "PPO Plus" }, { type : "Dental", plan : "Standard" } ] } 10 12/2/2015 Enterprise Big Data Stack Data Management Online Data Offline Data RDBMS RDBMS Hadoop EDW Security & Auditing Management & Monitoring Applications CRM, ERP, Collaboration, Mobile, BI Infrastructure OS & Virtualization, Compute, Storage, Network Modello dati Relational MongoDB { first_name: ‘Paul’, surname: ‘Miller’, city: ‘London’, location: [45.123,47.232], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value : 330000, … } ] } 11 12/2/2015 Document _id • N-dimensional storage • Each field can contain 0, 1, many, or embedded values • Query on any field & level • Flexible schema • Inline updates * • Embedding related data has optimal data locality, requires fewer indexes, has better performance Referencing Contacts Addresses { { } “_id”: , “name”: “title”: “company”: “phone”: “address_id”: , , ”, , } “_id”: , “street”: “city”: ”, “state”: “zip_code”: “country”: , , , 12 12/2/2015 Embedding Contacts { } “_id”: , “name”: “title”: “company”: “address”: { “street”: “city”: “state”: , “zip_code”: “country”: }, “phone”: , , , , , , Contact • • • • name company title phone Address • • • • street city state zip_code Relational Schema 13 12/2/2015 Contact • • •• name company adress address •• Street street •• City city •• State State •• Zip zip_code • title • phone Document Schema Contact Contact • • • • name company title phone Address • • • • street city state zip_code • • • name company adress address • Street street • City city • State state • Zip zip_code • title • phone How are they different? Why? 14 12/2/2015 Schema Flexibility { } “name”: “title”: “company”: “address”: { “street”: “city”: “state”: , “zip_code”: }, “phone”: { , , , “name”: “url”: “title”: , “company”: “email”: “address”: { “street”: “city”: , “state”: “zip_code”: } “phone”: “fax” , , , , , , , , , } MongoDB • Uno dei db document based • Memorizzazioen dati in Bson • Indici per accesso ai dati • Non esegue join 15 12/2/2015 MongoDB • Scalabilità • Tre tipi di sharding: hash-based, range-based, tagaware • Aumento o diminuzione alla pay as you go • Bilanciamento automatico Query Routing • Multiple query optimization models • Each sharding option appropriate for different apps 16 12/2/2015 Availability Considerations • High Availability – Ensure application availability during many types of failures • Disaster Recovery – Address the RTO and RPO goals for business continuity • Maintenance – Perform upgrades and other maintenance operations with no application downtime Replica Sets • Replica Set – two or more copies • “Self-healing” shard • Addresses many concerns: - High Availability - Disaster Recovery - Maintenance 17 12/2/2015 Single Data Center Primary – A Primary – B Primary – C Secondary – B Secondary – A Secondary – A Secondary – C Secondary – C Secondary – B • Automated failover • Tolerates server failures • Tolerates rack failures • Number of replicas defines failure tolerance Active/Standby Data Center Primary – A Primary – B Primary – C Secondary – B Secondary – C Secondary – A Secondary – A Data Center - West Secondary – B Secondary – C Data Center - East • Tolerates server and rack failure • Standby data center 18 12/2/2015 Active/Active Data Center Primary – A Primary – B Primary – C Secondary – C Secondary – A Secondary – B Arbiter – A Data Center - West Arbiter – B Secondary – A Secondary – B Secondary – C Secondary – B Secondary – C Secondary – A Arbiter – C Data Center - Central Data Center - East • Tolerates server, rack, data center failures, network partitions Global Data Distribution Real-time Secondary Primary Realtime Realtime Secondary Secondary Real-time Realtime Realtime Secondary Realtime Secondary Secondary Secondary 19 12/2/2015 Read Global/Write Local Primary:LON Secondary:NYC Primary:NYC Secondary:SYD Secondary:LON Secondary:SYD Primary:SYD Secondary:LON Secondary:NYC Scaling Data 20 12/2/2015 Working Set Exceeds Physical Memory Partitioning • User defines shard key • Shard key defines range of data • Key space is like points on a line • Range is a segment of that line 21 12/2/2015 Data Distribution • Initially 1 chunk • Default max chunk size: 64mb • MongoDB automatically splits & migrates chunks when max reached Routing and Balancing • Queries routed to specific shards • MongoDB balances cluster • MongoDB migrates data to new nodes 22 12/2/2015 MongoDB Auto-Sharding • Minimal effort required – Same interface as single mongod • Two steps – Enable Sharding for a database – Shard collection within database Architecture 23 12/2/2015 What is a Shard? • Shard is a node of the cluster • Shard can be a single mongod or a replica set Meta Data Storage • Config Server – Stores cluster chunk ranges and locations – Can have only 1 or 3 (production must have 3) – Not a replica set 24 12/2/2015 Routing and Managing Data • Mongos – Acts as a router / balancer – No local data (persists to config database) – Can have 1 or many 25