Musings on NoSQL

nosql Musings on NoSQL
Following my article on high performance at massive scale, I’ve started to get really interested in the type of distributed databases the big web players are using to handle current and future volume of data. Some of the products developed in my current organization have to cope with large amount of data (personal, financial, marketing, etc.) with an increasing need to aggregate and link this data to get the most complete picture about individuals and businesses. This is especially true in credit bureau, fraud detection, Marketing, customer management etc.

As the web is becoming more ‘social’, we’ve seen a lot of change in the data space:

  • Data volume is becoming larger and larger: Google manipulate 20 petabyte of data every day. Facebook handles 20 petabyte for their 10 billions photos !
  • Data is getting more and more connected due to its social trend. It means a huge number of joins (I don’t have to remind you that ‘joins’ is the ennemy in the DB performance world)
  • Data is becoming less structured.
  • Need for scalability and fault tolerance is exploding. Geographical distribution is also key if you want to be global.

While relational databases bring a lot of benefit especially around Atomicity, consistency, isolation and durability (ACID), is well proven and the industry is very familiar with it (programers, DBA), they unfortunately hit a wall when dealing with Web 2.0 requirement. Their schema is rigid (lack of flexibility), joins are really slow, it’s difficult to distribute the data across nodes, the optimization you can introduce break the benefit of normalization ie. data integrity and downtime is not acceptable. But the major issue today is scalability. RDBMS scale reasonably well vertically but there is so much you can do with one huge system. If you want to meet the social web requirement, you will have to scale horizontally. It does offer flexibility but is definitely more complex: You need to group your data by function and spread your functional groups across databases. You then split your data within functional areas across multiple database (sharding).

It’s been demonstrated by Eric Brewer, who is a professor at the university of California that distributed system can have only 2 of the following characteristics:

  • Consistency: Perception from the users that a set of operations occurs all at once.
  • Availability. All operation must be performed with an appropriate response time.
  • Partition tolerance. All operation must complete, even if one component/node is down/broken.

These characteristics are known as the CAP theorem. Since horizontal scaling is based on data partitioning, there is a trade-off remaining between consistency and availability. ACID database transaction doesn’t allow this tradeoff but non-relational databases address this limitation by introducing a new principle: BASE which trades some amount of consistency for availability.

  • Basically Available: Appears to work all the time.
  • Soft state: It doesn’t have to be consistent all the time.
  • Eventually consistent: At some stage it will reach consistency !

Don’t you love it? While ACID is pessimistic and forces consistency for all operation, BASE has an optimistic view and assumes that inconsistent operation will occur (hell, shit happens !) but will reach a consistent state at some point. It seems a bit loose and difficult to manage but this is why non-relational database come to the rescue to implement smart consistency pattern and help you reach scalability you couldn’t dream about a few years ago !

There are today 4 main trends in the non-relational database world which dominate the space:

Key-Value databases
Entries are stored as key-value pairs in large hash tables. Domains (possible values of an attribute) are similar to those found in table but no specific schema is defined. Keys are arbitrary while values are blobs. There are no explicit relationships between domains. You access keys and values through API (SOAP, RESTful). Integrity is guarantee by the application itself.

Major open-source and commercial Key-values databases:

  • Dynamo (Amazon)
  • SimpleDB (Amazon web services). Written in Erlang !
  • Voldemort (LinkedIn)
  • Memcached: In  memory key-value store. All the major web players are using it: Facebook, Twitter, YouTube etc.

Column-oriented databases
Entries are stored by column versus row. It brings you a big performance uplift when you need to query many rows for smaller sets of data (not all columns) and it maximizes disk performance (read scans). It’s definitely not the right choice if you need to query all columns of a single row or need to write a new row with all column data supplied.
columnorienteddatabase Musings on NoSQL
Major open-source and commercial column oriented databases:

Document databases
couchdb Musings on NoSQL
This was inspired by Lotus Notes and very similar to key-values stores. Each DB record is stored as a document (JSON). DB is schema-less and highly denormalized.

Major open-source and commercial column oriented databases:

Graph databases
In this model, entities are stored as nodes and edges. Nodes represents entities while edges represent relationships. It’s basically a key-value store with full support for relationship.
 Musings on NoSQL
Major open-source and commercial column oriented databases:

the whole noSQL (ie. Not Only SQL)  hype is picking up a lot of steam right now with the acceleration of the social web. The big relational DB players are already playing with it (IBM’s M2 corrals massive data sets with Hadoop). What will happen to your favorite RDBMS? Should they leverage Memcached or JBoss cache to get an uplift in scalability and performance while relying on their RDBMS engine to maintain ACID properties? is this the right balance if you don’t have Facebook or YouTube performance requirement?

If you’re interested to read more about noSQL (I know I’m hungry for more !), I definitely recommend the following read:

 Musings on NoSQL

2 thoughts on “Musings on NoSQL

Comments are closed.