A Few Words About 'NoSQL' and Other Unstructured Databases
No one likes this term. Attempting to describe something by what it isnt typically doesnt work and, to make matters worse, this is about data-store relationships and not about SQL at all. Yet NoSQL databases have significant advantages, including:
- Seemingly infinite scalability (Facebook is using Cassandra to store and query 50TB of user inbox data).
- Extraordinary fault tolerance.
- High availability.
- A design-friendly lack of schema.
- Integration of both RESTful and cloud computing technologies.
Disadvantages revolve around a basic fact: These are not relational databases built to rapidly process transactions, perform error checking, and maintain data integrity.
For the past 30 or 40 years, database design has focused on adding more controls and scaling vertically. These are two great things for transaction-heavy environments. But times have changed, and some enterprises no longer require expensive, highly redundant hardware on the data storage, server and network. Its become less expensive computationally to accept that failures cant be completely prevented.
Todays large-scale databases are designed to dynamically repair node failures by partitioning and replicating data across clusters. Partitioning the data not only minimizes the impact of any single hardware failure but also distributes the load of database operations. Non-relational databases typically have the ability to maintain multiple hot copies of data. Nodes can fail or be added and replications compiled and moved on the fly. Some NoSQL databases are flexible enough to allow for control over which objects are stored on which replicas to improve performance and scalability.
As the term NoSQL implies, not all non-relational databases support SQL queries. In fact, there are significant differences between how SQL queries are handled between different products. At a minimum, they all offer simple key-value matching (as in a hash table). Typically, stored datas key-value attributes can be queried directly and the document returned. For NoSQL databases that do not fully support SQL, some programming is required to convert a SQL query into something that will run against the data store. The ease of this programming and a data stores support for SQL are gating factors in adoption since SQL is a core business IT skill in many enterprises. The degree to which countless SQL admins can be brought to bear in the new world of non-relational databases will ease implementation and increase adoption.
Matt's short list:
SimpleDB is a key component of Amazons Cloud computing offering, along with Elastic Compute Cloud (EC2) and Simple Storage Service (S3). SimpleDB is a mature data store with the goal of simplicity. SimpleDB supports eventual consistency via asynchronous replication. Replication is read-only and there are no auto-sharding features. SimpleDB organizes documents into domains that contain their own indices and metadata. Domains may be stored on different Amazon nodes.
Cassandra, written in Java and available under Apache licensing, uses column groups for partitioning and replication. Updates are cached in memory and then flushed to disk, where the files must be periodically compacted. Failure detection and recovery are fully automated. Cluster membership is managed via a gossip-style algorithm. Cassandra provides eventual consistency. There is also some support for versioning and conflict resolution.
The database can track intersections and unions between enormous numbers of data objects. Cassandra saves query time by denormalizing the data before it is stored and precomputing join operations between records.
CouchDB is an Apache project written in Erlang. According to Apache, CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. There are libraries for different languages such as Java, C, PHP, and Python, that convert native API calls into RESTful calls. Data is stored in documents", which are essentially key-value maps themselves, using JSON data types. Documents are grouped into collections where schema reside.
CouchDB scales through asynchronous replication but lacks an auto-sharding mechanism. Reads are distributed to any server while writes must be propagated to all servers. CoucheDB does not guarantee consistency, although it does implement MVCC on individual documents. If someone else has updated the document since it was fetched, then CouchDB relies on the application to resolve versioning issues. CouchDB has limited transaction processing functionality. All document and index updates are flushed to disk on commit.
MongoDB, a GPL open source document store, is one of the most feature-rich of the non-relational NoSQL databases. Written in C++ and sponsored by 10gen, it provides indices on collections and provides a sophisticated document query mechanism. SQL support is excellent and dynamic queries are supported by automatic indexing.
MongoDB supports automatic sharding. Master-slave replication is used mostly for failover, not for scalability. MongoDB maintains eventual consistency and global consistency with a current local primary copy of a document that then replicates throughout the data store nodes.
MongoDB stores data in a binary format called BSON, which supports Boolean, integer, float, date, string and binary types. Client drivers encode the document data structure into BSON and send it via socket connection to the MongoDB server. MongoDB supports a GridFS specification for large binary objects such as images and videos that are stored in chunks and can be streamed to clients. MongoDB supports map-reduce to aggregate queries across documents.
Riak is an open-source Erlang based project that uses a RESTful client interface. Objects can be fetched and stored in JSON and can have multiple fields but the document store itself cannot be queried. The only lookup possible is on the primary key.
Riak supports object replication and sharding by hashing on the primary key. Replica values are eventually consistent. Riak relies on vector clocks for version control and includes functionality for reconstituting out-of-synch data.
Architecture is simple and symmetric, relying on consistent hashing to distribute data throughout a ring of nodes. Shards are distributed around virtual and physical nodes. There is no master node to track system status. All of the nodes use a gossip protocol to track node status and data location. Any node can service a request from any client, plus a map/reduce mechanism splits work across nodes.
Riaks storage can be in memory or on disk or a combination of the two. This flexibility allows for commonly accessed key-value pairs to reside in memory while the rest of the data is on disk.
An important feature that sets Riak apart from others is that it can store links between documents. For example, documents about authors can be linked directly to documents about their books without the need for secondary indices.
Tokyo Cabinet/Tokyo Tyrant
Part of the larger Tokyo Product, these represent C libraries. The front end is Tokyo Tyrant, the multi-threaded back end server is Tokyo Cabinet. The Tokyo Cabinet library creates a key-value store with language bindings for Java, Ruby, PERL, and more. Tokyo Cabinet is an extremely fast embedded database.
Tokyo Tyrant supports get, set, and update operations; asynchronous replication with master/slave or dual master; and record locking, ACID transactions, binary array data types, and complex update operations. Tokyo Tyrant manages Tokyo Cabinets three network interfaces: the binary protocol, HTTP for RESTful communications, and Memcached.
Closed solutions that require heavy customization may ultimately constrain data stores. Evaluate non-relational database solutions thoroughly and pilot test before product rollout. Design solutions around architectures that scale based on the type of data to be housed and its requirements. Planning requires an understanding of key differences between centralized RDBMS system design and distributed non-relational system design.
SQL RDBMS transaction processing is not going to disappear. Traditional database design principles still hold true transactional integrity and immediate consistency are required. However, where horizontal scaling to millions of concurrent users is a requirement, non-relational or NoSQL databases warrant serious consideration.
Matt Sarrel is executive director of Sarrel Group, a technology product test lab, editorial services and consulting practice specializing in gathering and leveraging competitive intelligence. He has over 20 years of experience in IT and focuses on high-speed large scale networking, information security, and enterprise storage. E-mail firstname.lastname@example.org, Twitter: @msarrel.