Thursday, October 27, 2011

This Is Not the DBMS You Are Looking For.

NoSQL databases such as CouchDB can be great tools for some jobs. Their concept of "denormalization" shouldn't be thought of as a rebellion against repressive traditional SQL data structures, but as a formalization of the usual workarounds in the special cases where SQL sucks. If you are trying to represent nested objects with possibly unknown fields, SQL is a nightmare. Before, you might have given up, cursing Boyce and Codd, and rolled your own object store (store blobs in fields, or [gasp] a text file). Now, CouchDB gives you a way to store your complicated data and query it efficiently.

That efficiency comes with a price, however. The problem, and the reason CouchDB won't ever be useful for anything but playing around and special-purpose internal storage for applications, is security, which basically can't be done. A user can see an entire database or none of it, a fact fundamental to the design of NoSQL document stores like CouchDB, and one many a n00b on Stack Overflow seems unwilling to accept.

The reason is simple. CouchDB views are efficient because it precomputes the answer, sorted by the view keys. A request grabs a contiguous block of memory, and the server doesn't have to do anything. Incidentally, this is why you cannot filter by one thing and sort by another. The server is lazy (and hence fast) -- it gets to "relax" while putting the work of sorting, etc., on you. The problem is that the server can't do any reorganization of the rows it returns, including authentication to validate which rows it should. That would be work, and would ruin the magical storage paradigm and the performance. Since you can't filter rows by user, any user who can read the database can read the entire thing -- they can use the /_all_docs view, they can scan document IDs, etc. You can stop users from writing by implementing document update validation, but everyone reads everything.

This means you cannot do per-document authentication with lists, shows, or views. You cannot rely on  a "valid_user" key emitted in a view. Attempting to use URL rewriting to hide parts of the HTTP API is asking for trouble. You can only set per-database security, with so-called "readers" who may in general read and write everything but design documents, and "admins" who can read and write everything.

This is fine if you trust all your users and have nothing to hide, but CouchDB likes to posture as a real web framework for the real internet. It has a web server, lists and shows can effectively be used for templating, it has module support, user authentication, support for running middleware daemons, and more. CouchApps --  web applications contained in a database's design documents -- are a thing.

These are only practical in totally controlled situations: namely, localhost. You can't really deploy them anywhere. Even the user authentication that is built into CouchDB is a red herring -- it is fundamentally flawed because it is implemented at the server level and not the database level. You can't write Twitter as a CouchApp. You can make a blog, but you can't make a Blogger. Nor a Facebook, nor a chat server, nor much of anything else.

Consider the chat server, since "Toast" is a famous example described at this page with an ugly URL: If you really wanted to deploy Toast, you would quickly find yourself scratching your head, thinking that there surely must be a way make two separate chatrooms, readable only by the participants, without making a separate database for each. There is not a way. The "official" solution is to create a separate database for each user and copy the chat content into each user's db with filtered replication. This doesn't exactly bespeak scalability. Despite Couchbase co-founder J. Chris Anderson's 1999-esque claims that "disk is free," (few bytes * n_tweets * n_users!) is not free if you're Twitter-scale.

With CouchDB, the only way to implement per-document security is in middleware -- you can add security objects to your documents (like those at the DB level), intercept requests to the CouchDB server, and filter CouchDB's responses there. There are a few problems with this approach:

  1. It sacrifices all the performance of CouchDB, since you have to iterate over all the rows.
  2. You can't do anything about list functions. Lists operate on view results, and it's all done internally -- you can't intercept the view rows before they get to the list.
  3. URL rewrites must be handled upstream of middleware -- it has to know what it's requesting
The right answer is for NoSQL document-based database systems to recognize real-world problems and evolve. The DBMS of the future should do the above-described middleware's function internally. It may not be as fast as CouchDB, but at least it would be usable.