MongoDB and CouchDB: vastly different queries

Both MongoDB and CouchDB are document-oriented datastores. They both work with JSON documents. They both are usually thrown into the NoSQL bucket. They’re both hip. But that’s where the similarities, for the most part, stop.

When it comes to queries, both couldn’t be any more different. CouchDB requires pre-defined views (which are essentially JavaScript MapReduce functions) and MongoDB supports dynamic-queries (basically what you’re used to with normal RDBMS ad-hoc SQL queries). What’s more, when it comes to queries, CouchDB’s API is RESTful, while MongoDB’s API is more native — that is, you essentially issue a query using a driver in the code of your choice.

For example, with CouchDB, in order to insert some data, I can use a tool like Groovy’s RESTClient and issue a RESTful post like so:

import static groovyx.net.http.ContentType.JSON
import groovyx.net.http.RESTClient

def client = new RESTClient("http://localhost:5498/")
response = client.put(path: "parking_tickets/1234334325",
  contentType: JSON,
  requestContentType:  JSON,
  body: [officer: "Robert Grey",
         location: "199 Castle Dr",
         vehicle_plate: "New York 77777",
         offense: "Parked in no parking zone",
         date: "2010/07/31"])

Note, in this case, I have to delineate a ID for this parking ticket (1234334325) (I can, incidentally, ask CouchDB for a UUID too by issuing an HTTP GET to the /_uuids path).

If I wish to find all tickets issued by Officer Grey, for example, I must define a view. Views are simply URLs that execute JavaScript MapReduce functions. Accordingly, I can quickly code a function to grab any document whose officer property is “Robert Grey” like so:

function(doc) {
  if(doc.officer == "Robert Grey"){
    emit(null, doc);
  }
}

I have to give this view a name; consequently, when I issue an HTPP GET request to that view’s name, I can expect at least one document:

response = client.get(path: "parking_tickets/_view/by_name/officer_grey",
        contentType: JSON, requestContentType: JSON)

assert response.data.total_rows == 1
response.data.rows.each{
   assert it.value.officer == "Robert Grey"
}

In summary, with CouchDB, I can’t quickly issue an ad-hoc RESTful call to obtain some bit of information — I must first define a query (aka view) and then expose it to the outside world. In contrast, MongoDB works much like you’ve been used to with normal databases: you can query for what ever your heart desires at runtime.

For example, I can add the same instance of a parking ticket using MongoDB’s native Java driver (there are better options for working with MongoDB, by the way) like so:

DBCollection coll = db.getCollection("parking_tickets");
BasicDBObject doc = new BasicDBObject();

doc.put("officer", "Robert Grey");
doc.put("location", "199 Castle Dr");
doc.put("vehicle_plate", "New York 77777");
//...
coll.insert(doc);

I can subsequently find any ticket issued by Officer Robert Smith by simply issuing a query on the officer property like so:

BasicDBObject query = new BasicDBObject();
query.put("officer", "Robert Smith");
DBCursor cur = coll.find(query);
 while (cur.hasNext()) {
   System.out.println(cur.next());
 }

Thus, while both document-oriented datastores have some similarities, then it comes to querying, they are vastly different. CouchDB requires the usage of MapReduce while MongoDB allows for more dynamically oriented queries (MongoDB also supports MapReduce). Can you dig it?

Concrete concurrency werewolves

Cédric Beust has an interesting blog post entitled “Clojure, concurrency and silver bullets” where he takes issue with the notion that Clojure can yield code that

is multithread safe and it will automatically scale.

Cédric goes on to state that the concurrency problem doesn’t need a new language as

hundreds of thousands of lines written in C, C++, C#, Java and who knows what other non functional programming languages are running concurrently, and they are doing just fine

In fact, Cédric is quick to point out that Java already has added libraries (in the form of java.util.concurrent) that facilitate easier concurrent coding — and I don’t disagree with him. What’s more, he goes on to point out that Actors aren’t the end-all and be-all of concurrent programming — he even points out an excellent discussion regarding Actors on Stephan Schmidt’s blog entitled “Actor Myths” which is loaded with a fruitful discussion worthy of a close read.

I tend to agree with both Cédric and Stephan — there are no silver bullets which will kill the concurrency werewolf. Yet, I’d like to point out a few things regarding concurrency and specifically actors that might shed some light on why people are espousing something like Clojure and why the Actor Model has gained some mind share.

First, as Herb Sutter noted in his article entitled “The Free Lunch is Over: A Fundamental Turn Toward Concurrency in Software” obtaining an appreciable speed up in application performance requires taking advantage of multi-core chip architectures, which for myriad applications running today isn’t happening. That is, when these applications were written, concurrency wasn’t necessarily tackled, because let’s face it: for the average Joe, thread programming can be difficult to get right.

Accordingly, I suspect that the “thousands of lines written in C, C++, C#, Java and who knows what other non functional programming languages [that] are running concurrently” today were written that way on purpose. These programs were written with threads by smart people. Yet, I’m willing to bet that even those programs have subtle bugs that might not have shown up yet.

These applications (and the authors of them) aren’t going to see things scale up — that is, witnessing a performance increase by throwing better chips or more memory at them (like we could do in the past) won’t help. These applications will instead, need to start running on multi-core chips, where they can begin to take advantage of parallelism (if written correctly to use them!).

Second, threaded programs aren’t terribly difficult to write — no one disputes that — what’s difficult is to get them written correctly. Let’s face it, most testing strategies today rely on deterministic behavior: “given foo, then bar should be 23″. But threads and those nefarious bugs that creep in when shared state and mutability butt heads add inconsistency to this mind set. The phrase “given foo, then bar should be 23″ doesn’t always hold true all of a sudden. Sometimes bar is 89 and other times things blow up or worse lock up and bar is doomed.

Thus, people have started evaluating alternate ways to leverage threads without actually using threads directly because they can be hard to use correctly. If you haven’t read Edward A. Lee’s paper “The Problem with Threads” then go read it now. Mr. Lee does an excellent job of pointing out that our programming model based upon threads doesn’t

vaguely resemble the concurrency of the physical world

Yet, he makes a subtle statement regarding potential solutions that I’m sure Cédric would appreciate:

We should not replace established languages. We should instead build on them.

For some developers the java.util.concurrent library will be good enough, but for others, the Actor Model, which essentially hides locks and synchronized blocks and rather than sharing variables in memory, leverages a mailbox that effectively separates distinct processes from each other. And as it turns out, you can start using Actors in Java quite easily via a number of libraries.

Finally, I suspect that for some people, Clojure and by relation functional programming’s manifestation of concurrent programming is easier to grasp as they resemble “concurrency of the physical world.” For me, Actors embody concurrency in a concrete manner: I can visualize a solution more easily.

There are no silver bullets in software development. Thus, the concurrency werewolf won’t be slain easily; however, the options available to subdue the beast are manifold and those options that provide a concrete model of parallel programming, in my opinion, will be more successful than those that don’t.

Stu Halloway on Clojure

I recently had the opportunity to chat with Stu Halloway (the author of “Programming Clojure” and the CTO and co-founder of Relevance) about, as you can probably guess, Clojure.

Briefly, Clojure is a “dialect of Lisp” and “predominantly a functional programming language” and thus, has a lot of smart people excited. As Stu himself states in the podcast, Clojure “unleashes the power of the JVM” and (in my interpretation of his words) allows a singular focus on solving a problem. That is, Clojure facilitates expressing the essence of a solution with elegant and maintainable code.

I must admit, I’ve been a bit of a skeptic of Lispy languages. I guess the fact that I had to learn and program some Lisp for a CS course in college has left a veritable scar on my conscience. You see, back then, C++ and this up and coming slow language for the web, dubbed Java, were “hot” and Lisp wasn’t even on the map of “cool” (at least for the people and companies I was hanging out with). Stu and the surrounding community’s excitement and passion for Clojure, however, has me re-engaging Lisp. I’ve even been reading Stu’s book!

If you’re curious about Clojure, I highly recommend listing to Stu — he’s a super interesting person and his opinions on Object-Oriented programming, Patterns, and languages in general are quite interesting.

High performance SimpleDB

Sid Anand, who writes the Practical Cloud Computing blog, has a series of posts entitled “SimpleDB Essentials for High Performance Users” in which he outlines a set of best practices and conventions for effectively leveraging SimpleDB. If you are using SimpleDB or are planning to, I highly recommend reading his points as they are super hip. Check out:

In particular, he advocates a form of sharding. That is, rather than putting all data into one SimpleDB domain, he recommends splitting domains up into small chunks so as to increase throughput. This makes a lot of sense; what’s more, sharding in this case isn’t terribly dangerous as SimpleDB doesn’t support cross domain queries to begin with and id management is up to an application anyway. Lastly, there are limits to the amount of space you can store in a domain; thus, sharding can facilitate growth nicely.

While not an entry in the aforementioned series, his article entitled “SimpleDB Performance: 5 Steps to Achieving High Write Throughput” is excellent too. Don’t forget to check out my two articles on SimpleDB:

Finally, I highly recommend reading Werner Vogels’ (the CTO of Amazon) “Eventually Consistent – Revisited” as it provides a base of knowledge for what’s behind SimpleDB.

Guidance on Git podcast

I’m excited to announce that IBM developerWorks has launched a new series of podcasts hosted by yours truly. These podcasts feature technical discussions with various (opinionated) luminaries on a diverse set of subjects ranging from Git to Clojure to Griffon and even .NET (just to name a few!).

The first podcast in the series is a discussion about Git with my friend, Matthew McCullough. I had the pleasure of attending a presentation Matthew gave on Git at a NFJS conference; I was thoroughly impressed with his passion and depth of knowledge regarding how to get started and use Git effectively.

I think you’ll find, as I did, his excitement regarding Git is infectious — if you don’t want to start using Git after listening to Matthew, then you probably never will! So what are you waiting for? Have a listen!

The next big JVM language?

There’s an interesting thread of comments related to a blog post by Stephen Colebourne, who is giving a talk at this year’s JavaOne entitled “Next Big JVM language.” In particular, he and others note that the Fantom language could be the answer (I find this interesting as Fantom really wasn’t even on my radar. Until now.). Moreover, many of the threads claim Scala to be the next big language. It seems people still prefer static typing over dynamic-ness. Either way, I got the distinct impression, based upon those individuals that left comments, which, by no means reflects the community at large, that Groovy isn’t it.

Principally, the arguments against Groovy can be summarized as its lack of performance (compared to Scala, for instance). Not to be outdone, a few folks brought up Groovy++ (which attempts to add a bit of static-ness to Groovy ostensibly to increase performance). Nevertheless, the comments are quite interesting to read if for anything that Fantom is gaining mind share perhaps at the cost of other more mainstream alternatives like Groovy.

Leveraging JPA with Amazon’s SimpleDB

Modeling domain objects for almost any type of application is a breeze using a relational framework like Grails, but what about SimpleDB? This article published by IBM DeveloperWorks entitled “Cloud storage with Amazon’s SimpleDB, Part 2″ shows you how to use SimpleJPA, rather than the Amazon SDK, to persist objects in SimpleDB’s cloud storage.

SimpleJPA automatically converts primitive types to the string objects that SimpleDB recognizes; what’s more, SimpleJPA also handles SimpleDB’s no-join rules for you automatically, making it easier to model relationships. The bottom line: SimpleJPA can help you access significant, inexpensive scalability quickly and easily.

Don’t forget to check out the previous article in this short series on SimpleDB of this article, entitled “Cloud storage with Amazon’s SimpleDB, Part 1” — and while you are at it, check out my other articles related to NoSQL and the like!

Think twice before sharding

sharding dangerI recently saw that Grails supports sharding via a nifty plugin. Briefly, sharding (as defined in Wikipedia) is a method of horizontal partitioning

whereby rows of a database table are held separately, rather than splitting by columns (as for normalization). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.

That is, sharding duplicates a database model (or schema) across multiple database servers (this is different than traditional partitioning, which is row based and usually on the same server). Consequently, one must then decide on what data to shard; for example, (as Wikipedia states) one may shard by customers — European customers live in shard one, US based customers in shard two, and so on. The obvious benefits are:

  • increased read performance (simply put, there is less data to scan in a table)
  • increased reliability (that is, if shard one from above goes down, conceivably, US based customers (in shard two) aren’t affected)

From a scalability standpoint, which has certainly become quite vogue in the age of Aquarius with applications like Twitter, Facebook, Flickr and the like, sharding appears to be an easily justifiable approach. Yet, before you decide to shard, you’d be wise to consider just what you’re getting yourself into.

Most sharding implementations are at the application level — that is, the database itself doesn’t know you’ve decided to shard its data. Accordingly, primary keys can become problematic. If you leave it to the database to generate keys (i.e. sequences) then the possibility of two shards having the same primary key is real. Thus, you’ll most likely need to leave primary key generation to the application. This isn’t such a bad thing and probably not an issue for most apps. Keys are the easy issue to solve, unfortunately.

Sharding works best when shards act in isolation. For instance, in keeping with Wikipedia’s example, so long as US customers don’t, in any way, relate to European customers, everything is hip. If, however, these two entities require joining, sharding becomes a dilemma: querying across shards is not easy. In fact, you must handle joining shards at the application level (because as stated earlier, the database doesn’t know about shards). And you had better have already decided to generate cross-shard unique keys at that level too.

As another example to further demonstrate this headache, the Grails sharding plugin is described with a simple application that stores users and comments. This problem domain works well so long as nothing ever changes; however, what if, in the future, this fictitious application needs to provide thread-able comments (so as to form a conversation)– that is, comments would need parents (i.e. comment #24 is in reply to comment #2).

One way of easily modeling this requirement is to store the comment parent id in the comment child (i.e. a property on comment), yet already this application is now broken because initially ids are “not unique across shards.” Thus, if this were a real life application (I know it’s not and is used simply for demonstration purposes) a decision to shard early on has already locked the application into a data storage model that requires a lot of heavy lifting to evolve. Think about it: because ids aren’t guaranteed to be unique across shards, what if a customer in shard one leaves a comment in reply to a comment from a customer in shard two? How would an application efficiently link (i.e. join) the two in either a read or a write?

I don’t dispute that the issue above can be solved; however, the resolution creates artificial complexity (which always yields higher defects), but most importantly, chances are, any solution to this problem most likely will obviate benefit #1 (reads are supposed to be fast!).

Lastly, sharding creates additional work when it comes to data management. Operational maintaince of shards is a lot of work, especially if shards require bifurcation (that is, what if your US customer shard is so big you need to break it into east and west sub-shards?) and/or migrations.

This isn’t to say sharding is nefarious. There are scenarios where sharding can increase application reads and writes; what’s more, sharding is certinaly in use in varying domains. But, as my friend Tim Berglund dared to muse:

relational sharding is a smell

In plain English: sharding without considering the long term consequences is dangerous. Think twice before sharding, especially if you’re considering it before all else. In fact, if you do find yourself considering sharding at the outset, then perhaps you should be looking at a NoSQL alternative to the relational model. Because it’s their bag, both Twitter and Facebook did.

More GAE datastore resources

There’s an interesting interview with the creators of Twig, Objectify-Appengine, and SimpleDS, which are all ORM-like frameworks built for the GAE that facilitate working with the underlying datastore (an abstraction of Bigtable). If you haven’t worked with GAE, you need to know that the exposed hip datastore isn’t relational — it’s schema-less and more like a key/value store; consequently, the JDO features exposed by default tend to leave people a bit distressed (especially when it comes to relationships).

The questions asked and the answers these developers provide are quite helpful in understanding both the pains and the beauties of the Bigtable abstraction. And while these developers are obviously biased towards their respective framework and the details of the interview are focused on the datastore itself, this conversation is a worthwhile read for anyone new or considering using the GAE.

What’s more, the interview provides a link to a GAE forum where both the creators of Twig and Objectify-Appengine square off regarding their respective frameworks implementations. Both frameworks have a distinctly different mechanism for dealing with relationships and while I tend to prefer Twig’s relaxed syntax, Jeff Schnitzer, the creator of Objectify-Appengine makes a cogent case for why dealing directly with GAE keys is safer.

Interestingly enough, a full-stack framework targeting the GAE has yet (to my knowledge) fully emerge — indeed, Grails works on the GAE but this is basically an afterthought (i.e. Grails wasn’t built for the GAE). Moreover, Gaelyk is specifically built for the GAE, but lacks a fully fleshed out ORM implementation, preferring to expose an enhanced version the low-level Entity API. The Play framework, which is somewhat like Grails but without a lot of Groovy, has a GAE module (aka plugin) along with an Objectify-Appengine one. It should be interesting to see how things play out (no pun intended).

In the clouds with Amazon’s SimpleDB

development 2.0As part of the Amazon Web Services family, Amazon’s SimpleDB is a massively scalable and reliable key/value datastore, which is exposed via a web interface and can be accessed using any language you’d like — from Java to Ruby to Perl to C#. In fact, Amazon has recently released a standardized SDK for both the .NET and Java platforms.

Check out IBM DeveloperWorks’ newest article entitled “Cloud storage with Amazon’s SimpleDB, Part 1” — in this article, you’ll see firsthand how to leverage Amazon’s Java SDK to work with SimpleDB. In fact, this is the first of two articles exploring SimpleDB’s unique approach to schemaless data storage, including a demonstration of one of the datastore’s most unusual features: lexicographic searching.

Stay tuned for part 2, where I’ll cover using JPA to work with SimpleDB. Until then, happy reading!