mongoDB: 9 months on – "Take it or leave it" technical choices
Article part of my "mongoDB 9 months" series:
- Setting the stage – introducing why and how I started to use mongoDB, which provides some context of the stuff spoken off after,
- The good – stuff I really appreciate with mongoDB,
- The bad – hurdles which could/should be fixed – mostly minor yet irritating points about mongoDB stack,
- mongoDB "Take it or leave it" technical choices -the current article,
- The good to know – side discoveries which doesn’t change the world but are better known before hand,
- Conclusion – a small attempt at concluding over all of this while taking a step back.
Anyway, let’s have a look at mongoDB "Take it or leave it" technical choices.
Coming from a key value store background, mongoDB provides no equivalent of RDBMs relationships. Actually, the only help provided is the DBRef. However, DBRefs are very limited: it’s a convention to store a collection name and a document identifier. They don’t behave at all like foreign keys: mongoDB keeps no count/trace of them and a document can be deleted whatever the DBRefs pointing at it.
Actually, query can act upon only one collection type at a time. It means it’s possible to have multiple documents from the same collections in one query but fetching 2 documents of 2 different collections requires 2 queries. Be it for retrieval or only to take part in some query, joining is simply not available.
In more details, when displaying the document, it means as many queries as distinct collection types referenced. Let say you have a building collection and you want to display it and where it is. Maybe for the location you have the City, Country and Region. Then you have to query the City, Country and Region collections one a a time.
Doing four queries instead of one for displaying a document might sound "not perfect but bearable". However, the matter get worse if you want to display a list of buildings and their locations. There, for each building, you would have to query 3 times on top. For a list of 20 buildings, you would end up with 61 queries. This most likely won’t fit anymore.
Starting from there, a long list of improvements is possible:
- You could have some kind of unit of work on the client side in order to avoid to request twice the same document. While this again involves some tooling, you can’t rely on it to make queries fast enough, since they may all reference different locations as well as referencing many other documents in your database.
- Going further on the unit of work business, one could have a second level cache, and do the relevant caching on the application server side. As such one could avoid hitting the database for each document, skipping the network overhead and potentially some serialization/deserialization on the way.
- Another improvement would be to somehow collect first all the City, Country and Region ids, in order to do only 3 extra queries on top, then to rewire the fetched locations to their original buildings. While doable, this once again requires extra tooling
- Depending on the use case, loading lazily the referenced documents could also be of some help.
In the end, each of these improvements requires some server side tooling, and sometime some serious one. Applying them all isn’t this trivial neither. In the case of the list of entities, lazy loading of references could in fact be a real downer (if lazy loads happen a lot).
These options are even better combined actually. An unit of work extending over the lazy loading, for example, sounds like the way to go. Similarly for the second level cache.
On top of that, the query to get the building list in the first place matters as well. It’s not just about fetching entities.
Indeed, when creating the building/location list, you may want to order by countries and cities names. You most likely want to offer pagination as well, to make it all user friendly. If you stick to plain DBRefs and the like, it means:
- fetching all the buildings of interest (and not only the 20 first – hopefully you don’t have too many of them),
- for all of these buildings, fetching in one row all the relevant countries and then all the relevant cities,
- sorting the original list of building according to the names of the countries and cities fetched,
- displaying only the 20 first.
On the way, you’ll also need to use field selection.
However, field selection/filtering comes with its issue as well. Indeed, in the given example, you may have to first work on all buildings with filtered out fields, but then for the 20 remaining you may have to fetch them fully, adding one more query on the way.
Overall, the improvements to tackle the lack of joins are quite something to build, use and maintain properly. Furthermore, in the end, the number of queries might still be too important, depending on the use case.
In the end, using only documents and references won’t always fit. Actually, you need to de-normalize your documents to embed the relevant part of the referenced documents. With the list of buildings example, it would most probably mean having the country and city names stored with each building. As such ordering and limiting can be done in one query.
However, all this now de-normalized data requires special cares. If the referenced city name changes, you would like the de-normalized names in building collection to reflect this, isn’t it? So, how to do it?
Well, first of all, that’s where, IMHO, mongoDB documentation lacks by quite some extend. The documentation and "company line" is fairly easy. One can sum it up in 4 words: "you should rather embed", as seen in their Schema design page. If you’re to de-normalize, you’re on your own. From what I got/read, they didn’t provide more hints on how to proceed there because it’s a client side matter with many options. Yet, personally, I would have love at least some presentation of the main ones. No guidance at all is really not enough.
Currently, I see these options:
- ad hoc de normalization support: when saving/updating/deleting some Foo entity, one explicitly goes at linked documents and updates there the de normalized data. Since synchronous when saving, this solution is only possible on small scales. It is usually the first approach.
- self made tools for update propagation: you somehow have some way of declaring that the Bar entities contain de normalized content of the Foo entities. The tools in turn take care of propagating changes. This approach scales better than the first one. It works well as long as no complex logic is needed to build the de normalized content. In this case, one has to switch to the 3rd option which follows. Furthermore, one may want to avoid polluting some documents with plenty of use cases specific de normalized data. Again, the 3rd option might be the way to go.
- pre aggregation for some specific query/views. The idea there is to build a document for a given query/view in order to fulfill it in one query. The document’s content is then highly specific and efficient. Tools are still required to maintain it and likely these pre aggregated query/view can do with different update delays (some might be ok with night batch for example).
The 2 last options need some sort of modification/insertion/deletion propagation. Ideally, some kind of trigger would be welcomed, but they aren’t available in mongoDB. It means you’ve to do it yourself application server side.
Regarding tooling for de-normalization, you’re on your own as well. As said before, DBRefs is the only "tool" provided. You won’t get a database provided list of documents referencing some other one. There’s no tool to propagate update or handled deletion.
While not impossible to achieve, de-normalization handling requires skills and time to do right. This extra work on the client side isn’t common in database work. Having to do such a update/deletion propagation mechanism is not trivial, especially with the limited toolset available.
While not casting a shadow over mongoDB as a whole, the complexity of this extra work shouldn’t be underestimated . You should really plan to develop your de-normalization tools early and test them thoroughly to know how it performs and what you can expect from them. Same stands for the improvements (first/second level cache, lazy loading, collect and then fetch strategy) you may choose. And don’t forget mongoDB – The bad, because some other current limitations of mongoDB may bit you as well: better find them before having to deploy some update propagation mechanism in a hurry for a business critical reason!
Due to the lack of transaction, mongoDB provides limited support for ACID:
- atomicity applies only for changes on one document. If your update batch affecting 10 documents fails somewhere, some of the document will be updated but not the others. Rollbacking the already changes is left to the application server.
- consistency isn’t present either. A long running update could see new documents inserted somewhere in the collection and not affect them. A failed batch update leaves the database in a inconsistent as long as some one doesn’t clean it.
- finally, isolation is not present at all. Ongoing updates can be seen in between of their execution by other operations.
Some implementations and API choices privileging performances can also have unexpected behaviors. For example, when listing all the documents of a collection, having twice the same is possible with the default settings.
To me, usually, transaction implies mostly two things:
- a certain level of atomicity of multiples operations: you can make sure some operations won’t be seen before their are fully completed, avoiding stale data to be taken in account anywhere,
- an "easy" safety net: without thinking of it, if one operation of a transaction fails then all of them will be roll backed. It comes "out of the box".
For the atomicity part, while the documentation provides some workarounds, namely through a two phase commit approach, this approach comes with its price: complexity and ad hoc aspect of it, since it needs to be done each time you need such 2 phase commit). This solution also is limited to update of documents of the same collection (and queries not involving other collection neither).
IMHO, mongoDB simply doesn’t fit if you have many update involving related documents, like the classic Account transfer example. And while having transaction and some ACID properties for sure doesn’t come for cheap in RDBMs neither, it can still be done there when needed, and without this many hoop jumping. Actually, having read the two phase commit page, I almost feel compelled of saying that transactions are possible even in high load environment. Indeed, I know of some finance company having transactions with read committed isolation level for write on their DB2, even for the tables handling stock exchanges back office operations, so with a hell of a lot of transactions. Sure, it ain’t be cheap. But possible.
Regarding the safety net part, it’s still somehow doable. Indeed, with the proper write concern. Indeed, the SAFE write concern makes db operation synchronous and throws exception in case of issue. As such, one can and should care for this application server side. In case of multiple operations in a row, one should then handle roll backing them, manually. However, while possible, this isn’t always easy. Indeed, one doesn’t necessarily know which documents where actually updated, in the failing operation or some before. As such, properly roll backing doesn’t come for free, and might even, depending on the use case, be fairly hard to achieve.
On the 27/05/2011: enumeration of the possible de normalization strategies.
On the 06/06/2011: introduction of the part about the isolation and safety net aspects of transactions.