Moxie Group: Big Data

Beyond the RDBMS

The days of using traditional databases are not over, as there are many (indeed perhaps most) situations where building an application component on top of a relational data store like MySQL, SQL Server, or Oracle is the best tool for the job. However, relational databases are inherently architecturally limited in the scale that they can achieve for problems that deal with extremely large data sets.

You can scale a single database table to still have acceptable performance into gigabytes of size with millions of rows, and with the right (read: expensive) hardware perhaps 100's of gigabytes and billions of rows. But, when you start needing to store and process single data sets that are terabytes in size and grow into the trillions of rows (or, perhaps more interesting, running tables that might have millions of columns), you start moving beyond what is practically possible to process with a traditional relational database.

Thinking In Maps

The NoSQL movement has helped popularize and legitimize building complex applications without using a traditional RDBMS. However, it's easy to take for granted what the relational database provided. Working with a NoSQL database requires approaching application architecture and design in a very different way.

Need to find all products within a category? You can't join tables. Need to find all user records in the database from a specific city? There is no SQL "where" clause. Need to efficiently query records over a certain column? You can't add indexes. Need to update multiple tables atomically? Transactional options are limited. Need to ensure data consistency between tables? There are no foreign keys (or even schemas!)

We're familiar with these kinds of technologies and architectures, are actively using them to create real software solutions, and our consulting and development practice is available to help you solve similar problems.

The Right Tool for the Job

Many technology options exist for handling updates and queries to large data sets, and each choice brings with it tradeoffs that need to be considered. The CAP theorem helps brings attention to one aspect of these choices.

Is it acceptable for some users to see data that might be out of date (thereby sacrificing consistency) in order to guarantee that the system can handle any kind of hardware failure (perhaps of the entire data center)? Or is some downtime acceptable to ensure the user always sees correct information? The answer to that question will affect whether or not a tool like HBase or Casandra is a better fit.

Efficiently storing data in a scalable fashion is only part of the challenge. To then process the records to provide useful applications to your users and provide needed analytics for your business requires other tools. Need to query all of the data and transform it in mass? Then Hive may be the best fit. Need to setup the data so that it can be efficiently queried in aggregates? Using a columnar database like LucidDB will let you run complex aggregate queries in milliseconds. Need to allow business users to slice and dice the data at runtime for analytics? Then an OLAP engine like Mondrian may be the way to go. Need to use machine learning techniques to do predictive modeling? Then a prediction API will let you do just that.