One problem with the Big Data paradigm is the lack of Big Data capable software engineers. Why is this? Let's check off the reasons:
- It is hard to reason about parallel computation. Steve Jobs was famous for saying that "nobody knows how to program [multi-core]. I mean two, yeah; four, not really; eight, forget it." Now take those eight cores and multiply it across tens, hundreds, or in some cases, thousands of machines. This is a difficult problem.
- There is a constantly changing ecosystem of tools, languages, and work flows. This leads to confusion among newcomers on which to learn and use, and which are irrelevant.
As the Big Data space moves forward, we will need to overcome these barriers in tandem. But if we take a note from history, we’ll notice that both of these problems have already been solved before - let's take a step back in time.
SQL, the Forefather
In 1974, IBM built the first relational database management system (RDBMS): System R. They also introduced SQL, a declarative language which was based upon relational algebra. A few years later, Larry Ellison and co. created Oracle 2, building their first relational database out of his garage. They decided to follow suit and use SQL for a simple reason: if customers switched to Oracle and decided they didn't like it, they could easily switch back to IBM without rewriting their queries.
SQL is a special kind of language - I like to think of it as "Mother Knows Best". By submitting an idea of what you want the machine to do, the database system designs a plan for how to do it. The programmer need not think about parallel programming, as the database does it for her. In theory, these query plans would be optimal; in practice, they are far from it (why else would SQL Server need a "USE PLAN" query hint?). Regardless, SQL ruled the roost for over three decades, and countless systems are built from that foundation. In fact, it became a standard in the American National Standards Institute in 1986.
As time grew, so did disk sizes and data sizes. These monolithic databases ran on single machines, which were becoming increasingly expensive over time. There were distributed data warehousing options (see Teradata, et al.), but these solutions were prohibitively expensive for many businesses
Onward to NoSQL
In 2003, the Google File System (GFS) was introduced. This was a distributed data store that focused on large, append-only files and ran on a cluster of relatively small machines. A year later came MapReduce, a simple way for programmers to write large-scale parallel processing jobs on these files. With these developments came the open source version, Hadoop. What did this mean for the data processing world?
In hindsight, these were seminal moments; at the time, they were largely ignored by the database community. The itch had been scratched, however, and companies noticed the cost savings that this open-source storage system, running on small machines, could bring.
Within the paradigm, programmers write MapReduce jobs. The key to MapReduce is that programmers write single-threaded, imperative programs which run simultaneously on multiple machines. While this makes writing parallel programming easy, it limits the types of computations that can be performed. As an example, read about the difficulties of graph processing with MapReduce. The real question was how to effectively overcome this limitation.
Microsoft thought they had the answer, and in 2007 released Dryad, an alternative to both those distributed databases and MapReduce. It gives complete control to developers - defining any data flow necessary, when to store parts of data computation to disk, when to replicate data streams, and more. But of course, with great power comes great responsibility, and few programmers were up to the task. Microsoft found most users were just writing MapReduce jobs, and eventually the project went the way of the Dodo bird.
There and Back Again
Microsoft's solution was on target, but implemented poorly. While we need the power that a generalized data distribution graph brings, what engineers do not need is the responsibility. We've been here before! The system needs to let us tell it what to do, but decide for itself how to do it.
Since then, the Hadoop ecosystem has been moving in the direction of more possibilities within the paradigm (e.g. not restricting to just map and reduce), but simpler programming. We've crossed into Spark's RDDs, Google Dataflow's Windows (now Apache Beam), and Flink's Datastreams and Datasets. I won’t dive into the technicalities of these APIs here, but they each bring their own distinctive flair to the Big Data genre, while still allowing for the imperative programming that MapReduce requires.
But SQL fits the bill too, and we've seen a resurgence of SQL-like APIs. Hive quickly came out of Facebook to be SQL on MapReduce, Phoenix is SQL on HBase, and both Spark’s and Flink's Dataset APIs look very close to a typical SQL API.
This is, of course, part of the problem that we discussed earlier. Too many options! Whenever any of these new, amazing, fast, and easy tools come out, programmers may need to either rewrite their old code or move to a new API or language for all new code. Either way is a hindrance to both the programmer and the codebase. The former lacks consistency in toolsets, and the latter requires continuous efforts on the part of the programmer.
Finding the Right Abstraction
The Big Data community needs to decide which abstraction is correct, and iterate on building out new tools to utilize these abstractions. This way new tools can be utilized on older codebases, and programmers won't have to learn an endless number of new paradigms to stay relevant. New frameworks should focus first on utilizing the existing API, and only if necessary expand on it to allow for new programming options. Database vendors have done this for decades; in general, SQL looks the same, but each has their own unique set of functions that differ from their competitors. Just like early Oracle customers could switch back to IBM if they felt like they needed to, our Big Data codebases should run on a variety of engines without any modification.
While SQL is one opportunity for this abstraction, another that is seen in both Spark and Dataflow (now Apache Beam) is a functional API. MapReduce took an idea out of the functional notebook, and these newer frameworks are extending that idea even further. Note that, similar to SQL, functional languages are considered declarative, and not imperative.
In the end, the best language will be one that is able to specify the greatest number of possible programs in as easy and concise syntax as possible. I'm willing to admit here that this may not be SQL - there is a reason that the database world had Datalog for recursive queries.
The Future of Big Data Programming
There are quite a few frameworks, ideas, papers, moments, and intricacies that I left out in the above discussion. My goal is to help people understand what the current Big Data ecosystem looks like, how we got there, and now what comes next - specifically for end-users in the Big Data space. There are quite a few problems left to be solved, and the above discusses two that affect programmers specifically.
In the immediacy we need better tools to help engineers better utilize Big Data systems. For example, how should I be storing my data? Which indexes, partitions, or clusters should I use? Perhaps a future computation engine could understand my needs over time and help de-normalize, partition, and index my data itself, in order to give the fastest average performance per query.
New computation engines will need to take advantage of these storage layers as well - so a generalized index may well be in order, one that can be utilized across many different tools, as teams find out which ones meet their needs. The same goes with file storage types (Avro, ORC, etc.) and partitioning strategies.
Consider the vast number of tools I mentioned in this article alone, and realize that these are only the tip of the iceberg in the data processing world. It's time to make these tools more accessible to the rest of the world.
If you are having trouble finding Big Data Engineers, let us help! We have experience with a litany of Big Data tools both in and out the Hadoop ecosystem, and are known for delivering timely and professional projects for our clients.