The Vast Majority of Big Data is Lunacy

It's 2017 and we still don't properly understand the collective human endeavour of software construction. I've speculated elsewhere that the original sin was borrowing semantics from engineering and construction, when in fact software construction is more like R&D than building bridges or airports.

At any rate, one of the consequences of our poor understanding is, that we continue to reach for ill-fitting metaphors, and we try to compartmentalise what little we think we understand in patterns.

Some of these patterns are often useful, such as software design patterns. Other patterns are driven more by marketing and business architecture. They aren't called patterns, I think it's more fair to call them fads, to emphasise that the vast majority are not - from an engineering perspective - new at all, and have no use without an engineering context.

How is your SOA doing? Are you Web 2.0 ready? Is your ERP healthy?

When complex information systems hit the boardroom, they transform into what Richard Dawkins termed memes. An idea which transmits itself in a population like a virus. Gartner's hype cycle is a good accelerator of this particular type of induced confusion.

Owing to the nature of work I have been doing for a few years now, the corporate IT meme I'm interested in here is the Big Data meme.

One sign that an IT meme has originated in the drunken echo chambers of the board room is, either no one can supply identical definitions of it, or if they can it means nothing to the engineers who are going to be implementing. Here, have a look at Gartner's definition of Big Data:

Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.

This is a problematic definition. when is our data high-volume and what does high-velocity mean? What on God's green earth does process automation have to do with Big Data? This is the IT equivalent of blinker fluid.

I have a very specific definition which has served me well. You don't have to agree with the definition, but you cannot refuse to acknowledge that its simplicity and actionability makes it useful.

Data is Big when it does not fit in core memory, on disk, or even on remote disk.

I'm not clever enough to have invented that myself. It's a paraphrasing from the first paper to mention the term Big Data in the library of the Association for Computing Machinery all the way back in 1997.

So there's a lot of invalid reasons to call your data "Big". Some that I've seen in the wild (names withheld to protect the otherwise innocent and well-meaning):

  • "Our data is Big Data because we pay an eight digit licensing fee for Oracle databases" - My condolences. It's still not Big Data.
  • This is Big Data because it takes 14 hours to do a DB restore. That's a long time because there's so much of it." - Nope, it takes 4 hours because you persist in XML and your bulk loader is using one core. Also your sysadmin is playing foosball.
  • "We implemented MongoDB as a data store, so we're Big Data." - No, you implemented MongoDB out of a desire to sporadically lose data silently.
  • "These 24 Gb of customer records crashed Microsoft Access when we wanted to use Crystal Reports on it." - My cat can crash Access, and he's not Big Data.

The reality is that in mainstream commercial operations, datasets which do not fit in memory or on disk are very rare. Even in an interpreted language such as Python, you can fit something like 250 million integers into 16 Gb of core memory (back of napkin calculation), which is a decent amount. and 16 Gb is what a modern desktop comes with, never mind a server.

And here comes the fun part. If your data clearly does not fit in memory or on disk, this is STILL not a green light to invest in Big Data infrastructure for analytical or data science purposes.

Understanding your datasets properly allows you to transform nominally "Big" datasets into smaller data subsets which are disproportionately easier to work with.

There's a common saying: "Those who do not understand Unix are condemned to reinvent it, poorly." Back in the 70s and 80s, we weren't swimming in a sea of cheap South Korean RAM. Back then, there was still fairly big datasets. As a result, the typical UNIX system today (Linux being the obvious example) ships with more or less the same tools which back then were designed to operate on datasets which would not fit in memory. They do this by streaming the data from disk, and they do it so well that they're still shipped today not for legacy support but because They Do Their Job Really Well®.

They had 640 kb of memory (figuratively) and the engineers worked with it. And this still works today. These tools are used to sample data, slice it, recode it, or perform other operations which take a large dataset and cuts it down to the size of the specific analysis which is to be performed (for the propellerheads, we're talking tools like awk, sed, grep, cut, tr, etc.).

Sometimes your dataset spans 45 years, and you discover it's really only the last 9 months' data you need. Other times you might have a large dataset, but discover you only need 3 columns out of 800. Maybe you can take a sample of 500 records from a dataset with fifteen million records, and still achieve the exact same model accuracy. Diminishing returns on analytical accuracy is a very real thing.

All of this is accomplished using a combination of traditional "Not Big Data" technology and traditional "Knowing Your Data". It's healthy to recognise when a pinch of data self-reflection can save a large and often very expensive Big Data initiative.

Big Data very real; some data genuinely does not fit in memory or on disk. The technology exists to work with this data, and the skills required to work with this data are still rare and costly (even if the systems themselves aren't; the credible big data technologies are all open source).

Fortunately UNIX and some good old common sense have got your back. Know your data and consult your engineers.