Gartner has recently put forth a formal definition for Big Data: “Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.” For those of you new to Big Data problems, the definition includes three critical “V”’s often referred to as the “3 V’s of Big Data”: volume, velocity and variety. And the fourth critical piece of the definition is “…require[s] new forms of processing…” – i.e., a novel approach to the management of data.
Here’s why that’s critical. Vendors, companies and individuals love to throw out the dictum that they have handled big data problems before big data was big. But it just ain’t so! Yes, 10 years ago, a gigabyte of data was a lot for a system to handle efficiently. Kudos for those who successfully manipulated data sets of that size. And in 10 years from now, a petabyte may not seem like a lot of data, so the future you may question why you struggled with “only a petabyte”. But just because you have “volume” doesn’t mean that you know how to work with big data problems.
The reality is that these big data problems are of a class that are just plain difficult or impossible to solve with traditional approaches. (For those of you that need specifics – I’m talking about relational data models!) Big data doesn’t mean that you must use NoSQL or other technologies, but it does mean that you need to approach the problem with a new approach, typically by implementing a highly parallel processing algorithm such as Mapreduce (though, again, it doesn’t have to be Mapreduce).
So if you are talking to a vendor who is touting their big data experience, and they say they’ve been solving these types of problems for years and years – be wary. They are giving you a “marketing answer”. Dig in. Understand what they are really doing, and understand if their solution really meets your needs. They may certainly be a great choice for you but, equally, they may not.