Actually, it’s been around for a while, but recently I keep seeing it being talked about, so perhaps it’s time I put my contribution into the mix!
Firstly, unlike a lot of jargon or buzz-words, ‘big data’ defines the problem as opposed to the solution, and it is genuinely ‘what it says on the tin’, to coin a horrible and over-used phrase! Big data is, quite simply, big data. This means it gives us all sorts of problems with collection, collation, storage, analysis and, of course, reporting.
Oddle enough, data has proliferated as a result of the analysis that we do on already existing datasets. Because we are always looking for interesting correlations, and to a certain extent just because we can, we tend to create dataset after dataset all based on slightly different permutations of the underlying information. Not surprisingly, this gets out of hand quite quickly – to the tune, in fact of 2.5 quadrillion (a thousand billion) gigabytes of new data generated daily, according to some sources (not sure who’s counting, though)! It is also claimed that the volume of data generated by businesses on a worldwide basis is now doubling every 13 to 14 months. IBM state that 90% of all the data we possess was created in the last two years.
Companies like Walmart process over a million transactions an hour, all of which goes into the data warehouse for analysis. When the Sloan Digital Sky Survey began to collect astronomical data in 2000, it outstripped all data collected in the previous centuries of astronomy in a matter of weeks. There are many more such examples on Wikipedia and elsewhere.
The thing about this data is that it is difficult to work with using ‘traditional’ relational database tools, like SQL Server or Oracle. Partly, this is a capacity issue, as data volumes of this size require parallel systems to rack up the resources necessary, whether that be tens or hundreds of servers, maybe more. It would be too time-consuming and costly to try and process this data in a traditional way, hence the concept of big data and the processing and analysis of it.
The focus is on answering questions that were previously beyond your reach – like (in Walmart’s case) how many of those million transactions an hour had certain characteristics that might inform some future business decisions? Answering thse questions can make your business more responsive, and answering them quickly – as close to real time as possible – can possibly give you some big advantages.
This is the reason that big data is the current big thing, and in other posts I will summarise (without getting too technical) some of the solutions to the problem which have become buzzwords in themselves – Hadoop, anyone?