Data lakes: Look before you leap
21 November 2016
It’s a feature of developing technology that buzz words tend to appear all of a sudden – everyone is talking about it. At the moment it’s ‘data lakes’, a method of storing Big Data – for which Apache Hadoop is the best-known platform.
The term data lake has been used to mean many things in industry but the ‘lake’ terminology is quite a useful way of getting your head around the idea. In contrast to data warehousing, where the data is pre-sorted, cleaned and packaged according to what you might need it for – a data lake is a vast repository of unpackaged structured and unstructured data that any user can access and mine. James Dixon, who’s credited with inventing the term, likened a data warehouse to a store of bottled water, whereas the lake is a body of water in its natural state.
The overriding advantage of a data lake is that the data is non-relational. With a data warehouse you need to know what questions you want to ask of the data upfront, and then structure the data accordingly. A data lake brings the processing to the data, so it can respond to an entirely new question without the need for an expensive rewrite.
Data lakes are an appealing idea for organisations. Our latest global CEO Survey showed that business leaders see Big Data as important ammunition for the future; 68% said that Big Data and analytics technologies were likely to generate the biggest return on investment for their organisation. Given what’s been written so far about the advantages of a data lake approach, it must be tempting for business leaders to just dive in and set aside a capital investment, thinking it will solve all their problems.
But as with any technology project, it’s a good idea to stop and ask yourself – what do I really need this for?
It’s essential to remember that implementing a data repository isn’t a technology project. The technology allows you to make the best use of your data; it allows you to move up the curve in terms of understanding and driving your business. So don’t start down the data repository road until you have a clear idea of what you want to do with the data you have.
We have a vastly increased capacity to collect structured and unstructured data – but just because you can collect it, doesn’t mean that you should, or that you’re obliged to use everything you collect. It’s possible to explore data relatively cheaply – cloud technology has brought us that ability. So why spend millions on technology when you can do what you need to do for a fraction of the price?
As the saying goes, don’t boil the ocean. The only way you can solve a problem is by understanding what problem you want to solve in the first place.