Data lakes: Look before you leap

21 November 2016

It’s a feature of developing technology that buzz words tend to appear all of a sudden – everyone is talking about it. At the moment it’s ‘data lakes’, a method of storing Big Data – for which Apache Hadoop is the best-known platform.

The term data lake has been used to mean many things in industry but the ‘lake’ terminology is quite a useful way of getting your head around the idea. In contrast to data warehousing, where the data is pre-sorted, cleaned and packaged according to what you might need it for – a data lake is a vast repository of unpackaged structured and unstructured data that any user can access and mine. James Dixon, who’s credited with inventing the term, likened a data warehouse to a store of bottled water, whereas the lake is a body of water in its natural state.

The overriding advantage of a data lake is that the data is non-relational. With a data warehouse you need to know what questions you want to ask of the data upfront, and then structure the data accordingly. A data lake brings the processing to the data, so it can respond to an entirely new question without the need for an expensive rewrite.

Data lakes are an appealing idea for organisations. Our latest global CEO Survey showed that business leaders see Big Data as important ammunition for the future; 68% said that Big Data and analytics technologies were likely to generate the biggest return on investment for their organisation. Given what’s been written so far about the advantages of a data lake approach, it must be tempting for business leaders to just dive in and set aside a capital investment, thinking it will solve all their problems.

But as with any technology project, it’s a good idea to stop and ask yourself – what do I really need this for?

It’s essential to remember that implementing a data repository isn’t a technology project. The technology allows you to make the best use of your data; it allows you to move up the curve in terms of understanding and driving your business. So don’t start down the data repository road until you have a clear idea of what you want to do with the data you have.

We have a vastly increased capacity to collect structured and unstructured data – but just because you can collect it, doesn’t mean that you should, or that you’re obliged to use everything you collect. It’s possible to explore data relatively cheaply – cloud technology has brought us that ability. So why spend millions on technology when you can do what you need to do for a fraction of the price?

As the saying goes, don’t boil the ocean. The only way you can solve a problem is by understanding what problem you want to solve in the first place.

 

Comments

The problem with a data lake (to continue the analogy) is that you still end up wanting bottles of water, or perhaps other ways of holding the water in a more structured way. The effort to get the water into an organised state can be high.

So you end up trying to add structure to the data generically into reusable data assets so it is done consistently and not repetitively. I like to consider the collection of these assets an information lake.

Then you take a step back and realise it starts to be reminisent of a warehouse, only on a much bigger scale that can scale out as required.

Data Scientists will still want the data lake, but it alone will not solve your organisations data challenges.

Yes, Mark, totally agree. If you are using the lake to bring in Data Warehouse (i.e. structured) data from multiple sources, then, yes, you do have to invest in making sure that the data has the right structure and granularity in order to be able to compare apples with apples.

The beauty of the data lake should be that you can largely cover both types of use-cases with the same technology, to avoid having many fragmented databases across the firm, and as you said at much bigger scale.

Your information lake term is really key, it allows an enterprise to create a strategy for defining each type of information in a global way so that one can look at data all it's business landscape in a consistent way.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated and will not appear until the author has approved them.