Data Lakes vs. Databases vs. Data Warehouses: Why Building a Data Lake is Crucial for Modern Organisations

August 29, 2024

Haran Shivanan

In today’s data-driven world, building a data lake has become essential for organisations looking to harness the power of raw data. Unlike traditional databases and data warehouses, which are limited in their ability to store and process vast amounts of unstructured data, data lakes offer a more flexible solution. By storing data in data lakes, organisations can easily manage, access, and analyse large volumes of raw data, making it available for AI and other advanced analytics tools. Understanding the differences between databases, data warehouses, and data lakes is crucial for businesses aiming to stay competitive and leverage the full potential of their data assets. In this post we’re going to look at databases, data warehouses and data lakes and what you need to use.

Summary

Data Lakes invert how you think of data.

Previously your data team spent time finding ways to store the data in the correct way to minimise compute and storage requirements and make expensive data warehouses faster and more efficient.

With data lakes, you store data in any way that is easy to store it in and apply tools to transform it later.

This is possible because cloud platforms have made simple storage cheap, durable and fast so you don’t need a database any more just to store data.

This idea of storing unstructured and semi- data now and cleaning it later when required is relevant today because AI is changing how we think about data and to be safe, we try to store everything we can because we don’t know how useful it might be later for use cases we have not thought of today.

Lets dive in!

We get questions from customers and potential customers – big and small – on data lakes and the aspirations to build one.

The term often means different things and generally there is a vague understanding of what it is so lets clarify it here.

Today, in 2024, there are 3 different ways you usually store data:

A database
A data warehouse
A data lake

Lets just briefly look at how they are different:

Databases

These are the normal things people use in normal operations and reporting.

There are many kinds of databases including relational databases (that use sql), no-sql databases (that let you store data in a less structured way), time series databases (optimised for sensors and other historical data), and document databases.

Your running IT systems use some combination of the above – sometimes 1 of them, often more than one. They get you 90% the way to your final goal.

Data Warehouse

Oftentimes you have complex reporting needs and the way you store data for regular day to day use doesn’t match the way you need to access data to create reports.

So your reports need to get data from different places in different ways and it becomes really slow.

So what do you do? You make a copy of your regular day-to-day data into a separate place and structure it and summarise it in a way that is easy and fast to generate reports for.

Congratulations – you have a data warehouse!

Because we like to invent complicated ways of describing things, the process of copying your regular day-to-day data into the new place is called an ETL pipeline.

The place where you store the final data used for reporting is called a data warehouse.

This can even be a different place in your existing database.

It could also be a different database. If your volume of data is high – you may choose to use a dedicated data warehouse application – like Snowflake or Redshift.

These are purpose built databases for doing aggregations and summaries over large volumes of data. They are not really suitable to regular day to day data operations but are geared towards reporting over large volumes of data.

Data Lakes

With cheap cloud storage, AI taking over the world and IoT becoming commonplace, there is a need to capture and store large volumes of data so that they can be leveraged for later in some way (that ‘some way’ is usually ‘AI’) Crucially, right now we don’t know how we will use that data but we want to capture and store everything.

This is where data lakes come in.

Since we don’t know how we want to use that data, we don’t know how we want to store it. So we just want to keep it in some raw format so we don’t lose any information and we can figure it out later.

Furthermore, even though storage is cheap, data warehouses are not. Data warehouses are expensive to procure and run. They require dedicated hardware and have high licensing costs. This cost goes up as the volume of data goes up.

On the other hand, S3 and other storage mediums are cheap, durable and fast and ultra-reliable as a storage mechanism. Every cloud provider has an S3 equivalent service. There are on-prem alternatives as well.

So the thinking is- why not just store all the raw data in some easily accessible format.

So that’s what a data lake is.

It’s not really a thing that you install or provision and run. It’s more of an architecture or concept for how you want to store data.

Data Lakes invert how you think of data

So with databases and data warehouses, you think about how you want to use data, and then carefully store it in that way and then make it easy to read and analyse.

With data lakes, you don’t really know yet how you want to use the data, so you store it the easiest way and then take time in figuring out how to analyse it.

To think of it in another way, data warehouses use ETL (extract, transform, load) whereas data lakes use ELT (extract, load, transform)

So how do I store data in a data lake?

In theory you could just dump the data in whatever format you receive it – pdfs, email exports, images.

In practice, a lot of data you get may not come in a ‘format’ – for example – data you pull from an energy metre or from an HR application.

There’s no right answer but generally there are a few industry-standard formats everyone has kind of agreed on for storing data in a data lake:

JSON

This is easy – just serialize your structured data as JSON and write it to a JSON file in some directory structure.

Its easy to write but not easy to read or analyze later.

CSV

This is another popular one – you probably already get some data in this format. Older Point of Sales cash registers, or energy systems from the bronze age upload csv files to something called ‘FTP’.

Parquet

This is a popular alternative to csv. It’s similar to csv but optimised for storage size and aggregation of data.

There are others like ORC and AVRO – they all work in a similar fashion

The meta-data layer

Since your data lake is a big soup of data from different places, it becomes important to catalogue and classify all that data in some way so you know what ‘device-2004929.csv’ actually represents.

So you need to create a metadata layer to catalogue and classify all the data in your lake.

Usually this is where some kind of platform comes in.

There are general platforms for working with data lakes – and there are domain specific platforms that are tailored to vertical industry.

Platforms

Lucy is tailored towards the workplace and property domain and makes it easy to build and manage data lakes around it.

Platforms like Lucy make it easy to build a data lake and use it.

How so?

It makes it easy to get data into the system. You can ingest data through web hooks and apis, pull data from cloud and on-premise systems, manual entry through forms, upload arbitrary data – Lucy can even pull it out of your email and chat history
It makes it easy to tag and classify your data including data that doesn’t sit directly in our system
It makes it possible to query and analyse data from multiple stores through a single unified layer.
It makes it easy to publish this data for others to consume

What makes Lucy tailored to the certain verticals?

It has out of the box connectors that are industry specific – like support for standard building management protocols and IoT systems, connectors to applications that are often used, and the
It has tools to model data in a way that is natural – like modelling sensors that hold a real-time value and a trend history
It has tools to query and analyse this data in a way that is natural to the property industry.

Interested in building your own data lake?

Start by reading up on how to create a data strategy for your portfolio. Contact us to see how we can help you!