It's really easy to get lost in the technical jargon that the vendors who are selling products throw around, but this article has missed the important part, and spent all the time talking about the relatively unimportant part (data formats).
You need to step back and look from a broader perspective to understand this domain.
Talking about arrow/parquet/iceberg is like talking about InnoDB vs MyISAM when you're talking about databases; yes, those are technically storage engines for mysql/mariadb, but no, you probably do not care about them until you need them, and you most certainly do not care about them when you want to understand what a relational DB vs. an no-SQL db are.
They are technical details.
...
So, if you step back, what you need to read about is STAR SCHEMAS. Here are some links (1), (2).
This is what people used to be before data lakes.
So the tldr: you have a big database which contains condensed and annotated versions of your data, which is easy to query, and structured in a way that is suitable for visualization tools such as PowerBI, Tableau, MicroStrategy (ugh, but people do use it), etc. to use.
This means you can generate reports and insights from your data.
Great.
...the problem is that generating this structured data from absolutely massive amounts of unstructured data involves a truly colossal amount of engineering work; and it's never realtime.
That's because the process of turning raw data into a star schema was traditionally done via ETL tools that were slow and terrible. 'Were'. These tools are still slow and terrible.
Basically, the output you get is very valuable, but getting it is very difficult, very expensive and both of those problems scale as the data size scales.
So...
Datalakes.
Datalakes are the solution to this problem; you don't transform the data. You just injest it and store it, basically raw, and on the fly when you need the data for something, you can process it.
The idea was something like a dependency graph; what if, instead of processing all your data every day/hour/whatever, you defined what data you needed, and then when you need it, you rebuild just that part of the database.
Certainly you don't get the nice star schema, but... you can handle a lot of data, and what you need to do process it 'adhoc' is pretty trivial mostly, so you don't need a huge engineering effort to support it; you just need some smart table formats, a lot of storage and on-demand compute.
...Great?
No. Totally rubbish.
Turn out this is a stupid idea, and what you get is a lot of data you can't get any insights from.
So, along come the 'nextgen' batch of BI companies like databricks so they invent this idea of a 'lake house' (3), (4).
What is it? Take a wild guess. I'll give you a hint: having no tables was a stupid idea.
Yes! Correct, they've invented a layer that sits on top of a data lake that presents a 'virtual database' with ACID transactions that you then build a star schema in/on.
Since the underlying implementation is (magic here, etc. etc. technical details) this approach supports output in the form we originally had (structured data suitable for analytics tools), but it has some nice features like streaming, etc. that make it capable of handling very large volumes of data; but it's not a 'real' database, so it does have some limitations which are difficult to resolve (like security and RBAC).
...
Of course, the promise, that you just pour all your data in and 'magic!' you have insights, is still just as much nonsense as it ever was.
If you use any of these tools now, you'll see that they require you to transform your data; usually as some kind of batch process.
If you closed your eyes and said "ETL?", you'd win a cookie.
All a 'lake house' is, is a traditional BI data warehouse built on a different type of database.
Almost without exception, everything else is marketing fluff.
* exception: kafka and streaming is actually fundamentally different for real time aggregated metrics, but its also fabulously difficult to do well, so most people still don't, as far as I'm aware.
...and I'll go out on a limb here and say really, you probably do not care if your implementation uses delta tables or iceberg; that's an implementation detail.
I guarantee that correctly understanding your domain data and modelling a form of it suitable for reporting and insights is more important and more valuable than what storage engine you use.
All very good and useful points. One additional thing to mention is that as you are querying across the raw data with a data lake(house), performance is fundamentally worse, even if a lot of the marketing material will tell you otherwise. Usually significantly worse than if your data was in a columnar database in practice.
Depending on your use case this may or may not be a problem. For most companies I'd wager that it is a bigger problem than it first appears.
> The problem is that generating this structured data from absolutely massive amounts of unstructured data involves a truly colossal amount of engineering work
Isn't this kind of obsolete in 2023 with LLMs?
Sure, AI is currently slow, and very expensive, but eventually the idea of needing to query a massive unstructured data source is something that will go the way of the dodo bird when you have a technology that can magically turn unstructured data into structured data quite efficiently. And in that case, when your data is properly structured, there are database technologies that are 1000x more efficient than all of these MapReduce-esque solutions for BI.
I am familiar traditional JSON to relational table ETL pipelines, but how does ETL pipelines differ in data lake/house when processing data dumps, such as PDF/images?
I think I agree with this to some extent in that it's hard for me to imagine a use case where I have a bunch of clean Parquet files, nicely partitioned, in some kind of cloud storage system.
If I'm already going through the trouble of doing ELT/ETL and making a clean copy of the raw data, why would I do that in cloud storage and not in an actual database?
I don't echo your dismissal of the idea because a whole lot of people seem to be excited about it. But I personally feel like I'm missing the use case compared to the lake + warehouse setup.
Is it about distributing responsibility across teams? Reducing storage cost? Open source good vibes?
Maybe a legitimate use case is being able to use the same data source for multiple query engine frontends? That is, you can use both Spark and Snowflake on the same physical data files.
I'd be interested to hear about this from someone who's using or planning to use a lakehouse.
* Storing large amounts like petabytes in any database is phenomenally expensive, just for the storage alone.
* For some kinds of data, like image data, databases are generally the wrong tool.
* The consumers of these kinds of systems may have really dynamic workloads. Imagine ML jobs that kick off 1K machines simultaneously to hammer your DB and read from it as fast as possible. Cloud-managed object stores have solved this scaling issue already. If you can get infrastructure you manage out of the way, you get to leverage that work. If your DB is in the middle, you're on call for it.
> If I'm already going through the trouble of doing ELT/ETL and making a clean copy of the raw data, why would I do that in cloud storage and not in an actual database?
Well, depends on your requirements. You can definitely go point-to-point straight into another DB.
One reason to keep data in object storage, is it gives you a sort of “db independent” storage layer. At a previous $work, we had a tiered system: data would come in from source systems (primary application db’s, marketing systems, etc), and would be serialised verbatim in structured format in S3 (layer 1). Data eng systems would then process that data-refining it, enriching it, ensuring types and schemas, etc, which would be serialised into the next tier (layer 2). At this level they’d be nice to use, so the data analysts would operate against this data in their spark notebooks.
BI and reporting, and other applications could either use data from this layer directly, or if they had special requirements, or performed computationally difficult enough tasks, we would add another layer (layer 3) for specialised workloads and presentation layers. Layer 2 and 3 data may also be synced into data warehouses like ClickHouse.
This gave us complete lineage of data (no more mystery tables, no more “where did you get this data from”, etc), and the storage itself is reasonably cheap. Many services can query these storage layers directly, so setting up views, or projections into different layouts - even for huge quantities of data- becomes feasible and achievable with no more engineering effort than a query.
Was it a lot? Yep. Would I recommend or do it everywhere? Absolutely, 100% no I would not. Was it a good fit for that org? Yeah, arguably better than they could utilise, but for them, other approaches were anaemic and fragile at best.
Could it be done simpler now? Yep, but it got the job done then haha.
Edit to add: it was also language agnostic, which was a huge win and is an understated part of these new parquet-based solutions: you’re no longer limited to “fragile python app” or “spark cluster” to interact with your data. Rust, C#/F#, various FE tools for JS/TS (cube, etc). This is a huge win because you’re not longer tied to keeping around an aging spark/hadoop cluster that has gradually encrusted more garbage into it until it’s this massive, ultra-fragile time bomb nobody dares touch that powers mass amounts of back-office-business needs.
All of this is correct, but it misses the main point of the new table formats - they are open-source and the data can be stored on very low cost storgae - S3.
So, having a data warehouse that stores TBs or even PBs of data is not as expensive as it used to be (by an order of magnitude or more). And the formats for storing the data (Parquet), its metadata (Iceberg, Hudi, delta lake), its query engine (DuckDB, Polars, Ibis) - they are all open-source.
> it misses the main point of the new table formats
I didn't miss it; it's irrelevant.
It makes, almost no difference in practice, between a competent implementation in one and a competent implementation in the other.
It makes absolutely no difference that they are open source.
Understanding the details of each of the individual components will give you no meaningful insight into how to build a lakehouse.
...because, when you slap all those parts together, in whatever configuration you've picked what you end up with is a database.
A big, powerful cloud database.
Well, you have a database now and you still have zero insights and zero idea how to get any of them; that because you didn't understand that you need to build some kind of data warehouse on top of that database. You need to load the data. You need to transform the data. You need to visualize the data and build reports on it. If you're good, you probably need to preprocess the data to use as training inputs.
I'll say it more clearly and explicitly one. more. time:
- Having a database != having a data warehouse.
- Having a big cloud database build out of cloud storage, table formats, metadata engines and query engines != a lakehouse.
Having an empty database is of no value to anyone, no matter how good it is.
All of those parts, all of those things are only the first step. It's like installing postgres. Right, good job. We're done here? Reports? Oh, you can probably import something or something or I know, powerBI is good, let's install that. It'll tell you you have no data... but... we've got the infra now right? Basically done.
You need to step back and look from a broader perspective to understand this domain.
Talking about arrow/parquet/iceberg is like talking about InnoDB vs MyISAM when you're talking about databases; yes, those are technically storage engines for mysql/mariadb, but no, you probably do not care about them until you need them, and you most certainly do not care about them when you want to understand what a relational DB vs. an no-SQL db are.
They are technical details.
...
So, if you step back, what you need to read about is STAR SCHEMAS. Here are some links (1), (2).
This is what people used to be before data lakes.
So the tldr: you have a big database which contains condensed and annotated versions of your data, which is easy to query, and structured in a way that is suitable for visualization tools such as PowerBI, Tableau, MicroStrategy (ugh, but people do use it), etc. to use.
This means you can generate reports and insights from your data.
Great.
...the problem is that generating this structured data from absolutely massive amounts of unstructured data involves a truly colossal amount of engineering work; and it's never realtime.
That's because the process of turning raw data into a star schema was traditionally done via ETL tools that were slow and terrible. 'Were'. These tools are still slow and terrible.
Basically, the output you get is very valuable, but getting it is very difficult, very expensive and both of those problems scale as the data size scales.
So...
Datalakes.
Datalakes are the solution to this problem; you don't transform the data. You just injest it and store it, basically raw, and on the fly when you need the data for something, you can process it.
The idea was something like a dependency graph; what if, instead of processing all your data every day/hour/whatever, you defined what data you needed, and then when you need it, you rebuild just that part of the database.
Certainly you don't get the nice star schema, but... you can handle a lot of data, and what you need to do process it 'adhoc' is pretty trivial mostly, so you don't need a huge engineering effort to support it; you just need some smart table formats, a lot of storage and on-demand compute.
...Great?
No. Totally rubbish.
Turn out this is a stupid idea, and what you get is a lot of data you can't get any insights from.
So, along come the 'nextgen' batch of BI companies like databricks so they invent this idea of a 'lake house' (3), (4).
What is it? Take a wild guess. I'll give you a hint: having no tables was a stupid idea.
Yes! Correct, they've invented a layer that sits on top of a data lake that presents a 'virtual database' with ACID transactions that you then build a star schema in/on.
Since the underlying implementation is (magic here, etc. etc. technical details) this approach supports output in the form we originally had (structured data suitable for analytics tools), but it has some nice features like streaming, etc. that make it capable of handling very large volumes of data; but it's not a 'real' database, so it does have some limitations which are difficult to resolve (like security and RBAC).
...
Of course, the promise, that you just pour all your data in and 'magic!' you have insights, is still just as much nonsense as it ever was.
If you use any of these tools now, you'll see that they require you to transform your data; usually as some kind of batch process.
If you closed your eyes and said "ETL?", you'd win a cookie.
All a 'lake house' is, is a traditional BI data warehouse built on a different type of database.
Almost without exception, everything else is marketing fluff.
* exception: kafka and streaming is actually fundamentally different for real time aggregated metrics, but its also fabulously difficult to do well, so most people still don't, as far as I'm aware.
...and I'll go out on a limb here and say really, you probably do not care if your implementation uses delta tables or iceberg; that's an implementation detail.
I guarantee that correctly understanding your domain data and modelling a form of it suitable for reporting and insights is more important and more valuable than what storage engine you use.
[1] - https://learn.microsoft.com/en-us/power-bi/guidance/star-sch... [2] - https://www.kimballgroup.com/data-warehouse-business-intelli...
[3] - https://www.snowflake.com/guides/what-data-lakehouse [4] - https://www.databricks.com/glossary/data-lakehouse