It's really easy to get lost in the technical jargon that the *vendors* who are ...

krysp · on Dec 30, 2023

All very good and useful points. One additional thing to mention is that as you are querying across the raw data with a data lake(house), performance is fundamentally worse, even if a lot of the marketing material will tell you otherwise. Usually significantly worse than if your data was in a columnar database in practice.

Depending on your use case this may or may not be a problem. For most companies I'd wager that it is a bigger problem than it first appears.

oceanplexian · on Dec 30, 2023

> The problem is that generating this structured data from absolutely massive amounts of unstructured data involves a truly colossal amount of engineering work

Isn't this kind of obsolete in 2023 with LLMs?

Sure, AI is currently slow, and very expensive, but eventually the idea of needing to query a massive unstructured data source is something that will go the way of the dodo bird when you have a technology that can magically turn unstructured data into structured data quite efficiently. And in that case, when your data is properly structured, there are database technologies that are 1000x more efficient than all of these MapReduce-esque solutions for BI.

xvector · on Dec 30, 2023

As someone that came away confused after reading the article, this was hugely helpful - thanks.

towelpluswater · on Dec 30, 2023

I think the bigger problem is that replication/ingestion (ie: what fivetran does) has come to represent 'ELT'. Likely by design.

And you don't need that pesky transformation part.

Except you really do, when you get beyond having a source system or two.

iamcreasy · on Dec 31, 2023

I am familiar traditional JSON to relational table ETL pipelines, but how does ETL pipelines differ in data lake/house when processing data dumps, such as PDF/images?

nerdponx · on Dec 30, 2023

I think I agree with this to some extent in that it's hard for me to imagine a use case where I have a bunch of clean Parquet files, nicely partitioned, in some kind of cloud storage system.

If I'm already going through the trouble of doing ELT/ETL and making a clean copy of the raw data, why would I do that in cloud storage and not in an actual database?

I don't echo your dismissal of the idea because a whole lot of people seem to be excited about it. But I personally feel like I'm missing the use case compared to the lake + warehouse setup.

Is it about distributing responsibility across teams? Reducing storage cost? Open source good vibes?

Maybe a legitimate use case is being able to use the same data source for multiple query engine frontends? That is, you can use both Spark and Snowflake on the same physical data files.

I'd be interested to hear about this from someone who's using or planning to use a lakehouse.

big_whack · on Dec 30, 2023

In my experience,

* Storing large amounts like petabytes in any database is phenomenally expensive, just for the storage alone.

* For some kinds of data, like image data, databases are generally the wrong tool.

* The consumers of these kinds of systems may have really dynamic workloads. Imagine ML jobs that kick off 1K machines simultaneously to hammer your DB and read from it as fast as possible. Cloud-managed object stores have solved this scaling issue already. If you can get infrastructure you manage out of the way, you get to leverage that work. If your DB is in the middle, you're on call for it.

FridgeSeal · on Dec 30, 2023

> If I'm already going through the trouble of doing ELT/ETL and making a clean copy of the raw data, why would I do that in cloud storage and not in an actual database?

Well, depends on your requirements. You can definitely go point-to-point straight into another DB.

One reason to keep data in object storage, is it gives you a sort of “db independent” storage layer. At a previous $work, we had a tiered system: data would come in from source systems (primary application db’s, marketing systems, etc), and would be serialised verbatim in structured format in S3 (layer 1). Data eng systems would then process that data-refining it, enriching it, ensuring types and schemas, etc, which would be serialised into the next tier (layer 2). At this level they’d be nice to use, so the data analysts would operate against this data in their spark notebooks.

BI and reporting, and other applications could either use data from this layer directly, or if they had special requirements, or performed computationally difficult enough tasks, we would add another layer (layer 3) for specialised workloads and presentation layers. Layer 2 and 3 data may also be synced into data warehouses like ClickHouse.

This gave us complete lineage of data (no more mystery tables, no more “where did you get this data from”, etc), and the storage itself is reasonably cheap. Many services can query these storage layers directly, so setting up views, or projections into different layouts - even for huge quantities of data- becomes feasible and achievable with no more engineering effort than a query.

Was it a lot? Yep. Would I recommend or do it everywhere? Absolutely, 100% no I would not. Was it a good fit for that org? Yeah, arguably better than they could utilise, but for them, other approaches were anaemic and fragile at best.

Could it be done simpler now? Yep, but it got the job done then haha.

Edit to add: it was also language agnostic, which was a huge win and is an understated part of these new parquet-based solutions: you’re no longer limited to “fragile python app” or “spark cluster” to interact with your data. Rust, C#/F#, various FE tools for JS/TS (cube, etc). This is a huge win because you’re not longer tied to keeping around an aging spark/hadoop cluster that has gradually encrusted more garbage into it until it’s this massive, ultra-fragile time bomb nobody dares touch that powers mass amounts of back-office-business needs.

jamesblonde · on Dec 30, 2023

All of this is correct, but it misses the main point of the new table formats - they are open-source and the data can be stored on very low cost storgae - S3. So, having a data warehouse that stores TBs or even PBs of data is not as expensive as it used to be (by an order of magnitude or more). And the formats for storing the data (Parquet), its metadata (Iceberg, Hudi, delta lake), its query engine (DuckDB, Polars, Ibis) - they are all open-source.

wokwokwok · on Dec 30, 2023

> it misses the main point of the new table formats

I didn't miss it; it's irrelevant.

It makes, almost no difference in practice, between a competent implementation in one and a competent implementation in the other.

It makes absolutely no difference that they are open source.

Understanding the details of each of the individual components will give you no meaningful insight into how to build a lakehouse.

...because, when you slap all those parts together, in whatever configuration you've picked what you end up with is a database.

A big, powerful cloud database.

Well, you have a database now and you still have zero insights and zero idea how to get any of them; that because you didn't understand that you need to build some kind of data warehouse on top of that database. You need to load the data. You need to transform the data. You need to visualize the data and build reports on it. If you're good, you probably need to preprocess the data to use as training inputs.

I'll say it more clearly and explicitly one. more. time:

- Having a database != having a data warehouse.

- Having a big cloud database build out of cloud storage, table formats, metadata engines and query engines != a lakehouse.

Having an empty database is of no value to anyone, no matter how good it is.

All of those parts, all of those things are only the first step. It's like installing postgres. Right, good job. We're done here? Reports? Oh, you can probably import something or something or I know, powerBI is good, let's install that. It'll tell you you have no data... but... we've got the infra now right? Basically done.

It's just step 1.