Thank you for this excellent post! I've been developing [my own platform](https://github.com/MattTriano/analytics_data_where_house) that curates a data warehouse mostly of census and socrata datasets but I haven't really had a good way to share the products with anyone as it's a bit too heavyweight. I've been trying to find alternate solutions to that issue (I'm currently building out a much smaller [platform](https://github.com/MattTriano/fbi_cde_data) to process the FBI's NIBRS datasets), and your post has given me a few great implementations to study and experiment with.
Because it means you need to keep another copy of your data in a special format just for DuckDb. The point of Parquet is that it’s an open format queryable by multiple tools. You don’t need to wait to load every table into a new format, you don’t need to retain multiple copies, and you don’t need to keep them in sync.
If DuckDb is the only query engine in your analytics stack, then it makes sense to use its specialized format. But that’s not the typical Lakehouse use case.
Yeah, that’s one of many issues with Clickbench. It’s also one table so it can’t test joins.
TPC-H is okay but not Lakehouse specific. I’m not aware of any benchmarks that specifically test performance of engines under common setups like external storage or scalable compute. It would be hard to design one that’s easily reproducible. (And in fairness to Clickbench, it’s intentionally simple for that exact reason - to generate a baseline score for any query engine that can query tabular data).
Hi, i am working on something similar and was looking for ways how i can host my open data. the approach seems interesting, can i reach out to you to discuss more on this somewhere?
That note should be named "Thinking" to be honest. The goal is to collect a bunch of bullet points that I can read at any given time to remind me how to think better or common gotchas.
They are called multiple ways. I like to call mine a Personal Handbook[1] but I've also seen this being referred as Digital Garden or Personal Knowledge Base.