Shodata aims to solve a different problem: lightweight versioning for small-to-medium datasets with zero infrastructure setup. Think "GitHub for CSV files" rather than a full data lakehouse.
Iceberg is excellent for production data lakes with Spark/Trino, but it requires running catalogs, configuring S3/Glue, and SQL knowledge. For many ML teams working with <100GB datasets, that's overkill.
Our sweet spot is teams who need:
Drag-and-drop versioning (no CLI/SDK required)
Instant previews and diff visualization
Collaboration features (comments, access control)
Public sharing (like the LLM hallucinations dataset)
I'll definitely look at Iceberg's catalog design for inspiration on metadata management. Appreciate the pointer!
Shodata aims to solve a different problem: lightweight versioning for small-to-medium datasets with zero infrastructure setup. Think "GitHub for CSV files" rather than a full data lakehouse. Iceberg is excellent for production data lakes with Spark/Trino, but it requires running catalogs, configuring S3/Glue, and SQL knowledge. For many ML teams working with <100GB datasets, that's overkill. Our sweet spot is teams who need:
Drag-and-drop versioning (no CLI/SDK required) Instant previews and diff visualization Collaboration features (comments, access control) Public sharing (like the LLM hallucinations dataset)
I'll definitely look at Iceberg's catalog design for inspiration on metadata management. Appreciate the pointer!