More

pitah1 · 2025-09-05T13:53:16 1757080396

I have a tool[1] that solely worked with docker before and was putting off supporting podman for a while because I thought it would take some time. But it turned out to work straight out of the box without tweaking. Essentially frictionless.

[1] Tool for reference: https://github.com/data-catering/insta-infra

pitah1 · 2025-05-20T01:02:18 1747702938

I have a small open-source project, that uses docker compose behind the scenes, to help startup any service. You can look to add it in (or I am also happy to add it in) and then users are one command away from running it (insta moose). Recently just added in lakekeeper and various data annotation tools.

insta-infra: https://github.com/data-catering/insta-infra

Callicles · 2025-05-20T02:04:36 1747706676

Interesting. How do you do dependencies between those pieces of infrastructure if there's any? For example, in our Docker Compose file, we have temporal that depends on progress and then moose depends on temporal. How is that expressed in Insta-Infra?

pitah1 · 2025-05-20T02:12:40 1747707160

It leverages docker compose 'depends_on' for the dependencies (https://docs.docker.com/compose/how-tos/startup-order/). For example, airflow depends on airflow-init container to be completed successfully which then depends on postgres.

https://github.com/data-catering/insta-infra/blob/main/cmd/i...

pitah1 · on Jan 22, 2025

This looks really cool. I'm surprised I didn't find this before when I was searching for something like this. I've been using jpackage[1] for a while now but this seems like it would be easier for me to manage using JReleaser given there is support via Gradle.

Would this be a simple lift and shift job to move to JReleaser (as it seems like it just uses jpackage behind the scenes)? With jpackage, if you want to create a Windows exe, it needs to be built on Windows. Similarly, build dmg on Mac and deb for Linux. Does Jreleaser also require this?

[1] https://docs.oracle.com/en/java/javase/22/docs/specs/man/jpa...

aalmiray · on Jan 22, 2025

Given that JReleaser relies on jpackage to create native installers, yes, you must run it on the target platform. Luckily it’s not that complicated to do on GitHub Actions. JReleaser offers plenty of examples for different setups, here’s how to do it for jpackage https://jreleaser.org/guide/latest/examples/java/jpackage.ht...

pitah1 · on Dec 31, 2024

I've taken a stab at making a solution for it via https://github.com/data-catering/data-caterer. It focuses on making integration tests easier by generating data across batch and real-time data sources, whilst maintaining any relationships across the datasets. You can automatically set it to pick up the schema definition from the metadata in your database to generate data for it. Once your app/job/data consumer(s) use the data, you can run data validations to ensure it runs as expected. Then you can clean up the data at the end (including data pushed to downstream data sources) if run in a shared test environment or locally. All of this runs within 60 seconds.

It also gives you the option of running other types of tests such as load/performance/stress testing via generating larger amounts of data.

pitah1 · on Oct 2, 2024

This was also my philosophy behind creating insta-infra (https://github.com/data-catering/insta-infra). Single command to run any service. No additional thinking required.

Too many times I've become very frustrated when an installation doesn't work the first time or it has some dependencies that you haven't installed (or worse, you have a different version). Then you end up in some deep rabbit hole that you can't dig out from. Now for each tool I make, it must have a quick start with a single command.

pitah1 · on Oct 2, 2024

The world of mock data generation is now flooded with ML/AI solutions generating data but this is a solution that understands it is better to generate metadata to help guide the data generation. I found this was the case given the former solutions rely on production data, retraining, slow speed, huge resources, no guarantee about leaking sensitive data and its inability to retain referential integrity.

As mentioned in the article, I think there is a lot of potential in this area for improvement. I've been working on a tool called Data Caterer (https://github.com/data-catering/data-caterer) which is a metadata-driven data generator that also can validate based on the generated data. Then you have full end-to-end testing using a single tool. There are also other metadata sources that can help drive these kinds of tools outside of using LLMs (i.e. data catalogs, data quality).

pitah1 · on Aug 16, 2024

I recently went down the rabbit hole of using PyScript for running a Python CLI app in the browser.

It felt hacky the whole time, especially when dependencies were involved. I had to create wrapper classes to work around Pydantic 2.x not being available to use. I tried to put all logic into the Python files but found some things missing that I had to put in JavaScript.

I think it could be good in use cases where you want some simple UI with custom UI logic on top of your Python code but maybe Streamlit or Gradio could be more suitable.

GitHub repo: https://github.com/data-catering/data-contract-playground

Website: https://data-catering.github.io/data-contract-playground/

skeledrew · on Aug 17, 2024

The Rustification of a lot of Python projects is making it more difficult that necessary to use Python everywhere.

pitah1 · on July 23, 2024

I've created a Docker image for it and onboarded it into my tool called insta-infra[1]. You should be able to run it via:

    ./run.sh maestro

[1] https://github.com/data-catering/insta-infra

pitah1 · on June 10, 2024

Would love to hear what people think or other approaches people have taken to help quickly spin up tools on your laptop.

pitah1 · on June 1, 2024

I've been keeping an eye on these kinds of Spark accelerator libraries for a while now.

How does it compare to Blaze[1] and Gluten[2]?

I'm interested in running some benchmarks soon against all three for my project to see how they all go.

[1] https://github.com/kwai/blaze

[2] https://github.com/apache/incubator-gluten

ed_elliott_asc · on June 1, 2024

Apparently blaze is also datafusion