Hacker Newsnew | past | comments | ask | show | jobs | submit | ark__n's commentslogin

Need help to decide the tools to be used for the below problem:

The system is a bunch of batch jobs that are scheduled to run at different intervals. These jobs can be modelled as an acyclic directed graph of steps. They basically download files from vendors and map the rows inside them into a generic format (for generating reports). There are a lot of vendors and each vendor can have a different file format containing different fields -- hence requiring custom business logic to populate (map) the corresponding generic file (like aggregating fields, fetching values from DB, etc.). Also these vendors' files sometimes contain errors, or are dropped late for download, etc. -- failures can happen and these failed instances of jobs should be able to rerun.

Existing system is built using Spring Batch and Spring Integration. The problems with the existing system are:

1. there are more than 200 jobs and most of them have their own custom logic during mapping -- cannot be generified

2. lot of manual work needed to onboard new vendors

3. jobs are synchronous and run only on one node, typically for lots of hours

4. rerunning jobs is a nightmare

Dream state for this system:

1. Dynamically add jobs to the runtime using generic components that can be reused -- maybe through an API / UI

2. Preferably, multiple records from a single file be processed across distributed nodes to generate a single output generic file

3. Rerunning should be easier

I am a noob to CS. I did a good bit of research for the past month. Found a few data-science tools in Python -- which is a no-no for a production system. Also, I know that the steps cannot be made generic after some extent since custom mapping logic is required for almost every vendor. But asking to see what is possible. Any help to point to prospective tools and technologies to solve the above will be much appreciated.

Thanks


Use Airflow maybe?


Looks very promising. Can I add new jobs (tasks in Airflow's jargon) reusing my custom steps (operators in Airflow's jargon) during runtime? Also, is there something similar in Java, Go, etc.?


Aside: how do you go about writing these blogs? As a new grad in CompSci, looking at all of the great details these blogs touch upon, is overwhelming


I guess people mainly write what they're passionate about, and for the same reason they know a lot about those topics because they already spent a lot of time thinking about it.


It's a mix of personal interest and seeking to develop an expertise. It's also a lot of discipline. I wrote more publicly in the early part of my career, but my writing is more company-internal these days.


The only problem I had with BitWarden was you cannot add/update entries on mobile when you're offline. This might not be a big issue for many, but it was a deal-breaker for me. I'm now rocking a local KeepassXC (PC) + Keepass2Android + Syncthing setup that syncs when I'm on my home network.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: