I'm going to disagree heavily here. The world of cloud computing, microservices,...

dfabulich · on Jan 30, 2020

Cloud computing helps here, but microservices still make this harder. Some of the data is in Dynamo, some of it is in Aurora, some of it is in MySQL RDS, some of it is in S3, and nobody knows where all of it is at once.

WoahNoun · on Jan 30, 2020

From a project management perspective, each data source should have some requirements behind it from the business team. Those requirements should be prioritized meaning you can prioritize which data source to tackle first. You automate the process in AWS data pipelines for that data source, write the documentation for the next analyst, and move on to the next data source.

The complexity you and the OP seem to be describing are more in the management and prioritization of analytics projects than in the actual "this is a hard technical problem" domain. It's just a lot of it is tedious especially compared to "everyone just put all your data in the Oracle RACs and bug the DBA until they give you permission" model of the past.

zerd · on Jan 31, 2020

Also one of the service teams might need to change their schema, which the reporting team needs to adjust their process to handle that. That's fine, but they need to know that in advance, and then they might have a backlog of other things that they need to do, and then some other teams schema changed without notice, so now they always have to play catch-up.

ForHackernews · on Jan 30, 2020

What/where do you run this mythical one-analyst pipeline, though? Is that in cloud services too? Airflow? Kubeflow? Apache Beam? It sounds like you're just pushing the problem around.

WoahNoun · on Jan 30, 2020

AWS data pipelines and AWS lambda. It's cloud services the whole way down.

https://aws.amazon.com/datapipeline/

geggam · on Jan 30, 2020

I saved a company 20k a month by creating a job server in AWS. Lambda isnt cheap when you start using it hard

WoahNoun · on Jan 30, 2020

Lambda is mostly used for it's trigger functionality for data or artifacts that are created at irregular intervals. Eg, an object is uploaded to s3 which triggers a lambda which runs a copy command for that object into redshift. The kind of stuff that's well below the threshold for leaving the free tier.