Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm going to disagree heavily here. The world of cloud computing, microservices, and hosted/managed services has made the analyst and data engineers job easier than ever. If the software team builds a new dynamodb table, they simple give the AWS account for the analytics team the appropriate IAM permissions and the analytics team will set-up an off-peak bulk extract. A single analyst can easily run an entire data warehouse and analytics pipeline basically part time without a single server using hosted services and microservices. With a team of analysts, the load sharing should be such that the ETL infrastructure is only touched when adding new pipelines or a new feature transformation.

And for data scientists working on production models used within production software, most inference is packaged as containers in something like ECS or Fargate which are then scaled up and down automatically. Eg, they are basically running a microservice for the software teams to consume.

Real time reporting, in my opinion, is not the domain of analysts; it's the domain of the software team. For one, it's rarely useful outside of something like a NOC (or similar control room areas) and should be considered a software feature of that control room. If real-time has to be on the analysts (been there), then the software team should dual publish their transactions to kinesis firehouse and the analytics team can take it from there.

Of course, all of this relies heavily on buy-in to the world of cloud computing. Come on in, we all float down here.



Cloud computing helps here, but microservices still make this harder. Some of the data is in Dynamo, some of it is in Aurora, some of it is in MySQL RDS, some of it is in S3, and nobody knows where all of it is at once.


From a project management perspective, each data source should have some requirements behind it from the business team. Those requirements should be prioritized meaning you can prioritize which data source to tackle first. You automate the process in AWS data pipelines for that data source, write the documentation for the next analyst, and move on to the next data source.

The complexity you and the OP seem to be describing are more in the management and prioritization of analytics projects than in the actual "this is a hard technical problem" domain. It's just a lot of it is tedious especially compared to "everyone just put all your data in the Oracle RACs and bug the DBA until they give you permission" model of the past.


Also one of the service teams might need to change their schema, which the reporting team needs to adjust their process to handle that. That's fine, but they need to know that in advance, and then they might have a backlog of other things that they need to do, and then some other teams schema changed without notice, so now they always have to play catch-up.


What/where do you run this mythical one-analyst pipeline, though? Is that in cloud services too? Airflow? Kubeflow? Apache Beam? It sounds like you're just pushing the problem around.


AWS data pipelines and AWS lambda. It's cloud services the whole way down.

https://aws.amazon.com/datapipeline/


I saved a company 20k a month by creating a job server in AWS. Lambda isnt cheap when you start using it hard


Lambda is mostly used for it's trigger functionality for data or artifacts that are created at irregular intervals. Eg, an object is uploaded to s3 which triggers a lambda which runs a copy command for that object into redshift. The kind of stuff that's well below the threshold for leaving the free tier.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: