Completely agree with you re: Bloomberg's somewhat shady history of reporting. However, in this case, the article is citing a research note written by TD Cowen equity research analysts.
"The Internal Revenue Service's (IRS) legacy IT environment includes applications, software, and hardware, which are outdated but still critical to day-to-day operations. Specifically, GAO's analysis showed that about 33 percent of the applications, 23 percent of the software instances in use, and 8 percent of hardware assets were considered legacy. This includes applications ranging from 25 to 64 years in age, as well as software up to 15 versions behind the current version. As GAO has previously noted, and IRS has acknowledged, these legacy assets will continue to contribute to security risks, unmet mission needs, staffing issues, and increased costs."
"Snowflake has no incentive to push a code change that makes things 20% faster because that can correspond to 10–20% drop in short-term revenue. In a typical Innovator’s Dilemma, Snowflake prioritizes other things that generate an ever larger menu of compute options, like Snowpark and data apps built on Streamlit, that will bleed your organization dry."
This is not true. Snowflake has done just that - it has continuously improved performance resulting in reduced credit consumption and revenue from customers on a unit compute/storage basis. And it has negatively impacted their revenues and stock price. Snowflake's incentive is to strengthen their competitive position and to hopefully generate more long-term revenue from their customers.
The CFO forecasted a $97 million dollar short fall when guiding for 2022 revenue resulting from product improvements. Snowflake stock dropped immediately after.
"Similarly, phased throughout this year, we are rolling out platform improvements within our cloud deployments. No two customers are the same, but our initial testing has shown performance improvements ranging on average from 10% to 20%. We have assumed an approximately $97 million revenue impact in our full-year forecast, but there is still uncertainty around the full impact these improvements can have. While these efforts negatively impact our revenue in the near term, over time, they lead customers to deploy more workloads to Snowflake due to the improved economics."
"Snowflake Inc., a software company that helps businesses organize data in the cloud, dropped the most ever in a single day Thursday after projecting that annual product sales growth would slow from its previous triple-digit-percentage pace.
Executives said improvements to the company’s data storage and analysis products will let customers get the same results by spending less, which will hurt revenue in the short term, but attract more clients in the future.
“The full-year impact of that next year is quite significant,” Chief Executive Officer Frank Slootman said on a conference call Wednesday after the results were released. But “when customers see their performance per credit get cheaper, they realize they can do other things cheaper in Snowflake and they move more data into us to run more queries.”"
I would not have guessed Roblox was on-prem with such little redundancy. Later in the post, they address the obvious “why not public cloud question”? They argue that running their own hardware gives them advantages to cost and performance. But those seem irrelevant if usage and revenue go to zero when you can’t keep a service up. It will be interesting to see how well this architecural decision ages if they keep scaling to their ambitions. I wonder about their ability to recruit the level of talent required to run a service at this scale.
Since the issue's root cause was a pathological database software issue, Roblox would have suffered the same issue in the public cloud. (I am assuming for this analysis that their software stack would be identical.) Perhaps they would have been better off with other distributed databases than Consul (e.g., DynamoDB), but at their scale, that's not guaranteed, either. Different choices present different potential difficulties.
Playing "what-if" thought experiments is fun, but when the rubber hits the road, you often find that things that are stable for 99.99%+ of load patterns encounter previously unforeseen problems once you get into that far-right-hand side of the scale. And it's not like we've completely mastered squeezing performance out of huge CPU core counts on NUMA architectures while avoiding bottlenecking on critical sections in software. This shit is hard, man.
This is not true, if they handled the rollout properly. Companies like Uber have two entirely different data centers and during outages they failover you either datacenter.
Everything is duplicated which is potentially wasteful but ensures complete redundancy and it’s an insurance policy. If you rollout, you rollout to each datacenter separately. So in this case rolling out in one complete datacenter and waiting a day for their Consul streaming changes probably would have caught it.
The parent poster said that it would have happened even if they had cloud, ie. another datacenter. That's my assumption for the comment.
As far as I can tell from reading, Roblox doesn't have multiple datacenters. I find that really hard to believe, so if that's not true, then my point would be incorrect. If it is true, then if they completely duplicated their datacenters, they would be able to make the switch in one datacenter to streaming while keeping the other datacenter the old setting until they validated that everything was fine. That would have caught the problem, having slow rollout across datacenters.
Uber is also a service that has a much lower tolerance for downtime: If people can't play a game, they're sad. If they're trying to get a ride and it doesn't work, or drivers apps stop working suddenly, the stranded people get very upset in a hurry, and the company loses a lot of customers.
It can be totally reasonable for Uber to pay for 2x the amount of infra they need for serving their products while not being worth it for a company like Roblox.
You didn't read it properly. The changes were rolled out months before, but the switch to streaming based on that rollout was made 1 day before the incident. That was the root cause.
I think the public cloud is a good choice for startups, teams, and projects which don't have infrastructure experience. Plenty of companies still have their own infrastructure expertise and roll their own CDNs, as an example.
Not only can one save a significant amount of money, it can also be simpler to troubleshoot and resolve issues when you have a simpler backend tech stack. Perhaps that doesn’t apply in this case, but there are plenty of use cases which don’t need a hundred micro services on AWS, none of which anyone fully understands.
> But those seem irrelevant if usage and revenue go to zero when you can’t keep a service up
You're assuming the average profits lost are more than the average cost of doing things differently, which, according to their statement, is not the case.
A great survey of the current landscape of ML tool innovation.
This reminds me of the explosion of "big data" tech, which feels like it started exploding in 2010 and peaked maybe five years later.
If ML follows a similar cycle, then in perhaps five years, most tools will have receded into obscurity but a few frameworks and approaches will become dominant. Big stakes and justifies the prevalence and investment in OSS.
For example, here's another article from MarketWatch citing this same research note -- https://www.marketwatch.com/story/the-research-note-thats-ra...
Leads me to lend more credence to the news even though OP is linking Bloomberg.