I guess it's computationally expensive? OP states in the website that their solution takes 1 hour of processing for 30 minutes long videos .
Now imagine offering this for all the YouTube videos available:
- either it's done on their servers (hard to believe due to high costs)
- either it's done on client side (which is also difficult, due to lack of processing power)
Well, I'm also shamefully unoptimized at the moment.
YouTube added auto-captions years ago. Long before there was Whisper, let alone things like Whisper.cpp. I imagine what I'm doing now is computationally no more expensive than what they did back then.
Now imagine offering this for all the YouTube videos available:
- either it's done on their servers (hard to believe due to high costs) - either it's done on client side (which is also difficult, due to lack of processing power)