Founder of Modal here. We've spent a ton of time on this, including building our own distributed file system optimized for low-latency high-througput workloads. We don't use K8s or Docker and built our own custom infrastructure instead.
Cold starting containers quickly is a fascinating problems. We've gotten a long way but there's still a lot more to do. For GPU-based inference, starting containers isn't enough – you also need to initialize the model GPU quickly. We are working on a long list of things that will bring down cold start latency even further.
Is Modal a good solution for running fine-tuned LLMs and Whisper models? If the cold-start time is low we're more than willing to modify our code to use Modal's infra.
Happy to follow up via email but didn't see one in your profile.