We've been using EC2 Container Service for a few months now and at first glance it appears great but we've ran into quite a few problems using it:
- You can only bind 1 port and 1 ELB to a service. This means, for example, if you have nginx listening on port 80 and 443 you need to manually configure the ELB and can't take advantage of automatically generated port numbers on the host (So you basically can only have 1 HTTP(s) server on a host)
- There is no way to cleanly decommission a host from a cluster. Say if you want to reboot or replace a server in the cluster you can't tell ECS to drain connections from the ELB and move the containers off the host
- Unable to specify rules for which hosts services should run on. For example you can't say that the service needs to make sure it has instances in X AZs or don't run multiple instances of the same service on the same host.
- No easy way to implement any kind of service discovery, you have to roll this yourself or set up loads of internal ELBs as a sort of service discovery.
- Worst of all the ecs-agent is very buggy. It has releases where it just keeps crashing [1] and leaving untracked running containers just hanging around to an issue where they released a new version with a newer docker library that changed how it handled certain parameters causing our entire cluster to fail until they hot fixed it with a change to how their API sent data to the client.
We're currently looking at alternatives for our docker infrastructure.
This is shameless self promotion but check out rancher.com. It's open source and allows you to spin up a container service like GKE or ECS but cloud independent.
Wow - my company is getting into the EC2-Docker idea pretty heavily at the moment (and I'm going to wind up in charge of monitoring it). This is...not great to hear.
I'm very interested in ECS's startup latency characteristics. Do you know offhand about how long it takes to provision a new container, and if this is subject to wide fluctuation?
I've not done any measurements in terms of latency but from my experience an agent on the host will ask the docker daemon to start a new container in the order of seconds after making an API request to ECS that calls for an additional container to be provisioned.
I'd recommend setting up a private registry within EC2 though, otherwise you'll have a fairly significant delay while it pulls the image from wherever and incur bandwidth charges for doing so. You'll still have some delay while it pulls the image from your registry running on EC2 but not as significant if it were externally hosted.
- You can only bind 1 port and 1 ELB to a service. This means, for example, if you have nginx listening on port 80 and 443 you need to manually configure the ELB and can't take advantage of automatically generated port numbers on the host (So you basically can only have 1 HTTP(s) server on a host)
- There is no way to cleanly decommission a host from a cluster. Say if you want to reboot or replace a server in the cluster you can't tell ECS to drain connections from the ELB and move the containers off the host
- Unable to specify rules for which hosts services should run on. For example you can't say that the service needs to make sure it has instances in X AZs or don't run multiple instances of the same service on the same host.
- No easy way to implement any kind of service discovery, you have to roll this yourself or set up loads of internal ELBs as a sort of service discovery.
- Worst of all the ecs-agent is very buggy. It has releases where it just keeps crashing [1] and leaving untracked running containers just hanging around to an issue where they released a new version with a newer docker library that changed how it handled certain parameters causing our entire cluster to fail until they hot fixed it with a change to how their API sent data to the client.
We're currently looking at alternatives for our docker infrastructure.
[1]: https://github.com/aws/amazon-ecs-agent/issues/156