I work primarily in operations on an entertainment service not too dissimilar from Netflix. Every couple of months I have pager duty for a week. I'm lucky to get 6 hours of consecutive sleep in a day. It seems to take a week to recover from it. Everyone I know who works in operations on services with similar availability requirements seems to have the same problem. It can be extremely difficult at times especially when you're under pressure to get something resolved and your cognitive abilities are shot.
There isn't much point to this comment I guess other than to say that yes, good sleep is very important and more important than people realize.
You should push for split shifts, 12 hours each or maybe rotating 16-hour shifts or something. Nobody should be a pager monkey during their sleep hours.
Of course, you should also be pushing for higher quality systems and software :-)
Have you countered some of the debt with exercise? I recommend a brisk, highly intense workout (e.g., the scientific 7-minute workout) as soon as you wake up. It will help you get over cortisol levels and give you a nice rush of endorphins, plus nice boost of blood flow to the brain. I've found that it helps me even get over grogginess and headaches when I'm very sleep deprived.
I have not, but in speaking with other people on my team, it has been mentioned as helping a lot. It's probably a lot more effective than a bunch of coffee.
Do the developers not share in pager duty for their apps at your company? I thought this was now standard practice anymore. I've never seen a company where the devs are on pager duty with the kind of reliability issues you describe. And yes, these were all companies with very tight uptime requirements.
If I am guessing his company correctly, then most development teams indeed do their own pager duty with typical oncalls lasting one week. Some teams have extremely heavy ops load and have support people or even support teams that focus on doing ops and reducing op load.
There are several possible causes of such an unusually high ops load. Maybe it's a failing in that team (teams typically track operation load carefully, as it can serve as an early indicator of something in the team dynamics going wrong), that team might be under pressure from management to push out new features on a tight schedule (leaving less time to get things right, or to fix known issues), they may have had the unfortunate luck of inheriting a service written quick and dirty by one of the VP's favored "get shit done fast" teams (a problematic phenomenon that fortunately seems to be becoming less common), or it may just be the nature of their particular work for whatever reason.
I have seen teams with extremely high operation load get support teams based out of other timezones, so that 24/7 ops coverage can be done mostly by people working during their regional business hours. Why that hasn't been set up in his case, I cannot say.
There are many, many companies that provide customer-facing services that don't develop their own software in house and rely on vendors for all their systems. The TV services provided by telcos and cablecos are one example. Of all the "major" telcos in the world that provide a TV service, I only know of one that does anything in house.
Out of curiosity, how much sleep are you getting on average per night in the following week?
For myself personally, I've found that there's pretty much nothing that 2 (perhaps 3 max) consecutive nights of unlimited sleep can't fix. Sometimes I'll sleep for 12 or 13 hours in a row each night, and then by the last day I usually feel normal again.
I get about 8 hours a night following my week of sleep deprivation. I would probably try and get more but my problem is compounded by having young children.
There isn't much point to this comment I guess other than to say that yes, good sleep is very important and more important than people realize.