I've first hand encountered several situations with subtitles where it would have been ambiguous who was speaking without speaker annotations, despite the voices being distinctive, me being able to hear them clearly etc. Just think of a rapid exchange with neither speaker on-screen for more than two or three sentences and replies.
You can probably get 99% there without that for a lot of content, but I'd challenge the notion that this is somehow only important for hearing impaired viewers (or people just watching without clearly audible sound for other reasons).
I guess you can have that in a real life situation as well, where you don't have subtitles at all (hello AR) and still can handle. Do you want your subtitles full of metadata the whole movie for every movie and every day, for those several situations when the image director made a mess? You can always play that confusing scene again.
I definitely prefer subtitles to be as helpful as possible, yes. That includes having situationally appropriate metadata (which is different from "all the metadata, all the time").
I don't think that's an unrealistic goal to have for AIs; they're already extremely good at semantic scene description after all. By looking at the image in addition to just the audio track they probably also get a lot more metadata, which a refined world model will eventually be able to use just like a human subtitle editor can today.
So you mean, the AI should figure out when a problematic scene is coming, and only then add labels and whatnot? Not impossible, just somebody must teach them, same with subtitles positioning.