For reasonably long downloads (so it has a chance to calibrate), why don't congestion algorithms increase the number of inflight packets to a high enough number that bandwidth is fully utilized even over high latency connections?
It seems like it should never be the case that two parallel downloads will preform better than a single one to the same host.
There are two places a packet can be ‘in-flight’. One is light travelling down cables (or the electrical equivalent) or in memory being processed by some hardware like a switch, and the other is sat in a buffer in some networking appliance because the downstream connection is busy (eg sending packets that are further up the queue, at a slower rate than they arrive). If you just increase bandwidth it is easy to get lots of in-flight packets in the second state which increases latency (admittedly that doesn’t matter so much for long downloads) and the chance of packet loss from overly full buffers.
CUBIC tries to increase bandwidth until it hits packet loss, then cuts bandwidth (to drain buffers a bit) and ramps up and hangs around close to the rate that led to loss, before it tries sending at a higher rate and filling up buffers again. Cubic is very sensitive to packet loss, which makes things particularly difficult on very high bandwidth links with moderate latency as you need very low rates of (non-congestion-related) loss to get that bandwidth.
BBR tries to do the thing you describe while also modelling buffers and trying to keep them empty. It goes through a cycle of sending at the estimated bandwidth, sending at a lower rate to see if buffers got full, and sending at a higher rate to see if that’s possible, and the second step can be somewhat harmful if you don’t need the advantages of BBR.
I think the main thing that tends to prevent the thing you talk about is flow control rather than congestion control. In particular, the sender needs a sufficiently large send buffer to store all unacked data (which can be a lot due to various kinds of ack-delaying) in case it needs to resend packets, and if you need to resend some then your send buffer would need to be twice as large to keep going. On the receive size, you need big enough buffers to be able to fill up those buffers from the network while waiting for an earlier packet to be retransmitted.
On a high-latency fast connection, those buffers need to be big to get full bandwidth, and that requires (a) growing a lot, which can take a lot of round-trips, and (b) being allowed by the operating system to grow big enough.
I've run a big webserver that served a decent size apk/other app downloads (and a bunch of small files and what nots). I had to set the maximum outgoing window to keep the overall memory within limits.
IIRC, servers were 64GB of ram and sendbufs were capped at 2MB. I was also dealing with a kernel deficiency that would leave the sendbuf allocated if the client disappeared in LAST_ACK. (This stems from a deficiency in the state description from the 1981 rfc written before my birth)
I wonder if there’s some way to reduce this server-side memory requirement. I thought that was part of the point of sendfile but I might be mistaken. Unfortunately sendfile isn’t so suitable nowadays because of tls. But maybe if you could do tls offload and do sendfile then an OS could be capable of needing less memory for sendbufs.
You can in theory. You just need a accurate model of your available bandwidth and enough buffering/storage to avoid stalls while you wait for acknowledgement. It is, frankly, not even that hard to do it right. But in practice many implementations are terrible, so good luck.
It seems like it should never be the case that two parallel downloads will preform better than a single one to the same host.