You are right that a kind of multi-threading can be useful to mitigate the effects of branch mispredictions.
However, for this, fine-grained multi-threading is enough. Simultaneous multi-threading does not bring any advantage, because the thread with the mispredicted branch cannot progress.
Out-of-order execution cannot be used during branch mispredictions, so like I have said, both SMT and OoOE are techniques useful only when a data cache memory exists.
Any CPU with pipelined instruction execution needs a branch predictor and it needs to execute speculatively the instructions on the predicted path, in order to avoid the pipeline stalls caused by control dependencies between instructions. An instruction cache memory is also always needed for a CPU with pipelined instruction execution, to ensure that the instruction fetch rate is high enough.
Unlike simultaneous multi-threading, fine-grained multi-threading is useful in a CPU without a data cache memory, not only because it can hide the latencies of branch mispredictions, but also because it can hide the latencies of any long operations, like it is done in all GPUs.
Fine-grained multi-threading is significantly simpler to implement than simultaneous multi-threading.
However, for this, fine-grained multi-threading is enough. Simultaneous multi-threading does not bring any advantage, because the thread with the mispredicted branch cannot progress.
Out-of-order execution cannot be used during branch mispredictions, so like I have said, both SMT and OoOE are techniques useful only when a data cache memory exists.
Any CPU with pipelined instruction execution needs a branch predictor and it needs to execute speculatively the instructions on the predicted path, in order to avoid the pipeline stalls caused by control dependencies between instructions. An instruction cache memory is also always needed for a CPU with pipelined instruction execution, to ensure that the instruction fetch rate is high enough.
Unlike simultaneous multi-threading, fine-grained multi-threading is useful in a CPU without a data cache memory, not only because it can hide the latencies of branch mispredictions, but also because it can hide the latencies of any long operations, like it is done in all GPUs.
Fine-grained multi-threading is significantly simpler to implement than simultaneous multi-threading.