It may be possible to do this with an LSTD VAD, I always had really good luck with that. I tried a few random ones in here for silence removal - no quality guarantee [0]
I found LTSD pretty robust compared to simpler energy based things as long as you have a small chunk of background sound at the start. The LTSD implementation is largely from my friend Joao, so I can't take credit for the cool part, only the bugs
I found LTSD pretty robust compared to simpler energy based things as long as you have a small chunk of background sound at the start. The LTSD implementation is largely from my friend Joao, so I can't take credit for the cool part, only the bugs
[0] https://gist.github.com/kastnerkyle/a3661d6be10a0ae9e01fd429...