An interesting alternative approach for instrument sound separation is to use a fused audio + video model. So, given that you also have video of the instruments being played, you can perform this separation with higher fidelity.
I was fascinated by the work done by “The Sound of Pixels” project at MIT.
The Dockerfile in the swift-jupyter repo is a superset of what you need. You could remove the lines dealing with jupyter and you'd be left with a Docker container with the s4tf compiler.
I was fascinated by the work done by “The Sound of Pixels” project at MIT.
http://sound-of-pixels.csail.mit.edu/