Right now SelfieSegmentation is just a thin wrapper around the selfie segmentation solution provided in https://google.github.io/mediapipe. It operates on RGB frames, so that's why I need the conversion. The model inference is also done on the CPU. Interestingly, there is a GPU mediapipe graph available, but I haven't looked into what's needed to use that yet.
And yes, Boxfilter is just a wrapper around opencv's boxfilter. This is probably the lowest hanging fruit that could be moved to use GPU.
Would it be possible to avoid the conversion to RGB? (This forum thread says it's CPU-only: https://forums.developer.nvidia.com/t/videoconverts-performa...)