TLDR: using a KD-tree, I find the face containing the UV coordinate. Then I transform the UV coordinate to barycentric coordinates within that containing face, then put that barycentric coordinate through the local -> world -> view -> perspective transform matrices
A common approach in rendering engines to convert screen space coordinates to objects is to render a second image with light and shadow disabled where the color uniquely maps to an id. You then can uniquely identify 24 bits worth of objects without needing to maintain a KD tree.
TLDR: using a KD-tree, I find the face containing the UV coordinate. Then I transform the UV coordinate to barycentric coordinates within that containing face, then put that barycentric coordinate through the local -> world -> view -> perspective transform matrices