I wonder if it'll use the same architecture as visionOS; where the vision tracking events and UI affordances are processed and composited out-of-process; with the app never seeing them.
That's probably how it'll go because it's the path of least resistance. A button will already have a listener for tap, so the OS translates the vision tracking into a "tap" and triggers the relevant code. There's no point telling the app about vision tracking because apps wouldn't already have a handler for that event. And for privacy reasons, there's no need to start now.