There is a lot of pretraining data available around screen recordings and mouse movements (Loom, YouTube, etc). There is much less pretraining data available around navigating accessibility trees or DOM structures. Many use cases may also need to be image aware (document scan parsing, looking at images), and keyboard/video/mouse-based models generalize to more applicants.