Suppose 3 UI changes go out. How do you decide how much each increased users?
Measure the change in usage for each screen or section of the app that each team worked on.
Perhaps one was trying to drop support calls and the other increased users.
Measure support calls per user (which is the right way to measure it).
You thought that developers saw QA as a roadblock before? Wait until they see QA keeping them from the check they think they are going to get
This has never been a problem for us.
How do you gently break the news that you're not even going to let the developer try to get that dangled reward.
These kind of tests can be easily randomized. If product management really thought the UI was a bad idea, they could keep the first test size really small and then only increase the sample if the design proved itself.
If I worked in an organization that announced a policy like this, I'd start shopping my resume around.
We would definitely only do it for one small part of the team that volunteered for the experiment.
Overall, any kind of performance based pay depends on the metric. If the metric is robust, the scheme can succeed. If the metric is not robust, the whole thing fails. Will the metrics I proposed be robust? I can answer your concerns and attempt to make them more robust, but I don't know if they would be robust enough. I think it would be interesting to run the experiment, and find out for real.
Let's see. Your response to the 3 UI changes question is Measure the change in usage for each screen or section of the app that each team worked on. From that I conclude that you've probably never done A/B testing and discovered how difficult those deltas are to accurately measure, or discovered how changes upstream waterfall down through your site. The amount of data you need to narrow down performance changes to within, say, 5% is much more than most people realize. And from a business view you don't want to do that for the simple reason that once you have a demonstrated improvement, continuing the measurement to figure out the exact win is costing you real money.
If, as with several of my employers, people take a long time from engaging with your website to making money for you, the tracking problems get MUCH more challenging. When you add complex social dynamics on top of it (such as person A getting person B to make a purchase), you're in for a world of fun. (Yes, I've worked on a site that had to solve exactly this problem.)
Measuring support calls per user is also not as simple as it looks. Different parts of your site generate support calls at different rates. And it isn't always obvious what part of the site generated the call. This gets worse if you have the long lead time issue that I've faced multiple times.
As for whether QA is viewed as a roadblock, that depends strongly on team dynamics. But in places I've seen that had that bad mindset, things can get very, very bad.
On the randomizing tests, sure you can limit how many users are shown the option. But when someone makes changes that introduce technical debt on the back end, that debt and resulting bugs can't be segregated out so easily. Sometimes you really just need to say no to features. Anything that limits your freedom to do that is a Bad Thing.
Now don't get me wrong. I like the idea of performance based pay. My current employer (Google) is all about performance pay. But you have to structure the incentives right. Giving people pay after the fact based on feedback from people around you creates incentives to not just optimize a flawed metric, but to work together as a team and get stuff done.
This is important enough that while I'd read with interest what the experience of this experiment was like, I'd personally not want to be involved in anything like it until I'd been convinced that it really works like you hope it will.
From that I conclude that you've probably never done A/B testing and discovered how difficult those deltas are to accurately measure, or discovered how changes upstream waterfall down through your site
We've done it before. We've launched new screens, measured the usage, then done a bunch of changes to the screen and seen usage go up. We also have a lot of good data on which screens generate the support call, data on the conversions of all the different screens, etc.
But again, your points are all good. We have pretty good data. But it's nowhere near as simple and clear as the same type of measurement for a sales person. Would the data be good enough to base pay on? I don't know.
Measure the change in usage for each screen or section of the app that each team worked on.
Perhaps one was trying to drop support calls and the other increased users.
Measure support calls per user (which is the right way to measure it).
You thought that developers saw QA as a roadblock before? Wait until they see QA keeping them from the check they think they are going to get
This has never been a problem for us.
How do you gently break the news that you're not even going to let the developer try to get that dangled reward.
These kind of tests can be easily randomized. If product management really thought the UI was a bad idea, they could keep the first test size really small and then only increase the sample if the design proved itself.
If I worked in an organization that announced a policy like this, I'd start shopping my resume around.
We would definitely only do it for one small part of the team that volunteered for the experiment.
Overall, any kind of performance based pay depends on the metric. If the metric is robust, the scheme can succeed. If the metric is not robust, the whole thing fails. Will the metrics I proposed be robust? I can answer your concerns and attempt to make them more robust, but I don't know if they would be robust enough. I think it would be interesting to run the experiment, and find out for real.