Hi, I'm one of the creators of http://www.abtests.com. The issue of statistical ...

paraschopra · on Dec 1, 2009

Doing the math here. A/B Tests with conversions are modeled as binomial variables. So the standard error of the conversions here is sqrt(p(1-p)/n) where p is conversion rate and n is number of hits (p(1-p) is standard deviation of binomial distribution). Calculating standard error for both of your versions - sqrt(0.002*(1-0.002)/2834) = 0.0008 and for the other SE is 0.0017. Now since there are large number of trials, you can model the difference of two binomial distributions as a normal distribution, standard deviation of whose is sqrt(se_1^2 + se_2^2) = 0.0019.

Now the way significance is checked is by using single tailed z score (we are testing if the difference in two distributions is statistically significant and greater than zero). Z score in this case is p_1 - p_2/std that is (0.008-0.002)/0.0019 = 3.1579 which is way larger than the critical value of 1.65 (which corresponds to 95% confidence).

So, the difference is indeed statistically significant. A note of caution is that some theory says that you cannot model a binomial distribution as a normal distribution until you have at least 10 successes or failures, which is the case here.

defen · on Dec 1, 2009

See my reply lower in the thread - I worked out the numbers using Bayesian inference to find the exact probability that B is better than A, subject to a number of assumptions. The benefit of this approach is that it's exact so you don't need a certain number of samples to properly approximate a normal distribution. The answer is that B is almost certainly better than A. Here's the calculation I plugged into Wolfram Alpha:

2835 2837 choose[2834,6] choose[2836,24] NIntegrate[(f^6) (1-f)^2828 (g^24) (1-g)^2812,{f,0,1},{g,f,1}]