Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hi, I'm one of the creators of http://www.abtests.com. The issue of statistical significance has come up over and over, so I'll try to explain our view of it.

We ask people to input their raw data...both trials and conversions. If they do this honestly (anybody can fake data about anything) then in our view the results speak for themselves. We've had folks upload data that was obviously not statistically significant, and we've had people write blog posts denouncing those results. We've also had folks upload test data that was statistically significant and people say they're learning a lot.

So we've had both solid and suspect data uploaded to the site with good discussion around it. This is exactly what we hoped for...I think in the future as more tests get uploaded the wheat will be separated from the chaff, so to speak, and those tests with significant data will get lots more attention than those that don't. In fact, we're already seeing this in the traffic logs.

And, as several folks have mentioned, many tools do the hard stats math for you, telling you when your data is statistically significant. This helps people know when they can be confident in sharing their data with others.



Doing the math here. A/B Tests with conversions are modeled as binomial variables. So the standard error of the conversions here is sqrt(p(1-p)/n) where p is conversion rate and n is number of hits (p(1-p) is standard deviation of binomial distribution). Calculating standard error for both of your versions - sqrt(0.002*(1-0.002)/2834) = 0.0008 and for the other SE is 0.0017. Now since there are large number of trials, you can model the difference of two binomial distributions as a normal distribution, standard deviation of whose is sqrt(se_1^2 + se_2^2) = 0.0019.

Now the way significance is checked is by using single tailed z score (we are testing if the difference in two distributions is statistically significant and greater than zero). Z score in this case is p_1 - p_2/std that is (0.008-0.002)/0.0019 = 3.1579 which is way larger than the critical value of 1.65 (which corresponds to 95% confidence).

So, the difference is indeed statistically significant. A note of caution is that some theory says that you cannot model a binomial distribution as a normal distribution until you have at least 10 successes or failures, which is the case here.


See my reply lower in the thread - I worked out the numbers using Bayesian inference to find the exact probability that B is better than A, subject to a number of assumptions. The benefit of this approach is that it's exact so you don't need a certain number of samples to properly approximate a normal distribution. The answer is that B is almost certainly better than A. Here's the calculation I plugged into Wolfram Alpha:

2835 2837 choose[2834,6] choose[2836,24] NIntegrate[(f^6) (1-f)^2828 (g^24) (1-g)^2812,{f,0,1},{g,f,1}]




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: