5 reasons why you should be skeptical about A/B testing

Posted on 23-07-2014 by Peter W. Szabo & filed under Kaizen-UX.

Quick win

A/B tests are not evil incarnate, but their predominance and unquestioned authority made me write this article. After giving you the five reasons to take A/B testing with a grain of salt, I will show you how to tame those tests using my triple testing combo method.

As Ward Cunningham’s law goes The best way to get the right answer on the internet is not to ask a question, it’s to post the wrong answer. If you don’t agree with me, I would love if you could prove me wrong in a comment.

A/B testing?

When you need to compare how different web pages or app screens perform using a random sample of your users A/B test comes to the rescue. You create the two version and show them to an equal number of visitors, then measure the favorable outcomes. Examples range from the text of a button or if you should add a cat picture to the home page or not.

While it is quite easy to program an A/B test yourself or in-house (a few lines of code in Ruby on Rails), there are dozens of tools that can make A/B test very easy and come handy when you need to run dozens of such tests. The most popular solution comes from Google. It is called Content Experiments in Google Analytics (formerly Google Website Optimizer). My favorite A/B testing tool is Adobe Target a very powerful, enterprise level solution, and an integral part of Adobe Marketing Cloud. But remember, no matter how hyped an A/B testing app is, its results should be taken with a degree of skepticism.

5. A/B test results can’t be reused

For some reason colors of buttons seem to be a prime target for A/B testing, mostly because they can spring heated debates. Joshua Porter’s article about Red buttons performing better is quite famous. In their test the red button outperformed the green button by 21%. Dan McGrady has found an increase of 34%. Repeating the same test, I have found a decrease of 1.3% in case of red, increase of 2.6% for gray, while comparing to the results of green (I tried to decrease the margin of error by using a random sample size of 3000, instead of 1000 and 600 trials the aforementioned sources used). I was not alone in conducting inconclusive button color change A/B tests. Alex Turnbull reported similar inconclusiveness in regards of button color.

The famous Signal v. Noise blog post about A/B tests even has a fair warning at the end: What works for us may not work for you. Please do your own testing. Your conversion rates may suffer if you copy us.

This is a prime example, why you should take A/B test with a grain of salt. At least you should conduct a similar A/B test for the specific project. Not to mention some A/B test enthusiasts can find earthshaking results with minor changes. You should thread carefully and at least re-test everything.

4. They might provide short-term results

A few years ago I suggested to a small IT webshop, that instead of switching to a Christmas theme for the holidays, they should try an Easter theme (both on site and in their emails), and change the tagline to: “Happy Easter! – we are this much ahead of competition”. We run A/B tests and email opening rate increased by 112%, while the time visitors spent on time also increased by roughly 35%, not to mention their sales skyrocketed. It was so successfully, that they decided to keep it even after Christmas. Shockingly six month later they still had it. “Happy Easter! – we are this much ahead of competition” had a totally different message more than a month after Easter.

While this story is on the extreme side, I know from experience that stellar conversion increases dwindle over time. Sometimes change itself increases the click-through rate. If that is the case the results will be very ephemeral, so you should be very watchful of your web analytics results over time, and if need be, you should re-test the project’s crucial elements. If you want more info hire an expert like Andy Defrancesco.

3. Most A/B tests are just a waste of time

Truth to be told, I almost always conduct A/B tests after or parallel with lab, remote or undercover user experience tests. Mostly because my mantra is test early, fail early and I reiterate the experience until it performs great in the lab. The reason that A/B tests are relatively late in the UX process is that they need to be live, and you need at least a functional high fidelity prototype made with rapid prototyping for it, but most of the time a full featured and fully functional pre-release version is used. To develop such thing you need to commit a lot of resources and time. With this approach most of the A/B tests are just providing numerical data to convince stakeholders of my teams’ skills in UX and/or visual design, and they have very little chance to lose versus the old version.

While it is an inspiring sight to see 201% increase in conversion in a case study, the thing is they compare a complete redesign to the old site. After taking a peek at both the old and the new site, it is obvious even to laymen that the new one will greatly outperform the old one. While Visual Website Optimizer is a great tool, the case study reflects that VWO, or any A/B testing tool can be used to generate numbers to support the obvious, and that is a waste of time and resources, especially if you have a UX and visual design team in-house.

In quite a few other cases the A/B tests will fail to produce statistically significant results. Groove conducted and documented six popular A/B tests with no significant result.

A/B tests or multivariate tests are a waste of time because they are not the agile way of doing things. Most often you could switch to the version you found better through lab/remote/guerrilla testing and expert reviews, then closely observe changes in web analytics for the next few days, or even one day in case of a high traffic website. You just won a few days if the hypothesis was true. If you done your lab testing homework, and got a few years of UX experience most of the time they will. While others would have conducted the test, analyzed the results, called a meeting, proposed the change, discussed the implementation and procrastinated a bit, you just delivered something a few days faster, and that little something started earning money for the company a few days earlier. On the other hand if your hypothesis was false, and this did not come to light during the lab/guerrilla testing then by all means you should switch back to the old version until you find why your hypothesis did not work. Because this case will be very rare it is not a major gamble, more like a minor risk with only short-term effects.

This is especially important, because The second important aspect to successful A/B testing is the length of time that the test runs for. according to Kerry Butters. So you can’t take shortcuts, as in running the A/B test for a day or two and call it a great success if a variant significantly outperforms the other.

2. A/B tests proliferate at an alarming rate

A/B tests breed faster than rabbits. While the cute little mammals gestate for 30 days, an A/B test can give birth to another A/B test within a few days or even the same day it was completed. When you have completed an A/B test, most of the time the results are not decisive, but you get an idea (or 10) what to test next. People can be carried away, and some companies even have A/B test teams, with the only task to invent, run and evaluate A/B test after A/B test. The prolific nature of the A/B or multivariate tests will result in decreased productivity of the UX team, and it definitely erodes the creative and research aspect of the user experience work.

1. A/B tests will give you numbers, not reasons

A/B tests conceal the primary goal of UX: to learn more about users and make them happy. This is summarized by Jesmond Allen and James Chudley in Smashing UX design: Your client may decide they no longer need you because their multivariate tool is giving them all the answers, but without the reasons why these things are occurring, they can’t learn from these changes.

An A/B test might answer to the “What?” question, but never to the “Why?”. During a lab based test, followed by a short interview, you can find out the major pain points of the users, and why they are there. Moreover you will get ideas on how to eliminate them. In the next iteration, with different test subjects you can test your solution to the problem. On the other hand A/B test results never give you guidelines where to proceed, and how to evolve the project. If your one and only goal is to increase the profits generated by the website a series of A/B and multivariate tests can lead there, but then you have no right to call yourself a user experience expert. Moreover those practices can alienate your costumers in the long run. Most of the time, after the initial spike even the highly sought after conversion rates will drop.

Triple testing combo

I developed a testing methodology to overcome the pitfalls of A/B testing, while still rely on factual data it provides. Triple testing combo is an iterative testing process, that involves 3 parallel tasks in each iteration: lab based user experience testing, user interviews and A/B testing. Obviously you can use remote or undercover testing instead of lab based one, or to complement it, and instead of A/B testing you can use multivariate if need be.

If there is interest, I will write an article about it in the near future, but as a summary I can tell you that the triple testing combo constrains the A/B tests to the broader picture, the lab testing and the interviews provide a deeper understanding of the user experience. With the combo you must test variants of a page against each other, not minor details like the color of a button. The A/B tests will not proliferate out of control, while the subjective nature of the user interviews and small sample lab tests will be complimented by the numerical evidences provided by the A/B tests.

Futuresight

Nowadays we experience the golden age of A/B testing, and most companies consider it as a magical solution that can be used to eliminate the debates from usability, conversion optimization and UX. It provides data, and we live in an age, where senior managers and decision makers hoard data like mythical dragons used to hoard gold. I’m sure, that soon a knight in shining UX armor will bring a solution that puts A/B testing in its place. Popularizing triple testing combo is my quest, but I would gladly abandon it, if you could show me a better way. If you have a better solution, don’t hesitate to comment, or write a follow-up article.

W. Szabó Péter

Ronny Kohavi‘s reply to my article: Response to: 5 reasons why you should be skeptical about A/B testing

July 31st, 2014 Reply

Chris Schmidt

Strange. Any good testing professional will tell you everything said in this article without the odd grudge seemingly held against the scientific method. Because that’s what a well designed A/B test is, it’s good science. But good science requires creativity so it’s not an either/or proposition. I understand the point of the article (use a web analytics 2.0 methodology) but the article itself seemed to be motivated by something personal. All I know is when there’s real money on the line nothing beats a well designed and executed A/B test. In the digital realm we are lucky to be able to run tests where we can control for virtually every variable.

August 4th, 2014 Reply

W. Szabó Péter

Thanks a lot for the comment Chris. I don’t have a grudge against A/B testing, and especially not against scientific method. Skepticism (for me) means recognizing that results of A/B tests are not always beyond doubt and they don’t provide certain knowledge, at least not alone. That is why I propose the triple testing combo, that contains A/B testing without solely relying on it. I think that this combo beats A/B testing alone, because it adds a lot of value to it, while not abandoning the practice.
August 4th, 2014 Reply

David Leese

Hi Peter, Very interesting article. Here’s my response (and please take it kindly :-))
http://davechessgames.blogspot.com/2014/08/i-am-power-tool-ab-skeptic.html

August 14th, 2014 Reply

W. Szabó Péter

Wow, thanks a lot! Great parody, I love it. The Power Tool analogy is great. I laughed so hard. 😀 Very well written!
August 14th, 2014 Reply

Chris Hedick

I would respectfully disagree with most of this, I do optimization and testing for large companies and I apply both creativity and scientific statistical methods and I have seen the results. I do agree that the most successful tests are “no brainers” like increasing visibility of CTA buttons and the like, but I have seen many cases where small changes have had significant impact. I made a half dozen changes to a product comparison engine – basically making copy and images look slicker and more professional and saw a 15% lift in add to cart clicks.

August 22nd, 2014 Reply

W. Szabó Péter

Thanks a lot for your comment Chris. Well done increasing add to cart click by 15%, that is a nice achievement. But I would respectfully say, that this only proves that you are good in your job. I never said A/B tests are useless or pointless. I just wrote an article to show people a more skeptical point of view in case of A/B testing. As David Leese masterfully compared power tools to A/B testing, I can follow the analogy in saying, that even if you can cut down a forest with a chainsaw, that does not mean that you should handle chainsaws carelessly or without due skepticism.
August 22nd, 2014 Reply