A/B test on the Amazon product page

TechnicaL FOUNDATIONS for Product managers

Quick theory about A/B testing and a test setup to improve the Amazon purchasing experience for Alexa-compatible products.

A PM has a new great idea

Assume you are a PM at Amazon, specifically in the Smart Devices department. And, of course, you want to make even more sales for the company. You have a great novel idea: instead of just showing smart products in search results, you will also add the “Works with Alexa” tag (see the picture below).

Sounds reasonable, right? Why not just add this new statement to the website and move on?

Should we just apply the change for all users?

Yes!
Not so fast!

The problem with “just doing it” is that it might happen that you drop, not improve your sales. One of the potential reasons would be that you added even more text to the already crowded page. It might sound small, but every pixel counts when you have millions of views per day browsing through hundreds of products.

So what do we do? We apply science. IT giants are giants because they make decisions using a scientific approach, which moves them in the right direction.

First, I need to explain a theory quickly, so if you know it, skip the next section and jump directly to the exercise.

Basics of A/B testing: complex theory in a fun way

No worries, it is easier than you think.

Imagine you enter the bus and there are many very tall people. You ignored the bus plate (because you were staring at your smartphone, he-he). There are two options:

It was a regular city bus, and it just happened that some people were tall
Or a special one for basketball players and their crew (you know this week is a World Basketball Championship in your town).

Question: without talking to other people, can you figure out whether you are in a “normal” or “basketball” bus? Let’s use science!

First, we know the height distribution of all citizens in our country (we can Google it on our smartphone). Then we see the height of all the people on the bus (let’s assume we can measure them more or less using just our eyes).

And now — magic! The science of statistics allows us to compare distributions of heights and decide whether or not we observe something special (a basketball player bus) or just a city bus with a few tall people in it). Long story short, we will put all we know (height distribution we googled and people’s heights in the bus) into the boring formula and state something like: “With a probability of p=X, this distribution is not different from the normal one.”

This is a compelling statement! For example, it can be that p=2%, meaning that there is only a 2% chance that the people you see are from a “normal” bus; that is — only a 2% chance that the presence of tall ones amongst them is pure coincidence. So we do accept our hypothesis — they are a basketball team indeed!

Note that there is a 2% chance that we can be wrong —and we can not do anything about it. Practically, it means that if we enter different buses 50 times on this day, there will be a high chance that at some point, our math will tell us, “This is the basketball team,” but this is not. That’s life; we need to deal with it.

Let’s now imagine the formula said p=66%.

What does it mean?

66% chance that it is a "normal" bus
66% chance that it is a "basketball" bus

It means that it is quite likely (66%!) that the distribution we see is normal, so we have to reject our hypothesis about the basketball team. Nope, this is just a bus, and it happens to have tall people! For example, here in the Netherlands, this is quite a common situation :)

How do we choose the threshold that defines whether it is “small enough” to state whether our idea is right or wrong? It depends on the domain. For instance, in medicine, for placebo vs. new pill tests, they use a 1% or 5% threshold (because you don’t want a pill that doesn't work, right?).

For e-commerce and other non-critical cases, 10% is good enough. Yes, we will make a mistake every tenth time, but 9 out of 10, we will move in the right direction. And this is why IT giants are giants.

Let's set up out A/B test

Now that you know the basics, I can reveal the magic of A/B testing. The idea is simple: you split your population into two halves: 50% in A see the old experience (old product card) and 50% in B — new one (“Works with Alexa” tag). Assume 100000 will see the A and 100000 — the B (everything else is the same, just this tiny text copy is different). You must do the A/B split randomly — no assumptions, no logic; otherwise, you mess up the science.

Okay, the experiment has started, and it took 1 week to gather the data! Let’s count the purchases: in A, it was 14000, and in B — 14180. So, is the new experience (“Works with Alexa”) better? It looks like it, right? 14180 is definitely bigger than 14000!

Is it better with the "Alexa" copy tag?

Yes
Let me check first

Well, science disagrees. If you put it into the A/B test calculator (you can just google the free one, like this — the link already contains the numbers we need), you can see that the p-value is actually 0.25, so there is a 25% chance that the new experience is not better than the old one, but because of the randomness sales occurred to be a bit higher. Sorry, the quarter is too high a chance of a mistake. We must decline our hypothesis and state that the new Alexa tag does not add any value to our sales funnel.

Your turn

Imagine that A stays the same (14000 sales), but instead of 14180 in B, you see 14390 purchases. Enter this new value into the A/B test tool and find an updated p-value.

P-value is ...

0.0125
0.3
0.1

The answer

Here is a new link. You can see that the graph became green — and the p-value is now 0.0125, meaning there is only a 1% chance that our change is random. It looks like we actually improved the sales funnel, and science agrees with this. Cool, right?

Now we can indeed make the change for all our customers and move on to the next great ideas of a product manager.

What's next?

I hope you enjoyed this practical exercise about A/B tests and even tried to use the A/B calculator yourself. Next week, we will expand on this example and make the test results more tricky (exactly how it would happen in your real PM life).

Of course, there is more to say and practice regarding the A/B experiments. If you want to learn more about A/B testing in a fun and practical way, there is a self-paced, hands-on course where you solve tasks with virtual colleagues.

Many PMs have already enjoyed the course (see reviews) and become more effective in their job, and so will you.

Error get alias