When creating or adding additional variations to your experiences, you need to select whether traffic is optimized by Mutiny or distributed evenly between variants. You can learn more about selecting your traffic distribution here.

# How does Mutiny optimize among your variations?

This optimization problem, also known as the multi-armed bandit problem, attempts to efficiently identify the best variation while continuing to explore all options in real time. This ensures you are always picking the best variation. It explores all options while exploiting (prioritizing) what it has currently identified as the best option. This tradeoff between "exploration" versus "exploitation" can be simply defined as:

- Exploitation: Choose the variation with the best average return based on what you know of all the variations currently.
- Exploration: Be aware that your knowledge of the different choices might be limited, and opt to participate in options that could possibly show themselves to be of higher returns.

We have implemented multiple algorithms to solve this optimization problem and performed experiments between the top performing ones to find the best algorithm for our customer needs. The two top performing algorithms are:

## Epsilon Greedy

As a simple method to solve the multi-armed bandit problem, Epsilon-Greedy chooses between exploration and exploitation randomly. After an initial period of exploration, the algorithm exploits a random variation "e" percent of the time and the best option "1-e" percent of the time. For example, if e=0.05, the algorithm will exploit the best variation 95% of the time and will explore random alternatives 5% of the time. This is actually quite effective in practice, but as we’ll come to see it can under-explore the variant space before exploiting what it estimates to be the strongest variant. This means that Epsilon-Greedy can get stuck exploiting a suboptimal variant.

## Thompson Sampling

Thompson sampling is a more sophisticated method to solve multi-armed bandit problem. This algorithm works by maintaining a distribution over the possible reward values for each variation, building up a probability model from the obtained rewards of the previous experiences. It then selects the arm with the highest expected reward by sampling from the probability model. Convergence of Thompson sampling in multi-arm bandit is proven and it has also been shown that Thompson sampling is instantaneously self-correcting. It is very effective in situations where the rewards are highly variable, as it allows the algorithm to adapt to changing conditions.

# Which algorithm is better?

Let's take a look at two simulated experiments performed on an experience with 5 variations of a CTA button on a website to compare these two algorithms.

## Experiment #1: Clear winner case

Let’s assume all of these variations have 10% probability of winning (getting a click on the CTA) except for the last variation that has 90% probability of getting a click. It is clear that the winner is the last variation. The following plots show how Epsilon Greedy algorithm (note that we tried 5 different values for e parameter in this experiment) compares to Thompson sampling algorithm in choosing the best variation (the best variation is the last one with 90% probability of winning). The Epsilon Greedy algorithm chooses the best arm approximately 60% of the time after 100 time steps of the simulation. In contrast, it is clear that Thompson sampling is converting to the optimal solution by choosing the best arm almost all the time with probability of being chosen close to 1 (y-axis). Therefore, in this case, Thompson sampling is much quicker and much more robust to find the winning variation and showing it more frequently.

## Experiment #2: No clear winner case

The above experiment was a simple case for the algorithms because there was a clear winner. Now, assume all of these variations have 80% probability of getting a click except for the last variation that has 90% probability. It is not very clear that the winner is the last variation as the other variations are also performing well. The following plots show how Epsilon Greedy algorithm (note that we tried 5 different values for e parameter in this experiment) compares to the Thompson sampling algorithm in choosing the best arm (last one with 90% probability). In this case, the Epsilon Greedy algorithm is not able to find the winner during the whole simulation and more or less works like a random algorithm. By contrast, Thompson sampling performs better by choosing the best arm approximately 70% of the time close to the end of the simulation.

In summary, the Thompson sampling algorithm usually outperforms the Epsilon Greedy algorithm. That's why we use it as the default algorithm at Mutiny. This means any new Personalized Experiences created by our customers will use our Thompson Sampling implementation in order to find winning variations quickly and in a more robust way.

## Comments

0 comments

Please sign in to leave a comment.