Running Parallel Multivariate Experiments at Licious — Part-1

Published in

Licious Technology

10 min readSep 6, 2023

In this article, we will explain the inception/features of the experimentation tool at Licious. We will go deep into the need for an In-house experimentation tool, various feature sets of the tool, and how it is changing the way India is ordering meat online.

Why A/B testing?

Licious is highly consumer-centric & experimental in nature. We are always trying to optimise every funnel/UI to make the end customer experience better. However continuous tweaking requires continuous unbiased measurement to ascertain that if we are moving in the right direction.
As they say “Data is King”, It is imperative that we measure every small change on the platform to validate every hypothesis with data, Hence every feature goes through an A/B test, we roll out the feature to a small set of consumers, gather their reactions, and once we have sufficient data to draw conclusions from this test, we decide whether to expand the rollout to a larger audience or discontinue the feature.

Why did we build an In-House Experimentation Platform?

With the increasing requirement to test out new features rolling out every month, it was becoming difficult to carry out multiple parallel experiments with our existing experimentation platform, as this platform did not guarantee independent selection or distribution of the user base from one experiment to another.

Furthermore, due to Licious’ hyperlocal requirements, we aimed to conduct experiments targeted at very specific user segments or localities.

We essentially needed the following features from an experimentation platform:

The ability to concurrently run multiple multivariate experiments in an orthogonal manner.
Conduct experiments targeting specific user segments or localities.
Capability to conduct and manage experiments both on user-facing applications and internal backend applications.

Based on our nuanced needs, the teams determined that building an In-House A/B Experiment tool would help us overcome the limitations of our current tool and also make it easier for us to augment the tool basis our current and future needs.

Architecture of Experimentation Platform

Our experimentation platform consists of four major parts:
- Experiment Data: This part handles the storage and retrieval of experiment configurations, which are stored in a MySQL database and cached in Redis.
- Audience Selection: This section defines the audience of an experiment and its filtering.
- Variant Allocation: This part handles the allocation of variants in an experiment w.r.t. an identifier.
- Experiment Analysis: This section handles the analysis of an experiment and computes its results by extracting data from the data lake.

All interactions on the Experimentation Platform occur through REST APIs. These APIs include configurations for experiments and the retrieval of a variant corresponding to an identifier for a given experiment.

What are the feature sets of the In-House Experimentation Platform?

The In-house A/B Experimentation tool supports multiple features which allows us to support the nuanced needs of Licious. The features are as follows.

Orthogonality: Making experiments independent while running multiple multivariate experiments.

Why is Orthogonality important?
Whenever we run multiple experiments, there is a high chance that a user can be a part of multiple experiments which poses a problem that it becomes difficult to isolate the impact of one experiment from another experiment. For example, a user is a part of an experiment on the Homepage and another experiment on the Checkout Page, while evaluating the results we found that Day-7 retention of the user improved by 1%, however as the user is part of both the experiments, it becomes difficult to ascertain whether this improvement is coming from Checkout or Homepage experiment.

To solve this problem, we introduced orthogonality. In simpler terms, this approach seeks to distribute the test and control variants of all experiments in a manner that facilitates the isolation of the impact of a specific experiment from others, even in cases where users are participating in multiple experiments.

How is Orthogonality implemented?
When selecting users for an experiment, our process involves two steps. First, we ascertain on which identifier the experiment needs to run, which could be a user key or any other unique key.
Subsequently, we determine whether the above identifier is part of the audience defined for that experiment. We will explain the audience criteria later in this article

We follow “Hashing” technique to determine whether the identifier is part of the experiment or not.
- First, we utilise a consistent hashing algorithm to hash the identifier. This algorithm ensures a high level of randomisation and uniform distribution, commonly known as the “Avalanche effect.”

- Second, we transform the hashed value into a numerical point that falls within the range of 0 to 1.
For instance, using MD5 as a hashing algorithm, we calculate hash of identifier, which produces 32-digit hexadecimal number, which is then divided by maximum possible 32-digit hexadecimal number, to get a floating number between 0 & 1.

- Third, we determine the range based on the desired audience size. For instance, if we aim for an audience of 30%, we will select all points within the range of 0 to 0.3.

Identifiers presented as points on a number line

For variant allocation, we proceed to select points within the previously determined range(0–0.3) based on the variant distribution.

Experiment with Audience 30%, variants, varA, varB, varC with 1:1:1 distribution

To ensure orthogonality,
We append a unique experiment key to each identifier. This strategy guarantees that the allocation process remains separate for every experiment. Consequently, each experiment has its distinct number line (frame of reference), therefore the same identifier is present at different positions in different experiments.

Same user’s position on different experiment’s number line.

When two experiments (conducted on the same audience) are orthogonal, meaning their variant distributions are independent and equitable, we observe that users assigned to variants in one experiment are distributed equally among the users assigned to variants in the other experiment.

For instance, consider Exp-1 with variants var-A (33%), var-B (33%), and var-C (33%), and Exp-2 with variants var-P (50%) and var-Q (50%). In this case, we can observe that the population of var-P consists of equally distributed population of var-A, var-B, var-C. This illustrates how orthogonal experiments result in a balanced and independent allocation of users across variants.

Audience selection: Targeting specific users

Targeting specific user sets is important for Licious as it allows us to target customers living in a specific city, using a certain app version or using a particular feature of the app such as Licious Infiniti. Hence with the in-house platform, we can segment the target users according to business needs, the tools allow us to target users who are

Residing in a specific city/hub
Using a particular version of an app on a specific platform
Falling into a particular segment based on their activity on our platform

All these conditions can be applied independently.

Whenever we receive a request to assign a variant to a user, the initial step involves checking whether the user meets the audience criteria through a series of filter checks. If these checks are met, the user becomes part of the experiment and the corresponding variant is assigned. If the checks are not satisfied, the user is not included in the experiment.

Real-time control: Controlling feature rollout or variant distribution in real-time

We have the ability to dynamically control the distribution of variants and modify their properties in real-time, all without requiring a formal app rollout.

We can increase/decrease the audience size and variant distribution at run-time. Also, the variant configuration can be modified.
In the above configured experiment, we can even introduce new variants, such as one featuring a green button, and seamlessly integrate them into apps that have already been released.
Moreover, our capabilities extend to the point where we can fully deploy a single variant to 100% once we reach a conclusion in our experiment.

Experiment Analysis: Making sense of data

Just running an experiment is not enough, it is of utmost importance that the impact of an experiment is analyzed to quantify the final impact on Business metrics, hence the platform provides multiple capabilities to make the analysis easier.

First, we collect data on user interactions with our test feature, which is stored in our data lake.
Next, we extract this data and run our analytics algorithm to validate and evaluate our hypotheses.

This analytics process is divided into three parts:

1. Calculate variant metrics
These can include calculation of
- Conversion of a variant
- Change in conversion between any two variants

Conversion is calculated by taking two sets of data, exposure and action.
We gather data like how many users have viewed a particular variant (denominator) and how many users have performed action on this variant (numerator).For example,
Variant-A was viewed by 1000 users and 100 users interacted with it, in this case conversion is 0.1 (or 10%).

The change in conversion is typically calculated in comparison to the baseline (or control) variant. It is the difference between the conversion rate of the variant in question and the baseline conversion rate.
For instance, if Variant-A (control) has a conversion rate of 10%, and Variant-B has a conversion rate of 12%, the change in conversion would be 20%.

2. Enough exposure to experiment
We calculate sample size of an experiment by using these parameter

Baseline conversion rate
Statistical significance
The minimum detectable effect we want between two variants.

Baseline conversion is the conversion metric of the control variant against which we aim to measure our other variants.

Statistical significance is a value that defines how unlikely the observations could have occurred due to chance.

Minimum Detectable Effect (MDE) is the value that determines the sensitivity of our conversion metric. In other words, it represents the minimum change we aim to measure between two variants. Consequently, a lower MDE value corresponds to a larger sample size, and vice versa.
MDE can either be a static value or can be derived dynamically by measuring the difference in conversion between two variants.

As an example: An experiment with the following parameters:

Baseline conversion: 30%
Statistical significance: 95%
Minimum Detectable Effect (MDE): 5%

This would necessitate an exposure of approximately 14,000 users per variant.
Will explain these calculations in the next part of this article.

3. Distribution is equitable
As we do not directly control user allocation, which is independently managed by an algorithm relying on a unique identifier as input, situations may arise where the distribution is not equitable. This occurrence is referred to as Sample Size Ratio Mismatch (SSRM).
To validate this, we employ Pearson’s 𝛘² (chi-squared) test, with a predetermined level of significance. The formula for calculating p-value is as follows: ∑((Oi — Ei)²)/Ei; where Oi represents the observed value, and Ei stands for the expected value. Subsequently, we compare the resulting p-value with a critical value, determined based on the Degree of Freedom and the chosen significance (or alpha) level.

For instance, let’s consider an experiment with two variants, A and B, and an exposure of 20,000, which is evenly split at a 50:50 ratio. If the observed values are 10,100 for variant A and 9,900 for variant B, the expected value for each variant would be 10,000. In this case, the calculated p-value is obtained by ((10,000–10,100)²/10,000) + ((10,000–9,900)²/10,000), resulting in a value of 2.

The Degrees of Freedom (DF), is the number of independent variables that can be estimated in a statistical analysis, and in this case, it would be determined by the number of variants minus 1, which is 1 in this scenario. The critical value, for a significance level of 95% and DF of 1, is 3.84.
This implies that in 95 out of 100 cases, a survey that aligns with a sample will yield a p-value of 3.84 or lower.

Consequently, 2 < 3.84, we can consider our distribution acceptable with a 95% level of significance.

If the user distribution is slightly inaccurate but the deviation is minimal, we might opt to wait for more exposure or make adjustments to the variant distribution. Conversely, if the deviation is substantial, we investigate whether there are errors in the experiment setup or feature implementation.

To conclude an experiment, we must ensure an equitable distribution among the variants and each variant has enough exposure.
Once the experiment is concluded, the winning variant i.e. the variant with the highest conversion rate, is rolled out to all the users.

For example in the image above, we can observe that variant B has a higher conversion rate than variant A. Therefore, we can conclude that variant B is the winner of this experiment.

Conclusion

Currently, our platform provides robust support for experimentation and feature rollout, along with precise audience control. Moreover, we offer a basic analytics dashboard for insightful tracking.

Looking ahead, we’re determined to enhance our data collection methods for experimentation. This will enable us to gain more informed insights and, as a result, make more strategic decisions. Our evolution as a data-driven company backed by results is at the forefront of our goals!

In the next part, we will delve deeper into how we define analytic metrics and maintain a minimal impact on identifier assignment to variants, when there are changes in the experiment configuration.