It is increasingly common for data to be collected adaptively, where experimental costs are reduced progressively by assigning promising treatments more frequently. However, adaptivity also poses great challenges on post-experiment inference, since observations are dependent, and standard estimates can be skewed and heavy-tailed. We propose a treatment-effect estimator that is consistent and asymptotically normal, allowing for constructing frequentist confidence intervals and testing hypotheses.


Adaptive data collection can optimize sample efficiency during the course of the experiment for particular objectives, such as identifying the best treatment (D. Russo 2020) or improving operational performance (Agrawal and Goyal 2013) . To achieve these efficiency gains, the experimenter—rather than staying with a fixed randomization rule—updates the data-collection policy (which maps individual characteristics/contexts to treatments/actions) in response to observed outcomes over the course of the experiment. In this way, the experimenter can resolve uncertainty about some aspects of the data generating process at the expense of learning little about others (Murphy 2005). A common family of the design algorithms are bandit algorithms (Lai and Robbins 1985), where treatment assignments are selected to trade off exploration and exploitation to maximize the cumulative performance over time.

The increasing popularity of adaptive experiments results in the growing availability of data collected from such designs. A natural query arises: can we reuse the data to answer a variety of questions that may not be originally targeted by the experiments? However, adaptivity also poses great statistical challenges if the post-experiment objective differs significantly from the original, and standard approaches used to analyze independently collected data can be plagued by bias, excessive variance, or both. This post seeks to address the problem of offline policy evaluation, which is to estimate the expected benefit of one treatment assignment policy—often termed as policy value—with data that was collected using another potentially different policy. For example, in personalized healthcare, doctors may use electronic medical records to evaluate how particular groups of patients will respond to heterogenenous treatments (e.g., different types of drugs/therapies or different dosage levels of the same drug) (Bertsimas et al. 2017), whereas in targeted advertising, retailers may want to understand how alternative product promotions (either in mail or online) affect different consumer segments (Schnabel et al. 2016).

Problem formulation

Consider that samples are collected by a multi-armed bandit algorithm, where each observation is represented by a tuple \((W_t, Y_t)\). The random variables \(W_t \in \{1, 2, \dots, K\}\) are called the arms, treatments or interventions. The reward or outcome \(Y_t\) represents the individual’s response to the treatment, for which we use the potential outcome framework and denote \(Y_t(w)\) as the random variable representing the outcome that would be observed if individual \(t\) were assigned to a treatment \(w\). We here consider a stationary setting of the potential outcomes, where \((Y_t(1),Y_t(2),\dots, Y_t(K))\) is sampled from a fixed distribution. The set of observations up to a certain time \(H_t := \{(W_s, Y_s) \}_{s=1}^T\) is called a history. The treatment assignment probabilities (also known as propensities) \(e_t(w) := \mathbb{P}[W_t = w | H_{t-1}]\) are updated over time by the experimenter, in response to the observations \(H_{t-1}\).

Our goal is to estimate the value of an arm \(w\), denoted by \(Q(w):=\mathbb{E}[Y_t(w)]\). We will provide consistent and asymptotically normal test statistics for \(Q(w)\), so that we can construct confidence intervals around the estimations to test hypotheses. We would like to do that even in data-poor situations in which the experimenter did not collect many samples around the arm \(w\).

Our approach

The main challenging in evaluating an arm \(w\) with observational data is known as the overlap issue between the target arm and the data-collection mechanism, when the arm assignments made during data collection differ substantially from the target arm \(w\). This issue becomes more severe when data is collected adaptively, since overlap with the target arm can deteriorate as the experimenter shifts the data-collection mechanism in response to what they observe. As a result, estimates from standard estimators can be skewed and heavy-tailed.

Our approach to recover the asymptotic normality is done in three steps. First, we construct an unbiased arm evaluation score of each sample, which is a transformation of the observed outcome. Second, we average these scores with non-uniform and data-adaptive weights, obtaining a new estimator with controlled variance. Finally, by dividing the estimator by its estimated standard error we obtain a test statistic that is consistent and asymptotically normal.

Step 1: Constructing an unbiased arm evaluation score for each observation

The arm evaluation scoring rule should address the sampling bias issue, which is notorious in adaptively collected data (Nie et al. 2018). One natural method is to re-weight observed outcomes based on importance sampling, which results in an inverse propensity score weighted (IPW) estimator: \[\begin{equation} \label{eq:ipw} \widehat{Q}_T^{IPW}(w) := \frac{1}{T}\sum_{t=1}^T\widehat{\Gamma}_t^{IPW}(w), \ \mbox{where} \ \widehat{\Gamma}_t^{IPW}(w) := \frac{\mathbb{I}\{W_t=w\}}{e_t(w)} Y_t. \end{equation}\] With the observation that \(\mathbb{P}({W_t = w | H_{t-1}, \, Y_t(w)} = \mathbb{P}({W_t = w | H_{t-1}} = e(w;H_{t-1})\), one can immediately see the unbiasedness of \(\widehat{\Gamma}_t^{IPW}(w)\) that has \(\mathbb{E}[\widehat{\Gamma}_t^{IPW}(w)|H_{t-1}]=Q(w)\).

The augmented inverse propensity weighted (AIPW) estimator generalizes the above by incorporating regression adjustment : \[\begin{equation} \label{eq:aipw} \begin{split} &\widehat{Q}_T^{AIPW}(w) := \frac{1}{T}\sum_{t=1}^T \widehat{\Gamma}_t^{AIPW}(w), \ \mbox{where}\ \widehat{\Gamma}_t^{AIPW}(w) := \hat{\mu}_t(w) + \frac{\mathbb{I}\{W_t=w\}}{e_t(w)} \left( Y_t - \hat{\mu}_t(w)\right). \end{split} \end{equation}\] The symbol \(\hat{\mu}_t(w)\) denotes an estimator of the conditional mean function \(\mu(w) = \mathbb{E}[Y_t(w)]\) based on the history \(H_{t-1}\), but it need not be a good one—it could be biased or even inconsistent. The second term of \(Y_t - \hat{\mu}_t(w)\) acts as a bias-correction term: adding it preserves unbiasedness but also reduces the variance, since the residual—when \(\hat{\mu}_t(w)\) is a reasonable estimator of \(\mu(w)\)—potentially has a smaller absolute mean as compared to the raw outcome \(Y_t\). We will hereby use the AIPW score \(\widehat{\Gamma}_t^{AIPW}(w)\) for each observation \(t\).

Step 2: Averaging arm evaluations with adaptive weights

When data is collected non-adaptively but by a fixed randomization rule, AIPW estimator is semiparametrically efficient (Hahn 1998). However, adaptivity makes the sampling distribution non-normal and heavy-tailed, and the variance of the AIPW scores \(\widehat{\Gamma}_t^{AIPW}(w)\) can vary hugely over the course of experiment. In fact, the conditional variances \(\mbox{Var}(\widehat{\Gamma}_t^{AIPW}(w)|H_{t-1})\) depend primarily on the behavior of the inverse propensities \(1/e_t(w)\), which may explode to infinity or fail to converge.

To address this difficulty, we consider a generalization of the AIPW estimator, which non-uniformly averages the unbiased scores \(\widehat{\Gamma}^{AIPW}_t(w)\) using a sequence of adaptive weights \(h_t(w)\). The resulting estimator is the adaptively-weighted AIPW estimator: \[\begin{align} \label{eq:aw} \widehat{Q}^{h}_T(w) = \frac{\sum_{t=1}^T h_t(w) \widehat{\Gamma}_t^{AIPW}(w)}{\sum_{t=1}^T h_t(w)}. \end{align}\]

Adaptive weights \(h_t(w)\) provide an additional degree of flexibility in controlling the variance and the tail of the sampling distribution. When chosen appropriately, these weights compensate for erratic trajectories of the assignment probabilities \(e_t(w)\), stabilizing the variance of the estimator. A natural choice of adaptive weights is to approximate the inverse standard deviation of the \(\widehat{\Gamma}_t^{AIPW}(w)\). In this way each re-weighted term \(h_t\widehat{\Gamma}_t^{AIPW}(w)\) has comparable variance, such that averaging these object may lead to a normal limiting distribution. We shall refer it to constant-allocation weighting scheme since each weighted element \(h_t\widehat{\Gamma}_t^{AIPW}(w)\) contributes to roughly the same share of variance in the final estimator. Weights of this type were proposed by Luedtke and Laan (2016) for the purpose of estimating the expected value of non-unique optimal policies that possibly depend on observable covariates.

More generally, to get an adaptively-weighted AIPW estimator \(\widehat{Q}^{h}_T(w)\) that is consistent and asymptotically normal, we require the following assumptions on our weighting schemes stated below.

Assumption 1 (Infinite sampling). \(\big(\sum_{t=1}^T h_t(w)]\big)^2 \,\big/\, \mathbb{E}\big[ \sum_{t=1}^T \frac{h^2_t(w)}{e_t(w)} \big] \xrightarrow[T \to \infty]{p} \infty.\)

Assumption 2 (Variance convergence). \(\sum_{t=1}^T \frac{h^2_t(w)}{e_t(w) } \,\bigg/\, \mathbb{E}\big[ \sum_{t=1}^T \frac{h^2_t(w)}{e_t(w)} \big]\xrightarrow[T \to \infty]{L_p} 1\), for some \(p>0\).

Assumption 3 (Bounded moments). \(\sum_{t=1}^T \frac{h^{2 + \delta}(w)}{e^{1 + \delta}(w) } \,\Big/\, \mathbb{E}\Big[ \Big(\sum_{t=1}^T \frac{h^2_t(w)}{e_t(w)} \Big)^{1 +\delta/2}\Big]\xrightarrow[T \to \infty]{p} 0\), for some \(\delta>0\).

Above, Assumption 1 requires that the effective sample size after adaptive weighting goes to infinity, which implies that the estimator converges. Assumption 3 is a Lyapunov-type regularity condition on the weights, which controls higher moments of the distribution. Assumption 2 is the more subtle condition that standard estimators such as AIPW estimator (i.e., \(h_t(w) \equiv 1\)) often fail to satisfy. We refer interesting readers to our paper Hadad et al. (2021) for a recipe on building weights that satisfy Assumption 2. In particular, we note a two-point allocation weighting scheme when the assignment probabilities \(e_t\) reflect the experimenter’s belief on arm optimality (as is the case for Thompson sampling). This weighting scheme allows us to downweight samples with small propensities more boldly but still preserve the asymptotic normality, and therefore often merits smaller variance and tighter confidence intervals as compared to the constant-allocation scheme.

Step 3: Estimating standard error and constructing a test statistic

With the evaluation weights discussed in step 2, when normalized by an estimate of its standard deviation, the adaptively-weighted AIPW estimator has a centered and normal asymptotic distribution. Similar ``self-normalization’’ schemes are often key to martingale central limit theorems (Peña, Lai, and Shao 2008).

Theorem. Suppose that we observe arms \(W_t\) and rewards \(Y_t=Y_t(W_t)\), and that the underlying potential outcomes \((Y_t(w))_{w \in \mathcal{W}}\) are independent and identically distributed with nonzero variance, and satisfy \(\mathbb{E}|Y_{t}(w)|^{2+\delta} < \infty\) for some \(\delta > 0\) and all \(w\). Suppose that the assignment probabilities \(e_t(w)\) are strictly positive and let \(\hat{\mu}_t(w)\) be any history-adapted estimator of \(Q(w)\) that is bounded and that converges almost-surely to some constant \(\mu_{\infty}(w)\). Let \(h_t(w)\) be non-negative history-adapted weights satisfying Assumptions 1-3. Suppose that either \(\hat{\mu}_t(w)\) is consistent or \(e_t(w)\) has a limit \(e_{\infty}(w) \in [0, \, 1]\), i.e., either \[\begin{equation} \label{eq:e_mu_alternative} \hat{\mu}_t(w) \xrightarrow[t \to \infty]{a.s.} Q(w) \quad \text{or} \quad e_t(w) \xrightarrow[t \to \infty]{a.s.} e_{\infty}(w) \end{equation}\] Then, for any arm \(w \in \mathcal{W}\), the adaptively-weighted estimator \(\widehat{Q}_T^{h}(w)\) is consistent for the arm value \(Q(w)\), and the following studentized statistic is asymptotically normal: \[\begin{equation} \begin{aligned} \label{eq:clt} &\frac{\widehat{Q}_T^{h}(w) - Q(w)}{\widehat{V}_T^{h}(w)^{\frac{1}{2}}} \xrightarrow{d} \mathcal{N}(0, 1), \ \ \ \text{where} \ \widehat{V}_T^{h}(w) := \frac{\sum_{t=1}^T h^2_t(w) \left( \widehat{\Gamma}_t(w) - \widehat{Q}_T(w) \right)^2}{\left( \sum_{t=1}^T h_t(w) \right)^2}. \end{aligned} \end{equation}\]


We consider three simulation settings, each with \(K = 3\) arms that yield rewards observed with additive \(\text{uniform}[-1, 1]\) noise. The settings vary in the difference between the arm values. In the no-signal case, we set arm values to \(Q(w) = 1\) for all \(w \in \{1, 2, 3\}\); in the low signal case, we set \(Q(w) = 0.9+ 0.1w\); and high signal case we set \(Q(w) = 0.5 + 0.5w\). During the experiment, treatment is assigned by a modified Thompson sampling method: first, tentative assignment probabilities are computed via Thompson sampling with normal likelihood and normal prior(D. J. Russo et al. 2018); they are then adjusted to impose the lower bound \(e_t(w) \geq (1/K)t^{-0.7}\).

We show comparison among four point estimators of arm values \(Q(w)\): the sample mean, the AIPW estimator with uniform weights (labeled as “unweighted AIPW”), and the adaptively-weighted AIPW estimator with constant and two-point allocation schemes. For the AIPW-based estimators, we use the same formula given in our theorem to construct confidence intervals. For the sample mean we use the usual variance estimate \(\smash{\widehat{V}^{AVG}(w) := T_w^{-2} \sum_{t: W_t = w}^{T} (Y_t - \widehat{Q}^{AVG}_T(w))^2}\). Approximate normality is not theoretically justified for the unweighted AIPW estimator or for the sample mean. We also consider non-asymptotic confidence intervals for the sample mean, based on the method of time-uniform confidence sequences described in Howard et al. (2021). We refer interesting readers to our paper for more results on estimating other arms as well as the contrast between arms (Hadad et al. 2021 ).

The below panel demonstrates that we are able to estimate the “good” arm value \(Q(3)\) with considerable ease in high- and low-signal settings, in that all methods produce point estimates with negligible bias and small root mean-squared error, and moreover attain roughly correct coverage for large \(T\). Conversely, estimating the “bad” arm value \(Q(1)\) is challenging. Although the AIPW estimator is unbiased, it performs very poorly in terms of RMSE and confidence interval width. In the low and high signal case, its problem is that it does not take into account the fact that the bad arm is undersampled, so its variance is high. Of our two adaptively-weighted AIPW estimators, two-point allocation better controls the variance of bad arm estimates by more aggressively downweighting `unlikely’ observations associated with large inverse propensity weights; this results in smaller RMSE and tighter confidence intervals. For the sample-mean estimator, naive confidence intervals based on the normal approximation are invalid, with severe under-coverage when there’s little or no signal. On the other hand, the non-asymptotic confidence sequences of Howard et al. (2021) are conservative and often wide.


One direct extension is to apply the above adaptive weighting technique to evaluating policies on populations with hetegeneous outcomes, using data that is collected adaptively via contextual bandit algorithms. In Zhan, Hadad, et al. (2021), we carefully choose adaptive weights to accommodate the variances of AIPW scores that may differ not only over time but also across the context space. The resulting estimator further reduces estimation variance.

Aside from policy evaluation, learning the optimal treatment assignment policies using adaptive data is also desirable, which enables personalized decision making in a wide variety of domains. In Zhan, Ren, et al. (2021), we establish the first-of-the-kind regret lower bound that characterizes the fundamental difficulty of policy learning with adaptive data. We then propose an algorithm based on re-weighted AIPW estimators and show that it is minimax optimal with the best weighting scheme.

About the author and article

  • Ruohan Zhan: postdoctoral fellow at the Stanford Graduate School of Business, incoming Assistant Professor at the Hong Kong University of Science and Technology.

Based on:

Vitor Hadad, David A. Hirschberg, Ruohan Zhan, Stefan Wager and Susan Athey. Confidence intervals for policy evaluation in adaptive experiments. Proceedings of the National Academy of Sciences, April 5, 2021, 118 (15) e2014602118, https://doi.org/10.1073/pnas.2014602118


Agrawal, Shipra, and Navin Goyal. 2013. “Thompson Sampling for Contextual Bandits with Linear Payoffs.” In International Conference on Machine Learning, 127–35. PMLR.
Bertsimas, Dimitris, Nathan Kallus, Alexander M Weinstein, and Ying Daisy Zhuo. 2017. “Personalized Diabetes Management Using Electronic Medical Records.” Diabetes Care 40 (2): 210–17.
Hadad, Vitor, David A Hirshberg, Ruohan Zhan, Stefan Wager, and Susan Athey. 2021. “Confidence Intervals for Policy Evaluation in Adaptive Experiments.” Proceedings of the National Academy of Sciences 118 (15).
Hahn, Jinyong. 1998. “On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects.” Econometrica, 315–31.
Howard, Steven R, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. 2021. “Time-Uniform, Nonparametric, Non-Asymptotic Confidence Sequences.” The Annals of Statistics Forthcoming.
Lai, Tze Leung, and Herbert Robbins. 1985. “Asymptotically Efficient Adaptive Allocation Rules.” Advances in Applied Mathematics 6 (1): 4–22.
Luedtke, Alexander R, and Mark J van der Laan. 2016. “Statistical Inference for the Mean Outcome Under a Possibly Non-Unique Optimal Treatment Strategy.” Annals of Statistics 44 (2): 713.
Murphy, Susan A. 2005. “An Experimental Design for the Development of Adaptive Treatment Strategies.” Statistics in Medicine 24 (10): 1455–81.
Nie, Xinkun, Xiaoying Tian, Jonathan Taylor, and James Zou. 2018. “Why Adaptively Collected Data Have Negative Bias and How to Correct for It.” In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, edited by Amos Storkey and Fernando Perez-Cruz, 84:1261–69. New York: PMLR. http://proceedings.mlr.press/v84/nie18a.html.
Peña, Victor H de la, Tze Leung Lai, and Qi-Man Shao. 2008. Self-Normalized Processes: Limit Theory and Statistical Applications. Berlin Heidelberg: Springer-Verlag.
Russo, Daniel. 2020. “Simple Bayesian Algorithms for Best-Arm Identification.” Operations Research 68 (6): 1625–47. https://doi.org/10.1287/opre.2019.1911.
Russo, Daniel J., Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. 2018. “A Tutorial on Thompson Sampling.” Found. Trends Mach. Learn. 11 (1): 1–96. https://doi.org/10.1561/2200000070.
Schnabel, Tobias, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. 2016. “Recommendations as Treatments: Debiasing Learning and Evaluation.” In International Conference on Machine Learning, 1670–79. PMLR.
Zhan, Ruohan, Vitor Hadad, David A Hirshberg, and Susan Athey. 2021. “Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits.” In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2125–35.
Zhan, Ruohan, Zhimei Ren, Susan Athey, and Zhengyuan Zhou. 2021. “Policy Learning with Adaptively Collected Data.” arXiv Preprint arXiv:2105.02344.