This standard operating procedures (SOP) document describes the default practices of the experimental research group led by Donald P. Green at Columbia University. These defaults apply to analytic decisions that have not been made explicit in pre-analysis plans (PAPs). They are not meant to override decisions that are laid out in PAPs. For more discussion of the motivations for SOP, see Lin and Green (2016).

This is a living document. To suggest changes or additions, please feel free to e-mail us1 or submit an issue on GitHub. Also, when referencing this document, please be sure to note the version and date.

Many thanks to Peter Aronow, Susanne Baltes, Jake Bowers, Al Fang, Macartan Humphreys, Berk Özler, Anselm Rink, Cyrus Samii, Michael Schwam-Baird, Uri Simonsohn, Amber Spry, Dane Thorley, Anna Wilke, and José Zubizarreta for helpful comments and discussions.

Preliminary checks on data preparation and study implementation

After collection of the raw data, all data processing steps leading up to creation of the files for analysis will be performed by computer code. We will make these computer programs publicly available.

The following types of checks should be performed and documented before analyses are publicly disseminated or submitted for publication:2

Any unresolved data preparation or processing issues will be disclosed in a research report or supplementary materials.

Standard errors, confidence intervals, and significance tests

Bell–McCaffrey adjustment for standard errors and confidence intervals

Many scholars agree that uncertainty due to sampling variability should be communicated not solely via null hypothesis significance testing, but via methods that focus on the magnitude of the estimated effect and the range of magnitudes compatible with the data, such as confidence intervals (Moher et al. 2010; Vazire 2016; Greenland et al. 2016; Wasserstein and Lazar 2016). Some journals require or encourage reporting of confidence intervals (CIs), while in others it is more typical to report estimated standard errors (SEs) along with point estimates. SEs are of interest not so much for their own sake as for enabling the construction of CIs and tests (Efron 1986; Pustejovsky and Tipton 2016). Thus, if SEs are reported without explicit CIs, they should ideally be accompanied by sufficient information for CI construction, such as the degrees of freedom for the distribution of the t-statistic.

Our default method for SEs, degrees of freedom, and CIs is the Bell–McCaffrey (BM) adjustment (Bell and McCaffrey 2002). Imbens and Kolesár (2016) give a very helpful discussion of the procedure and “recommend that empirical researchers should, as a matter of routine, use the BM confidence intervals.”3 Similar to Welch’s unequal-variances t-test, the BM adjustment combines a bias-reduced version of the “robust” or “cluster-robust” SE with a Satterthwaite approximation for the degrees of freedom of the t-distribution. PAPs are (as always) free to deviate from this default approach. Our primary goal here is to obtain approximately valid CIs for average treatment effects (where “valid” means that the actual coverage probability equals or exceeds the nominal coverage). A researcher preparing a PAP for a specific experiment may have reason to believe, based on statistical theory and/or simulations, that a simpler method (e.g., using a critical value from the standard normal distribution instead of the t-distribution when the sample is sufficiently large, or using the classical OLS SE when subjects are randomly assigned to two equal-sized groups) will achieve the same goal, that a different approach is needed due to complexities in the experimental design or the point estimation procedure, or that a different approach will yield more precision.

Below, we give example R code (a slight adaptation of Michal Kolesár’s BMlmSE() function) to perform the BM adjustment given a fitted OLS regression.4

Example R code for Bell–McCaffrey adjustment

In the following R code, the function BMlmSE() takes these arguments:

  • model (required): An object returned by lm(), representing a fitted OLS regression.

  • clustervar (optional): A factor whose levels correspond to the clusters. If clustervar is supplied, the function uses Bell and McCaffrey (2002)’s bias-reduced cluster-robust variance estimator. If clustervar is left unspecified (meaning that the user does not want clustered SEs), the function uses the HC2 robust variance estimator.

  • ell (optional): A vector specifying a linear combination of coefficients to compute an SE and degrees of freedom for. ell must have the same length as the vector of regression coefficients. For example, if model is an object returned by lm(outcome ~ treatment + covariate), then ell = c(0, 1, 0) specifies that we are only interested in the coefficient on treatment. If ell is left unspecified, the function computes SEs and degrees of freedom for all coefficients. (The adjusted degrees of freedom will generally be different for each coefficient, unlike the classical OLS degrees of freedom.)

  • IK (optional): A logical value. If clustervar is unspecified, IK has no effect on the results. If clustervar is supplied, IK determines whether the degrees of freedom are computed using Imbens and Kolesár (2016)’s method (if IK is TRUE or unspecified) or Bell and McCaffrey (2002)’s method (if IK is FALSE). Our SOP uses Bell and McCaffrey (2002)’s method.

The function returns a list with these components:

  • Vhat: The estimated covariance matrix.

  • dof: The adjusted degrees of freedom. (If ell is specified, dof is a scalar. If ell is unspecified, dof is a named vector with one element for each regression coefficient.)

  • adj.se: The SEs derived from Vhat Imbens and Kolesár (2016)’s additional adjustment to “convert the dof adjustment into a procedure that only adjusts the standard errors.” (If ell is specified, adj.se is a scalar. If ell is unspecified, adj.se is a named vector with one element for each regression coefficient.) This is our default method. adj.se can be multiplied by 1.96 to form the margin of error for a 95% CI. However, multiplying adj.se by 1.645 is generally a valid way to form the margin of error for a 90% CI.

  • se: The SEs derived from Vhat Imbens and Kolesár (2016)’s additional adjustment to “convert the dof adjustment into a procedure that only adjusts the standard errors.” (If ell is specified, se is a scalar. If ell is unspecified, se is a named vector with one element for each regression coefficient.) This our default method. To form the margin of error for a CI at any confidence level, multiply se by the appropriate critical value from the t-distribution with dof degrees of freedom. We give example R code for a 95% CI later in this section. If we are not reporting CIs explicitly, we will report se together with dof.5

  • se.Stata: The cluster-robust SE computed by the formula that Stata has used (see Imbens and Kolesár 2016). (If clustervar is unspecified, se.Stata is NA.) This is our default method.

## This is a slightly modified version of Michal Kolesar's BM_StandardErrors.R
## The original file was downloaded from:
##   https://github.com/kolesarm/Robust-Small-Sample-Standard-Errors
##   (Latest commit 2a80873 on Aug 26, 2015)
## We made minor changes for efficiency, as well as two changes that affect the
## code's functionality:
## 1) MatSqrtInverse() now stops with an error message if its argument is not of
##    full rank:
##    "Bell-McCaffrey SE undefined. This happens, e.g., when a dummy regressor is 1
##     for one cluster and 0 otherwise."
## 2) The list returned by BMlmSE() now includes "se".

# The MIT License (MIT)
# 
# Copyright (c) 2015 Michal Kolesar
# 
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

## Compute Bell-McCaffrey Standard Errors
library(sandwich)
library(Matrix)

message1 <- paste0(
    'Bell-McCaffrey SE undefined. This happens, e.g., when a dummy regressor is 1 ',
    'for one cluster and 0 otherwise.'
)

MatSqrtInverse <- function(A) {
    ##  Compute the inverse square root of a matrix
    if (rankMatrix(A) < NROW(A)) stop(message1)
    ei <- eigen(A, symmetric = TRUE)
    d2 <- 1/sqrt(ei$values)
    ## diag(d2) is d2 x d2 identity if d2 is scalar, instead we want 1x1 matrix
    ei$vectors %*% (if (length(d2)==1) d2 else diag(d2)) %*% t(ei$vectors)
}

BMlmSE <- function(model, clustervar=NULL, ell=NULL, IK=TRUE) {
    X <- model.matrix(model)
    sum.model <- summary.lm(model)
    n <- sum(sum.model$df[1:2])
    K <- model$rank
    XXinv <- sum.model$cov.unscaled # XX^{-1}
    u <- residuals(model)

    df <- function(GG) {                # Compute DoF given G'*Omega*G
        sum(diag(GG))^2 / sum(GG * GG)
    }

    if(is.null(clustervar)) {           # no clustering
        Vhat <- vcovHC(model, type="HC2")
        Vhat.Stata <- Vhat*NA

        M <- diag(n)-X %*% XXinv %*% t(X)       # annihilator matrix
        GOG <- function(ell) {           # G'*Omega*G
            Xtilde <- drop(X %*% XXinv %*% ell / sqrt(diag(M)))
            crossprod(M * Xtilde)
        }
    } else {
        if(!is.factor(clustervar)) stop("'clustervar' must be a factor")

        ## Stata
        S <- length(levels(clustervar)) # number clusters
        uj <- apply(u*X, 2, function(x) tapply(x, clustervar, sum))
        Vhat.Stata <- S/(S-1) * (n-1)/(n-K) * sandwich(model, meat = crossprod(uj)/n)

        ## LZ2
        tXs <- function(s) {
            Xs <- X[clustervar==s, , drop=FALSE]
            MatSqrtInverse(diag(NROW(Xs))-Xs%*% XXinv %*% t(Xs)) %*% Xs
        }
        tX <- lapply(levels(clustervar), tXs) # list of matrices

        tu <- split(u, clustervar)
        tutX <- sapply(seq_along(tu),function(i) crossprod(tu[[i]],tX[[i]]))
        Vhat <- sandwich(model, meat = tcrossprod(tutX)/n)

        ## DOF adjustment
        tHs <- function(s) {
            Xs <- X[clustervar==s, , drop=FALSE]
            index <- which(clustervar==s)
            ss <- outer(rep(0,n),index)     # n x ns matrix of 0
            ss[cbind(index,1:length(index))] <- 1
            ss-X %*% XXinv %*% t(Xs)
        }
        tH <- lapply(levels(clustervar), tHs) # list of matrices

        Moulton <- function() {
            ## Moulton estimates
            ns <- tapply(u, clustervar,length)
            ssr <- sum(u^2)
            rho <- max((sum(sapply(seq_along(tu), function(i)
                sum(tu[[i]] %o% tu[[i]])))-ssr) / (sum(ns^2)-n), 0)
            c(sig.eps=max(ssr/n - rho, 0), rho=rho)
        }

        GOG <- function(ell) {
            G <- sapply(seq_along(tX),
                        function(i)  tH[[i]] %*% tX[[i]] %*% XXinv %*% ell)
            GG <- crossprod(G)

            if (IK==TRUE) {            # IK method
                Gsums <- apply(G, 2, function(x) tapply(x, clustervar, sum)) # Z'*G
                GG <- Moulton()[1]*GG + Moulton()[2]*crossprod(Gsums)
            }
            GG
        }
    }

    if (!is.null(ell)) {
        se <- drop(sqrt(crossprod(ell,Vhat) %*% ell))
        dof <- df(GOG(ell))
        se.Stata <- drop(sqrt(crossprod(ell,Vhat.Stata) %*% ell))
    } else {
        se <- sqrt(diag(Vhat))
        dof <- sapply(seq(K), function(k) df(GOG(diag(K)[,k])))
        se.Stata <- sqrt(diag(Vhat.Stata))
    }
    names(dof) <- names(se)

    return(list(vcov=Vhat, dof=dof, adj.se=se*qt(0.975,df=dof)/qnorm(0.975),
                se=se,
                se.Stata=se.Stata))
}

Here is an example of R code using BMlmSE() to construct a 95% CI for the average treatment effect (ATE).

## For this example, assume:
## - We have an experiment with simple or complete randomization of subjects.
##   10 subjects are assigned to treatment and 5 to control.
## - Our PAP specified that we would estimate ATE using an OLS regression of the
##   outcome on the treatment dummy and a single covariate, but did not specify
##   how we would estimate the SE or construct a CI.

# Generate fake data

set.seed(1234567)
treatment <- c(rep(1, 10), rep(0, 5))
covariate <- rnorm(15)
outcome <- treatment + rnorm(15)

# Use lm() to fit an OLS regression

ols.fit <- lm(outcome ~ treatment + covariate, singular.ok = FALSE)

# Use BMlmSE() to compute the BM SE and degrees of freedom

## The argument ell must have the same length as the vector of regression coefficients.
## In this example, ell = c(0, 1, 0) indicates that we only want to compute
## the Bell-McCaffrey SE and degrees of freedom for the linear combination
##   0 * intercept + 1 * (coef on treatment) + 0 * (coef on covariate),
## i.e., the coefficient on treatment.

bm <- BMlmSE(model = ols.fit, ell = c(0, 1, 0))

# Construct a 95% confidence interval for the average treatment effect

point.estimate <- coef(ols.fit)['treatment']

critical.value <- qt(0.975, df = bm$dof)  # Critical value for 95% CI
margin.of.error <- critical.value * bm$se

ci = c(point.estimate - margin.of.error, point.estimate + margin.of.error)
names(ci) = c('lower','upper')

point.estimate  # Estimated average treatment effect
bm$se           # HC2 robust SE
bm$dof          # Bell-McCaffrey degrees of freedom
ci              # Bell-McCaffrey confidence interval

Avoiding regression models that do not allow the BM adjustment

The BM adjustment’s SE estimators are undefined for some regression models:

  • The HC2 robust SE (returned by BMlmSE() when clustervar is unspecified) is undefined if any observation has a leverage of 1, which happens when the regressors include a dummy variable that equals 1 for that observation and 0 for all others (or vice versa), and in other scenarios where the values of the regressors are such that the OLS solution will always fit that observation perfectly, no matter what the outcome values are.6 In this situation, BMlmSE() returns values of NaN (Not a Number).

  • The BM cluster-robust SE (returned by BMlmSE() when clustervar is supplied) is undefined when the matrix \(I_i - H_{ii}\) (Bell and McCaffrey 2002, 7) for some cluster \(i\) is not of full rank, which happens, for example, when the regressors include a variable that would be constant if one cluster were dropped (e.g., a dummy variable for one cluster or for a subset of one cluster). In this situation, BMlmSE() prints the following error message: “Bell-McCaffrey SE undefined. This happens, e.g., when a dummy regressor is 1 for one cluster and 0 otherwise.”7

We think these situations can usually be prevented by specifying parsimonious regression models and avoiding dummy covariates that identify one or two subjects or clusters. Also, they can be detected by attempting the BM adjustment with mock outcome data before treatment effects are estimated (since the leverages and the \(I_i - H_{ii}\) matrices depend only on the regressors, not the outcome). If the PAP has specified a regression model that makes the BM adjustment undefined, we will use mock outcome data to help determine which covariates need to be dropped to make the BM adjustment well-defined.

Use of clustered standard errors

Our default approach is to use clustered SEs if and only if a regression includes multiple observations for some randomization units, and in that case, to cluster the SEs at the level of the randomization unit (not at any higher level). For example:

  • When individual subjects are randomly assigned to treatment, the randomization unit is the subject. In the typical case where a regression includes only one observation per subject, we will not use clustered SEs. If a regression includes multiple observations per subject, we will use SEs clustered at the subject level.

  • When households are randomly assigned to treatment, the randomization unit is the household. If a regression includes only one observation per household, we will not use clustered SEs. If a regression includes multiple observations per household (e.g., one observation per person, with more than one person in some households), we will use SEs clustered at the household level (not at any higher level such as county or state).

In cases where we use clustered SEs, our default SE estimator is Bell and McCaffrey (2002)’s bias-reduced cluster-robust SE. The SE and associated degrees of freedom can be computed using the BMlmSE() function above, with IK = FALSE and clustervar set to a factor variable whose values identify the clusters. (For example, if hhid is a variable that uniquely identifies households, then one can supply the argument clustervar = as.factor(hhid) to BMlmSE() to compute robust SEs clustered at the household level.)

Taking block randomization into account in SEs and CIs

Additional issues arise in block-randomized experiments:

  1. If treatment assignment probabilities vary by block, simple comparisons of the treatment and control groups’ average outcomes do not yield unbiased or consistent estimates of treatment effects. Some methods to address this problem are discussed in Gerber and Green (2012) (sections 3.6.1 and 4.5).

  2. Regardless of whether treatment assignment probabilities vary by block, SEs and CIs may be too conservative if they do not take the blocking into account (Bruhn and McKenzie 2009).8

Weighted regression (Gerber and Green 2012, 116–17) addresses issue #1 but does not by itself address issue #2. Taking a weighted average of block-specific treatment effect estimates and estimating its SE (Gerber and Green 2012, 77) is a possible way to address both issues. It is not obvious how to calculate the degrees of freedom for the distribution of the resulting t-statistic (an issue that can matter in small samples). However, OLS regression with treatment–block interactions (Schochet 2010, sec. 4.2.3) can reproduce both the weighted average point estimate and its estimated SE. Let \(B\) be the number of blocks, and let \(X_b\) be a dummy for block \(b\), for \(b = 1, \ldots, B - 1\) (omitting the dummy for block \(B\) to avoid collinearity). To compute a weighted average of the block-specific treatment–control differences in means, with weight \(w_b\) on block \(b\) and \(\sum_{b=1}^B w_b = 1\), one can use the estimated coefficient on the treatment dummy \(T\) in an OLS regression of the outcome \(Y\) on \(T\), \(X_1, \ldots, X_{B-1}\), and \(T \cdot (X_1 - w_1), \ldots, T \cdot (X_{B-1} - w_{B-1})\). In simulations, we have found that the HC2 robust SE for the coefficient on \(T\) in this regression is equivalent to the SE estimator in Gerber and Green (2012) (p. 77).

Our default methods for analyzing block-randomized experiments rely on OLS regression, using the Bell–McCaffrey adjustment for SEs, degrees of freedom, and CIs. Whether we include block dummies and treatment–block interactions in the regression will depend on sample sizes and other aspects of the experimental design, as described below. We do not attempt to cover every possible design here, but focus on some commonly used ones. The subsections that follow assume there are only two treatment arms (i.e., either one treatment group and one control group, or two treatment groups without an additional control group). We plan to cover designs with three or more treatment arms in a future version of the SOP.

As always, PAPs are free to deviate from the default methods described below. They may also use these defaults as a starting point and add other covariates to the regression model.

In what follows, let \(\overline{X}_b\) denote the sample mean of \(X_b\), for \(b = 1, \ldots, B - 1\). That is, \(\overline{X}_b = N_b / N\), where \(N\) is the total number of subjects in the experimental sample and \(N_b\) is the number of subjects in block \(b\).

Block randomization of individuals, with treatment assignment probabilities constant across blocks

If each block has at least 5 individuals assigned to each treatment arm,9 we will estimate the average treatment effect by running an OLS regression of \(Y\) on \(T\), \(X_1, \ldots, X_{B-1}\), and \(T \cdot (X_1 - \overline{X}_1), \ldots, T \cdot (X_{B-1} - \overline{X}_{B-1})\).

If some block has fewer than 5 individuals assigned to some treatment arm:

  • If the blocks were formed by coarsening a continuous variable \(X\), we will estimate the ATE by running an OLS regression of \(Y\) on \(T\), \(X\), and \(T \cdot (X - \overline{X})\), where \(\overline{X}\) is the sample mean of \(X\).

  • Otherwise, we will not account for blocking in our SEs and CIs unless the PAP specified a method for doing so.

Block randomization of individuals, with varying treatment assignment probabilities

If each block has at least 5 individuals assigned to each treatment arm:

  1. Except in the extreme case described in item #2 below, we will use the same method as in the preceding subsection: estimate the average treatment effect by running an OLS regression of \(Y\) on \(T\), \(X_1, \ldots, X_{B-1}\), and \(T \cdot (X_1 - \overline{X}_1), \ldots, T \cdot (X_{B-1} - \overline{X}_{B-1})\).

  2. An alternative approach, sometimes called the least-squares dummy variable (LSDV) method, is to run an OLS regression of the outcome on treatment and block dummies (without interactions). The estimand for the LSDV method is a weighted ATE, giving each block \(j\) a total weight proportional to \(N_j P_j (1 - P_j)\), where \(N_j\) is the number of subjects in the block and \(P_j\) is their probability of assignment to treatment (Angrist 1998, 256; Angrist and Pischke 2009, 75; Gerber and Green 2012, 119). In contrast, the estimand for the method in item #1 above is the unweighted ATE, which gives block \(j\) a total weight proportional to \(N_j\). In the extreme case where there is at least one block \(j\) such that \[ \frac{N_j}{\sum_j N_j} > 20 \, \frac{N_j P_j (1-P_j)}{\sum_j N_j P_j (1-P_j)}, \] we will use LSDV and we will explain to readers that we pre-specified a weighted ATE as our estimand in an attempt to improve precision.10 See the section “Analysis of block randomized experiments with treatment probabilities that vary by block” under “Using covariates in analysis” for discussion of additional issues that arise when the LSDV approach is used.

If some block has fewer than 5 individuals assigned to some treatment arm, then, in the extreme case described in item #2 above, we will again use the LSDV method, but otherwise our default methods are:

  • If each block has a different treatment probability, use the same method as in item #1 above.

  • Otherwise, reduce the number of dummy variables in the regression by replacing block dummies with treatment probability dummies. Let \(C < B\) be the number of possible treatment probabilities; let \(Z_1, \ldots, Z_{C-1}\) be a set of dummy variables for all but one treatment probability; and let \(\overline{Z}_c\) be the sample mean of \(Z_c\), for \(c = 1, \ldots, C-1\). If the blocks were formed by coarsening a continuous variable \(X\), estimate the average treatment effect by running an OLS regression of \(Y\) on \(T\), \(Z_1, \ldots, Z_{C-1}\), \(X\), \(T \cdot (Z_1 - \overline{Z}_1), \ldots, T \cdot (Z_{C-1} - \overline{Z}_{C-1})\), and \(T \cdot (X - \overline{X})\), where \(\overline{X}\) is the sample mean of \(X\). If the blocks were not formed by coarsening a continuous variable, use the same approach but omit the terms \(X\) and \(T \cdot (X - \overline{X})\).

Block-randomized treatment/placebo designs

In a treatment/placebo design (Gerber and Green 2012, 161–64), the planned analysis compares treatment group compliers with placebo group compliers. Block randomization (if used) can ensure that the block composition of the entire treatment group matches that of the entire placebo group, but cannot ensure the same degree of balance between treatment group compliers and placebo group compliers. In fact, it is possible for a block to contain one or more compliers in the treatment group but none in the placebo group, or vice versa. We therefore do not use block dummies in our default methods for analyzing block-randomized treatment/placebo experiments.

If the treatment assignment probability is constant across blocks:

  • If the blocks were formed by coarsening a continuous variable \(X\), we will estimate the ATE by running an OLS regression of \(Y\) on \(T\), \(X\), and \(T \cdot (X - \overline{X})\), where \(\overline{X}\) is the mean of \(X\) among all compliers in the treatment and placebo groups.

  • Otherwise, we will not account for blocking in our SEs and CIs unless the PAP specified a method for doing so.

If the treatment assignment probabilities vary across blocks, we will use the same method as in the final bullet of the above section on “Block randomization of individuals, with varying treatment assignment probabilities,” except that \(\overline{Z}_c\) and \(\overline{X}\) should be defined as the means of \(Z_c\) and \(X\) among compliers only.

Block randomization of clusters

In a cluster-randomized experiment, if the number of individuals varies across clusters, researchers should decide whether their estimand gives equal weight to each individual or to each cluster (Green and Vavreck 2008; Imbens 2013). If the estimand gives equal weight to each cluster (or if the clusters are all of the same size, in which case the two estimands are equivalent), it is straightforward to adapt the methods from the sections above on “Block randomization of individuals”: Using cluster-level averages as the observations, run an OLS regression with one observation per cluster, and substitute “cluster” for “individual” or “subject” in the descriptions of the methods (e.g., change “If each block has at least 5 individuals assigned to each treatment arm” to “If each block has at least 5 clusters assigned to each treatment arm”).

If the cluster sizes vary and the estimand gives equal weight to each individual, our default methods use OLS regression with one observation per individual (and, as noted earlier, our default SE estimator in this case is the Bell and McCaffrey (2002) bias-reduced cluster-robust SE):

  1. If each block has at least 20 clusters assigned to each treatment arm,11 or if each block has a different treatment probability, estimate the average treatment effect by running an OLS regression of \(Y\) on \(T\), \(X_1, \ldots, X_{B-1}\), and \(T \cdot (X_1 - \overline{X}_1), \ldots, T \cdot (X_{B-1} - \overline{X}_{B-1})\).

  2. If the conditions for item #1 are not met:

  • If the treatment assignment probability is constant across blocks: If the blocks were formed by coarsening a continuous variable \(X\), we will estimate the ATE by running an OLS regression of \(Y\) on \(T\), \(X\), and \(T \cdot (X - \overline{X})\), where \(\overline{X}\) is the sample mean of \(X\).12 If the blocks were not formed by coarsening a continuous variable, we will not account for blocking in our SEs and CIs unless the PAP specified a method for doing so. (If no regression model is specified in the PAP, we will regress \(Y\) on \(T\).)

  • If the treatment assignment probability varies across blocks: If the blocks were formed by coarsening a continuous variable \(X\), we will estimate the ATE by running an OLS regression of \(Y\) on \(T\), \(Z_1, \ldots, Z_{C-1}\), \(X\), \(T \cdot (Z_1 - \overline{Z}_1), \ldots, T \cdot (Z_{C-1} - \overline{Z}_{C-1})\), and \(T \cdot (X - \overline{X})\) (where, as before, \(C\) is the the number of possible treatment probabilities, \(Z_1, \ldots, Z_{C-1}\) are a set of dummy variables for all but one treatment probability, and overlines denote sample means). If the blocks were not formed by coarsening a continuous variable, we will use the same approach but omit the terms \(X\) and \(T \cdot (X - \overline{X})\).

One-tailed or two-tailed test?

We will report two-tailed significance tests unless the PAP specifies a one-tailed test or some other approach.13

Studentized permutation test

For significance tests and p-values, our default method is a Studentized permutation test (Chung and Romano 2013; Janssen 1997; Romano 2009, sec. 3) that compares the t-statistic for the average treatment effect (i.e., the estimated ATE divided by its estimated SE) with its empirical distribution under random reassignments of treatment that follow the same randomization scheme as the actual experiment.14 We will use 10,000 randomizations and the random number seed 1234567. P-values for two-tailed tests will be computed according to the following convention: “In general, if you want a two-sided P-value, compute both one-sided P-values, double the smaller one, and take the minimum of this value and 1” (Rosenbaum 2010, 33).

Here is an example of R code for a Studentized permutation test.

## For this example, assume:
## - We have an experiment with complete randomization of 200 subjects.
##   50 subjects are assigned to treatment and 150 to control.
## - Our PAP specified that we would estimate ATE using an OLS regression of the
##   outcome on the treatment dummy and a single covariate.
## - Either the PAP specified that we would use the HC2 SE, or
##   it did not specify an SE estimator so we are following the SOP and using HC2.

library(randomizr)
library(sandwich)

set.seed(1234567)

n.rands <- 10000  # no. of randomizations for permutation test
N <- 200          # no. of subjects
N.treated <- 50   # no. of subjects assigned to treatment 

## Generate "observed" data for this example

treated <- c(rep(1, N.treated), rep(0, N - N.treated))
covariate <- rnorm(N)
outcome <- 0.1 * treated + rnorm(N)

## Run the pre-specified regression with the observed data.
## Save the observed t-statistic for ATE.

fit <- lm(outcome ~ treated + covariate, singular.ok = FALSE)

est.ate <- coef(fit)['treated']  # Estimated average treatment effect
se <- sqrt( vcovHC(fit, type = 'HC2')['treated','treated'] )  # HC2 robust SE

t.observed <- est.ate / se

# Permutation test

t.sim <- rep(NA, n.rands)

for (i in 1:n.rands) {
  ## Simulate random assignment of treatment

  treated.sim <- complete_ra(N, m = N.treated)

  ## Run the pre-specified regression with the simulated data.
  ## Save the simulated t-statistic for ATE.
  
  fit.sim <- lm(outcome ~ treated.sim + covariate, singular.ok = FALSE)
  est.ate.sim <- coef(fit.sim)['treated.sim']
  se.sim <- sqrt( vcovHC(fit.sim, type = 'HC2')['treated.sim','treated.sim'] )
  
  t.sim[i] <- est.ate.sim / se.sim
}

## "In general, if you want a two-sided P-value, compute both one-sided P-values, 
##  double the smaller one, and take the minimum of this value and 1."
##    Rosenbaum (2010), Design of Observational Studies, p. 33, note 2
## (Other options exist, but this is our default.)

p.left  <- mean(t.sim <= t.observed)
p.right <- mean(t.sim >= t.observed)

p.value <- min(2 * min(p.left, p.right), 1)
p.value

We recommend attempting the permutation test with mock outcome data and actual covariate data before analyzing the actual outcome data. The mock permutation test may reveal that on some randomizations, the t-statistic cannot be computed because the regressors are collinear or because the HC2 or BM SE is undefined (see the section above on “Avoiding regression models that do not allow the BM adjustment”). In such cases, covariates should be dropped from the model until the mock permutation test runs without errors.

Using covariates in analysis

Default methods for estimating average treatment effects

Estimation methods for the primary analysis will normally have been specified in the PAP. For reference in what follows, here we describe our default methods for an experiment with random assignment of individuals. Let \(N\) be the number of subjects, and let \(M < N\) denote the largest integer such that at least \(M\) subjects are assigned to each arm.

  • If \(M \geq 20\) , we use least squares regression of Y on T, X, and T * X, where Y is the outcome, T is the treatment indicator, and X is a set of one or more mean-centered covariates (see “Choice of covariates” below for guidelines on the choice and number of covariates). The coefficient on T estimates the average effect of assignment to treatment. See Lin (2012a) for an informal description of this estimator.

  • If \(M < 20 \leq N\) , we use least squares regression of Y on T and X.

  • If \(N < 20\), we use either difference-in-differences or difference-in-means. (Section 4.1 in Gerber and Green (2012) discusses the efficiency comparison between these two estimators. Again, the choice will typically be specified in the PAP.)

Example: Default Covariate Adjustment Procedure

suppressMessages({
  library(randomizr)
  library(sandwich)
  library(lmtest)
})

N <- 200

# Make some covariates
X1 <- rnorm(N)
X2 <- rbinom(N, size = 1, prob = 0.5)

# Make some potential outcomes
Y0 <- .6*X1 + 3*X2 + rnorm(N)
Y1 <- Y0 + .4

# Conduct a random assignment and reveal outcomes
Z <- complete_ra(N, m= 100)
Y_obs <- Y1*Z + Y0*(1-Z)

# Mean-center the covariates
X1_c <- X1 - mean(X1)
X2_c <- X2 - mean(X2)

# Conduct Estimation
fit_adj <- lm(Y_obs ~ Z + Z*(X1_c + X2_c), singular.ok = FALSE)

# Robust Standard Errors
coeftest(fit_adj, vcov = vcovHC(fit_adj, type = "HC2"))
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  1.51006    0.10208 14.7936 < 2.2e-16 ***
## Z            0.18060    0.13277  1.3603    0.1753    
## X1_c         0.57328    0.12269  4.6727 5.549e-06 ***
## X2_c         3.15502    0.20427 15.4451 < 2.2e-16 ***
## Z:X1_c       0.11814    0.15223  0.7761    0.4387    
## Z:X2_c      -0.33556    0.26658 -1.2588    0.2096    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare to unadjusted model
fit_unadj <- lm(Y_obs ~ Z, singular.ok = FALSE)
coeftest(fit_unadj, vcov = vcovHC(fit_unadj, type = "HC2"))
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  1.62170    0.19469  8.3296 1.329e-14 ***
## Z           -0.03118    0.26162 -0.1192    0.9053    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Choice of covariates for regression adjustment

Ordinarily our choice of covariates for adjustment will have been specified in the PAP. For voter turnout experiments, the SOP section “Issues specific to voter turnout experiments” gives a default set of covariates in case the PAP fails to specify the choice.

With M and N as defined above, we will include no more than \(M / 20\) covariates in regressions with treatment-covariate interactions, and no more than \(N / 20\) covariates in regressions without such interactions.15

If PAP has failed to specify the choice of covariates, if the experiment is not a voter turnout study, and if the number of available baseline covariates (excluding higher powers, other transformations, and interactions between covariates) is 10 or fewer and does not exceed the limits above, we will include all the covariates in our regressions.

In general, covariates should be measured before randomization. To make any exceptions to this rule, we need to have a convincing argument that either (1) the variable is a measure of pre-randomization conditions, and treatment assignment had no effect on measurement error, or (2) although the variable is wholly or partly a measure of post-randomization conditions, it could not have been affected by treatment assignment. (Rainfall on Election Day would probably satisfy #2.)

Occasionally a new source of data on baseline characteristics becomes available after random assignment (e.g., when political campaigns join forces and merge their datasets). To decide which (if any) variables derived from the new data source should be included as covariates, we will consult a “blind jury” of collaborators or colleagues. The jury should not see treatment effect estimates or any information that might suggest whether inclusion of a covariate would make the estimated effects bigger or smaller. Instead, they should be asked which covariates they would have included if the new data source had been available before the PAP was registered.

Covariates should generally be chosen on the basis of their expected ability to help predict outcomes, regardless of whether they appear well-balanced or imbalanced across treatment arms.16 But there may be occasions when the covariate list specified in the PAP omitted a potentially important covariate (due to either an oversight or the need to keep the list short when N is small) with a nontrivial imbalance. Protection against ex post bias (conditional on the observed imbalance) is then a legitimate concern.17 However, if observed imbalances are allowed to influence the choice of covariates,18 the following guidelines should be observed:

  1. If possible, the balance checks and decisions about adjustment should be finalized before we see unblinded outcome data.

  2. The direction of the observed imbalance (e.g., whether the treatment group or the control group appears more advantaged at baseline) should not be allowed to influence decisions about adjustment. We will either pre-specify criteria that depend on the size of the imbalance but not its direction, or consult a “blind jury” that will not see the direction of imbalance or any other information that suggests how the adjustment would affect the point estimates.

  3. The estimator specified in the PAP will always be reported and labeled as such, even if alternative estimates are also reported. See also “Unadjusted estimates, alternative regression specifications, and nonlinear models” below.

Missing covariate values

Observations with missing covariate values will be included in the regressions that estimate average treatment effects, as long as the outcome measure and treatment assignment are non-missing. Ordinarily, methods for handling missing values will have been specified in the PAP. If not, we will use the following approach:

  1. If no more than 10% of the covariate’s values are missing, recode the missing values to the overall mean. (Do not use arm-specific means.)

  2. If more than 10% of the covariate’s values are missing, include a missingness dummy as an additional covariate and recode the missing values to an arbitrary constant, such as 0.19 If the missingness dummies lead us to exceed the M / 20 or N / 20 maximum number of covariates (see above under “Choice of covariates”), revert to the mean-imputation method above.

Example: Recoding Missing Covariates

suppressMessages({
  library(randomizr)
  library(sandwich)
  library(lmtest)
})

N <- 200

# Make some covariates
X1 <- rnorm(N)
X2 <- rbinom(N, size = 1, prob = 0.5)

# Make some potential outcomes
Y0 <- .6*X1 + 3*X2 + rnorm(N)
Y1 <- Y0 + .4

# Conduct a random assignment and reveal outcomes
Z <- complete_ra(N, m= 100)
Y_obs <- Y1*Z + Y0*(1-Z)

# Some covariate values are missing:
X1_obs <- X1
X2_obs <- X2

X1_obs[sample(1:N, size = 10)] <- NA
X2_obs[sample(1:N, size = 50)] <- NA

# Less than 10% of X1_obs is missing, so:
X1_obs[is.na(X1_obs)] <- mean(X1_obs, na.rm = TRUE)

# More than 10% of X2_obs is missing, so:
X2_missing <- is.na(X2_obs)
X2_obs[X2_missing] <- 0

# Mean-center the covariates
X1_obs_c <- X1_obs - mean(X1_obs)
X2_obs_c <- X2_obs - mean(X2_obs)
X2_missing_c <- X2_missing - mean(X2_missing)

# Conduct Estimation
fit_adj <- lm(Y_obs ~ Z + Z*(X1_c + X2_c + X2_missing_c), singular.ok = FALSE)

# Robust Standard Errors
coeftest(fit_adj, vcov = vcovHC(fit_adj, type = "HC2"))
## 
## t test of coefficients:
## 
##                Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)     1.61287    0.19233  8.3861 1.066e-14 ***
## Z               0.64641    0.26467  2.4423    0.0155 *  
## X1_c           -0.25257    0.18965 -1.3318    0.1845    
## X2_c            0.10878    0.38592  0.2819    0.7783    
## X2_missing_c    0.65116    0.41515  1.5685    0.1184    
## Z:X1_c          0.28683    0.27746  1.0337    0.3026    
## Z:X2_c          0.46270    0.53133  0.8708    0.3849    
## Z:X2_missing_c -0.59844    0.59069 -1.0131    0.3123    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Unadjusted estimates, alternative regression specifications, and nonlinear models

Our primary analysis will be based on a pre-specified covariate-adjusted estimator (unless \(N < 20\)), but we will also report unadjusted estimates as a robustness check. Results from alternative regression specifications may also be reported as specified in the PAP, or as allowed under “Choice of covariates” above, or as requested by referees. We will make clear to readers which estimator was pre-specified as primary.

For binary or count-data outcomes, some referees prefer estimates based on nonlinear models such as logit, probit, or Poisson regression. Although we disagree with this preference (the robustness of least squares adjustment in RCTs is supported by both theory and simulation evidence),20 we will provide supplementary estimates derived from nonlinear models (using marginal effects calculations) if requested by referees. We prefer logits to probits because adjustment based on the probit MLE is not misspecification-robust.21

Covariate imbalance and the detection of administrative errors

We will perform a statistical test to judge whether observed covariate imbalances are larger than would normally be expected from chance alone. In an experiment with a binary treatment and a constant probability of assignment to treatment, the test involves a regression of the treatment indicator on the covariates and calculation of a heteroskedasticity-robust Wald statistic for the hypothesis that all the coefficients on the covariates are zero (Wooldridge 2010, 62). The covariates to be included in the regression should be specified in the PAP. (For voter turnout experiments, the SOP section “Issues specific to voter turnout experiments” gives a default set of covariates in case the PAP fails to specify the choice.) If the experiment is block-randomized with treatment probabilities that vary by block, we will also include dummy variables for the varying treatment probabilities in the regression, and we will test the hypothesis that all coefficients on the covariates, excluding the treatment probability dummies, are zero.

We will use a permutation test (randomization inference) to calculate the p-value associated with the Wald statistic.

In an experiment with multiple treatments, we will perform an analogous test using multinomial logistic regression of treatment on covariates.

A p-value of 0.01 or lower should prompt a thorough review of the random assignment procedure and any possible data-handling mistakes. If the review finds no errors, we will report the imbalance test, proceed on the assumption that the imbalance is due to chance, and report estimates with and without covariate adjustment.

Example: Permutation Test of Covariate Balance

suppressMessages({
  library(randomizr)
  library(sandwich)
})

# Generate Covariates

set.seed(1234567)

N <- 1000

gender <- sample(c("M", "F"), N, replace=TRUE)
age <- sample(18:65, N, replace = TRUE)
lincome <- rnorm(N, 10, 3)
party <- sample(c("D", "R", "I"), N, prob=c(.45, .35,.2), replace=TRUE)
education <- sample(10:20, N, replace=TRUE)

# Conduct Random Assignment
Z <- complete_ra(N, 500)

# Regress treatment on covariates
fit <- lm(Z ~ gender + age + lincome + party + education, singular.ok = FALSE)

# Obtain observed heteroskedasticity-robust Wald statistic
# See Wooldridge (2010), p. 62
# Null hypothesis is that the slope coefficients are all zero, i.e.
#  R beta = 0
#  where beta is the 7 x 1 vector of coefficients, including the intercept
#  and R is the 6 x 7 matrix with all elements zero except
#   R[1,2] = R[2,3] = R[3,4] = R[4,5] = R[5,6] = R[6,7] = 1

Rbeta.hat <- coef(fit)[-1]
RVR <- vcovHC(fit, type <- 'HC0')[-1,-1]
W_obs <- as.numeric(Rbeta.hat %*% solve(RVR, Rbeta.hat))  # Wooldridge, equation (4.13)

# Compare to permutation distribution of W

sims <- 10000
W_sims <- numeric(sims)

for(i in 1:sims){
  Z_sim <- complete_ra(N, 500)
  fit_sim <- lm(Z_sim ~ gender + age + lincome + party + education, singular.ok = FALSE)

  Rbeta.hat <- coef(fit_sim)[-1]
  RVR <- vcovHC(fit_sim, type <- 'HC0')[-1,-1]
  W_sims[i] <- as.numeric(Rbeta.hat %*% solve(RVR, Rbeta.hat))
}

# Obtain p-value
p <- mean(W_sims >= W_obs)
p
## [1] 0.8903
hist(W_sims)
abline(v = W_obs, lwd=3, col="red")