More Than Big Data Needed for Estimates as Good as Randomized Clinical Trials: Prospective Observational Causal Studies

A research methods post…

Bill Gardner is cautiously—very cautiously—hopeful that big data from electronic health records (EHRs) will enable unbiased estimates of the effects of medical treatment, without any randomized controlled trials.  Gardner’s hope, his caution and his description of the data needed—“all the factors that determine who gets what treatment”—are all right on the mark. And EHRs will massively increase data on detailed clinical factors that drive clinical decisions.

But I fear researchers will focus only on the data laying around in EHRs. To cure confounding, you need to go out and measure all the confounders—everything related to both treatment and outcome.

Let’s start with Gardner’s valuable message:

Here’s the problem with observational data and simply comparing people who have been treated with drug X to those who have gotten drug Y. The X people are probably different from the Y people in systematic ways. There are usually reasons why they got different drugs. So the observed differences in outcomes between X drug and Y drug patients might be due to unobserved differences between these patients, rather than the drugs….

In principle, however, we could get around the unobserved differences problem if we could measure enough of the relevant differences between the groups of patients. Then we could, for example, closely match drug X and drug Y patients. If we matched them on all the factors that determine who gets what treatment, then it would be as if a matched X and Y patients had been randomly assigned to get either X or Y. Then by averaging the differences between X patients and Y patients within matches, we could get an estimate of the causal effect of drug X relative to drug Y…

Here’s where the ‘big data’ movement comes in. We can assemble data sets with large numbers of patients from electronic health records (EHRs). Moreover, EHRs contain myriad demographic and clinical facts about these patients. It is proposed that with these large and rich data sets, we can match drug X and drug Y patients on clinically relevant variables sufficiently closely that the causal estimate of the difference between the effects of drug X and drug Y in the matched observational cohort would be similar to the estimate we would get if we had run an RCT.

If we can get treatment effect estimates from matched observational cohorts, they’ll be cheaper than RCTs. Moreover, we’ll have these estimates for the actual populations of patients who are routinely treated with these drugs, not the selected groups seen in RCTs.

Sound’s great—and it is. But some variables related to both treatment and outcomes won’t be in the EHR: Patient’s drive to get better… The presence—or not—of an energetic family member to push the docs for best-fit treatment and provide non-medical support… Readers experienced with this game can probably reel off a list of likely suspects. Critically, some factors are probably idiosyncratic to the setting—say local transportation or docs’ relationships with each other.

As a non-clinician outside the setting, I am not qualified to suggest specific variables. But the right clinical researchers could. Then they could measure them just before treatment choices are made. It would take a lot of work and planning—but a lot less than an RCT.

Let me show how this might work in a situation where I am qualified: higher education (my own industry) at Baruch College in City University of New York (my own institution). Like many colleges, Baruch is investing in online and hybrid (blended) courses. Unlike many colleges, Baruch has a center to investigate the effects of hybrid and online courses: Zicklin Online Learning Evaluation (ZOLE).

Trying to escape the same old observational studies problems that plague ZOLE like everyone else,  Ted Joyce, ZOLE director, decided to run a randomized experiment (ungated). He and colleagues, Sean Crockett, David Jaeger, Onur Altindag and Stephen O’Connell, randomly assigned students to hybrid and traditional courses in introductory microeconomics. They found that the traditional format (more class time) was more effective but only slightly—by 2.3 percentage points (or 0.2 standard deviation).

Causal estimates we can trust! But we don’t know if the results generalize to statistics, marketing, history… or generalize to liberal arts colleges…or to less able instructors…or to the next generation of online materials. And I won’t recount the massive time and effort Ted and administrators put into scheduling, registration, study enrollment and other practicalities for this single study.

All that means we need to do the best possible observational studies, as advocated by Don Rubin. Happily, thanks to Will Shadish, Tom Cook, Peter Steiner, M.H. Clark, Vivian Wong and other collaborators, we know that observational studies can produce the same unbiased estimates as randomized experiments—if you do them right.

In a seminal study, Shadish, Clark and Steiner randomized students to be in either a randomized experiment or an observational study (Ungated). The former were randomly assigned to math or vocabulary training, while the latter got to choose. The researchers measured math and vocabulary outcomes. Critically, before any assignment or choice, they also measured everything they thought could possibly drive selection— 23 constructs based on 156 items. With those controls, they found that even ordinary regression could reproduce the randomized experiment results.

Which control variables really mattered for getting rid of selection bias? The study authors and Cook found that one construct was essential: how much people liked math. In fact, it was disliking math that really drove avoiding math training. The usual demographics and administrative variables were almost useless for removing bias in estimate of math training. Unfortunately, “dislike of math” isn’t in your usual administrative databases. Nor would it be in the big data that online education systems can gather.

How do you get the variables essential for unbiased causal effects in an observational study? Tom Cook gave practical instructions at a workshop. Here they are, filtered somewhat through my own lens:*

  • Try to figure out every possible driver of treatment selection, especially any conceivably related to the outcome
    • through all possible methods (e.g., interviews with choosers, with experts)
    • in advance of treatment selection (before any outcome data seen)
    • recognizing selection may be idiosyncratic to setting
    • blanketing all possible treatment selection processes, since one doesn’t know the true process
  • Design survey instrument (and/or other data collection methods)
    • measure each construct as many ways and times as possible
  • Field survey and otherwise collect such data
    • just ahead of actual treatment selection
  • Estimate causal effects with propensity score matching
    • plain OLS regression works too with the right control variables

Since names help ideas spread, I would like to name this a Prospective Observational Causal Study. Okay, the acronym—POCS—sounds like pox, as in small pox, and is off-putting. Other suggestions? In the meantime, I’m sticking with POCS.

Taking Cook’s advice to heart, I am joining Ted and his colleagues in a POCS trying to replicate their randomized experiment on hybrid microeconomics. They are running the exact same course set-up one year later—with one difference: students get to choose between hybrid and traditional.

How did we find candidate selection-driving-variables? The group brainstormed using past experiences of students. Ted and I informally interviewed students. We all came up with a list of what we thought would matter to Baruch’s urban commuting undergraduates: commute time, hours worked, conscientiousness, online comfort/orientation, in-person orientation, risk aversion, importance of economics to student’s major, how hard economics seems to them and so on… We designed a survey instrument to measure that stuff. We also added variables we wanted from administrative databases.

Of course, we may not be able to reproduce the randomized experimental results. We were short on time to do this. There might be important stuff neither the interviewed students nor we thought of. And it’s a year later, hurting our external validity to the randomized experiment. Still, I am excited by the opportunity to participate in a POCS. (Yes, POCS can be singular!) I want to advocate for POCS the same way many advocate for randomized field experiments.

We still need all the randomized field experiments, strong quasi experiments and natural experiments we can get. And we definitely need big data (like EHRs) that are likely to contain lots of important drivers of selection. But to avoid using big data badly, fold them into a POCS.


*I thank Tom Cook for a helpful email exchange on this approach when I was writing the second edition of my textbook co-authored with Gregg Van Ryzin. The articles supporting this approach are: here, here, here, here, here. But Tom’s workshop instructions were particularly clear and practical.

Modified slightly August 25, 2014


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s