Stats #73: Medical journals - The trouble with apples and oranges

Content: This training class will discuss the problems with selection of a control group.

Teaching strategies: Didactic lectures and small group exercises.

Abstract: Finding a good control group is an underappreciated art in research. We often don't notice this until someone makes a stunningly bad choice, leading to the classic difficulty of comparing apples and oranges. In this talk, you will learn what to look for in a control group. You will also see the knots that researchers tie themselves in when they insist on a placebo arm in a birth control study and when they try to evaluate the prognosis of patients who are already dead. You will also see an example where two bad control groups can add up to a good comparison.

Objectives: In this seminar, you will learn how to:

Notes: There are no pre-requisites for this seminar. This class qualifies for one hour of IRB Education Credits (IRBECs).

 

Where can you find this handout?

This handout and the handouts that I use for all of my seminars and training classes are a compilation of individual web pages at www.childrensmercy.org/stats. I use the "Include Page" feature of Microsoft FrontPage to combine these into a single page. You can always find the most recent version of this compilation by going to the web address listed at the bottom of this page. Links for the handouts for other seminars and classes appear at www.childrensmercy.org/stats/training.asp.

Why don't I use PowerPoint?

I stopped using PowerPoint for my presentations in the mid 1990's. This was based on Edward Tufte's advice that presenting information in a paper handout is more effective than presenting the information on a projected screen. I found this to be excellent guidance. I enjoy talking when I don't have to wrestle with a laptop computer. I look at my audience more and interact with them better. I elaborate on this in greater detail at www.childrensmercy.org/stats/weblog2004/powerpoint.asp.


Apples or Oranges?  How do you ensure a fair comparison?

This material is an excerpt from Chapter 1 of my book, Statistical Evidence in Medical Trials, with some minor adaptations and updates.

Introduction

Almost all research involves comparison. Do women who take Tamoxifen have a lower rate of breast cancer recurrence than women who take a placebo? Do left-handed people die at an earlier age than right-handed people? Are men with severe vertex balding more likely to develop heart disease than men with no balding?

In each of these situations, you are making a comparison between a control group and a treatment/exposure group. I will use the terms treatment and exposure interchangably throughout this book, though I will reserve treatment for those conditions which represent an effort to produce a beneficial result and exposure to represent a condition that is, potentially harmful. You would call drinking water from a natural spring a treatment, but drinking water from a contaminated well an exposure. The distinction between treatment and exposure is not that critical though, and when I discuss a generic ‘treatment’ in this book, feel free to substitute the word ‘‘exposure’’ and vice versa.

 When you make such a comparison between a treatment group and a control group, you want a fair comparison. You want the control group to be identical to the treatment group in all respects, except for the treatment in question. You want an apples-to-apples comparison.

Covariate imbalance

Sometimes, however, you get an unfair comparison, an apples-to-oranges comparison. The control group differs on some important characteristics that might influence the outcome measure. This is known as covariate imbalance. Covariate imbalance is not an insurmountable problem, but it does make a study less authoritative.

Women who take oral contraceptives appear to have a higher risk of cervical cancer. But covariate imbalance might be producing an artificial rise in cancer rates for this group. Women who take oral contraceptives behave, as a group, differently than other women. For example, women who take oral contraceptives have a larger number of pap smears. This is probably because these women visit their doctors more regularly in order to get their prescriptions refilled and therefore have more opportunities to be offered a pap smear. This difference could lead to an increase in the number of detected cancer cases. Perhaps the other women have just as much cancer, but it is more likely to remain undetected.

The possibility that oral contraceptives causes an increase in the risk of cervical cancer is quite complex; a good summary of all the issues involved is available at: www.jhuccp.org/pr/a9/a9chap5.shtml.

There are many other variables that influence the development of cervical cancer: age of first intercourse, number of sexual partners, use of condoms, and smoking habits. If women who take oral contraceptives differ in any of these lifestyle factors, then that might also produce a difference in cervical cancer rates.

Case study: Vitamin C and cancer

Paul Rosenbaum, in the first chapter of his book, Observational Studies, gives a fascinating example of an apples-to-oranges comparison. Ewan Cameron and Linus Pauling published an observational study of Vitamin C as a treatment for advanced cancer (Cameron 1976). For each patient, ten matched controls were selected with the same age, gender, cancer site, and histological tumor type. Patients receiving vitamin C survived four times longer than the controls (p < 0.0001).

Cameron and Pauling minimize the lack of randomization:

Even though no formal process of randomization was carried out in the selection of our two groups, we believe that they come close to representing random subpopulations of the population of terminal cancer patients in the Vale of Leven Hospital.

Ten years later, the Mayo Clinic (Moertel, et al. 1985) conducted a randomized experiment which showed no statistically significant effect of vitamin C. Why did the Cameron and Pauling study differ from the Mayo study?

The first limitation of the Cameron and Pauling study was that all of their patients received vitamin C and followed prospectively. The control group represented a retrospective chart review. You should be cautious about any comparison of prospective data to retrospective data.

But there was a more important issue. The treatment group represented patients newly diagnosed with terminal cancer. The control group was selected from death certificate records. So this was clearly an apples-to-oranges comparison because the initial prognosis was worse in the control group than in the treatment group. As Rosenbaum says so well:

one can say with total confidence, without reservation or caveat, that the prognosis of the patient who is already dead is not good (p. 4).

The prognosis of a patient with a diagnosis of terminal cancer is also not good, but at least a few of these patients will be misdiagnosed. The ones in the control group, the ones that entered the study clutching their death certificates, had no misdiagnosis.

What steps can you take to ensure a fair (apples to apples) comparison?

When the treatment group is apples and the control group is oranges, you can't make a fair comparison. To ensure that the researchers made an apples to apples comparison, ask the following questions:

Did the authors use randomization? In some studies, the researchers control who gets the new therapy and who gets the standard (control) therapy. When the researchers have this level of control, they almost always will randomize the choice. This type of study, a randomized study, is a very effective and very simple way to prevent covariate imbalance.

If randomization was not done, how were the patients selected? Several alternative approaches are available when the researchers have control of treatment assignment, but minimization is the only credible alternative. When researchers do not have control over treatment assignments, you have an observational study. The three major observational studies, cohort designs, case-control designs, and historical controls, all have weaknesses, but may represent the best available approach that is practical and ethical.

Did the authors use matching to prevent covariate imbalance? Matching is a method for selecting subjects that ensures a similar set of patients for the control group. A crossover design represents the ideal form of matching because each subject serves as his or her own control. Stratification ensures that broad demographic groups are equally represented in the treatment and control group.

Did the authors use statistical adjustments to control for covariate imbalance? Covariate adjustment uses statistical methods to try to correct for any existing imbalance. This methods work well, but only on variables that can be measured easily and accurately.

This webpage was written by Steve Simon on (date unknown), edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Statistical evidence


The stubborn insistence on placebos (June 29, 2007). Category: Placebos in research

The patients who volunteer for a randomized trial are sacrificing a great deal of autonomy. They giving up the right to determine which drug they get and they are ceding this authority not to an expert but to a random device. You should not abuse that gift by asking them to participate in a trial where they have a 50% chance of getting a treatment that is known to be inferior. This is especially difficult when one of the choices is a placebo. There is a hot debate about when a placebo arm is ethically acceptable.

This debate is the sharpest when researchers have used a placebo surgery to maintain blinding. In general, it is difficult to blind a surgical trial. If one of the arms of your research study involves a bilateral orchiectomy, sooner or later your patients are going to notice that something is missing.

Sometimes you can blind the comparison between a small incision and large incision by placing a large bandage on every patient and staining the bandage with iodine to hide the lesser blood loss with the small incision.

But perhaps the greatest effort to keep a surgical procedure blinded occurred in a study of fetal cell implants as a treatment for patients with Parkinson's disease. In that study, patients in the control group underwent a surgery where holes were drilled in the skull, but no fetal cells were inserted. There was a fury of criticisms of this trial. My favorite critique has a wonderful title.

Dr. Weijer notes that this intervention required the control patients to undergo the following:

I'm not a doctor, but I suspect that each of these steps carries a non-trivial risk of harm. Can we ethically ask patients to suffer the risk of harm in order to provide information with a greater degree of scientific credibility? The answer is yes is some circumstances, but you have to proceed cautiously. We live in a society where a few individuals are allowed to donate a kidney to a total stranger. If someone asks to do this, of course, you have to proceed cautiously and make them understand what the risks of the surgery are and what the risks are of living the rest of your life with only one kidney.

But there's an equally important consideration. Patients will not volunteer for a study where one or more of the options are perceived by them to be inferior. In the era before AZT became available (prior to 1987), AIDS was considered a death sentence. So when researchers wanted to try to test new therapies, and insisted on a placebo arm, the patients rebelled. They tried to subvert the intent of the trials by doing one of two things. Some patients would get together in small groups and would pool their medication. They would grind up all the pills and then redistribute them. They felt that a half dose of a promising new drug would be a better choice than a 50% chance of getting an ineffective placebo. Other patients would take their first batch of pills to a chemist for analysis. If they found out that they were taking sugar pills, they would drop out and re-enroll under a different name.

You can't blame the patients for this behavior. They are acting in their best interests. In fact, it was largely because of the AIDS crisis that researchers have recognized that the placebo controlled trial is not an absolute requirement in all research studies. There is now general consensus that in a disease that has close to 100% morbidity or mortality, there is no need for a control group at all. Any treatment that is helps even a small fraction of patients to survive will stand out clearly against a background rate of 0% survival.

This webpage was written by Steve Simon and was last modified on 07/08/2008.


How two bad control groups can add up to one good comparison (June 28, 2007). Category: Observational studies

Many observational studies are criticized (often deservedly) for having a bad control group. If you choose a bad control group, you create an unfair (apples to oranges) comparison. But surprisingly, two controls groups, even if both are imperfect, can lead to a strong conclusion.

The trick is to recognize that if one control group has a positive bias (it makes the treatment group look better than it should) and the other one has a negative bias (it makes the treatment group look worse than it should), then these two control groups bracket the ideal control group.

If the treatment group is shown to be significantly better than both control groups, then it has to be significantly better than anything in between the two control groups. Thus it has to be significantly better than the unobserved ideal control group. Thus, two imperfect comparisons can lead to a valid overall comparison.

It works the other way also. If the difference between the treatment group and the negatively biased control group is small, and the difference between the treatment group and the positively biased control group is small then the size of the biases are small. Another way of thinking about this is that if the treatment group is close to both control groups then it has to be close to anything in between the two control groups.

How does this work in practice? Paul Rosenbaum has written extensively about this.

The most interesting example is from Economics, not Medicine, but it is a very instructive example. Economists have debated the impact of minimum wage laws for many decades. Conservative economists have argued that a minimum wage law leads to increased unemployment because it forces businesses to fire people when the mandated minimum wage exceeds the value that certain workers can provide to the business. This impact is felt especially hard among workers with limited skills. Liberal economists have argued that such laws have minimal impact on unemployment and because they raise the wages of the least skilled workers, it has a positive effect on the community.

You could look at trends in unemployment before and after the United States government raises the minimum wage, but this is a weak comparison. So many factors influence unemployment and these factors are all shifting in various directions over time. The resulting change in unemployment is biased, and the form of the bias is too complex to be effectively characterized. The problem is that there is no concurrent control group.

It is not an ideal comparison, but you can get a concurrent control group by noting that several states have passed minimum wage laws that are higher than the U.S. minimum wage. You could then compare the change in unemployment in the state raising its minimum wage to the change in unemployment in a neighboring state where the minimum wage remains unchanged. This was actually done in 1992 in New Jersey (the state that raised the minimum wage) and Pennsylvania (the neighboring state with no change in the minimum wage). There was no difference in the change in unemployment, seemingly indicating that there was little or no impact of a minimum wage increase on unemployment.

But wait, you're saying to yourself. Aren't you comparing New Jersey apples to Pennsylvania oranges? And the answer is that we are. It's an imperfect comparison. Furthermore, unless you are a super genius, you probably can't figure out if the Pennsylvania control group produces a positive bias or a negative bias.

Now 1992 was not the only time that New Jersey raised its minimum wage compared to the national wage, but each time the comparison is subject to the same biases. But we are fortunate in that there was a time that Pennsylvania raised its minimum wage. They didn't do it directly, but in 1996, an increase in the minimum wage nationwide caused an increase in Pennsylvania, but no change in New Jersey because the federal increase matched the earlier increase in New Jersey. Now New Jersey is the control group and there is change in unemployment is comparable in New Jersey and Pennsylvania. It's still an imperfect control group because we are comparing Pennsylvania apples with New Jersey oranges.

But wait, how can Pennsylvania be both an apple and and orange? We know that if the comparison PA-NJ has a positive bias, then NJ-PA must have a negative bias.  If PA-NJ has a negative bias, then NJ-PA must have a positive bias. If we see the same trivial change in unemployment when the study is positively biased as when the study is negatively biased, we know that the biases are not large enough to produce an artificial effect or to mask a real effect.

There are ways to apply this in a medical setting as well. Suppose that we want to make an intervention, but once we make the change, there is no going back. This might be installing a new computerized system for check-in at a hospital, or implementing a pay-for-performance at a clinical practice. You can't randomly turn the computer system off on certain days and you can't alternate the pay methods on different paychecks.

The first thought is that you could just measure things before and after the intervention (see figure below).

This is not an awful thing to do, and it is certainly better than doing no evaluation. How often have we made changes in our workplace and never taken the time to see if those changes actually made things better or had no effect? Sometimes, unfortunately, those changes actually make things worse, but we go along, blissfully unaware of the damage we've done. Suppose we have a multicenter trial. The above design looks like.

An immediate and obvious improvement is to intervene in half of the centers and leaving the others alone (see below).

Still, you have to worry about whether Sites A, D, and E are New Jersey apples and Sites B, C, and F are Pennsylvania oranges. Another variation (see below) can work even better.

Here each center gets to participate in the intervention (no one stuck in the boring control group), but because we stagger the intervention, we avoid some of the timing issues. If each site shows an improvement, you can't blame it on the calendar. It would be a very weird set of circumstances that would cause a positive bias at individual sites that would coincide with the six different changeover dates.

This webpage was written by Steve Simon and was last modified on 07/08/2008.