Stats #44: Things You Need to Know Before Starting a Research Project

Content: This class will introduce you to the statistical issues important in developing a research study. It combines material from classes #32, 42, and 52. This class is useful for anyone who participates in the planning of research. There are no pre-requisites for this class.

Teaching strategies: Didactic lectures and small group exercises.

Objectives: In this class you will learn how to:

This class qualifies for 3 IRB Education Credits (IRBECs).

Contents


Overview of the STATS web pages (January 21, 2000)

What are the STATS web pages?

The STATS pages are a collection of handouts that I use in my job as a statistical consultant. The web provides a nice home for these handouts, because as I update my material, the newest version is immediately available to anyone who is interested.

Where can I find STATS?

If you have a web browser, like Internet Explorer or Netscape Navigator, you can surf on over to my site,

http://www.childrensmercy.org/stats

which is also found at http://internet1/stats, if you are attached to the Children's Mercy Hospital network. There are two obsolete sites: http://www.cmh.edu/stats and http://simon/stats. Do not use either of these sites.

Some of the fun stuff you can find on the STATS web pages.

Ask Professor Mean.  For the tough Statistics questions that Dear Abby won't touch.

Planning Your Research Study.  Things you need to plan for before you start collecting your data.

Selecting An Appropriate Sample Size.  How much data do you really need?

Managing Your Research Data.  Everything you want to know before you step to the keyboard.

Steps In a Typical Data Analysis.  I have my data on the computer. Now what?

How to Read a Medical Journal Article.  Reading a journal is hard work. Here's some help.

Professor Mean's Library.  Good books and good web sites about Statistics.

... and even more good stuff!!!

This webpage was written by Steve Simon, edited by Linda Foland, and was last modified on 07/08/2008. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Website details


For CMH employees only: Statistical Consulting Services.

You can get free statistical consulting if you work for Children's Mercy Hospital. Steve Simon and Ashley Sherman provide a wide range of statistical consulting services to help you with your research projects. This help can start as early as the initial planning of your research. I also help with the analysis of your data, using SPSS or other statistical software. We can also provide assistance with the preparation of your presentations and publications.

Here area some examples of the services that we have provided:

Specific statistical advice has been outlined on a series of web pages which can be found at http://www.childrensmercy.org/stats/. The pages provide advice about planning your research, selecting an appropriate sample size, managing your research data, performing a variety of data analyses, presenting research data, and writing research papers.

How to get in touch with a statistician

If you would like to meet with Steve Simon or Ashley Sherman, you can set up an appointment by emailing or calling Judy Champion (jmchampion (at) cmh (dot) edu or 816-983-6784). If you have a very simple question, send an email directly to us (ssimon (at) cmh (dot) edu and aksherman (at) cmh (dot) edu).

This webpage was written by Steve Simon on 2003-04-30, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Professional details


Directions to my new office (April 25, 2008).

I have moved to a new office. It is a modular building just north of Children's Mercy Hospital. It is between 23rd and 22nd street, just off of Kenwood Avenue (Kenwood is a small north/south street just west of Holmes). If you need to get from your office to mine, here are some directions written by my Administrative Assistant, Judy Champion.

This webpage was written by Steve Simon and was last modified on 2008-07-14. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Professional details


Where do research ideas come from? by Ronan Conroy (September 20, 1999)

This is an HTML format version of an email by Ronan Conroy on April 9, 1999 to edstat-l, an Internet list and to sci.stat.edu, a USENET group. This email summarized a presentation he made about how to develop ideas for research. I have made some minor formatting changes (mostly the use of bolding, bulleting, and indenting to highlight the major themes), but all of the credit for writing up this summary belongs to Ronan Conroy. Part of this presentation represents a summary of discussions on edstat-l and sci.stat.edu. Here is the acknowledgement in Dr. Conroy's original email.

I'd like to thank the many people who took part in the discussion, or who wrote to me privately, and to stress that the quotes in it are often the person who made the point most memorably, rather than the only person who said it.

Many thanks to Chris Zorn, Gabriele Susinno, Giovanni S. Leonardi, the... inimitable dennis... roberts, Joe Ward, Michael Granaas, Roland Andersson, Jay Warner, Alex Heath, The Anthonys, Bob Frick, Jerry Dallal, Tjen-Sien Lim, Tim Cole, John F. Schnell, Robert C. Knodt, Paul Velleman and Joseph L. McCrary. Did I say Herman Rubin? I did now.

This material is reproduced with permission from Dr. Conroy. For what it's worth, I have included a copy of the original email. It's pretty clear that there was some formatting in Dr. Conroy's original document that got lost when it was translated to a text format for email.

Introduction

This paper tackles one of the questions that statisticians dread most: the most basic one of all. How do you start formulating a research project? It began life as a talk at a research seminar in the Rotunda Hospital, Dublin. Trying to write it up, I decided to mail the statistics lists that I subscribe to. This paper has been greatly enriched by the ideas and discussion generated on edstat-l, the statistics teaching list, as well as contributions from subscribers to the stata list and the UK statistics list allstat. Quotes are often attributed to the only person who made the point most memorably, but many of the ideas emerged repeatedly in different postings. I'd like to thank all those who took part in the discussion.

Exploring your environment

The first thing you need to do is identify your resources for research. This is often easier when you first arrive somewhere. After a while you begin seeing an environment as the place where you work or live or eat. You need to see it with a fresh eye to see it as a potential research environment.

Don't forget that your research environment includes not just your patients and your colleagues, but also includes any source of data, ideas or help that you have access to. Many of my own research projects have taken shape because my office is next door to the psychology department; a casual remark has often triggered a flurry of speculation, articles rooted out, contacts mentioned and so on.

The internet is also a valuable environment. Discussion lists abound,which can provide not just free advice but also an insight into current controversies and new directions in research. Simply subscribing to a list and reading the postings (the word for a person who does this is a lurker) without taking part in the discussions will often give you ideas.

General resources

How much time will you be able to devote to research? To what extent can you integrate it into your daily work?

Will colleagues help? For instance, if you need blood taken outside working hours, will the doctor on call oblige? Will nursing staff collaborate by collecting extra information?

Do you have access to a person, unit or department with a specific research interest? They can often be a useful source of ideas. Never underestimate the value of just going for coffee with someone who does a lot of research, or, better, a research team. The speed with which a bunch of researchers can take a vague idea and shape it into a research design is amazing. Most of these ideas go nowhere, but eavesdropping on the process can help you to do it yourself.

Giovanni Leonardi of the Environmental Epidemiology Unit at London School of Hygiene and Tropical Medicine put it like this: "There are many potential research ideas that never make it to becoming research projects, and the likelihood that a research idea will become a research project is heavily influenced by this idea having being selected and refined in an environment where potential ideas are routinely tested for their viability. Think of this as 'natural selection' of research ideas within the research environment."

Do you have access to a statistician, or someone who can advise you on study design and sample size?

What library facilities do you have access to? Skimming journals is a good ideas generator, which I will deal with in more detail later; but access to a good library, including literature searching and reprint ordering facilities, is a must. Add extra points for library staff who are willing to do literature searching with you looking over their shoulder to refine the search.

What computer facilities are available?

What are they funding this year? This sounds like a cynical point, but if there are funds available for research in specific areas, make use of them. What charities are there who might be interested in your research area? Talk to colleagues; there is often no single listing of available research sponsors, and you have to rely on the grapevine.

Specific resources

Do you have access to information already collected which could be the basis for a research project? This information could have been collected as routine clinical information. Although you probably cannot do a research project solely on the contents of patients' charts, routinely collected information may allow you to

Information may also be available as an offshoot of another research project. You may liaise with another research project and

It is a good idea to talk to people who are doing research in the setting in which you work. They will be able to spot potential difficulties in proposals, and may also have useful ideas as to what they would do if they had access to your facilities.

Potential projects

Now all you need to do is to get an idea for a project which will be realistic, given the resources available to you. This is often a stumbling block. I had one person come into the office to discuss a research project with me. 'I have 24 patients with rapid cycling mood disorder' he said. And stopped, waiting for me to say something. The trouble is that 24 patients with rapid cycling mood disorder is no more a research project than 24 trout in a shoebox. What you need to ask yourself is 'what do we not know about rapid cycling mood disorder'?

One very important piece of advice that recurred frequently in the edstat-l discussion was the need to develop many ideas simultaneously. Christie Brown, Assistant Professor of Marketing at University of Michigan Business School tells her students to imagine that inside them they have a large basket of research ideas, some better than others:

'I point students toward Donald Campbell's work on creativity. Campbell suggests one secret to generating better ideas lies in the QUANTITY of ideas generated. In other words you stand the best chance of pulling an idea from the "high" end of your good-idea basket if you make a lot of draws.' (Campbell, Donald T. "Blind variation and selective retentions in creative thought as in other knowledge processes." Psychological Review. 1960;67:380-400.)

Don't focus prematurely on a single idea. Develop a few together. It's like the process of conception: the chances of a child resulting from a single act of sexual intercourse are small. But the chances of a child not resulting from regular sexual intercourse are likewise small. Carry a notebook and write down every idea that you get, good or bad. You will learn from thinking about why the bad ones are bad as well as from why the good ones are good.

Christie Brown again: 'Write down everything. Do not self-censor. Keep a log of your baby-ideas in case they end up being worth pursuing. Get in the habit of generating at least one idea based on everything you read in your domain and even out of it.

Bob Frick, a cognitive scientist, actually forces students to develop a number of research ideas as a learning exercise. 'The assignment was to come up with three "kernels", and the students had about a month to do it. The notion was that they were supposed to find some original idea they had. It usually ended up being an original observation. Original to them -- it didn't have to be original to the field of psychology. Their original idea would then be a kernel that could be developed into an experiment. Most people have these, but they don't pay attention.'

Extending the ideas of others

Much of the discussion on edstat-l centred around where ideas for research projects come from. The sources of ideas divide into two:

I'll take the easy one first!

Repeating research that has been done by others doesn't sound like task, but there are several important reasons why it needs to be done, and there are some other benefits too. The reasons why research needs to be replicated include:

Local research is needed to make sure that findings from other countries apply locally. Indeed, basic research is constantly needed to monitor local health needs and to evaluate the services being delivered.

All research needs extension to new contexts and development along an obvious line - Clinical trials are often done on homogeneous, idealised patient groups; they need extension to realistic groups such as those with comorbidity, or beyond the age range of the original research. Think of

Factors which have been identified in a disease may be present in other similar diseases. Since its role in peptic ulcer disease was uncovered, H pylori has been investigated for many other unsolved crimes.

Yes, there is a feeling of a bandwagon rolling along, but someone has to check out these questions.

You may spot an explanation which the original study failed to identify and test. This is, of course, classic 'stroke-of-genius' research. Just remember, though, that the explanations that are most often overlooked are the commonest, most familiar things.

You may not believe a piece of research. Not all research is good research. I have, several times, replicated and extended research because I didn't believe it. Incredible research deserves to be replicated. If you confirm the original findings, you have helped to overcome the resistance that they will find in being accepted. If you fail to confirm the findings, this in itself is interesting. Though try to make sure that the original author isn't asked to review your paper!

Even doing a straightforward replication of a previous study can be a very worthwhile exercise. As a first project, it means that you already have a 'canned' methodology, and you will learn a lot about running, analysing and presenting research, But there are often surprises too.

Chris Zorn of Emory University wrote: 'As a social scientist responsible for training grad students in statistics, one thing that I've always found useful is replications While the main reason I use replications is to teach students statistics and/or software, these exercises often prompt them to extend the work they are replicating. These can range from the simple (e.g. testing for relationships in the data that the original investigators didn't look for) to the very involved. The result is often interesting, if a bit derivative, research projects, some of which have led to PhD theses, etc.'

Andersson Roland puts it simply: Dig where you stand. That is, make use of all the data that is already at hand and that nobody had time to analyse. Almost always there will be unexpected or unknown patterns in these data that can be detected if you analyse them with an open mind. You do not always need to have an research idea ready when you start. They will come up when you try to formulate an explanation for the patterns that you find in your data.

Alex Heath, an economist from Australia, wrote: A good way to get started thinking about research questions for me is to find things which have been done overseas (usually the US or the UK) and adapt them to Australian data. I find that once you start replicating things you find interesting twists and turns which allow you to say something completely new.

Although I have replicated several studies because I didn't believe them, this probably isn't the best spirit in which to replicate. But neither should you simply accept the original research as scripture. Paul Velleman, the person responsible for the DataDesk statistical package and ActivStats, a statistical teaching package, wrote in praise of an attitude of well-informed skepticism: This misses the most important part of the process -- an abiding skepticism. You must know your science before you can be intelligently skeptical about it, but just because you know what is common wisdom doesn't mean you should believe it. Indeed, if science is to progress, you must maintain a willingness to disbelieve. You don't do research by replicating previous results but by doubting them.

Dennis Roberts, responding to this, said: a good replication study does not have to be done BECAUSE one doubts them but rather, to bolster the case that the research findings made ...

I think that he and Paul really just differ in emphasis, with Dennis arguing that 'replication is very valuable ... we don't do enough of it ... ' while Paul cautions against literal-minded repetition. I think everyone would agree that the scientific idea of replication is doing something more intelligent than just looking for what the other guys already saw.

Paul makes the point, too, that it is hard to sit down and work carefully through a set of data without coming up with at least one pattern that needs further investigation. You may start by replicating a study, but this is almost guaranteed to act as a springboard to innovative questions of your own.

Getting a research idea by reading papers.

You can simply bury yourself in the library with a whole year's worth of your favourite journal and, starting from the most recent issue, use a series of filters to identify studies that you would be interested in and capable of extending. Even when I'm not in need of a research project, I often graze my way through a small stack of journals, picking up an interesting methodological approach here, or a useful measurement technique there. Many of my more prolific colleagues do this a lot. One, in particular, seems able to rummage out a half-a-dozen relevant journal articles from her shelves on any topic in about five minutes.

If I am looking for a potential project, I look at each article in turn and ask:

Getting your own ideas

This is an even harder subject to write about than extending and developing the ideas of others. (Did I say plagiarising? -- Never!). The secret seems to be keeping your eyes and ears open all the time. The observation doesn't have to be complicated. On the contrary, spotting an obvious question in an everyday event often has greater potential.

Jack Schnell of Department of Economics at the University of Alabama in Huntsville remembers simple advice he got as a student: 'look out of the window', meaning 'pay attention to what is happening out there in the world, look for issues that are ripe for investigation'. And since that time I have tried to do just that. For me, this has been more intellectually sustaining than, say, combing through some literature in the hopes of seeing a useful extension.

A simple observation can spark off a whole train of ideas. Roland Andersson, of the Department of Surgery in Joenkoeping, Sweden, said For me it started like this: I observed that we had had 12 patients with appendicitis during one week. The following weeks we had only one or two. I wondered: 'Had we had an epidemic of appendicitis?'. I happened to know about Knox space-time analysis and I started off from there and finally have written a thesis about 'Appendicitis - epidemiology and diagnosis'. Lots of new questions arise and I am now involved in a (as it seems) never ending project about aspects of appendicitis. (And please, don't worry if you have no idea what Knox space-time analysis is; the important point is that Roland brought together a specialised theoretical framework which he already knew and a common everyday observation. In other words, he applied the theory he knew to the world outside the window.)

But what frame of mind, what view of the world do you need in order to have productive research ideas? A lot of discussion focussed on this question. At one extreme was Robert Hamer, who very much doubted whether you could teach anyone how to look at the world in a questioning manner. I don't think that this is true, though. We are brought up in a way that does not encourage us to question the explanations we are given for things. Don't forget that all children are hungry to find things out, to know why things are so. This voice of hunger for knowledge and delight in figuring things out is much smaller and more timid by the time we have grown up, but with patience it can be called back. It takes time to rid ourselves of this learned uncuriuosity.

The trick is doing what children do: asking lots of questions and teasing out the logical consequences of the answers. Paul Velleman again: "Dennis is right that the problem is nudging the mind. We need to start that process in childhood. We must cultivate in our children and our students a broad-based skepticism coupled with a sense that there *is* order in the universe."

These are the sorts of questions that scientists and other children ask.

One must maintain an active and abiding skepticism about the explanations and models that have been proposed in science. Skepticism, which Paul Velleman identifies as a key attitude, doesn't involve simple disbelief, but rather being able to entertain a number of different explanations at once.

This struck a chord with Robert Knodt: After being involved with masters and doctoral students for over thirty-years and looking back for an answer to the original post, I find that the statement above applied to over 90% of those I helped... The first person I worked with was bothered by a statement in a 10th grade Biology book which said that trees were pruned in the fall in order to make them fill out areas and become more symmetrical. This still bothered him eight years later. He finally did is work on 'wound' hormones in plants.

Says who? Many pieces of medical knowledge are folkloric, and the evidence is slender. In particular

I don't believe that! Always trust your disbelief. Often a trip to the library will put your mind at rest, but think about

Why are we doing this? At every point in clinical practice there are decision forks. Some may be invisible (we always do X when Y happens) but these are the most interesting! For example

Why are they both right? Some disagreements in the literature are because no-one has yet spotted the reason why two different sets of investigators should have observed data that were seemingly contradictory.

Can we learn from the abnormal? We learn once from describing the normal--normal course of disease, normal range of variation etc. We learn a second time by examining cases that do not fit the general picture. Rare, pathological conditions can give us an insight into how more subtle, commonplace processes work.

Final thoughts

I don't know where ideas come from, but I do know that you get more ideas if you try to remember everything that happens that doesn't have a good explanation. I carry a little black notebook which can simply be used to note phone numbers and things I have to buy next time I go shopping, but it also means that I have a way of writing down an idea the instant I spot something interesting.

The last thing I want to say is based on my experiences teaching music to adolescents, as much as teaching research methods to medical students. The biggest obstacle you encounter is a feeling that you can't do this; that you aren't the sort of person who can sing, or make interesting observations or pose original questions. Just remember: this is what you did as a child, before you were taught any different. So you already know how to do this; just think of yourself as a little rusty.

The copyright for this page belongs to Ronan Conroy. This page was formatted by Steve Simon and was last modified on 2008-04-28. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.


Developing a research hypothesis (August 18, 1999)

Dear Professor Mean, I want to do some research, but before the hospital won't approve anything until I have a protocol with a research hypothesis. I'm not sure why a research hypothesis is important. Can you help? -- Little Linda

Dear Little,

Think of it as job security for your local statistician.

Short answer

A research hypothesis provides clarity. A problem has to be stated clearly before it can be solved. The research hypothesis will also provide direction for writing the rest of your protocol.

There are several steps that you should follow:

  1. Identify the four components that most research hypotheses have.
  2. Select between a one sided and a two sided hypothesis.
  3. Use your hypothesis to guide the writing of your research protocol.

Stating a hypothesis

Ideally, your research hypothesis should be specified prior to the collection of any data. An exception would be an exploratory study. For example, if you are investigating the cause of poor morale among health care providers, you may not have enough information to specify anything more specific than a whole range of factors that might influence morale.

In general, a hypothesis will have four major components. Not every hypothesis can be fit into this framework, of course, but knowledge of these four components might help you if you have an incompletely formed hypothesis.

The first component is the subject group. In other words, who are you interested in studying? Subjects could be patients, their parents, or the health care providers.

The second component is the treatment or exposure. In other words, what is being done to part or all of your subject group. A treatment implies an action on your part, such as providing information or applying a new therapy. An exposure, on the other hand, implies some action that you do not control, such as lead poisoning or premature birth.

The third component is the outcome measure. In other words, how or in what manner is the treatment or exposure going to be assessed. It is very important that the outcome measure be defined precisely and unambiguously. For example, if your outcome is breast feeding rates, you should use standard definitions of breast feeding, such as those provided by the World Health Organization.

The fourth component is the control group. In other words, who are you comparing to. It is important for the control group to be as similar as possible to those who receive a treatment or exposure.

As mentioned earlier, not every research hypotheses will have all four components. For example, a cross-over design involves applying both a new treatment and a standard treatment using the same patients. For this study, the hypothesis would not involve a separate control group. Correlational studies look at relationships within a single group, such as a study of the factors that cause medication errors. This type of study would not have a treatment/exposure. The structure mentioned here, however, is still useful for developing most research hypotheses.

One sided versus two sided hypotheses

During the planning of your research, you need to specify whether you plan to use a one sided or two sided hypothesis. A two sided hypothesis states that there is a difference between the treatment/exposure group and the control group, but does not specify in advance what direction you think this difference will be. A one sided hypothesis states a specific direction (e.g., increase).

If you expect that a change in either direction is possible and that changes in either direction are interesting, then you should use a two sided hypothesis.

If changes in one direction are uninteresting and unpublishable, then use a one sided hypothesis. Also if a change in the unexpected direction is equivalent in practice to no change, then use a one sided hypothesis.

The best example of this is when you are comparing a new therapy to an existing therapy, where the new therapy is much more expensive, your only concern is to show that the new therapy is better. If it turns out that then new therapy is equal to or worse than the standard therapy, you will not adopt it.

Some important issues involving the control group

With a treatment, where you intervene, it is often possible to select those patients who receive the treatment through the use of randomization. Randomization ensures comparability, because the random selection ensures that, on average, subjects who receive the treatment will be comparable to subjects who do not receive the treatment.

When you have an exposure instead, it is often difficult to ensure that the subjects without the exposure are comparable to the the exposed subjects. Sometimes matching will help, but you should only use matching for very important prognostic variables. For example, birth weight plays a major role in infant mortality, so it is often helpful to match your exposure group to your control group on the basis of birth weight. Matching, however, will often present difficult logistics, especially when the pool of control subjects in not much larger than the pool of exposed subjects.

What are your next steps?

Other important issues to be considered in your protocol is

  1. determination of the sample size,
  2. identification of potential confounding variables, and
  3. what efforts at blinding will be used, if any.

Once you have a well defined research hypothesis, though, these details will fall into place. Hah, hah, did I really say that? The rest of the protocol is still pretty darn hard, but it would have been impossible if you didn't have that research hypothesis.

To determine an appropriate sample size, you need a research hypothesis, an estimate of the standard deviation of your outcome measure, and assessment of how much change is considered clinically relevant. Hey, you're already a third of the way there! Finding a standard deviation requires either reviewing previous research on that outcome measure or running a pilot study. The clinically relevant difference is a judgement that is made solely on medical knowledge. Your statistician cannot tell you what a clinically relevant difference would be.

Confounding variables are those variables which are related to your outcome measure and which may differ between your treatment/exposure group and your control group. Assessment of potential confounding variables is especially important when you cannot randomize.

Blinding means hiding information about the treatment/exposure from the patients, their parents, and any health care professional who interacts with the patients and their parents. Blinding is useful when it can be done, but blinding is not always possible. For example, in a comparison of a drug that is rectally administered to oral administration, the patient usually figures out quickly which group they are in. But even when the patients themselves know which group they are assigned to, you can sometimes still use blinding for laboratory personnel and for interviewers.

Summary

Little Linda needs to include a research hypothesis in her grant proposal, but doesn't know what it should say. Professor Mean explains that you should develop a hypothesis to giveyour research clarity. There are four components in most research hypotheses:

  1. a subject group,
  2. a treatment or exposure,
  3. an outcome measure, and
  4. a control or comparison group.

Other important issues to keep in mind while developing a research hypothesis:

  1. Use a one sided hypothesis when changes in the opposite direction are uninteresting.
  2. Randomization helps ensure that you have a comparable control group.
  3. Use the research hypothesis to guide the determination of sample size, the identification of confounding variables, and the efforts to blind information.

Annotated Bibliography

http://www.shef.ac.uk/~scharr/reswce/question.htm

This site provides information about evidence-based medicine, but much of the material is still relevant to developing research protocols. The four components to a research hypothesis come from this site.

Massey, V.H. (1995) Nursing Research, Second Edition, Springhouse PA: Springhouse Corporation

This book provides a "how to" framework for conducting research in any easy to skim outline format. The book includes topics on ethics, literature review, sampling techniques, data analysis, and presentation of research results. The sections that deal with planning are the best parts of this book.

Lang, T.A. and Secic, M. (1997) How to Report Statistics in Medicine. Annotated Guidelines for Authors, Editors, and Reviewers, Philadelphia, PA: American College of Physicians.

It seems ironic to recommend a book on writing the final results, but it helps to start out with your goal in mind. If you think about the information that belongs in your research paper, then you will have a good idea of what you need to specify during the planning stages of your research. This book also uses an easy to skim outline format, but it has significant narrative text under each outline element.

This webpage was written by Steve Simon on 2008-xx-xx, edited by Steve Simon, and was last modified on 2008-07-14. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Ask Professor Mean, Category: Grant writing


Getting IRB approval for  your research (October 9, 2002)

Dear Professor Mean: I am submitting a proposal to our Institutional Review Board. Is there anything you can do to help me get IRB approval? --Terrified Terri

Dear Terrified:

Why not bring a freshly baked batch of chocolate chip cookies to the IRB meeting? I'd be glad to sample the batch first to make sure it tastes okay.

Disclaimer

In a perfect world, everyone would listen when Professor Mean talks and they would decide things exactly the way he would. Alas, it's not a perfect world. Our IRB here at Children's Mercy Hospital uses criteria that differ from the guidance I give below, and your IRB probably does also. I'm working with our IRB to better understand the criteria they use and when I get a better understanding, I'll update these web pages accordingly.

But don't try the PMSS defense: You should approve this protocol because Professor Mean Said So. Sadly, it does not work.

By the way, if you serve on an IRB, I'd love some feedback from you on how your IRB assesses scientific validity.

Short answer

The IRB does look at a variety of issues, but the one with particular relevance to statistics is whether the study has scientific validity. It is unethical to expose research subjects to any risks, discomforts, or inconveniences if the study has dubious validity. The Declaration of Helsinki states

Medical research involving human subjects should only be conducted if the importance of the objective outweighs the inherent risks and burdens to the subject. www.wma.net/e/policy/17-c_e.html 

Justification for scientific validity also appears in the Nuremberg Code.

The experiment should be such as to yield fruitful results for the good of society, unprocurable by other methods or means of study, and not random and unnecessary in nature. ohsr.od.nih.gov/nuremberg.php3

Good statistical design can touch on several aspects of scientific validity:

  1. Is your sample chosen appropriately?
  2. Is your sample size large enough?
  3. Are you measuring things well?
  4. Do you have a good plan for analysis of the data?

Make sure that you provide enough documentation in your proposal to convince the IRB that the answer is YES! to all these questions.

Is your sample chosen appropriately?

Who you choose to participate in your research study will say a lot about how easily you can generalize your results to the real world. No sample is perfect, and even just the process of asking for informed consent can hurt generalizability.

If you randomly select subjects and/or randomly assign them to treatment and control, that's good. But more important is the pool of subjects that you are drawing your sample from. Ideally, your pool of subjects should include the full spectrum of the rainbow. In practice, logistical constraints make this ideal impossible.

Watch out when you select subjects only when your research coordinator is on the clock, or only from a tertiary care center. These are examples, where you may not have much success in extrapolating your findings to a more general group of patients. You can't generalize to all fruit when your sample is restricted to apples.

Sometimes there are hidden restrictions on your sample. Some studies may implicitly exclude patients if they:

The logistics of your research and limitations on your time and trouble may also place restrictions on your sample by excluding patients who arrive on weekends and evenings.

Sometimes these restrictions are trivial and sometimes not. It's best to acknowledge these implicit restrictions and be honest about the extent to which they hurt your ability to generalize.

Also, you need to be very careful about selecting your control group. The control group needs to be identical to the treatment group, except for the therapy or exposure being studied. If the control group differs on other factors, especially factors that affect prognosis, then you have problems. You need to control for these other factors, through randomization, matching, or covariate adjustment.

Is your sample size large enough?

The size of your sample plays a vital role in scientific validity. You can't ignore this issue. Every single research study, no matter what the type, should have an explicit justification of the sample size. Virtually every research area has identified and documented problems with inappropriate sample sizes. Failure to consider sample size represents one of the biggest problems with research today.

With a small sample size, you may not have enough precision to make any useful statements about your research data. This is a waste of research dollars, but it is also unethical. An inadequate sample size needlessly puts subjects at risk without any benefit  to society.

The opposite problem can also occur. Some research studies include too many research subjects, but this problem is rarer. Including too many research subjects is also a waste of research money and it is also unethical. You are exposing more patients to the risks, discomforts, and inconveniences of the research study than you need to make precise statements about your data.

The justification of your sample size could take the form of a power calculation, if you have a formal research hypothesis. If your study will produce some simple descriptive statistics, then you should show that the confidence limits about these statistics will be reasonably narrow. Even if your study has a non-quantitative objective, you should still justify your sample size, possibly using a non-quantitative criteria.

There are many complex formulas for determining sample size; here is some general advice.

First you need to think about the size of the difference you are trying to detect and compare that to the standard deviation of your outcome measurement. If you are trying to detect differences that are small relative to your standard deviation, then you need a very large sample size. Detecting a difference that is about one fifth of a standard deviation, for example, might require a sample size in the hundreds.

If you are trying to detect a difference that is very large relative to your standard deviation, then you can get by with a smaller sample size. Detecting a difference that is about the same size as a standard deviation would only require a few dozen subjects.

Be careful! You might be tempted to say that you are only looking for differences that are large relative to the standard deviation, but you may end up painting yourself into a corner. If you suspect that your control group is a full standard deviation or more away from the treatment group, then this difference is one that would be so large as to be visibly different.

For example, Jacob Cohen points out that 13 year old girls and 18 year old girls differ in average height by about 0.8 standard deviations. He also mentions that the Ph.D. holders and college freshman differ in average IQ by about the same amount. Do you really believe that your study will show such a large difference?

Second, when you are counting events, discrete events like deaths, it is the number of these events, not the total number of subjects studied, that determines the precision of your results. When the events are very rare, this means that you have to sample a large number of patients in order to accumulate enough events.

As a very rough guide, you should strive for at least 25 to 50 events per group. If your event occurs only 1% of the time, that means that you might need as many as 5,000 patients per group. If an event occurs one fourth of the time, you might be able to get by with one or two hundred patients per group.

Event Rate Recommended
sample size
25% 100 to 200
5% 500 to 1,000
1% 2,500 to 5,000
0.2% 12,500 to 25,000

Finally, if the sample size you need is unattainable--you don't have the budget, perhaps, or the study would take too long--then consider redesigning your experiment. Find a way to reduce the variability of your outcome measure. A cross-over design, for example, will usually have much less variability because each patient serves as his/her own control. Sometimes intermediate measurements (often called surrogate measurements) will improve your sensitivity enough so you can attain a reasonable amount of precision with a limited sample size.

Sometimes research will have a qualitative rather than quantitative goal. We might be interested, for example, in the issues that children with sickle cell disease face, or teenagers reasons for starting to smoke cigarettes. For qualitative studies, there is no mathematical formula that you can apply to justify your sample size.

The sample size needs to be large enough to ensure a rich and complete set of responses. Look for a sample size large enough to ensure that both ends of the spectrum (and the middle) are represented. If the population you are studying is very homogenous, then as few as a dozen patients may be enough. You may also wish to depart from random sampling and use a purposive sample instead. You can also justify a small sample size if you use purposive sampling. A purposive sample deliberately looks for patients with certain characteristics and can ensure that you have included all relevant viewpoints and perspectives in your study.

Another way to assess the sample size is by saturation. Saturation occurs when the same themes get repeated over and over and no new ideas are generated.

          Are you measuring things well?

There are a lot of scientific issues that I can't answer here. Is arterial distensibility is a good marker of heart disease? What is the best way to determine gestational age? Should you measure blood pressure in the left arm or the right arm?

I can, however, ask some questions that will help you determine whether your measures are clinically relevant.

Is your measure valid and reliable?

Every discipline has slightly different definitions and standards for validity and reliability. As a general rule, the issues of validity and reliability become most important when you are measuring something abstract, like stress, or something subjective, like quality of life.

The easiest way to ensure validity and reliability is to use measures that have already been established in the peer reviewed literature. You can also hedge your bets by including several measures of the same outcome.

If you have concerns about validity and reliability, you might reserve a fraction of your sample (from 5% to 20% is a good starting point) for more thorough analysis. These patients might receive additional tests to verify that your simple outcome measure actually works well. Or you might have these patients evaluated by two different people and measure the level of agreement.

Be cautious about sources of information that are known to be imperfect. For example, in a study of 295 deaths from child maltreatment, only half were identified as such on the death certificates. The gender of the child, whether the perpetrator was a parent, and whether the child died in a rural or urban county, had a differential impact on ascertainment.

Do you define all your terms objectively?

Research must be repeatable, so you need to use terms that are defined well enough so that another expert could reproduce your work and come up with roughly comparable findings.

You need to provide operational definitions for any events that are subject to differing interpretations. For example, the Scottish Intercollegiate Guidelines Network defines life threatening asthma as:

"Features of life threatening asthma include agitation, altered level of consciousness, fatigue, exhaustion, cyanosis, and bradycardia. Air entry is often greatly reduced, which may lead to a 'silent chest'. The peak flow, if recordable, is usually less than 33% of best or predicted."  www.sign.ac.uk/guidelines/fulltext/38/section2.html

Up to 1992, the National Center for Health Statistics defined current and former smokers by asking the following two questions:

"Have you ever smoked 100 cigarettes in your lifetime?"
"Do you smoke now?"
www.cdc.gov/nchs/datawh/nchsdefs/currentsmoker.htm

The Social Security Administration defines blindness as:

"when your vision cannot be corrected to better than 20/200 in your better eye, or if your visual field is 20 degrees or less, even with corrective lens. Many people who meet the legal definition of blindness still have some sight and may be able to read large print and get around without a cane or guide dog." www.rcep7.org/socialsecurity/faq/blind/default.html

Is your outcome important to your patients?

Patients are usually interested in one of three things: morbidity (will I develop diabetes?), mortality (will I die?), or quality of life (will I be able to lift and carry a bag of groceries?). Ideally, you should try to measure one or more of these things directly. If you can't measure them directly, then does your indirect measurement (sometimes called a surrogate measurement) have a strong link with morbidity, mortality, or quality of life?

Also, are you focusing on a short term outcome because of your convenience, when your patients are most interested in long term outcomes? It is easy to get someone to quit smoking for a week, but it is much harder to get them to persist through a full year.

Do you have a good plan for analysis of the data?

It is important to have a plan. If you don't tell the IRB what you expect to do with your data, they won't be able to decide if the goal of your research is worth the risks, discomforts, and inconveniences of the patients in the study.

This does not have to be very detailed. If all you want to do is a descriptive study where you estimate a few means and proportions, then that's all you need to say. A lot of very valuable research does nothing more than this. Here's an example:

In this research study, we will study children with severe hearing loss  in order to estimate the proportion who lose a hearing aid, and the average expense associated with these losses.

It's a myth that all research requires a hypothesis specified prior to the collection of the data.  Most (but not all) qualitative research lacks a formal hypothesis. A descriptive study like the one described above does not have a research hypothesis. Some other examples of research without a formal hypothesis include:

You can sometimes artificially contrive a hypothesis in these situations, but it is usually better to explicitly state that you don't have a research hypothesis. Instead identify the alternative goal you are trying to achieve or the question you are trying to answer. For example;

There is no research hypothesis for this pilot study. Our goal instead is to identify ambiguous language, missing categories, and other problems with the patient satisfaction questionnaire.

If you are testing a hypothesis, you need to specify that hypothesis as well as how you will test that hypothesis. This may appear difficult to you, but if you don't muck this up too badly, the IRB will probably give you a pass. You need to show enough detail so you don't appear totally incompetent.

If your data analysis plan is bad, it can still be fixed after the data are collected. In contrast, if you have a lousy control group or your sample size is grossly inadequate, you need to do something before you start collecting data.

So don't worry about the details too much. If you specify a Mann-Whitney test and you really needed to use a Kruskal-Wallis test instead, the IRB will probably still approve your study contingent on fixing that detail. Still, there are some statistical details that you need to worry about.

Specify what your alpha level is (usually 5%) and whether your hypothesis is one-sided or two-sided. A one-sided test looks at changes in a single direction. Changes in the opposite direction are considered either impossible or irrelevant. One-sided tests are often used when changes in the opposite direction would have the same implications as a null finding. For example, we might find that a new drug is equivalent to a placebo, or that it performs worse than a placebo. We would refuse to adopt the drug in either situation. So comparisons to a placebo are usually one-sided.

Contrast this with testing a standard drug to a new drug. If the new drug performs worse, we would never use it, but if it is equivalent, then we would use part of the time based on other factors like cost, convenience, and patient preference. Comparisons of two active drugs are usually two-sided. This might change, however, if the side effect profile of one drug is so harsh that you would only prescribe it when it is superior.

Further reading

  1. Assert: A standard for the review and monitoring of randomized clinical trials. Howard Mann. (Accessed on October 14, 2002). http://www.assert-statement.org/ Excerpt: "The ASSERT statement is the articulation of A Standard for the Scientific and Ethical Review of Trials. It proposes a structured approach whereby research ethics committees review proposals for, and monitor the conduct of, randomized controlled clinical trials. In order to ensure the ethical conduct of research involving human subjects, the ASSERT checklist comprises items that need to be addressed by investigators applying for approval to conduct a clinical trial. These items are chosen to enable fulfillment of certain universally applicable requirements for the ethical conduct of research: social and scientific value; scientific validity; fair subject selection; favorable risk-benefit ratio; and respect for potential and enrolled subjects."
  2. Content and quality of 2000 controlled trials in schizophrenia over 50 years. Thornley B and Adams C. British Medical Journal 1998:317(7167);1181-1184. [Abstract] [Full text] [PDF]
  3. Underascertainment of Child Maltreatment Fatalities by Death Certificates, 1990-1998. Crume TL, DiGuiseppi C, Byers T, Sirotnak AP and Garrett CJ. Pediatrics 2002:110(2);e18. [Abstract] [PDF]

Very bad joke: How many IRB members does it take to screw in a light bulb?

As documented in 45 CFR 46.107(a), this review board must consist of five (5) or more members, and at least one of these members must possess a background in Electrical Engineering. In addition, at least one of the members must come from a home without any electricity. Any member of the IRB who owns stock in an electrical utility or who regularly pays bills to an electrical utility should recuse themselves from participation in the review of this research.

If the bulb should burn too brightly, burn too dimly, or flicker, then an adverse event report should be sent to the IRB (21 CFR 312.32). If the light bulb is dropped, then a serious adverse event report should be sent to the FDA by telephone or by facsimile transmission no later than seven (7) calendar days after the sponsor's initial receipt of the information.

If this is a multi-center light bulb trial, then a data and safety monitoring board (DSMB) may be needed (NIH Policy for Data and Safety Monitoring, June 10, 1998, http://grants.nih.gov/grants/guide/notice-files/not98-084.html, accessed on October 9, 2002). The DSMB should review any adverse event reports and interim results. If the clinical equipoise of the light bulb is lost, then the DSMB should terminate the study and provide all previously recruited light bulbs with the best available light bulb socket.

In order to maintain scientific integrity, the use of a placebo socket may be necessary. The placebo socket should have the same taste, appearance, and smell of a regular socket and the fact that this socket has no electricity should be hidden from the light bulb and from the person screwing in the light bulb. According to the 2000 revision of the Declaration of Helsinki, paragraph 29, the use of placebo sockets is acceptable where no proven prophylactic, diagnostic, or therapeutic socket exists.

A systematic review of all previous research into light bulbs must be presented so that the IRB can determine, per 45 CFR 46.11(a)(2), that the risks to the light bulb are reasonable in relation to anticipated benefits. The IRB should also ensure that the selection of light bulbs is equitable (45 CFR 46.11(a)(3)). If the light bulb has less than 18 watts of power, then additional requirements (45 CFR 46.401 through 409) apply.

The IRB must ensure that an informed consent document be prepared in language that the light bulb understands (45 CFR 46.116). This document should explain the expected duration of the light bulb's participation in the research, any reasonably foreseeable risks, and the extent to which the confidentiality of the light bulb will be maintained. This document should also emphasize that participation is voluntary and the light bulb can withdraw itself from the socket at any time without any penalty or loss of benefits.

The clipart on this page was courtesy of the clipsahoy web site: http://www.clipsahoy.com/index2.html. The remainder of the material is licensed under a Creative Commons This page was last modified on 07/08/08 . Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Ask Professor Mean,


Three things you need for a power calculation (November 8, 2001) Category: Ask Professor Mean, Category: Sample size justification

Dear Professor Mean, I want to do research. Is forty subjects enough, or do I need more? Didn't I hear you mention something about three things you need for a power calculation? -- Eager Edward

Dear Eager,

That reminds me of a cute joke. How many research subjects does it take to screw in a light bulb? At least 300 if you want the bulb to have adequate power.

Sorry, I was digressing. Is forty subjects an adequate sample size? That depends on a lot of factors. The basic idea, though, is to select a sample size which ensures that your study has adequate power. Power is the probability that your research study will successfully detect a difference, assuming that the treatment or exposure you are examining actually can cause an important difference. If you don't care whether your experiment is successful or not, then you can use just about any sample size.

Short answer

Power is to a research design like sensitivity is to a diagnostic test. A diagnostic test with good sensitivity is normally able to detect a disease when the disease is present. A research study with good power is normally able to detect a change when your treatment is indeed effective.

The actual calculation of power requires three pieces of information:

  1. your research hypothesis,
  2. the variability of your outcome measure, and
  3. your estimate of the clinically relevant difference.

Calculating power is sometimes difficult and it may require you to go to the time and expense of running a pilot study. But you should NEVER start a research project without knowing what your power is. That would be like using a diagnostic test with unknown sensitivity.

Research hypothesis

A research hypothesis will provide specific information that will determine what type of analysis is needed. A common structure for a research hypothesis is specification of the subject group you are testing, the treatment or exposure that this group will receive, the outcome measure, and the comparison or control group.

Some exploratory studies may not have a research hypothesis, of course, and for those studies you determine an appropriate sample size in a different way (for example, by insuring that the estimates from this exploratory study have adequate precision).

Variability of your outcome measure

You also need to have an estimate of the variability of your outcome measure. I'm assuming here that your outcome measure is continuous variable like birth weight or cholesterol level. If you are using a categorical outcome measure like mortality or cancer remission, then you need some estimate of the rate of mortality or remission in your control group.

Your literature review (you did do a literature review before you started this research, I hope), will usually provide you with an estimate of variability. Select a study that is reasonably similar to what you plan to do, and find out what that study reported for the standard deviation for your outcome measure.

Although I prefer a standard deviation, other estimates of variability are also acceptable. If the paper reports a variance, a standard error, a confidence interval, or a coefficient of variation, then there are simple formulas for converting these into standard deviations. If the study priveds a range, then you can divide the range by four to get a good approximation for the standard deviation.

Many of the people I see have a difficult time providing any estimate of variability. This area hasn't been studied before, so no one knows what the variability will be. But don't give up too easily.

First keep in mind that you only need a crude estimate of variability. Power calculations are capable of determining if you are "in the right ball park." They are good at specifying your sample size down to an order of magnitude perhaps but not much more than that. In other words, might tell you whether you need hundreds of subjects dozens of subjects instead of hundreds of subjects, or possibly if you need thousands of subjects.

Second, although most research is innovative and therefore unique, this innovation is often in the treatment and not in the outcome measure. So look for studies that used the same outcome measure, even if the treatment is quite different than yours.

Third, try to characterize variability in your control group and we can try to extrapolate what the variability will be in the treatment group. A retrospective chart review, for example, will provide a rough estimate of variability of your outcome measure under the current standard of care.

Third, you may have to use a clearly flawed estimate, but a flawed estimate of variability may still be better than no estimate at all. An estimate of variability in adults, for example, may not be an ideal estimate for a pediatric study, but at least it tells you if your study will have adequate power assuming that the variation in a pediatric population is comparable to variation in an adult population. That's still better than having no idea whether your study has adequate power.

If you've tried and you still can't come up with an estimate of variability, then don't despair. A pilot study can provide you with an estimate of variability when all else fails. Usually 20 to 30 subjects produce a reasonably stable estimate of variability. A pilot study is also helpful for finding out how quickly you can recruit subjects. Furthermore, a pilot study will also identify any weaknesses in the logistics of your research. Finally, if the protocol remains substantially unchanged after the pilot study, you can usually include those pilot subjects in the final analysis.

Clinically relevant difference

Wow, that was exhausting! You're not done, though, until you can tell me what a clinically relevant difference would be for your outcome measure. This is a difference that is large enough to be considered important by a practicing clinician.

For just about every type of study, some differences are so small as to be clinically meaningless. From a theoretical viewpoint, perhaps, changes of any size might be interesting. But theory and practice are very different. If a six month diet program produces an average weight loss of three pounds, a fever medicine reduces average temperature by half a degree Fahrenheit, or a smoking cessation program helps an additional two percent to quit, who cares what the theoretical implicaitons might be.

It's not easy but this is something that you have to do for yourself. The clinically relevant difference is determined by medical experts and not by statisticians. Hey, I'm still trying to understand the difference between good and bad cholesterol; I wouldn't even be able to start thinking about how much of a change in cholesterol is considered clinically relevant. You might start by asking yourself "How much of an improvement would I have to see before I would adopt a new treatment?" Also, try talking with some of your colleagues. And look at the size of improvements for other successful treatments.

Still, there are some general guidelines that might help. Try looking at the resolution of your measuring device, thinking in terms of relative changes, or specifying changes with respect to your standard deviation.

Average changes that are smaller than the resolution of your measuring instrument are probably not clinically relevant. For example, Apgar scores can take on any whole number between 0 and 10. Gestational age can only be measured accurately to within a week In these contexts, it is clear that average changes should probably be greater than one unit in order to achieve relevance.

Still this is not a perfect rule. We can measure weights to within a gram, but changes in birth weight would have to be in the hundreds of grams or more to be meaningful. And while no family can have a fractional number of children, decreasing the average family size by 0.2 children can have a profound effect on society.

It also may help to think in terms of relative changes. If you can change something by 25 percent or 50 percent, that is considered relevant in most contexts. It becomes harder to argue clinical relevance for changes of less than 10 percent. Again, this is not a perfect rule.

Finally, you might find it easier to specify changes with respect to your standard deviation. This type of change is called an effect size. A common classification is that 0.2 standard deviations is considered a small effect size, 0.5 standard deviations is considered a medium effect size, and 0.8 standard deviations is considered a large effect size.

An effect size of 0.2 is small enough that there is no obvious visible separation between the two groups. The difference in average heights between 15 and 16 year old girls is 0.2 standard deviations. An effect size of 0.8 is clearly visible. The difference in average heights between 14 and 18 year old girls is 0.8 standard deviations.

It may be unrealistic to look for changes much smaller than 0.2 standard deviations because the sample sizes become prohibitively large. It may also be unrealistc to expect to see changes much larger than 0.8 standard deviations since this size change does not seem to occur too often in the published literature.

Like the other two rules, this rule is also not perfect. In some animal experiments, for example, the similarity in the gene pool can often reduce variation to such an extent that changes of more than a full standard deviation are quite realistic. If you are trying to specify a clinically relevant difference, there is no substitute for a good understanding of the context of your research.

But I can't do it.

A lot of people tell me that they can't do this. They can't provide an estimate of variability or they can't determine what a clinically relevant difference is, even after I explain all of the above suggestions.

But you have to do it.

The CONSORT Guidelines require you to have an a priori justification of sample size for publication. If you don't do this now, you won't be able to publish the data in any journal that uses these guidelines. What's the point of doing the research if you can't publish it?

If your research requires an ethical review (e.g., through an IRB), they will require the same a priori justification. If the research involves animals, the appropriate animal care and use committee will require this justification.

The bottom line is that if you know so little about this avenue of research that you can't even come up with a preliminary estimate of the variability of your outcome variable, then you shouldn't be doing the research. You need instead to:

But do something, because your ability to perform the research and to publish your research depends on your justification of the sample size.

Example

In a study of two different skin barriers for burn patients, we are interested in three outcome measures: pain, healing time, and cost. We will randomly assign half of the patients to one skin barrier and half to the other.

For pediatric patients we usually measure pain with the Oucher, a five point scale that has been validated for children. A review of previous studies using the Oucher have shown that it has a standard deviation of about 1.5 units. We would be interested in seeing how large a sample size is needed to show a change of 1 unit, the smallest individual change attainable on the Oucher. We want to have a power of .80, or equivalently, the probability of a Type II error of .20.

The formulas for sample size vary from problem to problem. The sample size needed for a comparison of two independent groups is

wpe26.gif (1536 bytes)

We use the letter "z" to represent a standard normal distribution. Alpha represents the probability of a Type I error (usually .05). Beta represents the probability of a Type II error (we usually want this to somewhere between .05 and .20). Sigma represents the standard deviation, and this formula allows for the possibility of different standard deviations in group 1 and group 2. Don't forget that the formula requires you to square these standard deviations. Finally, D is the clinically relevant difference. In our example,

wpe23.gif (2183 bytes)

We round up. So in order to achieve 80% power for detecting a one unit difference in the Oucher score, which has a reported standard deviation of 1.5, we would need to sample 36 patients in each group.

Healing time is a more difficult endpoint to assess. Medical textbooks cite that the healing time for second degree burns has a range of 4 days (minimum 10, maximum 14). A study of healing times for a glove made from one of the skin barriers showed a healing time range of 6 (minimum 2 and maximum 8 days).

A rule of thumb is that the standard deviation is about one fourth to one sixth the size of the range. So we could have a standard deviation as small as 0.67 or as large as 1.5. An average change of one day in healing time would be considered clinically relevant.

If we use the largest possible estimate of standard deviation, we would get (coincidentally) the exact same sample size of 36 per group. If we used the smallest estimate of the standard deviation, we would need only 7 subjects per group.

Ffor one type of skin barrier, a study of costs showed a range of $4.00 ($5.50 to $9.50). We would like to be able to detect a difference as small as $0.50 in costs.

Using the same rule of thumb, we get an estimate of the standard deviation of either 0.67 or 1.0. Using the smaller estimate of standard deviation, we would need 29 subjects per group using the smaller estimate of standard deviation. We would need 63 subjects per group, using the larger estimate.

A sample size of 63 is untenable, so we decide that we can live with a study that could only detect a $1.00 change in costs. For this size difference, we would need 16 subjects per group using the larger standard deviation.

In summary, to achieve adequate power for all three endpoints, we would need 36 patients per group,. This is larger than we need for the healing time endpoint. It is also larger than what we need for the cost endpoint, unless we wanted to detect a $0.50 change in costs. To detect such a small difference, we need a sample size of 63 subjects per group.

Summary

Eager Edgar wants to know if forty subjects is enough to conduct a research study. Professor Mean explains that it is impossible to determine whether forty is an appropriate sample size without having these three things:

  1. a research hypothesis,
  2. a standard deviation for your outcome measure, and
  3. an estimate of the clinically relevant difference for this outcome measure.

Further reading

Jacob Cohen has an excellent discussion of effect sizes in Chapter 2 of his book and the examples of girls heights comes directly from this book. Bernard Rosner incorporates a discussion of power and sample size issues into every section on statistical testing. Russ Lenth's PiFace software will provide more accurate power calculations than those presented here (or in Rosner's book), which is especially important when you are estimating power for small sample sizes. The range method for estimating staindard deviations gives a more precise rule for converting a range into a standard deviation.

  1. Power and sample size page.
    Russell V. Lenth (Accessed on January 1, 2002).
    http://www.stat.uiowa.edu/~rlenth/Power/
  2. Range method for estimating standard deviation.
    (Accessed on October 2, 2000)
    http://www.uop.edu/cop/psychology/Statistics/range_method.html
  3. Statistical Power Analysis for the Behavioral Sciences, Revised Edition.
    Cohen J.
    New York NY: Academic Press (1977).
    ISBN: 0-12-179060-6.
  4. Fundamentals of Biostatistics, Third Edition.
    Rosner B.
    Belmont CA: Duxbury Press (1990).
    ISBN: 0-534-91973-1.

This page was written by Steve Simon and was last modified on 07/14/2008.


Statistical Evidence: Overview

This is a first draft of the overview for "Statistical Evidence."

"Still, it is an error to argue in front of your data. You find yourself insensibly twisting them around to fit your theories." Sherlock Holmes in The Adventure of Wisteria Lodge.

Reading medical research is hard work. I'm not talking about the medical terminology, though that is often quite bad (if I hear the word "emesis" one more time, I'm going to throw up!). The hard part is assessing the strength of the evidence. When you read a journal article, you have to decide if the authors present a case that is persuasive enough to get you to change your practice. This means assessing the strength of the evidence.

Some evidence is so strong that it stands on its own. Other evidence is weaker and requires support from other studies, from mechanistic arguments, and so forth. Still other evidence is so weak, that you should not consider any changes in your practice until the study is replicated using a more rigorous approach. I hope to elaborate on the criteria that you should use when assessing the strength of the evidence.

0.1. What should you look for?

When you are assessing the quality of the evidence, it's not how the data are analyzed that's important. Far more important is HOW THE DATA ARE COLLECTED. Don't agonize over technical details about the statistical analysis. After all, if you collect the wrong data, it doesn't matter how fancy the analysis is.

This is good news, because you don't need a lot of statistical training or a lot of mathematical sophistication to assess how the data are collected.

In this book, I want to show you what to look for and why. I will also highlight real research articles and use them as examples. Although all of the examples represent good and valuable research, some of the examples represent a level of evidence that by itself is less persuasive. It is helpful to understand why these examples are less persuasive.

0.2. Schizophrenic Research

Unfortunately, there is a lot of less than persuasive research out there. You don't have to look very hard to find solid empirical evidence of this. One of my favorite example is a study by Ben Thornley and Clive Adams that appeared in the British Medical Journal in 1998. You can find the full text of this article on the web at bmj.com/cgi/content/full/317/7167/1181 and it is well worth reading. Thornley and Adams looked at the quality of clinical trials for treating schizophrenia. Since they work for the Cochrane Collaboration Group, a group that provides systematic reviews of the results of medical trials, they are in a good position to write such an article.

Thornley and Adams actually identified over 2500 studies of schizophrenia, but decided to summarize only the first 2000 that they uncovered. Perhaps they reached the point of sheer exhaustion. I am very impressed at the amount of work this must have taken.

The research covered fifty years, starting in 1948 through 1997. The research covered a variety of therapies: drug therapies, psychotherapy, policy or care packages, or physical interventions like electroconvulsive therapy.

What did Thornley and Adams find? It wasn't a pretty picture. First, researchers in schizophrenia studied the wrong patients. Most studies used institutionalized patients, who are easier to recruit and follow up with, but who do not provide a good representation of the all patients with schizophrenia. Readers would probably be interested as much in community based studies, if not more interested, but only 14% of the studies were community based.

Second, the researchers also did not study enough patients. Thornley and Adams estimated that a good study of schizophrenia should have at least 300 patients in each group. This would be based on rates of improvements that might be expected for an active drug compared to placebo effects. Even though the desired sample size was 300, it turns out that the average study had only 65. Only 3% of the studies had 300 or more patients.

Third, the researchers did not study the patients long enough. A good study of schizophrenia should last for six months or more; long term changes are more important than short term changes. Unfortunately, more than half of the studies lasted for six weeks or less.

Finally, the researchers did not measure these patients consistently. In the 2,000 studies, the researchers used 640 ways to measure the impact of the interventions. Granted, there are a lot of dimensions to the schizophrenia and there were measures of symptoms, behavior, cognitive functioning, side effects, social functioning, and so forth. Still, there is no justification for using so many different measurements. Imagine how hard this makes it for anyone to summarize the results of this research. Failure to use and re-use a few standardized assessments has led to a very fragmentary (dare I say, schizophrenic) picture about schizophrenia treatments.

I don't wish to single out research in just this area. There are many reviews in other areas that also point out the flaws and shortcomings of research. Also keep in mind that research on schizophrenia is especially hard to do well. The take home message from Thornley and Adams is that just because the research is peer-reviewed does not mean that it is perfect. I hope to help you identify factors that limit the quality of peer-reviewed research.

0.3. Healthy Skepticism

Please don't panic. Research studies have many flaws but usually those flaws do not make the research wholly uninterruptible. These limitations should make you skeptical, perhaps, but not cynical.

The cynical attitude would be "you can prove anything with statistics" and leads to a nihilistic view that all research is garbage. The cynical attitude would lead you to nit pick a research paper, find a flaw here and a flaw there. Then use these flaws to disregard any research whose conclusions make you uncomfortable.

A skeptical attitude, on the other hand, would ask "how persuasive is this research" and would look at the strengths and the weaknesses of a research paper. It would place limits on how persuasive the research is. When the research was not sufficiently persuasive, a skeptical attitude would encourage you to think about what level of evidence would be enough to persuade you otherwise.

This webpage was written by Steve Simon on (unknown date), edited by Steve Simon and Linda Foland, and was last modified on 2008-07-08. This page needs minor revisions. Category: Statistical evidence


Apples or oranges?

1.0 Introduction

Almost all research involves comparison. Do woman who take Tamoxifen have a lower rate of breast cancer recurrence than women who take a placebo? Do left handed people die at an earlier age than right handed people? Are men with severe vertex balding more likely to develop heart disease than men with no balding?

When you make such a comparison between an exposure/treatment group and a control group, you want a fair comparison. You want the control group to be identical to the exposure/treatment group in all respects, except for the exposure/treatment in question. You want an apples to apples comparison.

1.0.1 Covariate imbalance

Sometimes, however, you get an unfair comparison, an apples to oranges comparison. The control group differs on some important characteristics that might influence the outcome measure. This is known as covariate imbalance. Covariate imbalance is not an insurmountable problem, but it does make a study less authoritative.

Women who take oral contraceptives appear to have a higher risk of cervical cancer. But covariate imbalance might be producing an artificial rise in cancer rates for this group. Women who take oral contraceptives behave, as a group, differently than other women. For example, women who take oral contraceptives have a larger number of pap smears. This is probably because these women visit their doctors more regularly in order to get their prescriptions refilled and therefore have more opportunities to be offered a pap smear. This difference could lead to an increase in the number of detected cancer cases. Perhaps, though, the other women have just as much cancer, but it is more likely to remain undetected.

There are many other variables that influence the development of cervical cancer: age of first intercourse, number of sexual partners, use of condoms, and smoking habits. If women who take oral contraceptives differ in any of these lifestyle factors, then that might also produce a difference in cervical cancer rates. The possibility that oral contraceptives causes an increase in the risk of cervical cancer is quite complex; a good summary of all the issues involved appears on the web at www.jhuccp.org/pr/a9/a9chap5.shtml.

1.0.2 Case study: Vitamin C and Cancer

Paul Rosenbaum, in the first chapter of his book, Observational Studies, gives a fascinating example of an apples to oranges comparison. Ewan Cameron and Linus Pauling published an observational study of Vitamin C as a treatment for advanced cancer (Cameron 1976). For each patient, ten matched controls were selected with the same age, gender, cancer site, and histological tumor type. Patients receiving Vitamin C survived four times longer than the controls (p < 0.0001).

Cameron and Pauling minimize the lack of randomization. "Even though no formal process of randomization was carried out in the selection of our two groups, we believe that they come close to representing random subpopulations of the population of terminal cancer patients in the Vale of Leven Hospital."

Ten years later, the Mayo Clinic (Moertel 1985) conducted a randomized experiment which showed no statistically significant effect of Vitamin C. Why did the Cameron and Pauling study differ from the Mayo study?

The first limitation of the Cameron and Pauling study was that all of their patients received Vitamin C and followed prospectively. The control group represented a retrospective chart review. You should be cautious about any comparison of prospective data to retrospective data.

But there was a more important issue. The treatment group represented patients newly diagnosed with terminal cancer. The control group was selected from death certificate records. So this was clearly an apples versus oranges comparison because the initial prognosis was worse in the control group than in the treatment group. As Paul Rosenbaum says so well: one can say with total confidence, without reservation or caveat, that the prognosis of the patient who is already dead is not good. (page 4)

When the treatment group is apples and the control group is oranges, you can't make a fair comparison.

1.0.3 Apples or oranges: What to look for.

To ensure that the researchers made an apples to apples comparison, ask the following questions:

Did the authors use randomization? In some studies, the researchers control who gets the new therapy and who gets the standard (control) therapy. When the researchers have this level of control, they almost always will randomize the choice. This type of study, a randomized study, is a very effective and very simple way to prevent covariate imbalance.

If randomization was not done, how were the patients selected? Several alternative approaches are available when the researchers have control of treatment assignment, but minimization is the only credible alternative. When researchers do not have control over treatment assignments, you have an observational studies. The three major observational studies, cohort designs, case-control designs, and historical controls, all have weaknesses, but may represent the best available approach that is practical and ethical.

Did the authors use matching to prevent covariate imbalance? Matching is a method for selecting subjects that ensures a similar set of patients for the control group. A crossover design represent the ideal form of matching because each subject serves as his or her own control. Stratification insures that broad demographic groups are equally represented in the treatment and control group.

Did the authors use statistical adjustments to control for covariate imbalance? Covariate adjustment uses statistical methods to try to correct for any existing imbalance. This methods work well, but only on variables that can be measured easily and accurately.

1.1 Did the authors use randomization?

Randomization is the assignment of treatment groups through the use of a random device, like the flip of a coin or the roll of a die, or numbers randomly generated by a computer.

Example: In a study of allergy shots (Adkinson 1997), 121 children with moderate-to-severe asthma were "randomly assigned to receive subcutaneous injections of either a mixture of seven aeroallergen extracts or a placebo."

Example: In a study of acupuncture (Bullock 1989) "80 severe recidivist alcoholics received acupuncture either at points specific for the treatment of substance abuse (treatment group) or at nonspecific points (control group)."

In both studies the researchers decided who got what. This is a hallmark of a randomized design and it only can occur when the patients and/or their doctors have no say in the assignment.

1.1.2 How does randomization help?

Randomization helps ensure that both measurable and unmeasurable factors are balanced out across both the standard and the new therapy, assuring a fair comparison. Used correctly, it also guarantees that no conscious or subconscious efforts were used to allocate subjects in a biased way.

There are situations where covariate imbalance can appear, even in a well randomized study (Roberts 1999). Just as you have no guarantee that a flip of 100 coins will yield exactly 50 heads and 50 tails, you have no guarantee that covariate imbalances cannot creep into a randomized study once in a while. This is not just a theoretical concern. One article (Mann 2002) argues that a difference in baseline stroke severity in a randomized trial of tPA produced an incorrect assertion of the effectiveness of this treatment.

Randomization relies on the law of large numbers. With small sample sizes, covariate imbalance may still creep in. A study examining the probability of covariate imbalance (Hsu 1989) showed that total sample sizes less than 10 could have a 50% chance or higher of having a categorical covariate with levels twice as large in one group than the other. This study also showed that total sample sizes of 40 or greater would have very little chance of such a serious imbalance, and a total of 20-40 subjects would be acceptable if there were only one or two important covariates.

1.1.3 A fishy story about randomization

I was told this story but have no way of verifying its accuracy. A long, long, time ago, the U.S. Environmental Protection Agency wanted to examine a pollutant to find concentration levels that would kill fish. This research required that 100 fish be separated into five tanks, each of which would get a different level of the pollutant. The researchers caught the first twenty fish and put then in the first tank, then the next twenty fish and put them in a second tank and so forth. The last twenty fish went into the fifth tank. Each fish tank got a different concentration of the pollutant. When the research was done, the mortality was related not to the dosage, but to the order in which the tanks were filled, with the worst outcomes being in the first tank filled and the best outcomes in the last tank filled. What happened was that the slow-moving, easy-to-catch fish (the weakest and most sickly fish) were all allocated to the first tank. The fast-moving, hard-to-catch fish (the strongest and healthiest fish) ended up in the last tank.

Failure to randomize in this study ruined the entire effort. The huge imbalance caused by putting the sickest fish in the first tank and the healthiest fish in the last tank overwhelmed any differences in mortality caused by varying levels of the pollutant.

1.1.4 The mechanics of randomization

Random assignment means that the choice is left to some device that is inherently random and unpredictable. A flip of a coin is one approach, but usually a table of random numbers or a random number generator is more practical.

The simplest way to randomize is to layout the treatment schedule in a systematic (non-random) fashion, generate a random number for each value in the schedule and then sort the schedule by the random number. Sorting by a random number is effectively the same thing as putting the list in a random order.

1.1.5 Concealing the randomization list

Another important aspect of randomization is concealed allocation, which is the concealment of the randomization list from those involved with recruiti