Stats #02: Using SPSS to Describe Your Data

Content:  This three hour training class will give you a general introduction in how to use SPSS software to manage your research data. This class is useful for anyone who needs to use SPSS to enter or analyze research data. Students should know how to use a mouse and how to open applications within Microsoft Windows. No statistical experience is necessary. This class will provide hands-on computer experience using SPSS software. You will use two SPSS data sets for practice exercises: bf.sav, and housing.sav. If you have trouble downloading these files, try

Objectives:  In this class, you will learn how to:

Teaching strategies:  Didactic lectures and individual computer exercises.

IRB Education Credits:  This class does not qualify for IRB Education Credits (IRBECs).

Outline:


Welcome to this SPSS computer training class! Please be seated in front of any computer which has a monitor turned on. If the monitor is turned off, that means the computer is not working properly today.


Overview of the STATS web pages (January 21, 2000)

What are the STATS web pages?

The STATS pages are a collection of handouts that I use in my job as a statistical consultant. The web provides a nice home for these handouts, because as I update my material, the newest version is immediately available to anyone who is interested.

Where can I find STATS?

If you have a web browser, like Internet Explorer or Netscape Navigator, you can surf on over to my site,

http://www.childrensmercy.org/stats

which is also found at http://internet1/stats, if you are attached to the Children's Mercy Hospital network. There are two obsolete sites: http://www.cmh.edu/stats and http://simon/stats. Do not use either of these sites.

Some of the fun stuff you can find on the STATS web pages.

Ask Professor Mean.  For the tough Statistics questions that Dear Abby won't touch.

Planning Your Research Study.  Things you need to plan for before you start collecting your data.

Selecting An Appropriate Sample Size.  How much data do you really need?

Managing Your Research Data.  Everything you want to know before you step to the keyboard.

Steps In a Typical Data Analysis.  I have my data on the computer. Now what?

How to Read a Medical Journal Article.  Reading a journal is hard work. Here's some help.

Professor Mean's Library.  Good books and good web sites about Statistics.

... and even more good stuff!!!

This webpage was written by Steve Simon, edited by Linda Foland, and was last modified on 07/08/2008. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Website details


For CMH employees only: Statistical Consulting Services.

You can get free statistical consulting if you work for Children's Mercy Hospital. Steve Simon and Ashley Sherman provide a wide range of statistical consulting services to help you with your research projects. This help can start as early as the initial planning of your research. I also help with the analysis of your data, using SPSS or other statistical software. We can also provide assistance with the preparation of your presentations and publications.

Here area some examples of the services that we have provided:

Specific statistical advice has been outlined on a series of web pages which can be found at http://www.childrensmercy.org/stats/. The pages provide advice about planning your research, selecting an appropriate sample size, managing your research data, performing a variety of data analyses, presenting research data, and writing research papers.

How to get in touch with a statistician

If you would like to meet with Steve Simon or Ashley Sherman, you can set up an appointment by emailing or calling Judy Champion (jmchampion (at) cmh (dot) edu or 816-983-6784). If you have a very simple question, send an email directly to us (ssimon (at) cmh (dot) edu and aksherman (at) cmh (dot) edu).

This webpage was written by Steve Simon on 2003-04-30, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Professional details


Directions to my new office (April 25, 2008).

I have moved to a new office. It is a modular building just north of Children's Mercy Hospital. It is between 23rd and 22nd street, just off of Kenwood Avenue (Kenwood is a small north/south street just west of Holmes). If you need to get from your office to mine, here are some directions written by my Administrative Assistant, Judy Champion.

This webpage was written by Steve Simon and was last modified on 2008-07-14. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Professional details


Terminal Server (February 3, 2003)

Terminal server is a new and improved approach to using SPSS and SigmaPlot and other programs. You log on to a dedicated computer and run your programs on that computer rather than running SPSS or SigmaPlot through the network.

Terminal server offers several advantages:

Listed below are instructions on how to load terminal server on your computer. It is very easy, even for someone who is not a computer nerd. If you prefer to have someone else load terminal server for you, please ask your contact person in Information Systems for help.

We have done some work with the test version of terminal services. You don't need to use the test version, except for special and unusual situations.

Contents


Downloading and installing terminal server.

The software to load terminal server is located on an internal web site. Open Internet Explorer and type

http://10.1.20.59/ts/install.exe

in the address bar. You will get a FILE DOWNLOAD dialog box (see below) that will ask you what to do with the file.

It might look slightly different, depending on the version of Internet Explorer that you are using. Click on the OPEN button. If you don't see an OPEN button, click on the RUN button.

If you see a SECURITY dialog box and/or WINZIP dialog box, click on the YES button to continue.

Once the installation is complete, Click on the START button and select Programs | Terminal Service Client | Client Connection Manager. Then right-click on the CMH TERMINAL icon. This brings up a pop-up menu (see below). Select PROPERTIES from the popup menu.

This will open up the PROPERTIES dialog box. Select the CONNECTION OPTIONS tab and click on the FULL SCREEN option button. This will ensure that terminal server will use your full screen rather than just part of your screen.

Click on the OK button to close this dialog box.

For a second time, right click on the CMH terminal icon to bring up the popup menu. Select CREATE SHORTCUT ON THE DESKTOP from the popup menu. If you do not see an option for "Create Shortcut on the Desktop", you can select "Send To" and then "Desktop (create shortcut)".


Can I load terminal server on my laptop? Your laptop needs to be connected to the network using a high speed internet connection or it needs to be attached directly to the hospital network. Follow the same steps described above. This will allow you to use SPSS on your laptop as long as you have a direct network connection or a connection via high speed internet access.

This webpage was written by Steve Simon on 2003-06-06, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.


Logging on to terminal server.

Click on the TRMSERV icon and a TRMSERV - Terminal Services Client window will appear. In the Log On to Windows dialog box (see below), type the same user name and password that you use when you turn on your computer in the morning.

You will now see the desktop of terminal server (see below). This looks very similar to your own desktop, except that it has a different background color and it has the SPSS 11.5 for Windows icon.

You are now connected to terminal server. Double click to open the SPSS folder and then click on the SPSS icon to run SPSS.

How do I exit from terminal server?

At the bottom of the terminal screen is a start button that looks just like the START button on your regular computer. Click on START and select Shut Down from the menu. You will see either a DISCONNECT or LOG OFF option chosen (see below).

Either one works the same. Click on the OK button. Close the Connect to Terminal Server window.

This webpage was written by Steve Simon on 2003-06-06, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.


Using files with terminal server.

You cannot use your floppy disk drive or your local hard drive directly from terminal server. Instead, you must save the file on a network drive. The best location is probably your user folder.

You have to tell terminal server the network name of your user folder. Do this once and your computer will remember from that moment forward.

Connect to terminal server and click on the MY COMPUTER icon. This will bring up a folder labeled My Computer (see below).

From the menu, select Tools | Map Network Drive. This will bring up the Map Network Drive dialog box (see below).

You need to assign a drive letter to the location of your network files. It would be best to set the drive letter to V:, but you can use a different letter if you like. Then type in the name of your folder. For me, it would be \\cmhsan08\users\ssimon.

After you have saved to the network, you can copy the file to a floppy disk or a local hard drive.

How do I open files in SPSS terminal server?

You can only open files located on the network. Before you connect to terminal server, copy the file from your floppy disk to a location on the network.

How do I use the training example data sets?

Training example data sets appear on a folder on the desktop as the SPSS Examples folder. Double click on this folder to open it. You can also find this folder on the D drive at D:\SPSS Examples.

This webpage was written by Steve Simon on 2003-06-06, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.


Printing from terminal server.

You have to tell terminal server the network name of the printer that you normally use. Do this once and your computer will remember from that moment forward.

Open Internet Explorer and select File | Print from the menu.

Click on the ADD PRINTER icon and follow the instructions. You will get a series of dialog boxes labeled Add Printer Wizard. The instructions are mostly straightforward. After an introductory screen (not shown), you will get a following dialog box asking if you are adding a local printer or a network printer. You must choose the Network printer option (see below), since terminal server does not work with local printers.

It helps if you know the exact name of your printer (one of the printers I use is named \\hpprint02\Medrsrch2).

If you know the exact name of your printer, you can type it in the above dialog box. If you are not 100% certain about the name of your printer, check the option anyway and leave the name blank. You will get a list of printers and print servers to browse through (see below). Do not use the Find a printer in the Directory option button, as that does not work well (at least not for me).

Once you have selected your printer, you should decide if this is the default printer, the one that SPSS terminal server will try to use as its first choice.

When you click on the Next button, you will get a dialog box summarizing your choices. If these choices appear reasonable, click on the Finish button. If something appears to be wrong, use the Back button to fix things.

This webpage was written by Steve Simon on 2003-06-06, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.


Removing terminal server from your computer.

Click on Start | Settings | Control Panel | Add/Remove Programs. Find Terminal Services Client on the list of programs and click on the Change/Remove button.

This webpage was written by Steve Simon on 2003-06-06, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.


Terminal server--What if I get an error message?

First of all, don't panic. Some of these error messages occur because efforts to protect against viruses, trojan horses, and other malicious software also interfere with the normal operations of SPSS and Terminal Server. Here are some of the messages that I have encountered already with a brief explanation of what causes the message and how to work around it.

If you encounter an error message other than the ones described here, please contact me.

Administrator access message. I have never seen this message on my computer, but your computer might pop up a dialog box when you are trying to install terminal server that says something along the lines of you don't have sufficient access or permission or administrator privileges to run SPSS. When IS set up your computer, they added a security layer that protects you against malicious viruses and other computer threats but which also disables your ability to install software on your own. You need to call the help desk and they will temporarily grant you "king/queen for a day" privileges that will enable you to install your own programs for a limited time.

"The client software could not initialize with SPSS Server at ." When loading SPSS, I would get a dialog box that says: "The client software could not initialize with SPSS Server at ." The folks at SPSS told me the solution. "This is the result of either a missing or corrupted file named 'registry.txt' in the SPSS program folder. This problem can be fixed by either reinstalling SPSS or obtaining a new copy of that file from our FTP site and replace it with the one in your SPSS directory. That file is located at ftp://ftp.spss.com/pub/spss/windows. Please locate the one that's specific to your SPSS version." -- SPSS Web Support, personal communication, September 25, 2002.

"This action has been cancelled due to restrictions put on this computer." You should not be getting this error message anymore, but I am keeping it here just in case. This message is actually a paper tiger. What happened a while back is that someone was running terminal server from home and thought it would be fun to download some games to run on terminal server. You know what happened next, of course. Virus attack on terminal server! So our IS folks decided that they had to add some major security restrictions to terminal server. The restrictions interfere with some of the minor bookkeeping activities with SPSS as it starts up. Apparently when SPSS checks for a proper license, it touches a part of terminal server that raises a security flag. But whatever happens on terminal server stays on terminal server. If you click OK on the dialog box, everything in SPSS works just fine.

"Windows cannot access the specified device, path, or file. You may not have the appropriate permissions to access." At CMH, we have created an SPSS group for security reasons. If you are not part of the SPSS group, you cannot access SPSS and you will get an error message along the lines of the above. Call the help desk (5-3454) and ask to be added to the SPSS group. You may need to reboot your computer afterwards.

"You do not have sufficient access to your machine to connect to the selected printer."  This message appears when you are trying to get Terminal Server to recognize and print to your networked printer. This occurs when IS has not installed the appropriate printer drivers on terminal server. Tell me the brand name of your printer, and we can fix it from our end.

This webpage was written by Steve Simon on 2003-06-06, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.


Stats >> Software >> Terminal server

This webpage was written by Steve Simon, edited by Linda Foland and Steve Simon, and was last modified on 07/08/08 . Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.


Pitch the pie! Ban the bar! (June 5, 2003).

This is an outline of a speech that I gave to Bluejacket Toastmasters on June 5, 2003.

I work a lot with numbers and I've found that there is usually a good way to display those numbers and a bad way. Here's an example.

It's a pie chart with bright bold colors and a deep 3-D effect. Is this a good way to display the data? WRONG! You should pitch the pie.

Here's another example.

It's a bar chart with big bold purple bars. Is this a good way to display the data? WRONG AGAIN! You should ban the bar.

These charts are useful once in a while, but most of the time all you need is the numbers themselves. You don't have to surround them in a cloak of fancy colors and graphic effects. The numbers by themselves are often all that you need.

But you can't just toss the numbers onto a sheet of paper and hope that it will work out well. You have to plan things. There are two things that can help:

  1. a little bit of rounding, and
  2. a little bit of re-ordering.

Costs of pet ownership example

Shown below is a table loosely adapted from a web page on pet care. I've taken a few liberties with some of the numbers to simplify this discussion, but the numbers are fairly close to the values on that web page.

  Amphibians Birds Cats Dogs
Initial Cost1 113.41 354.17 298.70 341.92
Food/Treats2 48.99 295.31 97.74 246.94
Vet Bills/Meds2 48.70 354.39 193.08 317.24
Misc. Costs2 41.11 116.06 64.19 211.57
         
  Ferrets Fish Hermit Crabs Lizards
Initial Cost1 96.58 104.74 89.57 103.84
Food/Treats2 101.86 58.68 32.79 296.84
Vet Bills/Meds2 150.86 43.60 21.72 348.00
Misc. Costs2 60.10 103.28 7.97 92.78
         
  Rodents Snakes Tarantulas  
Initial Cost1 53.16 97.31 101.11  
Food/Treats2 52.54 295.93 48.43  
Vet Bills/Meds2 52.00 153.83 23.68  
Misc. Costs2 61.56 70.06 43.32  

1includes items like cost of the pet, initial shots, litter box, collar, aquarium, etc.
2yearly cost. This cost will vary based on the size of the pet.

The initial cost would include the cost of the pet, litter box for a cat, collar and leash for a dog, aquarium for fish, and so forth. These are also averages and would not apply to someone who gets diamond studded collars for their pets. Also the average food cost for a small Yorkie is not going to compare the average food cost for a big Siberian Husky.

Look at this table and tell me what patterns you see. A few patterns might appear

But it takes a lot of squinting and staring to discover these patterns.

This table needs some work. The first thing is to do some rounding.

Rounding

Rounding is important because it reduces the strain on your brain. You don't have to work so hard to uncover patterns in the data.

When you look at a table of numbers, the first thing you often do is to make comparisons. These comparisons often involve an implicit subtraction.

For example, you might wonder to yourself "How much difference is there between the average vet bills for a dog and for a cat?"

The respective numbers are

317.24
193.08

Take some time to subtract here. This would tell you how much you would save on yearly vet bills if you got a cat instead of a dog.

Let's see, four minus eight is ummm, borrow the one, ow, ow, ow, my brain hurts.

You can simplify life by rounding the data to one or two significant figures. Here are the rounded costs

320
190

If I asked you to subtract those two numbers, you should be able to tell me the answer quickly and painlessly--130. My wife, an avid dog lover, would tell you that dogs are worth every penny!

When you round, you lose a little bit in precision. In this example, we're off by about six dollars or so. But the small loss in precision is more than made up for by the big gain in comprehension.

People I work with often don't like to round their numbers. It took a lot of effort to get that 317.24, by golly, and I don't want to throw any of that away.

Sometimes they will round their numbers but not enough. "Why can't I keep a third digit?" they ask. It turns out that the third digit will give you brain pain.

There's a reason for this. Inside your brain is a spot for short term memory storage. It can usually hold about four pieces of information without a problem. Anything more causes an overload and slows things down.

A pair of two digit numbers will fit into short term memory very easily, but a pair of three digit numbers will not.

In the vet costs example, rounding to three significant figures means rounding to the nearest dollar rather than to the nearest ten spot. This leads to the following subtraction.

317
193

Ow, ow, ow, my brain hurts again.

Re-ordering

When you arrange these numbers, try to anticipate the possible comparisons and then place the numbers close to one another. You have a choice here. You can orient the numbers horizontally,

320 190

by placing them within the same row. You could also orient the numbers vertically,

320
190

by placing them in the same column.

Which orientation is best for subtracting?

The vertical orientation appears far more natural for doing a subtraction. Also be sure to place the larger number above the smaller one. If you had the smaller one on top

190
320

it doesn't work as well.

Try to sort your numbers from high to low. If you have more than one column of numbers, use the first column, use the last column, or use the average of all the columns. It doesn't matter too much. A few of your numbers might not be in perfect order, but these deviations are actually interesting, as you will see in the example below.

Sorting by one of the columns will do a lot for your data, and if almost always better than the usual approach of alphabetizing by labels.

Have you ever seen a list of numbers for each of the fifty states. It's almost always alphabetical, but most of the time this places states next to one another that have almost nothing in common. Alaska is always between Alabama and Arkansas. Wisconsin is always between West Virginia and Wyoming. There is nothing to recommend this approach.

Sure you can find your own state quickly, but then can you find other states that are similar to your state?

A better approach would be to sort the states by some criteria. List the states with the largest square miles at the top (Alaska, Texas, California) and put the states with the smallest square miles at the bottom (Connecticut, Delaware, Rhode Island).

Or list the states with the most people at the top (California, Texas, New York) and with the fewest people at the bottom (Alaska, Vermont, Wyoming).

Costs of pet ownership example, reworked

Here is the same table reworked. I rounded each value, and re-oriented the table so that the costs for each type of pet fell into the same column. I also sorted the numbers based on the initial cost.

  Initial
Cost1
Food/
Treats2
Vet Bills/
Meds2
Misc.
Costs2
Birds 350 300 350 120
Dogs  340 250 320 210
Cats 300 100 190 60
Amphibians 110 50 50 40
Fish 100 60 40 100
Lizards 100 300 350 90
Tarantulas 100 50 20 40
Snakes 100 300 150 70
Ferrets 100 100 150 60
Hermit Crabs 90 30 20 10
Rodents 50 50 50 60

1includes items like cost of the pet, initial shots, litter box, collar, aquarium, etc.
2yearly cost. This cost will vary based on the size of the pet.

This table is a lot easier to look at. You might notice a few new patterns that weren't so obvious before.

You will probably notice other interesting patterns.

Summary

If you are  displaying numbers, pitch the pie and ban the bar. Most of the time you are better off displaying the numbers themselves. Just be sure to do a little bit of rounding and re-ordering first.

References

All of the ideas described above were championed by A.S.C. Ehrenberg three decades ago. You can find more details in his book.

A Primer in Data Reduction. A.S.C. Ehrenberg (1982) New York: John Wiley & Sons.

The web site where I got the numbers from is

How Much Does it Cost to Own a Pet?. Steph Bairey. Accessed on 2003-06-04. "There is plenty of information out there about how to care for and train your pet. However, most leave out a very important factor: what it will cost. The estimates below are expressed in US Dollars and based on prices of food, accessories, and veterinary care in the Pacific Northwest, USA; your expenses may vary. However, they are excellent guidelines!" www.practical-pet-care.com/article_view.php?ver=22

The numbers on the web page were already rounded, so I had to "unround" them for this example by adding a small random amount to each value. I also replaced some of the zero values by a slightly larger number and made some other minor adjustments. The costs reflected in my tables, however, are very close to the ones on the web.

This webpage was written by Steve Simon on 2003-06-05, edited by Steve Simon and was last modified on 07/08/2008. Category: Graphical display


Categorical versus continuous variables

Many of the choices you will make in a descriptive data analysis depend on whether the variable is categorical or continuous. Here's a brief reminder about what these terms mean.

What is categorical data?

Data that consist of only small number of values, each corresponding to a specific category value or label. Ask yourself whether you can state out loud all the possible values of your data without taking a breath. If you can, you have a pretty good indication that your data are categorical. In a recently published study of breast feeding in pre-term infants, there are a variety of categorical variables:

This webpage was written by Steve Simon on 2002-10-11, edited by Steve Simon, and was last modified on 2008-07-08. This page needs major revisions. Category: Definitions.

What is continuous data?

Data that consist of a large number of values, with no particular category label attached to any particular data value. Ask yourself if your data can conceptually take on any value inside some interval. If it can, you have a good indication that your data are continuous. In a recently published study of breast feeding in pre-term infants, there are a variety of continuous variables:

This webpage was written by Steve Simon on 2002-10-11, edited by Steve Simon, and was last modified on 2008-07-08. This page needs major revisions. Category: Definitions.


Stats >> Training >> Description of the breast feeding data set.

The file bf.sav contains data from a research study done at Children's Mercy Hospital and St. Luke's Medical Center. The data comes from a study of breast feeding in pre-term infants. Infants were randomized into either a treatment group (NG tube) or a control group (Bottle). Infants in the NG tube group were fed in the hospital via their nasogastral tube when the mother was not available for breast feeding. Infants in the bottle group received bottles when the mothers were not available. Both groups were monitored for six months after discharge from the hospital.

Variable list

  1. MomID Mother's Medical Record Number
  2. BabyID Baby's Medical Record Number
  3. FeedTyp Feeding type (Bottle or NG Tube)
  4. BfDisch Breastfeeding status at hospital discharge (Excl, Part, None)
  5. BfDay3 Breastfeeding status three days after discharge (Excl, Part, None)
  6. BfWk6 Breastfeeding status six weeks after discharge (Excl, Part, None)
  7. BfMo3 Breastfeeding status three months after discharge (Excl, Part, None)
  8. BfMo6 Breastfeeding status six months after discharge (Excl, Part, None)
  9. Sepsis Diagnosis of sepsis (Yes or No)
  10. DelType Type of delivery (Vag or C/S)
  11. MarStat Marital status of mother (Single or Married)
  12. Race Mother's race (White or Black)
  13. Smoker Smoking by mother during pregnancy (Yes or No)
  14. BfDurWk Breastfeeding duration in weeks
  15. AB Total number of apnea and bradycardia incidents
  16. AgeYrs Mother's age in years
  17. Grav Gravidity or number of pregnancies
  18. Para Parity or number of live births
  19. MiHosp Miles from the mother's home to the hospital
  20. DaysNG Number of days on the NG tube.
  21. TotBott Total number of bottles of formula given while in the hospital
  22. BirthWt Birthweight in kg
  23. GestAge Estimated gestational age in weeks
  24. Apgar1 Apgar score at one minute
  25. Apgar5 Apgar score at five minutes

Note: as I revise and improve this data set, I may add or remove variables from this list. So if the variables shown above don't match perfectly with the data set you have, don't panic.

Also note that I use different notation ("treatment" instead of "ng tube" and "control" instead of "bottle") in other parts of this website.

Source

Kliethermes PA; Cross ML; Lanese MG; Johnson KM; Simon SD [1999]. Transitioning preterm infants with nasogastric tube supplementation: increased likelihood of breastfeeding. J Obstet Gynecol Neonatal Nurs 28(3): 264-273

Stats >> Training >> Description of the breast feeding data set


Stats >> Training >> Housing data

The file housing.sav (also available as a text file) is "a random sample of records of resales of homes from Feb 15 to Apr 30, 1993 from the files maintained by the Albuquerque Board of Realtors. This type of data is collected by multiple listing agencies in many cities and is used by realtors as an information base." There are 117 records in this database.

Variable Names:

The original data set had selling price in hundreds of dollars, but I found it useful to convert this to dollars. This data set also had a column for annual taxes, which I did not include in this data set.

Source:

http://lib.stat.cmu.edu/DASL/DataArchive.html The Data and Story Library. Link last checked on May 11, 2004. "DASL (pronounced "dazzle") is an online library of datafiles and stories that illustrate the use of basic statistics methods. We hope to provide data from a wide variety of topics so that statistics teachers can find real-world examples that will be interesting to their students. Use DASL's powerful search engine to locate the story or datafile of interest."

Stats >> Training >> Housing data


Stats >> Training >> Stats #02: Practice Exercises

These exercises refer to three data sets:

You should have both files on a floppy disk, which is attached to your handout.

1. For the breast feeding data, compute a frequency table for all the values (not just the first ten) of the mother's medical record number. Verify that no mother of triplets was included in this study.

2. For the breast feeding data, compute a frequency table for the infant's medical record number. Confirm that no infant appears twice in this study.

3. Open the file HOUSING.SAV. How many houses are in this sample?

4. An important portion of the breast feeding study is an examination of side effects of the treatment. Some of the important side effect variables are:

The first variable in this list is categorical and the second is continuous. Compute and interpret frequencies and ranges as appropriate for these of these variables.

5. Other important variables in this study are breast feeding status at discharge (BF0), three days after discharge (BF1), three months after discharge (BF3), and six months after discharge (BF4). All of these variables are categorical. Summarize these variables using frequency tables. Note: BF2 refers to breast feeding status six weeks after discharge, but because this variable was not evaluated prospectively, the researchers decided not to include it in any analysis.

6. In the housing data set, three important variables are the size of the house (SQFT), whether the house was custom built (CUST) and the sales price of the house (PRICE). Which of these variables are continuous and which are categorical? Summarize the continuous variables using frequencies and ranges as appropriate.

7. In the breast feeding study, examine the relationship between the treatment group (FEED_TYP) and all of the side effect variables discussed above.

8. In the breast feeding study, examine the relationship between the breast feeding at discharge (BF0) and the treatment group (FEED_TYP), Mother's age (MOM_AGE), type of delivery (DEL_TYPE), birth weight (BW), gestational age (GEST_AGE), one and five minute Apgar scores (APGAR1, APGAR5), and age at discharge (DC_AGE).

9. In the housing study, examine the relationship between sales price (PRICE) and all other variables in the data set.

10. In the housing study, examine the relationship between whether a home was custom built (CUST) and whether it is more likely/less likely to be found on a corner lot (COR) or in the northeast region of the city (NEC).


Stats >> Model >> Steps in a descriptive model (October 11, 2001)

Every data analysis should start with a descriptive or exploratory analysis. If you have no research hypotheses, then you can stop with this. If you do have research hypotheses, the analysis will provide a solid foundation for any further statistical analysis.

Here are three steps that seem to work well for many descriptive analysis:

  1. Know your count.
  2. Compute ranges and frequencies.
  3. Examine relationships.

These steps may not be appropriate for every analysis, but they do serve as a general guideline. In this presentation, you will see these steps applied to data from a breast feeding study, using SPSS software.

Learning objectives

In this presentation, you will learn how to:

  1. Organize a plan for a descriptive data analysis.
  2. Produce and interpret statistics for a descriptive analysis
  3. Examine relationships using tables and graphs.

Know your count

You need to get a feel for how much data you have. This includes the number of subjects in your study; and the number of data values that are missing. When you have a count of the number of subjects in your study, keep that in mind when you examine any statistical procedures. If the total sample size in any of these procedures is less than your count, you may have problems with an undetected missing value.

This seems like a simple thing, but often there are subtle details that you can't ignore. For example, the following table lists the first 10 mothers in the study.

wpe22.gif (8044 bytes)

Notice that one mother appears twice. Further investigation shows that she is the mother of twins, both of whom were enrolled in the study. In this study, there were other twins, so the full data set includes 84 infants, but only 72 mothers. The presence of twins in the study greatly complicates the analysis, but we will not discuss those complications in this presentation.

Pay very special attention to counts when you are dealing with clusters or repeated measurements. An example of clusters would be when you randomly select families of subjects. For this type of study, you should note both the number of families in the study and the number of family members in the study. An example of repeated measurements would be when you examine a patient several times. For this type of study, note both the total number of patients and the total number of exams.

Compute ranges and frequencies

You should know what the maximum and minimum values are for all the important variables in your data set. If any of these are surprising, you should investigate. You should also know how many observations fall into each level of any important categorical variables.

Our outcome measure, the age when breast feeding was stopped is a continuous variable. Here is a table of statistics for this variable, including the minimum and maximum variables.

wpe2C.gif (3934 bytes)

At first glance, the maximum value (34 weeks) seems a bit large (the study followed infants for only 24 weeks after discharge). But when I talked to the nurses involved, they explained that the length of breast feeding included the time the infants were in the hospital.

Also notice that the sample size for this table (82) is less than the total number of data points. This serves as a reminder that some of the data are missing for the age when breastfeeding was stopped

Other tables (not shown) tell us that the birth weights ranged from 1 kilogram to 2.4 kilograms and the gestational age from 26 to 36 weeks. These are reasonable values for a population of pre-term infants. The youngest and oldest mothers are 16 and 44 years old respectively, which is also quite reasonable.

Race/ethnicity is a categorical variable. Here is a table for frequencies for this variable.

wpe2E.gif (3927 bytes)

This table shows that the patient population is almost exclusively white. Not only is this valuable for writing up the description of the patient population in your research paper, it also indicates that any attempt to account for race in later models is probably a waste of time.

Examine relationships

You should have a general idea of how one variable changes as another one changes. For two categorical variables, we can examine this using crosstabs. For two continuous variables, we can examine this using a scatterplot. For a relationship between a continuous and a categorical variable, we can use boxplots.

The following is a crosstabulation of feeding type versus delivery type. Notice that I have placed feeding type as the rows of the table.

wpe30.gif (3985 bytes)

Sometimes these tables are easier to interpret with percentages. I selected the row percentages option to get the following table.

We can see that there was a roughly 50-50 change for a C-section birth to find itself in the treatment or control group. In the vaginal birhts, however, there was a slightly greater tendency to be found in the control group. This is an imbalance which might cause problems with interpretation of the results.

Does delivery type also influence duration of breast feeding? The following box plot shows that c-section births tend to have longer durations than vaginal births, a somewhat surprising finding. Because delivery type is related to both feeding type and duration of breast feeding, we should be sure to examine delivery type as a potential confounding variable in any analysis.

wpe35.gif (2744 bytes)

The mother's age is an important factor in any breast feeding study. Here is a boxplot comparing ages in the two feeding groups.

wpe38.gif (2660 bytes)

We see that the NG tube group has older mothers than the bottle group. Further statistical analysis shows that the average age is 29 in the NG tube group and 25 in the bottle group, a difference of 4 years.

We also should examine the relationship between mother's age and duration of breast feeding. The following scatterplot shows a slight tendency for older mothers to breast feed longer.

wpe3A.gif (3264 bytes)

As with delivery type, we we should be careful to adjust for mother's age in any comparison of the two feeding groups.

This webpage was written by Steve Simon and was last modified on 07/08/2008.


What is a boxplot? (October 15, 2002)

The box plot is a graphical display of a five number summary. Sometimes the box plot is also known as a box and whiskers plot.

Here are the four steps you follow to draw a boxplot.

  1. Draw a box from the 25th to the 75th percentile.
  2. Split the box with a line at the median.
  3. Draw a thin lines (whisker) from the 75th percentile up to the maximum value.
  4. Draw another thin line from the 25th percentile down to the minimum value.

The length of the box in a box plot, i.e., the distance between the 25th and 75th percentiles, is known as the interquartile range. You can use this box length to detect outliers. If any whisker is more than 1.5 times as long as the length of the box, then we have evidence of outliers. A common variation on the box plot is to draw the whisker to the value which is just shy of 1.5 box lengths away, and highlight each individual data point more than 1.5 box lengths away.

This webpage was written by Steve Simon on 2005-08-18, edited by Steve Simon, and was last modified on 2008-07-08. This page needs minor revisions. Category: Definitions, Category: Graphical display.


How to set up tables.

It's not always clear how to best set up a crosstabs in SPSS. Here are some guidelines that might help.

Displaying tables of percentages (November 6, 2002) Category: Ask Professor Mean, Category: Writing research papers

Dear Professor Mean, My colleagues and I argue over the most appropriate way for displaying tables of percentages. Must the row or column always add to 100%? Also, in cases where it is difficult to know which variable is dependent, how does one decide the best way to present the results? -- Garrulous Gail

Dear Garrulous,

When you are deciding how to display two by two (or larger) tables, you have a variety of ways to do this. No way is correct all the time, and some of choices reflect subjective judgment. But here are some rules I use.

1. Never display more than one type of number in a table. Statistical software like SPSS can produce counts, row percents, column percents, cell percents, expected counts, residuals, and/or cell contribution to chi-squared values. At one time or another you might want to use each of these statistics, but never all at one time. Two or more numbers in a table causes confusion and makes your tables harder to interpret.

Present a single summary statistic in the table if at all possible. If you need to display two summary statistics (for example, both counts and row percentages), then place the counts in one table and the row percentages in a different table. If you have to fit them in the same table, place the two numbers side by side with the less important number appearing second and in parentheses For example, 54% (257).

2. Row percentages are usually best. Row percentages are the percentages you compute by dividing each count by the row total. Row percentages place the comparison between two numbers within a single column, so that one number is directly beneath the number you want to compare it to. This is usually better than column percents, where the numbers you want to compare are side by side. If you find that column percentages make more sense. Consider swapping the rows and columns.

If you find that cell percentages make the most sense, consider creating composite categories that combine the row and column categories. Cell percentages are the percentages that you get when you divide each cell count by the overall total. When cell percents are interesting, it usually means that you are interested in the four distinct categories in your two by two table. For example, you are interested in seeing what fraction of job candidates are white males, rather than seeing how the probability of being male influences the probability of being white. For this type of data, treat it as a single categorical variable with four levels (white males, white females, black males, black females) rather than two categorical variables with each having two levels (black/white, male/female).

3. Place the treatment/exposure variable as rows and outcome variable as columns. This relates to the above item. You usually are interested in the probability of an outcome like death or disease, and you are interested in how this probability changes when the treatment or exposure changes. Arranging the table thusly and using row percents usually gets you the comparison you are interested in.

4. If one variable has a lot more levels than the other variable, place that variable in rows. A table that is tall and thin is usually easier to read than a table that is short and wide. It is easier to scroll up and down rather than left and right. For a really large number of levels, you might have to print your table on two or more pages. Usually it is a lot easier to align these pages if the table is tall and thin. A short wide table that is split on two or more pages is often a disaster.

5. Whenever you report percentages, always round. A change on the order of tenths of a percent are almost never interesting or important. Displaying that tenth of a percent makes it harder to manipulate the numbers to see the big picture.

6. Don't worry about whether your percentages add up to 99% or 101%. First of all, it can't happen with a two by two table unless you round incorrectly. For a larger table, it can happen, but your audience is sophisticated enough to understand why this is the case. No one, for example, is going to be upset when 33% plus 33% plus 33% adds up to less than 100%.

7. When in doubt, write out your table several different ways. Pick out the one that gives the clearest picture of what is really happening. Don't rely on the first draft of your table, just like you would never rely on the first draft of your writing.

Examples

A simple fictitious example will help illustrate these points.

We classify people by their income (rich/poor) and also by their attitude (happy/miserable). There are, for example,  30 rich happy people in our sample and 70 poor miserable people.

This figure shows column percentages. We compute this by dividing each number by the column total.

We see for example that only 25% of all happy people are rich. This is a conditional probability and is usually written as P[Rich | Happy]. Read the vertical bar as "given." So this probability is read as the probability of being rich given that you are happy.

This figure shows row percentages. We compute this by dividing each number by the row total.

We see, for example that 75% of rich people are happy. This is a different conditional probability, P[Happy | Rich]. Read this as the probability of being happy given that you are rich.

Notice the distinction between the two probabilities. Only a few happy people are rich, but most rich people are happy.

This figure shows cell percentages. We compute this by dividing each number by the grand total. Each percentage represents the probability of having two conditions. For example, there is a 15% chance of being rich and happy.

The table above shows a good format for combining two numbers in a single table.

This is an alternate way of displaying cell percentages.

If we had a six categories for attitude rather than just two, we might arrange the table differently.

Notice that this table would not require any sideways scrolling.

Summary

  1. Never display more than one type of number in a table.
  2. Row percentages are usually best.
  3. Place the treatment/exposure variable as rows and outcome variable as columns.
  4. If one variable has a lot more levels than the other variable, place that variable in rows.
  5. Whenever you report percentages, always round.
  6. Don't worry about whether your percentages add up to 99% or 101%.
  7. When in doubt, write out your table several different ways.

This webpage was written by Steve Simon and was last modified on 07/08/2008.


Stats >> Model >> SPSS dialog boxes for a descriptive analysis (June 21, 2002)

This handout will show the SPSS dialog boxes that I used to create the examples in the descriptive data analysis handout. I will capitalize variable names, field names and menu picks for clarity.

Compute frequency counts

Select ANALYZE | DESCRIPTIVE STATISTICS | FREQUENCIES from the SPSS menu. You will see the following dialog box:

wpeB.gif (15118 bytes)

Click on RACE and then click on the right arrow button to add it to the VARIABLE(S) field.

Find minimum and maximum values.

Select ANALYZE | DESCRIPTIVE STATISTICS | DESCRIPTIVES from the SPSS menu. You will see the following dialog box.

wpe6.gif (12684 bytes)

Select your variable in the list on the left and click on the arrow button to add it to the VARIABLE(S) field. You can repeat this for additional variables if needed.

Compute cross tabulations

Select ANALYZE | DESCRIPTIVE STATISTICS | CROSSTABS from the SPSS menu. You will see the following dialog box.

wpe7.gif (22963 bytes)

Select variables from the list on the left. Add one to the ROW(S) field and another to the COLUMN(S) field. Click on the OK button to continue.

To produce row percents, select ANALYZE | DESCRIPTIVE STATISTICS | CROSSTABS again. Notice that SPSS remembered your previous choices. How nice! Now click on the CELLS button to get the following dialog box.

wpeA.gif (10025 bytes)

Check the ROW option. Now click on the CONTINUE button in this dialog box and the OK button in the previous dialog box.

Drawing boxplots

Select GRAPHS | BOXPLOT from the SPSS menu. You will see the following dialog box.

wpe8.gif (8568 bytes)

We will select the SIMPLE option and the SUMMARIES FOR GROUPS OF CASES option here. A good rule of thumb is to always try the default options first. You can always experiment with other options if needed, but the defaults in SPSS usually work well.

You would use the CLUSTERED option if you want to see separate box plots across the combination of two different categorical variables. You would select the SUMMARIES OF SEPARATE VARIABLES if you wanted box plots for several columns of data simultaneously.

When you click on the DEFINE button, you will see the following dialog box.

wpe9.gif (16064 bytes)

Select a continuous variable and add it to the VARIABLE field. Select a categorical variable and add it to the CATEGORY AXIS field. You can leave the LABEL CASES BY field blank if you like. The variable in this field provides labels for any outliers that might be found in the box plots. If the field is blank, SPSS labels outliers with the row number.

Draw a scatterplot.

Select GRAPHS | SCATTER from the SPSS menu. You will see the following dialog box.

wpeB.gif (6255 bytes)

We will select the SIMPLE, the default option. You would select the OVERLAY option instead if you wanted to plot more than two columns of data simultaneously. You would select the 3-D option if you wanted to examine the relationship among three continuous variables simultaneously. These 3-D graphs look fancy, but they are often difficult to interpret. Another option which works for three (or even more) variables in the scatterplot matrix. This arranges graphs of all possible pairs of your data in a nice grid. When you click on the DEFINE button, you will see the following dialog box:

wpeC.gif (22440 bytes)

Select continuous variables for the Y-AXIS field and the X-AXIS field. The remaining two fields are optional. If you place a categorical variable in the SET MARKERS BY field, SPSS will use different marks for each level of your categorical variable. If you place a variable in the LABEL CASES BY field, thenvalues of that variable will appear as labels by each data point. With a graph like ours with 87 points, those labels would make our graph far too cluttered.

You may wish to modify or customize the graph that SPSS produces. To make changes, double click on the graph. You will get a chart editor window that looks like the following.

Chart1.bmp (308278 bytes)

For example, the points displayed in this graph are too small and the wrong shape. To modify this, select FORMAT | MARKER from the SPSS menu. You will see the following dialog box.

Chart2.bmp (51958 bytes)

Select the open circle marker and the MEDIUM size option. Then click on the APPLY ALL button. If you like this choice, click on the CLOSE button in the above dialog box and select FILE | CLOSE from the chart editor window. The modified graph will appear in the SPSS output window.

Stats >> Model >> SPSS dialog boxes for a descriptive analysis

Page last modified on 09/24/2007. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.


Please fill out an evaluation form. Your input is important. These evaluation forms also ensure that we can offer Continuing Medical Education credits for this class.