PPA 696 RESEARCH METHODS

DATA ENTRY ISSUES

Why Coding?
Steps in data management
Prepare the data collection instrument and collect the data;
Prepare the data dictionary or codebook;
Tips on Coding:
Prepare the data matrix worksheets;
Prepare instructions for data entry and data analysis.
 

Why Coding?

    All research collects data of some sort. In order to make sense of the data, it must be analyzed. Analysis begins with the labeling of data as to its source, how it was collected, the information it contains, etc.

    Working with original data, however, can be very cumbersome, whether it is hundreds of mailed questionnaires, figures on yearly accident rates for the fifty states, or observations of classroom behavior of school children. For this reason, data are often coded.

    Coded allow the researcher to reduce large quantities of information into a form than can be more easily handled, especially by computer programs. Not all data need to be coded. For example, the accident rates for the fifty states would not be coded, but each state could be assigned a number (1 through 50) instead of using the state name. There are also content analysis computer programs that help researchers to code textual data for qualitative or quantitative analysis.
 

Steps in data management

a) prepare the data collection instrument and collect the data;
b) prepare the data dictionary or codebook;
c) prepare the data matrix worksheets;
d) prepare instructions for data entry and data analysis.
 
 

A. Prepare the data collection instrument and collect the data. Example:  Quality of Work Life Questionnaire

1. Name of Division where you work: _____________________________

2. How long have you been an employee in this company? _______years

3. How many county-sponsored training sessions have you attended? _____

4. What is your job classification?
_____Management
_____Technical
_____Administrative
_____Clerical

5. Is your position
_____supervisory
_____non-supervisory

6. Sex
_____male
_____female

7. In what area would you like to receive additional training? ___________
 
 

B. Prepare the data dictionary or codebook.

    If data are to be entered into a computer program, whether a spreadsheet, data base, or statistical program, they must be entered in exactly the same way for each person, questionnaire, state, or other unit of analysis.

    Many computer programs have limits on way data can be entered, stored, and retrieved. These limits should be reflected in the codebook. For example, the names of your variables often cannot exceed eight characters. Use short variable names, preferably all letters. You generally can use numbers as well as letters in variable names, but you cannot use spaces, punctuation, or other special characters.

    The variable names you assign to the data should reflect the nominal definitions of the variables themselves, such as "age," "jobclass," "seniority," and so forth. You may want to adopt a rule such as using only lower case letters for any alphanumeric data that you enter, or only uppercase letters. This will make typing variable names easier later when you must tell the computer program which variables to analyze.

    Data can be stored in many ways. The most common form for variables is numeric data, consisting only of numbers. Usually this allows for fractions to be stored as decimals, for example, 2.3 or 0.888

    Data can also be stored as letters, called alpha-numeric format. This allows the variable to be stored as either letters or numbers or a combination of the two. For example, you could store first names, such as "Amy," "Brad," "Caroline," etc. or combinations such as apartment numbers (102b), or license plate numbers (3XGJ429), etc.

    In neither case should data ever be entered with spaces, punctuation marks, or any special characters of any kind. Large numbers should not have any commas placed in them; names should not have any periods, dashes, quotation marks, etc.

    The codebook tells the coder how each questionnaire will be coded for data entry. It specifies the question on the questionnaire from which the data is taken, the variable name, the operational definition of the variable, the coding options, and the type of variable (numeric or alpha-numeric) and the number of columns the variable requires.

Example: Quality of Work Life Codebook
 
Q. 
No.
Variable 
Name
Operational Definition Coding Col. 
type
ID Questionnaire Number 001-999 1-3 
num
1 DIVISION Name of Division where you work? Planning=1 
Traffic=2 
Engineering=3 
Enforcement=4 
missing=9

num
2 LENGTH How long have you been an employee in this company? 01-98 
missing=99
5-6 
num
3 TRAINING How many county-sponsored training sessions have you attended? 00-98 
missing=99
7-8 
num
4 JOBCLASS What is your job classification? 
Management, Technical, Administrative, Clerical
Management=1 
Technical=2 
Administrative=3 
Clerical=4 
missing=9

num
5 SUPER Is your position supervisory or non-supervisory?  non-supervisory=0 
supervisory=1 
missing=9
10 
num
6 SEX Sex: male, female male=0 
female=1 
missing=9
11 
num
7 NEEDS In what area would you like to receive additional training? supervising=1 
budgeting=2 
computers=3 
personnel=4 
other=5 
missing=9
12 
num
 
 

Tips on Coding:

1. Use numbers to represent response categories. For example,
 
on a scale of attitudes about work, 

5=Very satisfied 
4=Satisfied 
3=Neutral 
2=Dissatisfied 
1=Very Dissatisfied

on a survey of where city residents live, 

Central=1 
Eastside=2 
North=3 
Westside=4

on a survey of college majors, 

Business=1 
Education=2 
Engineering=3 
Health=4 
Liberal Arts=5 
Science=6

 

2. Use zero and one to code variables with binary response categories, such as:

Are you a supervisor? No=0 Yes=1
Sex: Male=0 Female=1
Are you at headquarters or in the field? Headquarters=0 Field=1

(Be sure to use the number zero, and not the letter "O"; and the number one, not the letter "L").

3. The same data can be coded in more than one way. For example, the following data on what materials the library should acquire can be coded in two different ways:
 
data: 

-books on the middle ages 
-data bases 
-journals in criminal justice 
-videos & films 
-reference works 
-business reports 
-government documents 
-Internet contacts

Code for Subject Matter, e.g.: 

History 
Business 
Art 
Government

Code for type of material, e.g.: 

reference works 
electronic media 
books 
journals 
reports

 

4. One question on a questionnaire can yield more than one variable. For example: What type of training would you like to receive?

_____supervising _____budgeting _____computers _____personnel
 
 
This can be coded as one variable, 

TRAINING 
1=supervising 
2=budgeting 
3=computers 
4=personnel

Or as two variables, indicating first and second choices: 

TRAIN1 
1=supervising 
2=budgeting 
3=computers 
4=personnel 

TRAIN2 
1=supervising 
2=budgeting 
3=computers 
4=personnel

Or as four variables, indicating a yes/no preference for each type: 

TSUPER 
0=no  1=yes 

TBUDGET 
0=no  1=yes 

TCOMPUT 
0=no  1=yes 

TPERS 
0=no  1=yes

 

    The researcher has to try to anticipate how the data will look. A good idea of this can be gained from doing a pilot test of the instrument, and a dry run of the data collection process. It is important to be sure to leave enough columns to properly code the information for each variable, and to provide enough variables to capture all the richness, complexity, and variety of data that has been collected.

    If a sample of college students is asked about barriers they encounter is attempting to use the campus library, will students be asked to list the one main barrier, to rank order all the barriers, or to choose only the barriers relevant to them? And what if the students do not follow the instructions? Depending on what shape the data come in, the researcher will have to decide how to code this information, using one, two, or many variables.
 

C. Prepare the data matrix worksheets;

    When data are to be entered into a computer program for statistical analysis, usually this takes the form of a matrix. The variable names are entered at the tops of the columns which will contain the data for that variable, and the case records are entered across the rows.

Example:

Data Entry Worksheets Quality of Work Life Codebook
 
Id 

1-3

Division 

4

Length 

5-6

Training 

7-8

Jobclass 

9

Super 

10

Sex 

11

Needs 

12

001 3 22 15 4 0 1 4
002 1 1 3 2 1 0 1
003 2 9 99 3 0 0 3
 

    Each single numeral or character that is entered into a computer program takes up one column of space. Each datum can be found by knowing its location by column number in the matrix.

    Columns 1 through 3 taken together represent the person's employee ID number.
Column 4 represents the division worked in.
Columns 5-6 represent the length of time employed.
Columns 7-8 represent the number of training classes taken (note that the information on number of classes taken is missing for person number 003).
Column 9 represents the person's job classification.
Column 10 indicates whether the person is a supervisor or not.
Column 11 indicates whether the person is male or female.
Column 12 indicates what type of training the person wants in the future.

    Each record, case, questionnaire, or other unit of analysis is represented by a single row of data across the matrix. For example, person 001 is found in row 1; person 002 in row 2; and person 003 in row 3.

    Each record must be entered in exactly the same way. If the position of the data are to be entered in fixed-columns, this is referred to as fixed-field format. If data are missing for a record on any of the variables, something must still be entered into that field. Usually this is a number indicating that the data is missing. For a 1-column field, use the number 9; for a two-column field, use 99; and so forth. Just make sure that "9" or "99" is not also a valid response. In that case, use some other number; some computer programs will allow you to use a period (".") as a placeholder that is also an indicator of missing data.

    When you ask the computer, for example, the compute the average length of time employed of all the employees in your survey, the computer will look in columns 5-6 of each record. It will take whatever it finds there, and attempt to compute an average. It is important, therefore, that all length of employment data be in columns 5-6 for every record, and that no other type of data be in columns 5-6. The computer will disregard missing data codes (i.e., values of "99") in computing the average.

    Many computer programs have a limitation of a total of 80 columns of data per record. This is a holdover from when data were punched on cardboard cards that were fed into card readers, rather than entering data directly into the computer. If your data require more than 80 columns, you will have to construct additional data matrices to record the remainder of the information for each record.
 

D. Prepare instructions for data entry and data analysis.

    Data coding may be done directly on the data collection instrument (e.g., questionnaire) and then transferred to the data coding sheets, or entered directly into the computer. It is important to prepare detailed instructions for data coding and data entry, especially if these tasks are shared among or performed by several different people.

    There are a number of statistical, spreadsheets, and data base programs that can be used for data entry. Most programs will save the data and allow it to be output as a plain text or ASCII file, which is accepted by most statistical programs, such as SAS, SPSS, or STATA. Most of these programs are available in a desktop version, and many also come in cheaper student versions as well, such as Student Stata and Mystat.

    There are also a number of stand-alone products such as DataPerfect, which can be easily programmed to look just like the data collection instrument, making data entry quite easy and eliminating the need for a data entry matrix to be filled in. These programs also have built-in safeguards, so that, for example, alpha-numeric data cannot be entered into a variable that is for numeric data only; data are constrained to a limited number of columns so that four digits can't be entered into a three-digit variable; etc.