Data Processing and Descriptive Analysis
I. Today’s class
--review procedures for data processing
--illustrate some of these issues, particularly missing data, with NELS 8th grade survey
--review of Teleform
II. What’s involved in data processing?
--coding data
--entering data
--managing data
--cleaning data
--recoding data
--handling missing data
--creating scales and indexes
--generating descriptive statistics
III. Coding Data
--coding scheme is a set of rules for creating usable data from questionnaire responses
--should be done as early as possible in design stage
--should reflect how data will be used in analysis (e.g., creation of dummy variables for multivariate analysis)
--should provide unique codes for various types of valid and invalid answers as well as non-responses
e.g. NELS88
BYP40 NO. OF TIMES 8TH GRADER CHANGED SCHOOLS
Frequency Percent Valid Percent Cumulative Percent
Valid 0 NONE 9584 39.0 43.3 43.3
1 ONCE 5093 20.7 23.0 66.3
2 TWICE 2502 10.2 11.3 77.6
3 THREE TIMES 2312 9.4 10.4 88.0
4 FOUR TIMES 1328 5.4 6.0 94.0
5 FIVE OR MORE TIMES 1323 5.4 6.0 100.0
Total 22142 90.0 100.0
Missing 96 {MULTIPLE RESPNSE} 3 .0
98 {MISSING} 506 2.1
System 1948 7.9
Total 2457 10.0
Total 24599 100.0
--can be in advance or after the data are entered in the computer (recoding)
--once finished, should create a codebook that describes variables and procedures for coding the data, as well as variable and value labels
(e.g., example from NELS—electronic codebook and SPSS)
IV. Data Entry Procedures
A. Methods of data entry
--Interviewer coding during the interview
--data entry person
--computer-assisted techniques
--computer-assisted telephone interviewing (CATI)
--computer assisted personal interviewing (CAPI) or Data Entry Builder
--computer-assisted survey entry (e.g., Teleform)
B. Issues to address
--accuracy
--speed
--cost
V. Data Cleaning
--once data are entered into the computer, they should be verified
--data verification can be done a number of ways
--double-entry (SPSS Teleform allows this)
--visual checking
--need decision rules for handling invalid responses
--merging files (e.g, different surveys, different units of analysis, other data sources)
VI. Data Management
--different types of files
--data files: actual data
--system files: software conversion of data files
--important to understand how data files are structured
--rows represent cases (units of analysis) and columns data
--e.g. school attendance data
--important to code a case id for reference and for merging different sources of data
--important to name files to identify source of file and type of data file
--e.g. data files (*.dat) systems files (*.sav) syntax files (*.sps) output files (*.spo)
VII. Recoding Data
--often useful to recode original data into new variables
---different types and uses for recoded variables
1. recoding continuous variables into categorical ones (e.g., NELS SES)
2. collapsing existing categorical variable into smaller number of categories
3. converting verbatim responses to categories (e.g., occupations)
4. changing existing values to new ones to facilitate analysis (e.g., dummy variables)
--important to retain original as well as recoded variables in case you want to try new recoding procedures
--useful to distinguish between original and recoded variables in dataset (e.g., NELS: original variables labeled with question numbers, composites and other variables with other labels)
--e.g, BYP40 (original questionnaire item); BYSES (NCES-created composite)
VIII. Dealing with Missing Data
--Q: Why is it important to adequately address the issue of missing data?
--A: non-response often is not random, therefore can introduce systemic bias
--distinction between ignorable (not related to dependent or criterion variable)
and non-ignorable (is related to these critical variables)
--you can easily check this
--e.g., NELS Base Year Sample Design Report
(http://nces.ed.gov/pubs90/90463.pdf)
--p. 24 Unit non-response of students, parents, teachers, schools
--p. 37—School non-response bias estimates
--pp. 45-46 Item non-response: Student non-response by selected student characteristics
--if related, important to have systematic way of dealing with missing data
--two types of non-response:
1. unit non-response: missing cases
2. item non-response: missing values
--non-response can vary widely from variable to variable (e.g., NELS)
A. Unit non-response
--can compare sample with population to see if there is any apparent bias
--example: Rumberger, R.W., Ghatak, R., Poulos, G., Ritter, P.L., & Dornbusch, S.M. (1990). Family influences on dropout behavior in one California high school. Sociology of Education, 63, 283-299 (http://www.jstor.org/view/00380407/di975478/97p0510f/7?currentResult=00380407%2bdi975478%2b97p0510f%2b0%2c01%2b19901000%2b9995%2b80098999&searchID=8dd55340.10533586740&sortOrder=SCORE&config=jstor&frame=noframe&userID=806f7d52@ucsb.edu/018dd553400050b9c745&dpi=3)
--can use population weighting that can adjust for selection bias and unit response bias
--this is done with NELS (for overview of techniques in all NCES surveys, see: http://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2003603)
--e.g., NELS Base Year Sample Design Report
--p. 32 sample weights for schools and students
--also need to adjust standard errors because of non-random design
--either by adjusting individual standard errors or adjusting the effective weight
--e.g., NELS Base Year Sample Design report
--p. 54: SE = DEFT * (Var/n)1/2
--p. 51 Mean Design Effects (DEFFs)
--adjust weights: SWTADJ = PNLWT/[mean(PNLWT) * mean design effect]
B. Item non-response
--two-ways of dealing with missing data:
1. exclude all cases with missing data (most common method)
2. impute values for missing cases, such as recoding them to the mean of the valid cases and perhaps making some other adjustments
--Q: What’s wrong with the first method?
--A: It, in fact, is an implicit method of data imputation, by assuming missing cases have the same distribution of values as the valid cases (often not the case)
--types of imputing (won’t go into all these in any detail)
1. mean substitution
--common method
--assumes respondents and nonrespondents similar
--reduces variance and therefore decreases S.E. and can give false positives [SE = (Var/n)1/2]
2. single predictive imputation (e.g, regression model)
3. single predictive with stochastic term
4. multiple imputation
IX. Creating Scales and Indexes
--scales and indexes are composite measures created by combing a number of individual variables
--they provide a useful way of reducing data (improves degrees of freedom and interpretation of variables that measure similar attributes)
--terms often used interchangeably, although index means simply additive composite measure while scales usually created by more elaborate means, such as factor analysis
--e.g., NELS SES composite
--can be done a priori, based on previous research, or ex post facto, based on inductive analysis of data (factor analysis)
--example: Sui-Chu, E.H. & Willms, J.D. (1996). Effects of parental involvement on eighth-grade achievement. Sociology of Education, 69, 126-141 (http://www.jstor.org/view/00380407/di975500/97p0125r/5?currentResult=00380407%2bdi975500%2b97p0125r%2b0%2c01%2b19960400%2b9995%2b80039599&searchID=cc99333c.10533616620&sortOrder=SCORE&config=jstor&frame=noframe&userID=806f7d52@ucsb.edu/018dd553400050b9c745&dpi=3).
X. Generating Descriptive Statistics
--common to describe data used in study: means and standard deviations
--sometimes useful to show interesting and significant bivariate relationships between key dependent variables (rows) and independent variables (columns)
--example: Alexander, K.K., Entwisle, D.R., & Horsey, C. (1997). From first grade forward: Early foundations of high school dropout. Sociology of Education, 70, 87-107 (http://www.jstor.org/view/00380407/di020062/02p0009o/12?currentResult=00380407%2bdi020062%2b02p0009o%2b0%2c01%2b19970400%2b9995%2b80029599&searchID=cc99333c.10533613830&sortOrder=SCORE&config=jstor&frame=noframe&userID=806f7d52@ucsb.edu/018dd553400050b9c745&dpi=3).
XI. Illustration of these procedures with NELS
A. Unit non-response: weighting
B. Item non-response
C. Procedures for dealing with missing data
1. Excluding cases
2. Substituting mean
3. Estimating values