Sample Weights & Design Effects

Sample Weights & Design Effects

The NLSY97 sampling weights, which are constructed in each survey year, provide the researcher with an estimate of how many individuals in the United States are represented by each NLSY97 respondent. Individual case weights are assigned to produce group population estimates when used in tabulations.

Important Information About Custom Weights

Custom weighting program online: If users need longitudinal weights for multiple survey years or for a specific set of respondent ids, they can create custom weights by going to the NLSY97 Custom Weighting page.

This sampling weights section includes the following information:

Introduction to Weighting

Weighting is a challenging subject. Researchers must first decide if they should or should not weight the sample. If researchers decide to weight, they must then determine which weight variable to use. This can be a difficult decision because there are more than 30 different pre-created weight variables available in the NLSY97 dataset.

What does weighting do? Weights are in place to make sure the sample is representative of the population of interest and that other objectives are met. Weights are particularly important when over-sampling occurs. All NLS data sets use over-sampling. Over-sampling is the selection of a large number of additional respondents that match certain criteria. This over-sampling was done to allow researchers to measure more precisely the changes in key populations like blacks and Hispanics. Over-sampling impacts population descriptors, such as means and medians, because the NLSY97 has more respondents who are black or Hispanic than what really exists in the U.S. If the data are not adjusted, the greater number of black and Hispanic respondents skews population averages toward black and Hispanic averages. Adjusting the data by weighting reduces the impact of each black and Hispanic respondent and removes that bias. If a user attempts to summarize characteristics of the population but ignores weights, results are biased.

Weights for the NLSY97 range from a high of around 1.7 million to a low of 86,000. What do these numbers mean? All NLSY97 weights contain two implied decimal places. To interpret the weight, divide by 100. For example, a respondent with a weight of 1.7 million represents 17,000 people, while a respondent with weight of 86,000 represents 860 people. If a NLSY97 respondent has a weight of 0, he or she did not participate in that survey round.

User Note

If you are creating descriptive statistics such as means, medians, and standard deviations, it is suggested that you weight your results. If you are running a more complex analysis such as doing a regression, we suggest that you do not weight. For more details on when to use weights, see Practical Usage of Weights.

Types of Weights

Weights are found under the "Sample Design & Screening" Area of Interest in Investigator. Weights also can be found by searching for the word "Weight" in the Word in Title search option. There are two sampling weight variables available in each round:

SAMPLING_WEIGHT_CC. Provides a weight for everyone who participated in that particular round of surveying, using a special method of combining the cross-sectional and over-sample cases. This method makes the weight of an oversampled person invariant to which sample the person was drawn from. This reduces the variation in weights and hence improves the statistical efficiency of weighted estimators.

SAMPLING_PANEL_WEIGHT. This weights only people who are in every round from 1 to N. Those not in every round get a 0 weight. It is used when data are needed on individuals who participated in all rounds.

Rounds 1 and 2 also include two additional sampling weight variables, SAMPLING_WEIGHT and CS_SAMPLING_WEIGHT. Starting in round 3, NLS survey staff created a new more statistically efficient method of calculating survey weights called the "Cumulating Cases" strategy. Because the new method provided better information, weights were recalculated starting with round 1. However, since research had already been published using the original sampling weight variables, the original variables were left in the database to enable older work to be replicated.

For more details about how the weights were calculated, see Methodology for Calculating Weights.

User Note

Need a quick check to see the impact of weights for a complex statistical situation? Use R12361.01, which is the Round 1 Sampling Weight Cumulative Cases Method. This variable provides a weight for every NLSY97 respondent and adjusts for the over-sampling of blacks and Hispanics. It does not adjust for round-by-round non-response.

Practical Usage of Weights

Researchers should weight the observations using the weights provided if tabulating sample characteristics in order to describe the population represented (i.e., computing sample means, totals, or proportions). The use of weights may not be appropriate without other adjustments for the following applications:

Samples Generated by Dropping Observations with Item Nonresponses: Often users confine their analysis to subsamples of respondents who provided valid answers to certain questions. In this case, a weighted mean will not represent the entire population, but rather those persons in the population who would have given a valid response to the specified questions. Item nonresponse due to refusals, don't knows, or invalid skips is usually quite small, so the degree to which the weights are incorrect is probably quite small. In the event that item nonresponse constitutes a small proportion of the variables under analysis, population estimates (i.e., weighted sample means, medians, and proportions) are reasonably accurate. However, population estimates based on data items that have relatively high nonresponse rates--such as family income--may not necessarily be representative of the underlying population of the cohort under analysis.

Data from Multiple Waves: Because the weights are specific to a single wave of the study, and because respondents occasionally miss an interview but are contacted in a subsequent wave, a problem similar to item nonresponse arises when the data are used longitudinally. In addition, the weights for a respondent in different years occasionally may be quite dissimilar, leaving the user uncertain as to which weight is appropriate. In principle, if a user wished to apply weights to multiple wave data, weights would have to be recomputed based upon the persons for whom complete data are available. In practice, if the sample is limited to respondents interviewed in a terminal or end point year, the weight for that year can be used.

Users may also create longitudinal weights for multiple survey years by using the NLSY97 Custom Weighting program.

Regression Analysis: A common question is whether one should use the provided weights to perform weighted least squares when doing regression analysis. Such a course of action may lead to incorrect estimates. If particular groups follow significantly different regression specifications, the preferred method of analysis is to estimate a separate regression for each group or to use indicator variables to specify group membership; regression on a random sample of the population would be misspecified. Users uncertain about the appropriate method should consult an econometrician, statistician, or other person knowledgeable about the data before specifying the regression model.

Analysis by race: For research that includes analysis by race, using the regular sampling weights rather than the cross-sectional weights will produce results with higher precision for black and Hispanic or Latino youths. For research that focuses only on non-black, non-Hispanic youths or that does not include any analysis by race/ethnicity, using the cross-sectional weights will save processing time.