Sample Weights & Clustering Adjustments | National Longitudinal Surveys

Sample weights
Custom weights
Practical usage of weights
Clustering adjustments

Sample weights

In each survey year a set of sampling weights is constructed. These weights provide the researcher with an estimate of how many individuals in the United States each respondent's answers represent. Weighting decisions for the NLSY79 are guided by the following principles:

individual case weights are assigned for each year in such a way as to produce group population estimates when used in tabulations
the assignment of individual respondent weights involves at least three types of adjustment, with additional considerations necessary for weighting of NLSY79 Child data

The interested user should consult the NLSY79 Technical Sampling Report (Frankel, Williams, and Spencer 1983) for a step-by-step description of the adjustment process. A cursory review of the process follows.

Adjustment One. The first weighting adjustment involves the reciprocal of the probability of selection at the first interview. Specifically, this probability of selection is a function of the probability of selection associated with the household in which the respondent was located, as well as the subsampling (if any) applied to individuals identified in screening.
Adjustment Two. This process adjusts for differential response (cooperation) rates in both the screening phase and subsequent interviews. Differential cooperation rates are computed (and adjusted) on the basis of geographic location and group membership, as well as within-group subclassification.
Adjustment Three. This weighting adjustment attempts to correct for certain types of random variation associated with sampling as well as sample "undercoverage." These ratio estimations are used to conform the sample to independently derived population totals.

Sampling weight readjustments

Sampling weights for the main survey are readjusted to account for noninterviews each survey year. The readjustments are necessitated by differential nonresponse and use base year sample parameters for their creation, employing a procedure similar to that described above. The only exception occurs in the final stage of post-stratification. Post-stratification weights in survey rounds two and above have been recomputed on the basis of completed cases in that year's sample rather than the completed cases in the base year sample.

Custom weights

Users looking for a simple method to correct a single year's worth of raw data for the effects of over-sampling, clustering and differential base year participation should use the weights include each round on the data release. Unfortunately, while each round of weights provides an accurate adjustment for any single year, none of the weights provide an accurate method of adjusting multiple years' worth of data. The NLS has a custom weighting program which provides the ability to create a set of customized longitudinal weights. These weights improve a researcher's ability to accurately calculate summary statistics from multiple years of data.

The custom weighting program calculates its weights by first creating a new temporary list of individuals who meet all of a researcher's criteria. This list is then weighted as if the individuals had participated in a new survey round. The weights for this temporary list are the output of the custom weighting program.

There are two options for the custom weighting program on the Custom Weights for the NLSY79 page. The first option allows researchers to specify the particular rounds in which respondents participated. Researchers can also select if "The respondents are in all of the selected years" or can select if "The respondents are in any or all of the selected years." The second option allows users to input a list of respondent ids to get the appropriate weights for just that list. For example, this second option allows researcher to weight only those people who ever reported smoking cigarettes in any survey or weight only people who needed extra time to graduate from college.

Important information: Custom Weighting Program

If you select all survey rounds available and also pick "The respondents are in any or all of the selected years," the weights produced are identical to round 1 survey weight. This result arises because the any selection combined with all survey rounds produces a list of every person who participated in the survey.
The output of the custom weight program has 2 implied decimal places just like the weights found in the data release. Dividing each custom weight output value by 100 results in the number of individuals the respondent represents.

Practical usage of weights

The application of sampling weights varies depending on the type of analysis being performed. If tabulating sample characteristics for a single interview year in order to describe the population being represented (that is, compute sample means, totals, or proportions), researchers should weight the observations using the weights provided. For example, to estimate the average hours worked in 1987 by persons born in 1957 through 1964, simply use the weighted average of hours worked, where weight is the 1987 sample weight. These weights are approximately correct when used in this way, with item nonresponse possibly generating small errors. Other applications for which users may wish to apply weighting, but for which the application of weights may not correspond to the intended result include:

Samples generated by dropping observations with item nonresponses

Often users confine their analysis to subsamples for which respondents provided valid answers to certain questions. In this case, a weighted mean will not represent the entire population, but rather those persons in the population who would have given a valid response to the specified questions. Item nonresponse because of refusals, don't knows, or invalid skips is usually quite small, so the degree to which the weights are incorrect is probably quite small. In the event that item nonresponse constitutes only a small proportion of the data for variables under analysis, population estimates (that is, weighted sample means, medians, and proportions) would be reasonably accurate. However, population estimates based on data items that have relatively high nonresponse rates, such as family income, may not necessarily be representative of the underlying population of the cohort under analysis. For more information on item nonresponse in the NLSY79, see the Item Nonresponse section of this guide.

Data from multiple waves

Because the weights are specific to a single wave of the study, and because respondents occasionally miss an interview but are contacted in a subsequent wave, a problem similar to item nonresponse arises when the data are used longitudinally. In addition, occasionally the weights for a respondent in different years may be quite dissimilar, leaving the user uncertain as to which weight is appropriate. In principle, if a user wished to apply weights to multiple wave data, weights would have to be recomputed based upon the persons for whom complete data are available. In practice, if the sample is limited to respondents interviewed in a terminal or end point year, the weight for that year can be used (for more information on weighting see the section on Sample Weights & Clustering Adjustments).

Regression analysis

A common question is whether one should use the provided weights to perform weighted least squares when doing regression analysis. Such a course of action may not lead to correct estimates. If particular groups follow significantly different regression specifications, the preferred method of analysis is to estimate a separate regression for each group or to use dummy (or indicator) variables to specify group membership.

Users interested in calculating the population average effect of, for example, education upon earnings, should simply compute the weighted average of the regression coefficients obtained for each group, using the sum of the weights for the persons in each group as the weights to be applied to the coefficients. While least squares is an estimator that is linear in the dependent variable, it is nonlinear in explanatory variables, and so weighting the observations will generate different results than taking the weighted average of the regression coefficients for the groups. The process of stratifying the sample into groups thought to have different regression coefficients and then testing for equality of coefficients across groups using an F-test is described in most statistics texts.

Users uncertain about the appropriate grouping should consult a statistician or other person knowledgeable about the data set before specifying the regression model. Note that if subgroups have different regression coefficients, a regression on a random sample of the population would not be properly specified.

Clustering adjustments

Researchers use NLSY79 data to estimate a variety of statistics. Since NLSY79 data come from a sample instead of data from every age appropriate individual in the U.S. the statistics produced are only estimates of the "true" national values. When researchers use a computer package to compute a statistic such as a mean or a regression coefficient, the program automatically provides a second set of statistics, such as the standard error, standard deviation, or t-statistic, which tells researchers how precisely the mean or coefficient is measured.

Details

Instead of randomly selecting individuals located anywhere in the U.S. during 1978, only a random selection of areas were selected. By randomly selecting a fixed number of small areas, interviewers reduced the amount of time they spent traveling for each interview. In this way, costs were lowered and the survey was fielded faster yielding data more quickly. Like all other national data sets that use clustering, NLSY79 data has many groups or bunches of respondents who share similar characteristics because they lived in the same neighborhood during 1978. This makes survey results appear more homogeneous, or similar, than actually found in the US.

Researchers can use two different approaches to correct this problem. The first approach uses the tables found in the NLSY79 Technical Sampling Report. For each survey round there is a table that lists the "Design Effects" or DEFT factors. These DEFTs give users a simple method for determining approximately how much they should increase their standard errors when trying to measure the precision of their estimates. Using the DEFT factors is a simple method of adjusting standard errors to account for clustering. However, when using specialized subsamples, these tables provide no guidance for users on how to adjust regression coefficients being based on calculations from only a small subset of NLSY79 variables.

The more general method is to correct for clustering by using a specialized software package. Two of the most widely used packages to adjust surveys for clustering effects are Stata, sold by the Stata Corporation and Sudaan, sold by RTI International. This section describes how to adjust for clustering using Sudaan. Sudaan is used to generate the DEFT factors found in the Technical Sampling Report.

Important information: Clustering

If you do not have access to the Geocode data set, you cannot use Sudaan or Stata to adjust for clustering. The Geocode data set can only be accessed by individuals approved by BLS. See Geographic Residence and Neighborhood Composition for information about using the restricted-use Geocode file.

Table 1. Effect of clustering correction on a mean value's standard error, 1998 data, example one
Variable	Mean Value	Uncorrected Std Error	Corrected Std Error
Net Worth	$128,068	$3,403	$5,826
Family Income	$55,031	$536	$1,137
BMI	26.7	0.06	0.09

Table 2 shows how adjusting for clustering affects a simple regression. Using the same 1998 data, a simple unweighted least squares equation was run with both SAS and Sudaan using net worth as the dependent variable and six independent variables. Three of these independent variables (BMI, income and age) take a wide range of values, while the remaining three variables (black, Hispanic or Latino, and female) take the value of 1 if the respondent has the particular characteristic and 0 otherwise.

The table shows that adjusting for clustering changes many of the standard errors and associated t-values. The biggest effect is seen on the income line. The uncorrected standard error increases from 0.06 to 0.19, resulting in the t-value falling from 44.37 to 13.87. Smaller changes are seen for the other variables. The intercept, age, and female standard errors all increase in size while the BMI, black, and Hispanic or Latino variables all end up with slightly smaller standard errors.

Overall, both examples show that adjusting for clustering effects is important. The next subsection shows what variables are needed to adjust for clustering. The section ends with the specific Sudaan commands used to create the tables in this chapter.

Key variables needed for clustering correction

Two variables are needed to adjust the data set for clustering. Both variables are found only on the Geocode data set and are placed there because researchers can use these variables to determine where each civilian respondent lived in 1978.

Table 2. Effect of clustering correction on a mean value's standard error, 1998 data, example two
Variable	Coefficient Estimate	Uncorrected Std Error	Uncorrected t Value	Corrected Std Error	Corrected t Value
Intercept	186,808	43,534	4.29	52,166	3.58
BMI	1,091	466	2.34	457	2.39
Income	2.63	0.06	44.37	0.19	13.87
Black	40,394	5,938	6.80	4,259	9.48
Hispanic	41,382	6,617	6.25	4,554	9.09
Age	5,285	1,086	4.87	1,252	4.22
Female	2,814	4,891	0.58	5,064	0.56

As discussed above, the NLSY79 is a multi-stage clustered sample. The clusters were created by first dividing the entire U.S. into Primary Sampling Units, or PSUs. These PSUs were defined by NORC and were composed of Standard Metropolitan Statistical Areas (SMSAs), entire counties when the counties were small, parts of counties when the counties were large, and independent cities. NORC randomly selected two different sets of PSUs for inclusion in the study, each of which by itself randomly represents the U.S. This selection of two sets of PSUs means the NLSY79 is composed of two replicates or strata. Within each is a random selection of PSUs. The replicate or strata that a respondent belongs to is found in the Geocode data set only and is labeled variable R02191.46, entitled "Within Stratum Replicate Of Primary Sampling Unit." This variable takes either the value 1 or 2, for either the first or second replicate.

The variable, containing the PSU is labeled R02191.45, and is entitled "Stratum Number For Primary Sampling Units." R02191.45 ranges in value from 1 to 120. Researchers who want to know which geographic areas correspond to particular values should look at Attachment 104 of the Geocode Codebook Supplement for the crosswalk table. Respondents with a PSU code of 52 to 70 are part of the military sample and do not have any known geographic location.

Important information: Clarification on variable labeling

The label for variable R02191.46 found in SAS and SPSS programs that is automatically produced by NLS Investigator is confusing. The label reads "PRIMARY SAMPLNG UNIT PSU SCRAMBLED 79". This variable contains the scrambled replicate, or stratum number, not the PSU. PSU information is found in R02191.45. Users should be careful when adjusting geographic variables using the clustering corrections. The complete title for variable R02191.46 is "Within Stratum Replicate Of Primary Sampling Unit (PSU) - Scrambled." Because this variable is randomly scrambled, doing clustering corrections on some geographic variables produces incorrect results. Scrambling has no effect on variables that are not geographic, such as education, income, or training.

Using the key variables In Sudaan

The specific steps used to generate the tables above are covered in this section. While the tables were produced using the Windows Version 8.0 Standalone package, the steps and commands are similar for other versions of Sudaan. To adjust summary statistics such as means or regressions with Sudaan, the researcher needs to create three files: one containing the data, one telling Sudaan how to read the data, and one containing the specific commands. Any computer package can be used to create the data file. Data can even be written directly from NLS Investigator to a file. Figure 1 has the relevant portion of the SAS program used to create the data file used in Tables 1 and 2 above.

Figure 1. SAS commands to create Sudaan data file

Data obesity;
(SAS commands that generate variables like Age, Income, and BMI are placed here)
PSU =R0219145;
REPLICATE =R0219146;
proc sort; /* Sort the data since Sudaan can not handle unsorted */
by replicate psu;
Data;
Set obesity;
file 'C:\DesignEffects\ObesitySudaanAdjustment.dbs'
put ID     5.
PSU         3.
REPLICATE   2.
WGHT       7.
BLACK      2.
HISPANIC    2.
AGE        3.
SEX        2.
INCOME      9.
BMI        4.1
NETASSET    9

Run;

One of the key things to note is that the data are sorted by the PSU and replicate variables before being written to the file. For most operations, Sudaan requires the data to be in this order before processing.

The second file is the "label" file. This file is used to read the data into Sudaan. The label file, called "ObesitySudaanAdjustment.lab," is shown in Figure 2. The label file has five parts. The first column on the left is the variable's name, followed by a letter which tells Sudaan if the variable contains numeric or character data. The third and fourth columns contain the number of bytes (characters) taken up by the variable and the number of decimal places in the number. The last column contains the label. Sudaan expects the label file to follow a precise format with columns starting and ending in very specific places.

Figure 2. Sudaan label file
ID	N	5	0	ID# (1-12686)
PSU	N	3	0	# OF PSU
REPLICAT	N	2	0	REPPLICATE SCRAMBLED
WGHT	N	7	0	SAMPLING WEIGHT
BLACK	N	2	0	T/F BLACK
HISPANIC	N	2	0	T/F HISPANIC
AGE	N	3	0	AGE OF RESPONDENT
SEX	N	2	0	MALE 0 - FEMALE 1
TOTINC	N	9	0	TOTAL INCOME
BMI	N	4	1	BODY MASS
NETASS	N	9	0	TOTAL NET WORTH

The third file is the set of commands used to run Sudaan. Many versions of Sudaan allow commands to be typed directly into the program so researchers are not forced to create command files. Figures 3 and 4 provide the Sudaan commands that were used to create Tables 1 and 2 above. Figure 3 has three sections. The top section below the "Proc Descript" command tells Sudaan where to find the raw data and what variable contains the basic survey weights. The nest command defines which variables contain the replicate and PSU information. The middle section, beginning with "Var," tells Sudaan which variables will have descriptive statistics created. The final section, beginning with "Print," specifies the types of output that are shown.

The first section of Figure 4 is similar to commands seen above in Proc Descript. The large difference is that the "weight" command has the reserved name "_ONE_" after it instead of the NLSY79 weight, "wght." Putting the "wght" variable after the weight command would cause Sudaan to run weighted least squares. By using "_ONE_" instead, Sudaan weights all variables with the same 1.0 value, resulting in Sudaan running unweighted least squares. The second part of the command, which begins with "Model," shows the exact regression to run.

Figure 3. Sudaan commands used to create summary statistics in Table 1

Proc Descript
Data="C:\DesignEffects\ObesitySudaanAdjustment.dbs"
filetype=asciidesign=wr mean DEFT1est_no=12686;
weight wght;
nest REPLICAT PSU / MISSUNIT;
Var NETASS BMI TOTINC BLACK HISPANIC AGE SEX;
Print nsum="Sample Size" WSUM="Population Size" Mean
semean="Std. Err." DEFFMEAN="Design Effect" / style=nchs
nsumfmt=f6.0 wsumfmt=f10.0 deffmeanfmt=f6.2 semeanfmt=f11.2;

Figure 4. Sudaan commands used to create regression values in Table 2

Proc Regress
Data="C:\DesignEffects\ObesitySudaanAdjustment.dbs"
filetype=asciidesign=wr DEFT1est_no=12686;
weight ONE;
nest REPLICAT PSU / MISSUNIT;
Model NETASS = BMI TOTINC BLACK HISPANIC AGE SEX;

Related Variables	The 1979 Geocode data also contain the State, county, and metropolitan statistical area where the respondent lived in 1979.
Documentation	Additional information can be found in Standard Errors and Design Effects section of this User's Guide, in the NLSY79 Technical Sampling Report, and in Attachment 104 of the Geocode Codebook Supplement.
Data Files	Data on clustering can be found only in the NLSY79 Geocode files under the "GEOCODE" 1979 area of interest.