Tutorial: Constructing Comparable Samples across the NLSY79 and NLSY97

Example: Constructing work status at age 20 for both samples

Objective: This tutorial walks you through the basic steps of constructing parallel samples for research projects that use both the NLSY79 and the NLSY97 cohorts. It then helps you create a similar variable--work status at age 20--for both cohorts.

The two NLSY cohorts are carefully designed to support a variety of cross-cohort research. Both survey samples are based on birth year, are drawn from nationally representative area probability samples, and have similar questionnaire designs for many topics, particularly employment. However, there are small differences in design that one needs to take into account when preparing data files for research projects.

Knowledge Assumed:  This tutorial assumes that you already know how to use the NLS Investigator to create a tag set that saves your variables and to extract data. If you need assistance with the NLS Investigator before starting this tutorial you should see "How to Use the NLS Investigator".

Background Reading:

Preview of Steps

  1. Select the samples for analysis.
  2. Determine the age and the interview years you will need for your analysis.
  3. Create a tagset of variables to define work status at age 20 for the NLSY79.
  4. Create a tagset of variables to define work status at age 20 for the NLSY97.
  5. Construct work status variable for both samples.

Step 1: Select the samples for analysis

First we'll need to decide whether to exclude certain supplemental samples in the two cohorts from our analysis. The table below shows the types of samples that make up the surveys.

NLSY79 Sample Types (R01736.) NLSY97 Sample Types (R12358.)
Cross-sectional sample: Nationally representative sample of individuals born in 1957-1964 and living in the U.S. as of the first survey round Cross-sectional sample: Nationally representative sample of individuals born in 1980-1984 and living in the U.S. as of the first survey round
Oversamples of black and Hispanic individuals, same birth years as above Oversamples of black and Hispanic individuals, same birth years as above
Economically disadvantaged non-black, non-Hispanic oversample (discontinued after 1990), same birth years as above  
Military sample (mostly discontinued after 1985): Sample of individuals born in 1957-1961 and serving in the military as of September 30, 1978  

 

  1. Let's start with the NLSY79. The variable R0173600, Sample Identification Code, lists the gender, race/ethnicity, and sample type of each NLSY79 respondent. The NLSY79 includes two extra sub-samples not available in the NLSY97: a military sample and an economically disadvantaged non-black, non-Hispanic oversample. To omit the military sample from your analysis, you will want to exclude cases in which R0173600 ranges from 15 to 20. To omit the oversample of economically disadvantaged non-black, non-Hispanic respondents, you'll want to exclude cases in which R0173600 equals 9 or 12.
  2. Now let's turn to the NLSY97. The variable R1235800, CV_Sample_Type, lists the sample type of each NLSY97 respondent (if you are using the new version of NLS Investigator, this variable is preselected).
  3. The NLSY97 includes a cross-sectional sample and oversamples of black and Hispanic individuals, as in the NLSY79. To compare the two samples, one can use the full NLSY97 sample as well as the remaining NLSY79 cases (after excluding the military and poor white subsamples).
  4. Alternatively, you could restrict your analysis to the nationally representative cross-sectional samples in both the surveys. In the NLSY79, this amounts to restricting R0173600 equal to 1 through 8, and in the NLSY97, restricting R1235800 = 1. Doing so, however, will dramatically reduce your sample sizes and increase your standard errors.
  5. Note that we have not introduced the possibility of using sampling weights to compute population estimates. This is beyond the scope of this tutorial. However, see the section on Sample Weights, Design Effects & Clustering Adjustments in the NLSY79 User's Guide, and the section on Sample Weights and Design Effects in the NLSY97 User's Guide. In addition, if multiple waves of data are used, one can create appropriate weights using the custom weighting program offered for each survey. 

Step 2: Determine the age and the interview years you will need for your analysis

Depending on your research topic, you may want to look at respondents in each cohort at a given age or age range. Below are some issues to consider.

  1. Which topics are available at different ages or years across the surveys. Some topics, like employment and marital status, are collected in an event-history format. By gathering dates of jobs or marital changes, for example, the surveys collect a complete history of the particular topic. Other information is collected for a point in time, only in certain years, at certain ages, or for select birth cohorts. The starred summary charts (Asterisk Tables) for the NLSY79 and NLSY97 interviews offer a convenient way to see what topics are collected, and in which years they were obtained. One would need to look at the actual questionnaires to see how similar the question wordings are across the topics for the two cohorts.
  2. Which interviews are respondents the age you need for your research question. To determine in which interviews respondents are the ages you need for your research question, you could use the month and year of birth, interview date, and age at interview variables available in each data set. For example, if one wanted to find the first interview after the respondent turned 20, one could either compare the date of birth to the interview date or look at the age at interview variable in each round.
  3. Awareness of a special dating method used in some key event-history data, which allows one to compare the timing of various life events. Information about program participation, marriage, and schooling is provided on a monthly basis using a continuous month timeline in the NLSY97, starting with January 1980. Although not in a continuous month scheme, created event-history data in the NLSY79 often show the measures by month and year. Employment histories are presented on a weekly basis using a continuous week timeline in both the NLSY79 (starting with the week of January 1, 1978) and the NLSY97 (beginning in the month after the respondent turns age 14).

In the next step, we'll show an example of using this continuous week timeline to determine work status during the week that includes October 1st for the year the respondent turns 20.

Step 3: Create a tagset of variables to define work status at age 20 for the NLSY79

Now we'll show an example of looking at respondents in each cohort at a given age--age 20. By the 2006 interview, all NLSY97 respondents are over 20 years old, and by 1985 all NLSY79 respondents are over 20 years old.

Let's define the following variable for both cohorts: work status during the week that includes October 1st for the year the respondent turns 20. We'll need year of birth and weekly labor force status variables from the work history arrays for that particular week for both cohorts.

  1. In the NLSY79, respondents were born in the years 1957 through 1964. That means the year the respondents turn 20 ranges from 1977 to 1984. Note that the work history arrays begin on January 1, 1978, so we'll exclude the 1957 birth year from our analysis.
  2. If we turn to Appendix 18 of the NLSY79 Codebook Supplement, we'll find a table that tells us the week numbers in the work history arrays that correspond to each date. Two numbers are given, the week of the year and the week of the array. For example, in 1979, the week of October 1st is week number 39 in 1979, but week number 92 in the work-history array (number of weeks since 1/1/78). Given the layout of the data, the latter number is what we need. (In the NLSY97 the opposite is true.) We want to find the week numbers that correspond to the week of October 1st for the years 1978-1984--the years our respondents turn age 20. Here's what we need: week 40 in 1978, 92 in 1979, 144 in 1980, 196 in 1981, 248 in 1982, 300 in 1983, and 353 in 1984.
  3. Searching on the "Work History-Weekly Labor Status" Area of Interest will give us the weekly labor force status arrays in the NLSY79. Tag the variables that correspond with the list above (W0065200, W0070400, W0110300, W0150200, W0190100, W0230000, W0270500).
  4. We'll also need variables for year of birth (R0000500), sample composition (R0173600), and, of course, respondent ID (R0000100). Note that respondent ID is preselected in Investigator.
  5. Now create an extract of your NLSY79 data set. The variables included are as follows:

Reference Number Question Name Question Title Year
R00001.00 CASEID Identification Code 1979
R00005.00 S01Q01A Date of Birth - Year 1979
R01736.00 S24Q01 Sample Identification Code 1979
W00652.00 STATUS_WK_NUM0040 Labor Force Status (1978) Week 40 1979
W00704.00 STATUS_WK_NUM0040 Labor Force Status (1978) Week 92 1979
W01103.00 STATUS_WK_NUM0040 Labor Force Status (1978) Week 144 1980
W01502.00 STATUS_WK_NUM0040 Labor Force Status (1978) Week 196 1981
W01901.00 STATUS_WK_NUM0040 Labor Force Status (1978) Week 248 1982
W02300.00 STATUS_WK_NUM0040 Labor Force Status (1978) Week 300 1983
W02705.00 STATUS_WK_NUM0040 Labor Force Status (1978) Week 353 1984

 

Step 4: Create a tagset of variables to define work status at age 20 for the NLSY97

Now we need to create a similar tagset for the NLSY97.

  1. In the NLSY97, respondents were born in the years 1980 through 1984. That means the year the respondents turn 20 ranges from 2000 to 2004.
  2. If we turn to Appendix 7 of the NLSY97 Codebook Supplement, we'll find Table 1, an Excel spreadsheet, which tells us the week numbers in the work history arrays that correspond to each date. Two weekly numbers are given, the week of the year and the week of the array. For example, in 2000, the week that includes October 1st is week number 41 in 2000, but week number 1084 in the work-history array (number of weeks since 1/1/80). Given the layout of the data, the former number is what we need. Here's what we need: week 41 in 2000, 40 in 2001, 40 in 2002, 40 in 2003, and 40 in 2004.
  3. Searching on the Survey Year = "XRND", Word in Title (enter search term) contains "Status", and Word in Title (enter search term) contains "Employment" gives us variables that include the weekly labor force status arrays in the NLSY97. Tag the variables that correspond with the list above (R8812500, R8908000, R9043500, R9048700, R9179400). Note these variables have convenient question names in the data set: EMP_STATUS_year.week number.
  4. We'll also need variables for year of birth (R0536402), sample composition (R1235800), and, of course, respondent ID (R0000100). Note that all three variables are preselected in Investigator.
  5. Now create an extract of your NLSY97 data set. The variables included are as follows:

Reference Number Question Name Question Title Year
R00001.00 PUBID PUBID, Youth Case Identification Code 1997
R05364.02 KEY!BDATE_Y KEY!BDATE, Rs Birthdate Month/Year (Symbol) 1997
R12358.00 CV_SAMPLE_TYPE Sample Type. Cross-Sectional or Oversample 1997
R88125.00 EMP_STATUS_2000.41 2000 Employment: Employment Status in Week 41 XRND
R89080.00 EMP_STATUS_2000.40 2001 Employment: Employment Status in Week 40 XRND
R90435.00 EMP_STATUS_2000.40 2002 Employment: Employment Status in Week 40 XRND
R90487.00 EMP_STATUS_2000.40 2003 Employment: Employment Status in Week 40 XRND
R91794.00 EMP_STATUS_2000.40 2004 Employment: Employment Status in Week 40 XRND

 

Step 5: Construct work status variable for both samples

Now that we have our two data sets from Steps 3 and 4, we're ready to start programming our variables. The logic of this is as follows:

  1. We'll start by restricting our NLSY79 data to the cross-section and oversamples of black, and Hispanic individuals. These are the same sample types available in the NLSY97.
  2. Next, we'll want to look at the definitions of the employment status variables in both cohorts. We want to define an indicator variable equal to 1 if the respondent is working in a civilian or military job during the week that includes October 1st in the year he or she turns 20 and 0 if the respondent is not working.
  3. If we look at the codebook for the employment status variables in the NLSY79, we can see that a value of 100 or more means the respondent was working in a civilian job in that week. A value of 7 means they were in the military. The values 2, 4, and 5 correspond to not working in the particular week, and we will treat 0 and 3 as missing information.
  4. Similarly, the codebook for the employment status variables in the NLSY97 indicates that a value of 9701 or more means the respondent was working in a civilian job in that week. A value of 6 means they were in the military. The values 1, 2, 4, and 5 correspond to not working in the particular week, and again, we will treat 0 and 3 as missing information.

Sample programming code for these steps is available in SAS and STATA.

Additional Information

Final statistics from program:
work79_20 (mean = .604, N =8603)
work97_20 (mean = .665, N =8435)

Next Step: This tutorial focuses on forming comparable variables that measure work status at a given age in the cross-sectional sample and over-samples of blacks and Hispanics in the NLSY79 and NLSY97. One could calculate additional comparable variables in the two surveys over many different domains across different ages and years, such as labor market experience and number of jobs held, marital status and transitions, and fertility.