Types of Variables: Raw, Symbols, Rosters & Created

Types of Variables: Raw, Symbols, Rosters & Created

Several types of variables are present in the NLSY97 data, including:

  • Direct (or raw) responses from a questionnaire or other survey instrument.
  • Symbols and roster items, which are used to guide the interview.
  • Created variables based on responses to more than one data item. These items are edited for consistency where necessary.
  • Created variables from data provided on a non-NLS data set.
  • Variables provided by NORC or an outside organization.

This section will help users understand variable descriptions or titles, symbols and rosters, and created variables.

User Note

Survey personnel do not, in general, impute missing values or perform internal consistency checks across waves. Exceptions will be noted.

Variable Descriptions or Variable Titles

Each variable within NLSY97 main file data sets has been assigned an 80-character summary title that serves as the descriptive representation of that variable throughout the hard copy and electronic documentation system. Variable titles are assigned to capture the core content of the variable and to incorporate universe identifiers that specify the subset of respondents for which each variable is relevant within the limitations described below. Some titles indicate the reference periods (e.g., survey year or calendar year) of the variables as well.

Universe Identifiers:  If two ostensibly identical variables differ only in their respondent universes, the variable title will include a reference to the applicable universe. The appropriate universe will either be appended in parentheses or identified before the variable title.

Example 1: R00029. "R Do Any Work for Pay Last Week? (R Does Not Own Bus/Farm)"
R00030. "R Do Any Work for Pay or Profit Last Week? (R Owns Bus/Farm)"

Example 2: R01075. "Compensation Received (Start <16) EMP 01"
R01803. "Compensation Received (Start 16+) EMP 01"

User Note

Do not presume that two variables with the same or similar titles necessarily have the same (1) universe of respondents or (2) coding categories or (3) time reference period. While the universe identifier conventions discussed above have been utilized, users are urged to consult the questionnaires for skip patterns and exact time periods for a given variable and to factor in the relevant fielding period(s) for the cohort. In addition, variables with similar content may have completely different titles, depending on the type of variable (raw versus created).

Symbols and Roster Items

There are two main types of survey variables not necessarily represented by a single item in the questionnaire: symbols and roster items. These items are used by the CAPI system during the interview to organize, display, and store information collected during the interview; to determine which question paths the respondent should follow; and to fill in respondent-specific text in various questions. For example, rather than asking about a respondent's "current employer," the CAPI software fills in the actual employer name reported earlier in the interview. Many of these symbols and roster items are provided in the data set for user reference; researchers should be aware of the differences between the two types and the uses of each.

Symbols

Symbols are variables that are used by the NLSY97 CAPI software to determine the flow of the interview. Symbols may contain real-time information captured during the survey, or they may be created in advance of the interview by survey staff. For example, before the income section for rounds 1-5, the survey program created a symbol that states whether the respondent is independent (Y12!INDEPEN). This symbol is later used to determine whether the youth is asked certain income and asset questions. Similarly, before the survey round starts, survey staff create a symbol indicating the respondent's gender (SYMBOL!KEY!SEX) which is used throughout the interview to make sure that the respondent is asked appropriate questions about gender-specific topics such as pregnancy. 

All symbol variables have "Symbols" as their variable type. In general, question names for round 1 symbol variables begin with "KEY!"; symbols in subsequent rounds generally have "SYMBOL!" or "SYMBOL_" to start their question names.

Rosters

The NLSY97 uses rosters in various sections in which information is collected on a number of persons, schools, or employers. Rosters are an important part of the NLSY97 data set. These grids of information help researchers to analyze data in an efficient and accurate way. However, the structure and use of rosters may be somewhat confusing, so it is vital that researchers understand how they are constructed.

User Note

In addition to the detailed roster discussion in the following paragraphs, another example of a roster can be found in Employment: An Introduction. Although that example pertains specifically to employers, the basic concepts apply to other NLSY97 rosters. Researchers using any roster data may find the example helpful. More information about using specific rosters is found in the various topical sections. Researchers may be particularly interested in Household Composition, Characteristics of Non-Residential Relatives, Education, Training & Achievement Scores: An Introduction, and Marital & Marriage-Like Relationships.

What is a roster? 

A roster may be thought of as a list--for example, a list of household members, a list of employers, or a list of children. A respondent with two children will have data on the first two lines of the child list, or child roster. A respondent with four employers will have information on the first four lines of the employer roster. In addition to the name of the person or item (which is not released to the public), the roster contains other basic information, such as the age, race, and labor force status of household members or the start date and stop date for each employer.

In the paper-and-pencil interviews (PAPI) of older NLS cohorts, the questionnaires included a chart or grid listing this type of information, like the one shown in Figure 1. For example, in the household roster grid, each household member's name was entered in a separate row. The interviewer asked the respondent for each member's date of birth, enrollment status, employment status, etc., filling in the answers in the appropriate column. This completed household roster contained all the pertinent information about household residents, and researchers could easily use the variables based on this roster to examine characteristics of household members.

Figure 1. Sample PAPI Roster Grid

What are the names of all family members who are living in your home?
Name What is __'s relationship to you? How old is __ today? (Age 4 and older) Is __ enrolled in school? (Age 16 and older)
How many weeks did __ work in the last 12 months? How many hours did __ usually work per week? What kind of work was __ doing in the past 12 months?
Susan Mother 45 No 50 25 Graphic design
John Father 49 No 50 40 Banking
Jimmy Brother 17 Yes 35 15 Food service
Sally Sister 12 Yes (n/a) (n/a) (n/a)
Robert Brother 3 (n/a) (n/a) (n/a) (n/a)
Jane Grandmother 77 No 0 (n/a) (n/a)

When the NLS surveys changed to computer-assisted personal interviewing (CAPI), rosters became a very important way of organizing information during the interview. Instead of using an actual grid, however, CAPI questionnaires include a series of questions that gather the same types of information that would have been included in the grid in a paper-and-pencil interview. The computer then moves the answers to these questions into a grid, creating a roster from the information.

After the roster is created, it can be used to guide subsequent portions of the interview. For example, during the interview the NLSY97 questionnaire gathers the names, dates of attendance, and level of school (secondary school or college) for each of the respondent's schools and organizes them into a roster. The rest of the school section then asks questions about the first school on the roster, followed by questions about the second school, then the third, and so on. The information about the level of the school determines whether the respondent is asked questions that apply to high school or college.

The information from the roster is also presented in the data set as an organized list of data, so that these variables are easy for researchers to access. To the user, the school roster appears as a consolidated block of variables that contains key information such as dates of enrollment, an identification number for the school, and variables indicating the type (private or public) and level (junior high, high school, college) of the school. For example, the variables in the round 2 school roster are listed in Figure 2, along with their reference numbers. Thus, rosters are a way of organizing information both for researchers and for the actual interview so that questions are asked in a logical manner.

Figure 2. Example: Variables in CAPI-Generated School Roster for Round 2

Question Name Variable Title Reference Numbers 
(one for each school)
NEWSCHOOL_PERIODS.xx Number of Times R Enrolled in School xx R24605.-R24610.00
NEWSCHOOL_START1.xx Month/Year R Start 1st Enrollment in School xx R24611.00-R24616.01
NEWSCHOOL_START2.xx Month/Year R Start 2nd Enrollment in School xx R24617.00-R24620.01
NEWSCHOOL_START3.xx Month/Year R Start 3rd Enrollment in School xx R24621.00-R24621.01
NEWSCHOOL_STOP1.xx Month/Year R End 1st Enrollment in School xx R24622.00-R24627.01
NEWSCHOOL_STOP2.xx Month/Year R End 2nd Enrollment in School xx R24628.00-R24631.01
NEWSCHOOL_STOP3.xx Month/Year R End 3rd Enrollment in School xx R24632.00-R24632.01
NEWSCHOOL_SCHCODE.xx School Code Elementary, Middle, High, College R24633.-R24638.00
NEWSCHOOL_INTERVIEW.xx Which Survey Round School xx Reported in R24639.-R24644.00
NEWSCHOOL_TYPE.xx Type of School xx R has Attended R24645.-R24650.00
NEWSCHOOL_PUBID.xx PUBID of School xx R has Attended R24651.-R24656.00

How are rosters created during the interview? 

This section outlines the process used during the interview to create a roster. Rosters may include data from both previous interviews and the current interview.  After the roster is created and sorted, it can be used to guide the rest of the interview. Figure 3 provides a pictorial overview of the creation of a roster.

Figure 3. How Rosters Are Created

Data from previous interviews: As shown in the figure, creation of a roster for the current round often begins with information found in the roster from the previous round. The appropriate respondent-specific data are saved on the interviewer's laptop before he or she administers the survey. When the interview gets to a point where roster information is collected, the data from the previous round's roster are often used as the base for the current roster. The respondent verifies and updates the information. If no changes have occurred since the last interview--for example, if exactly the same people live in the respondent's household--then the current round's roster will be the same as the one from the previous round.

For example, the interviewer reads a list of all of the people on the household roster from the last interview. The respondent first states whether any of those people have moved out of the household and then reports new household members  If any members remain from the previous year, their information--date of birth, gender, race/ethnicity, etc.--is carried over from the previous interview, and any missing data are collected. This method is more efficient than asking the respondent to report all household members every year.

Raw data collection: After the respondent and interviewer review and update the roster from the previous round, the survey collects current information. For example, new people might have moved into the household, so the interviewer asks the respondent about their characteristics. At this point, the respondent is done answering questions that will fill up the data grid on a particular topic.

Roster creation and roster sort: Using the updated roster from the previous round and the new raw data just collected, the computer creates a new roster for the current round. For example, the employer roster contains the following information for each job: a unique identification number for the employer, employment dates, whether the job was current at the interview date, whether the job was in the military, and whether the job was an internship. If the respondent had held the job at the time of the previous interview, the start date and employer identification number are carried over from the old roster, and the other information is taken from the questions at the beginning of the employment section for the current year. Similarly, the household roster contains information from the previous interview about household members reported at that time and data from the current interview about new household members.

In some cases, the computer also sorts the roster and puts the items in order based on a specified variable. For example, in the round 1 household roster, all youths in the age range of the NLSY97 cohort were listed first, and then all other household members were listed from oldest to youngest. The employer roster is sorted by job end date so that the most recent jobs are listed first.

Roster use in the interview: Finally, the roster is used to determine the order in which the other questions about each topic are asked. In most cases, the survey collects far more information than is stored in the actual roster, and the answers to these questions remain outside the roster as raw data. So that the interview makes sense to the respondent, these additional questions are asked about the people or things on the roster in the order that the people or things are listed.

For example, the respondent first answers questions about industry, occupation, rate of pay, etc., for the first employer listed on the roster. The same questions are then asked about the second job, then the third job, and so on. Similarly, the first set of questions about household members refers to the first person listed on the roster. When all of those questions have been answered, the same questions are asked about the second person, the third person, etc.

How should researchers use the roster data in analysis? 

The data set is organized so that rosters can be easily found and used in research. Because rosters present key pieces of information in a structured format, they are the best place to obtain that information. Each roster has a unique name that serves as the beginning of the question name for all variables on the roster; the same name appears at the beginning of the variable title for each item on the roster. Users can also search under the term "Roster Item" in the Variable Type search criterion in Investigator. Different rosters have been used in different rounds, depending on the topics included in the interview and the type of information collected. The roster names and question names are shown in Figure 4.

Figure 4. Rosters Included Each Round

Roster Question name R1 R2-4 R5 R6 and up
Household Information HHI2 (rd. 1), HHI (rds. 2-9) * * * *
Nonresident Roster NONHHI * * * *
Youth Information YOUTH *      
School Roster NEWSCHOOL   * * *
Employer Roster YEMP * * * *
Freelance Jobs Roster FREELANCE * * *  
Training Roster TRAINING * * * *
Biological Children Roster BIOCHILD * *    
Biological/Adopted Children Roster BIOADOPTCHILD     * *
Parent Household Information PARHHI *      
Parent Youth Information PARYOUTH *      
Partner/Spouse PARTNERS * * * *
Other parents of respondent's children1 OTHERPARENTS      collapsed
Partner/spouse information1 CUMPARTNERS      collapsed
 
1These are collapsed rosters. The variables combine information across survey rounds. All respondents are represented in the roster, regardless of whether they were interviewed in the most recent round. These variables are listed as "XRND" rather than being associated with a survey year in the data.

Using rosters in single-round analyses

When looking at the data set, users will notice that many questions are repeated for each person or thing on the roster, and the titles for these repeated questions include a number. This number indicates the line on the roster that corresponds to the person or item being described in that variable. For example, the question "Self-Employed Business/Industry Job 02" indicates the industry of the second job listed on the respondent's self-employment roster. The researcher may then want to examine information such as the respondent's start and stop dates or rate of pay for that job. To find this information, he or she can then look at the data for those items contained in the roster for job #02, or the self-employment job that is on the second line of the roster. For all other questions asked after the roster was created in that same survey year, job #02 will refer to the same self-employment job.

In some cases, the raw answers may be blanked out of the public use data set. If a reference number is not listed for a given question in the questionnaire, then that raw data item may only be represented in roster form. For example, answers to the raw data questions used to create the employer roster are blanked out and do not appear in the data.  In the printed questionnaire, these questions have no reference numbers. However, all of the data collected in these questions (except for confidential information like the name of the employer) appears in the employer roster.

For some variables, the roster information may be more accurate because some rosters are updated during the interview if the initial report was inaccurate.  When survey staff prepare the data for release, they clean up the rosters if necessary but do not necessarily clean the corresponding raw data. Finally, because many rosters are sorted in a particular order, the number of a person or item on the roster will not match the number in the questions that precede roster creation. For example, in the household screener (the SE questions), person #01 is the first household resident mentioned to the interviewer. In the household roster and all later interview questions, person #01 is the oldest person in the household who was eligible for the NLSY97.  Person #01 in the SE questions might be person #05 on the roster. It can be very difficult to determine to which person, school, or job a pre-sort question refers. For all of these reasons, roster data are always preferable to raw data in cases where both are available.

Using rosters from more than one round

Because the NLSY97 is a longitudinal survey, researchers often want to use data across survey rounds. However, household residents, jobs, and so on may move around on the roster in different interviews. That is, a father who was listed third on the roster in round 1 might move to position 2 or 4 in round 2. The unique identification numbers (UIDs) are the key to finding the same person or thing in different rounds  Most of the rosters contain variables assigning a unique number to each person or thing listed.  This number never changes and can be used to link roster items across rounds. In some cases, it also makes it possible to link people between two different rosters in the same survey. For example, beginning in round 2 the unique ID listed for a child on the biological children roster is the same one assigned to that child on the household roster. Researchers can therefore examine data on both rosters about the same child.

An additional feature of most unique ID numbers is that they incorporate an indicator of the round in which the person or item was first reported. For example, IDs of roster items reported in round 1 may begin with "1" or "97," while those first reported in round 2 begin with "2" or "98." (Beginning with round 3, 4-digit years are used so that IDs begin with "1999" rather than just "99.")  UIDs for people on the household roster are constructed in a slightly different manner; researchers should refer to Household Composition for more information.

Created Variables

Created variables generally start with "CV_" or "CVC_" in the codebook. The "CV" variables are designated by survey year in the codebook, while the "CVC" variables are created as a "cross round" (XRND) variable, meaning the information used came from the respondent's latest interview, regardless of what survey round it was.

A few created variables have a prefix different from "CV" or "CVC." Sampling weight variables, for instance, have the variable names SAMPLING_WEIGHT and CS_SAMPLING_WEIGHT. Other exceptions to note include the validation variables for rounds 4 and 5, which have question name VALIDR_, and the timing variables (rounds 5 and up) with question names R5_TIM, R6_TIM, R7_TIM, and so forth. In addition, the family process variables constructed by Child Trends (see Appendix 9) have question names beginning with "FP_" in the codebook. In the Event History data, all variables are created (reference numbers for event history variables begin with the letter "E.")

Beginning in round 5 (2001), timing variables were created to measure the length of time a respondent took to complete the entire interview, along with a breakdown of the amount of time taken to complete each main questionnaire section. Each timing variable is tabulated in seconds, with one implied decimal place. Timing data can be found under the "Timing" Area of Interest in the NLS Investigator. In round 7, timing variables were expanded to show the length of time it took to complete subsections. Because of confidentiality concerns with the Welfare Knowledge section, round 7 timings are available only through the geocode release.

In addition to the variables created by CHRR, Child Trends, Inc., an organization involved in the NLSY97 questionnaire design process, has created a number of scales and indexes from several groups of variables described in this section. These scales and indexes are intended to aid researchers in using the various data items relating to attitudes and behaviors. For these variable descriptions, see the Created Variables listings at the beginning of the following sections:

Although these Child Trends created variables are described only briefly in this guide, interested researchers may obtain a detailed discussion of the creation procedures in Appendix 9 of the NLSY97 Codebook Supplement. This document also summarizes statistical analyses of the scales and indexes, as well as related data items, performed by Child Trends researchers. These variables contain the prefix "FP_" in their question names (FP stands for Family Process Measures). 

New Variables Created by Researchers

Researchers sometimes use the NLS public datasets to generate a new variable to use in their research. In some cases, researchers like to make that new variable publicly available (through their own data repository) so that it can be easily accessed for follow-up studies. This is permissible as long as researchers are using public NLS data (rather than restricted) and that they make it clear they are the author of the variable rather than the NLS team.