NLSW Documentation | National Longitudinal Surveys

All variables present on a main file data set (accessed through NLS Investigator) are documented via: (1) a cohort-specific codebook and (2) an accompanying codebook supplement. This section describes these components and discusses the important types of information found within each.

Codebook

The codebook is the principal element of the documentation system and contains information intended to be complete and self-explanatory for each variable in a data file. Codebook information can be viewed with the use of NLS Investigator by clicking on a variable's reference number once a list of variables has been selected.

Every variable is presented within the documentation as a block of information called a "codeblock." Codeblock entries depict the following information: a reference number, variable title, coding information, frequency distribution, reference to the questionnaire item or source of the variable, and information on the derivation for created variables. The codeblocks of many variables include special notes containing additional information designed to assist in the accurate use of data from that variable.

Codebooks are arranged by reference number. Variables are first grouped according to survey year. Within each survey year, those variables related to the interview (e.g., interview method, interview date, reason for noninterview, sampling weight, etc.) appear first, followed by variables picked up directly from the questionnaire and Information Sheet. In general, created and edited variables appear last, although the created environmental variables are grouped with variables related to the interview in the early survey years.

Important information: Codebooks and questionnaires

NLS codebooks are not a substitute for the questionnaires. Although these two pieces of documentation contain similar information, the questionnaires should be used to determine precise universe information.

Coding information

Each codeblock entry presents the set of legitimate codes that a variable may assume along with a text entry describing the codes. Users should note that coding information for a given variable in the NLS codeblock is not necessarily consistent with the codes found within the questionnaire or for the same variable across years. Use only the codebook coding information for analysis. The following types of code entries occur in NLS codeblocks:

Dichotomous variables

Dichotomous or yes/no variables that are uniformly coded "Yes" = 1, "No" = 0. Other dichotomous variables have frequently been reformulated to permit this convention to be followed.

Discrete variables

Discrete (categorical), as in the case of the categories in 'Activity Most of Survey Week 93':

1 = Working
2 = With a job, not at work
3 = Looking for work
4 = Going to school
5 = Keeping house
6 = Unable to work
7 = Retired
8 = Other

Continuous variables

Continuous (quantitative), as in the case of 'Hourly Rate of Pay at Current or Last Job 83 *KEY*.' These variables have continuous data, but the codebook presents a frequency distribution as in the sample codeblocks above for ease of use.

Combined quantitative-qualitative variables

Combined quantitative-qualitative, variables that are ostensibly quantitative but may have nonquantitative (categorical) responses, utilize integers equaling the actual values for the quantitative responses and 999 for the qualitative (categorical) response. For example, "YEAR STOPPED WORKING AT 1ST MOST RECENT JOB INTRVNG & LAST" is coded as follows:

60 thru 73 = actual year
999 = still working there

Multiple responses

In the early years of the surveys, response categories to multiple entry questions found in certain job search, child care, discrimination, or health questions were coded in a geometric progression. For example, more than one response to the question "Method of seeking employment to be used in next year" was possible. The response categories to that question were each assigned a value as follows:

1 = Checked with public employment agency
2 = Checked with private employment agency
4 = Checked with employer directly
8 = Checked with friends or relatives
16 = Placed or answered ads
32 = Other method

Multiple responses were then coded for each respondent by adding the individual codes, which yields a unique value for each combination. Such multiple entry variables were identified by an asterisk (*) next to the answer categories in the questionnaire. If a multiple entry has only a few unique combinations, the codebook will specify the exact combinations; those with many combinations need to be unpacked. See Appendix C: How to Unpack Multiple Entries to learn more about this process. After the 1989 (Mature Women) and 1991 (Young Women) surveys, this multiple entry practice was discontinued and all responses were coded as yes/no.

Important information: Geometric progression discontinued

After the 1989 survey, the practice of coding multiple entry variables in a geometric progression was discontinued and all responses were coded as yes/no. In this system, the question above would have six corresponding variables in the codebook, one for each response category. Codes of 1 and 0 would indicate whether the respondent answered positively for each category. Respondents who do not know or refuse to answer the question receive the appropriate missing value for all the variables that correspond to that question. Respondents who do not know or refuse to respond to just one category receive the appropriate missing value for the corresponding variable. The system for coding missing values in multiple response questions changed slightly in 1999. There are still separate variables for each response category, and respondents who do not know or refuse to respond to just one category are coded with the correct missing value for the corresponding variable. The difference is that, at the end of the series of variables, a new variable indicates that it is the final record for the series. In this variable, respondents who answered any or all of the category questions receive either a -8 or a 0 code, depending on the series, to indicate that they are done selecting response categories. In this variable, respondents who replied "don't know" to the entire series are coded as -2 and those who refused to answer the entire series are coded as -1. For some series, this final variable may have other options in addition to those described above.

Missing responses

Negative numbers are used to indicate that a respondent does not have a valid value for a particular variable. Different numbers indicate different reasons for nonresponse:

"Refusal" indicates that the respondent refused to answer a given question. These respondents are assigned a value of -1. This code is used for all interviews of this cohort.
"Don't know" indicates that the respondent did not know the answer to a given question. These respondents are assigned a value of -2. This code is used for all interviews of this cohort.
"Invalid skip" indicates that the respondent was not asked a question that she should have answered, usually due to programming or interviewer error. These respondents are assigned a value of -3. This code is only used consistently for CAPI interviews (1995-2003). CAPI is short for Computer-Assisted Personal Interviews.
"Valid skip" has slightly different meanings depending on survey year. In CAPI interviews (1995-2003), this code indicates that the respondent was skipped past the question intentionally, because she was not in the universe of respondents to whom that question applied. These respondents are assigned a value of -4. In paper and pencil interviews (PAPI), which were used from 1967-92, this code indicates either that the respondent is not in the applicable universe or there was some other error that resulted in a missing response (which generally would have resulted in an invalid skip code in a CAPI survey).
Finally, a "noninterview" value of -5 indicates that a respondent was not interviewed in that survey year. This code is used for all interviews of this cohort.

Important information: Missing values

The missing value codes described above are accurate for the 1999-2003 Mature Women and 1995-2003 Young Women data releases. In previous years, a more complicated system was used to indicate missing data in the PAPI interviews. Beginning in 1995, the missing values were reassigned using a standardized system that matches the Young Women's CAPI data as well as the other NLS cohorts. Beginning in 1999, the same process was applied to the Mature Women data. This standardization should make it easier to use the data in analysis. However, researchers using programs written for a previous release of the Mature and Young Women data may need to change the parts of their programming code related to missing values. Users who need more information about the codes previously used in order to make these adjustments should contact NLS User Services.

Three additional negative codes are used only with the Mature and Young Women's cohorts for particular types of nonresponse.

In questions dealing with usual hours per week worked, if the respondent reported that her hours varied, she was assigned a code of -6.
Women who had been widowed since the last survey were asked a series of questions regarding their husband's care and their financial situation since his death. A code of -7 was assigned to women whom the interviewer judged to be emotionally unable to answer these questions.
Some variables in multiple response question series include codes of -8, indicating that the respondent was done with the series.

Important information: Valid and invalid skips

In computer-assisted surveys, respondents are initially assigned a default code of -4 (valid skip) for all questions in the interview. Then the -4 codes are replaced by valid data. The -3 (invalid skip) codes must be inserted into the data as hand-edits when data archivists uncover skip pattern errors during the data cleaning process. Therefore, some respondents classified as valid skips may actually have skipped a question incorrectly. If researchers need to know the exact reason a question was not answered, they can examine the skip patterns and universes in the questionnaire to determine whether any additional respondents should have been identified as invalid skips.

Derivations

The decision rules employed in the creation of constructed variables have been included, whenever possible, in the codebook under the title "DERIVATIONS." This information is designed to enable researchers to determine whether available constructs are appropriate for their needs. In the 'Hourly Rate of Pay at Current or Last Job 83 *KEY*' example, the derivation describes in detail the items of the interview schedule used to create the variable. If the derivation is too lengthy to include in the codebook, the codeblock will instead refer users to the supplemental documentation item that contains variable creation information.

Frequency distribution

In the case of discrete (categorical) variables, frequency counts are normally shown in the first column to the left of the code categories. In the case of continuous (quantitative) variables, a distribution of the variable is presented using a convenient class interval. The format of these distributions varies.

Questionnaire item

"Questionnaire item" is a generic term identifying the source of data for a given variable. A questionnaire item may be a question, a check item, or an interviewer's reference item appearing within one of the survey instruments. Questionnaire item identifications are located in the extreme right hand column of the codebook. The question number, when available, is copied exactly from the questionnaire.

During PAPI interview years, all created variables have a question name of simply "CV." Created variables in CAPI survey years usually include the letters CV in the question name and usually have the word *KEY* in their title.

Valid values range

Depicted below the frequency distribution are the maximum and minimum fields, which define the range of valid values (the upper and lower limits) for a given question. "MINIMUM" indicates the smallest recorded value exclusive of nonresponse codes; "MAXIMUM" indicates the largest recorded value. In the case of the 'Hourly Rate of Pay' example, the maximum, or highest value recorded, is 9815 with two implied decimal places, or $98.15.

Topcoding income and asset values

Confidentiality issues restrict release of all income and asset values. To ensure respondent confidentiality, income variables exceeding particular limits are truncated each survey year so that values exceeding the upper limits are converted to a set maximum value. These upper limits vary by year, as do the set maximum values. From 1968 through 1971, upper limit dollar amounts were set to 999999. From 1972 through 1980, upper limit variables were set to maximum values of 50000, and in 1982 and 1983 the set maximum value was 50001. Beginning in 1985, income amounts exceeding $100,000 were converted to a set maximum value of 100001.

From the cohort's inception, asset variables exceeding upper limits were truncated to 999999. Beginning in 1983, assets exceeding one million were converted to a set maximum value of 999997. Starting in 1993, the Census Bureau also topcoded selected asset items if it considered that the release of the absolute value might aid in the identification of a respondent. This topcoding was conducted on a case-by-case basis with the mean of the top three values substituted for each respondent who reported such amounts.

Codebook supplements

Variable creation procedures and supplemental coding information are provided within each cohort's Codebook Supplement. There are separate codebook supplements for the Mature Women and Young Women cohorts. Choose a cohort below to review the corresponding codebook supplement: