Appendix 4: Geographic Variable Creation

Appendix 4: Geographic Variable Creation

Several variables in the main data set provide information about the respondent's area of residence. These variables permit researchers to identify key characteristics of the area without needing access to the geocode CD. In general, geographic variables are created using standard geocoding software. (Round 20 used Maptitude version 2023.) Because of this, programming code is usually not provided for these variables. An exception is the migration variable, for which programming code is linked below. This page also offers brief descriptions of the methods used to generate the geographic variables. For more information about the process of classifying a respondent's metropolitan area or about the geographic variables in general, refer to the introduction to the Geocode Codebook Supplement or contact NLS User Services.

Census Region of Residence at Survey Date

Variable Created: CV_CENSUS_REGION

This variable classifies respondents as residing in one of four regions defined by the U.S. Census Bureau. These regions are as follows:

Census Division States
Northeast Connecticut, Maine, Massachusetts, New Hampshire, New Jersey, New York, Pennsylvania,  Rhode Island, and Vermont
North Central Illinois, Indiana, Iowa, Kansas, Michigan, Minnesota, Missouri, Nebraska, North Dakota, Ohio, South Dakota, and Wisconsin
South Alabama, Arkansas, Delaware, District of Columbia, Florida, Georgia, Kentucky, Louisiana, Maryland, Mississippi, North Carolina, Oklahoma, South Carolina, Tennessee, Texas, Virginia, and West Virginia
West Alaska, Arizona, California, Colorado, Hawaii, Idaho, Montana, Nevada, New Mexico, Oregon, Utah, Washington, and Wyoming

MSA Status at Survey Date

Variable Created: CV_MSA

In rounds 1-7, this variable provided users with information about whether the respondent lived in the central city of the MSA, in another part of the MSA, or outside of an MSA. As defined by the U.S. Census Bureau, a central city is the major city lying within a Metropolitan Statistical Area (MSA). Initially, a variable was created using the TIGER/Line files (a database developed by the Census Bureau) to determine whether the respondent lived in an MSA. A second variable was created based on "places" data in the Maptitude program that identified whether the respondent lived in the central city. The variables were then combined to produce a single MSA/central city variable. For rounds 1-7, respondents are coded as follows:

  1. not in MSA
  2. in MSA, not central city
  3. in MSA, central city
  4. in MSA, not known
  5. not in country

Beginning in round 8, new Census standards were used in the creation of this variable. Rather than MSAs, the Census Bureau now defines Core-Based Statistical Areas (CBSA) statistical geographic entities consisting of the county or counties associated with at least one core (urbanized area or urban cluster) with a population of at least 10,000, plus adjacent counties having a high degree of social and economic integration with the core as measured through commuting ties with the counties containing the core (https://www.census.gov/programs-surveys/metro-micro.html).

Metropolitan and micropolitan statistical areas are the two categories of CBSAs. Metropolitan statistical areas have at least one urbanized area of 50,000 or more population, plus adjacent territory that has a high degree of social and economic integration with the core as measured by commuting ties. Micropolitan statistical areas are a newer set of statistical areas that have at least one urban cluster of at least 10,000 but less than 50,000 population, plus adjacent territory that has a high degree of social and economic integration with the core as measured by commuting ties. Metropolitan and micropolitan statistical areas are defined in terms of whole counties or county equivalents, including the six New England states. As of June 6, 2003, there were 362 metropolitan statistical areas and 560 micropolitan statistical areas in the United States.

The NLYS97 Round 8 used CBSA codes updated in January 2006; CBSA codes were updated again in December 2009 and were implemented for the most recent rounds of the NLSY97.

The largest city in each metropolitan or micropolitan statistical area is designated a "principal city." Additional cities qualify if specified requirements are met concerning population size and employment. The term "principal city" replaces "central city," the term used in previous standards. In the NLSY97, information about whether the respondent lives in a CBSA and whether the respondent lives within a principal city of the CBSA is combined into a single variable. This variable, still called CV_MSA, is distinguished from the Rounds 1-7 MSA variables by a title stating that the variable uses "2000 Census standards." The current variable is coded as follows:

  1. not in CBSA
  2. in CBSA, not principal city
  3. in CBSA, principal city
  4. in CBSA, not known
  5. not in country

Rural vs. Urban

Variable Created: CV_URBAN_RURAL

The CV_URBAN-RURAL variable indicates whether the respondent lives in an urban or rural area. Areas are identified as urban or rural by the Census Bureau. According to the Census Bureau, about 25 percent of the U.S. population lives in rural areas. In rounds 1-7, using the 1990 Census standards, urban places were those in "urbanized areas" or "places" with a population of at least 2,500; all other areas were rural. Beginning in round 8, the definition of urban areas was changed to follow the new 2000 Census Bureau standards, which defined all territory within an urban cluster or urbanized area as "urban."

Users should note that this variable includes an "unknown" category, coded 2. This value is assigned to respondents whose zip code includes both urban and rural areas or whose residence cannot be identified precisely enough to classify it as urban or rural. Respondents without valid address data are assigned a value of -3, invalid skip. Respondents who live out of the country are assigned a value of -4, valid skip.

Distance to Parents' Residence

Variables Created:

Distance from Parents

CV_DISTANCE_MOM_COL (COLLAPSED DISTANCE TO MOM)
CV_DISTANCE_DAD_COL   (COLLAPSED DISTANCE TO DAD)

Quality of Distance to Parent variables

CV_DISTANCE_MOM_QUALITY (QUALITY OF DISTANCE TO MOM VAR)
CV_DISTANCE_DAD_QUALITY  (QUALITY OF DISTANCE TO DAD VAR)

The distance variables are created based on the respondent's address and the address of their mother and father, as reported in the locator section of the questionnaire.  The longitude and latitude locations of the respondent and their parents are determined using the geocoding software. Distance is then created "as the crow flies"--that is, as a straight line between the residences, rather than according to actual travel routes.

In the public-use dataset, the distance variables are presented using the following categories:

  1. Lives in the Same Household
  2. 1 to 5 miles
  3. 6 to 10 miles
  4. 11 to 30 miles
  5. 31 to 60 miles
  6. 61 to 100 miles
  7. 101 to 200 miles
  8. 201 to 400 miles
  9. 401 to 700 miles
  10. > 700 miles

On the restricted-use Geocode CD, the exact distance is available.

Some of the reported addresses are not complete, so the exact street address can not be determined. These addresses are assigned the longitude and latitude of the center of the zip code in which they are located. The quality variables for the distance from parents' residence variables alert users as to whether the respondent's or the parent's residence was zip centroided. A centroid is essentially the midpoint of a zip code area. The geographic information is less certain via this method.

  1. Neither respondent nor parent is zip centroided
  2. Respondent is zip centroided
  3. Parent is zip centroided
  4. Both respondent and parent are zip centroid
  5. Respondent and/or parent at a foreign location

Migration History (Location)

Variable Created: CV_MIGRATE.xx

To provide information about the respondent's migration history, survey staff created this variable based on state and county codes, as assigned by the geocoding program. The variable released to public users classifies respondent moves into one of the following four categories: 

  1. Move within county
  2. Move within state; different county
  3. Move between states
  4. Move to or from a foreign country

Respondents who have not moved since the previous interview are assigned a value of -4, valid skip. If respondent address information is incomplete and survey staff are unable to determine locations of the respondent's residences, the respondent is assigned a value of -3, invalid skip. The input variables from the household are coded with a '0' when the respondent address is in a foreign country.

The program below creates the variable that reports migration history.  

Link to Migration History program file

Residence at Age 12

Variables Created:

CV_CENSUS_REGION_AGE_12
CV_MSA_AGE_12
CV_URBAN-RURAL_AGE_12
CV_CENSUS_REGION_AGE_12_YCHR
CV_MSA_AGE_12_YCHR
CV_URBAN-RURAL_AGE_12_YCHR

A set of created variables provides information about the respondent's residence at age 12. The first three variables on the list were created using data from the round 1 parent interview, and therefore are only available for respondents with a parent interview. These variables provide Census region, MSA, and urban/rural status using the same creation procedures as described above for the variables for residence as of the interview date.

In rounds 6-9, the survey included a "Childhood Retrospective" section that retrieved some childhood information from respondents who did not have a round 1 parent interview. The final three variables on the list above were created using this information. Note that each of these variables combines information across rounds 6-9 into a single item and that they are categorized as 1997 variables in the Investigator.

A small number of respondents have data in both sets of variables. Researchers using these data should be aware of this overlap and adjust their data accordingly.