Introduction to NLSY97 Geocode Data

National Longitudinal Survey of Youth - 1997 Cohort

Introduction to NLSY97 Geocode Data

The NLSY97 geocode file contains detailed information on the geographic residence of each NLSY97 respondent. These data permit researchers to analyze the influences that a respondent's area of residence or type of environment may have on outcomes such as education and employment. The geocode CD includes just the restricted geographic variables plus the respondent identification number, which can be used to link the geocode data with the public data available via NLS Investigator.

In addition to variables identifying the location of the respondent's residence, this supplemental data file contains selected variables drawn from other data sources and matched to the respondents' residences or schools. In rounds 1-5, a number of variables were taken from the County and City Data Books; survey staff stopped creating these variables in round 6 and beyond, but users are still able to match NLS data to the CCDB if desired.  Outside sources used in all rounds include the BLS publication Employment and Earnings and the Integrated Postsecondary Education Data System (IPEDS). This introduction provides a discussion of the creation of the variables available on the NLSY97 geocode data file.


Important Information

Researchers interested in obtaining the geocode CD must apply to the Bureau of Labor Statistics. Information on the application process is available on the BLS website.

Geocode Data File Creation Procedures

Standard geocoding software is used to geocode respondent addresses and create the geocode file. This software program links respondent address data to standard geographic information such as the FIPS (Federal Information Processing Standards) codes for state and county. FIPS codes are listed in Attachment 100. Three graduated matching methods were applied, depending on the quality of the address data available.

  1. An automated match was done between the respondent's locating address data and the geocoding software database (round 15 used ESRI ArcGIS (Arc Map version 10.0)). Address records with matching street segments were assigned the latitude and longitude of the location. In some cases, addresses had to be cleaned before they could be matched by the geocoding software. Cleaning involves steps such as standardizing the address format, correcting obvious misspellings, identifying apartment numbers and locating them in the correct field, etc. It does not include any changes that might result in a change in the actual address location.
  2. For some addresses, the procedure outlined in Step #1 failed to produce a match between the respondent's address data and the geocoding software database. In these cases, geocode staff used the geocoding software to locate the correct street. If the street number could be located along this street, the latitude and longitude were assigned. However, some streets in the geocoding software database do not include information about street numbers and some respondents do not report street numbers when providing their address. If either is the case, the address is manually located in the center of the street. The street is then classified as either a short street or a long street. Long streets cross Census tract boundaries while short streets do not. As a result, the level of certainty about geographical information is much higher for short streets than for long streets.
  3. Addresses unmatched by either of the first two procedures were assigned latitude and longitude coordinates according to a 5-digit zip centroid. A centroid is essentially the midpoint of a zip code area. The geographic information is less certain for respondents located using the zip centroid method.

Researchers can identify the method used to locate the respondent's address by using the variable GEO06, which provides information about the quality of the geographic match. This variable differentiates between addresses located based on the actual address, in the center of a short street, in the center of a long street, or using the zip centroid method. This variable can be used to determine the level of certainty for the respondent's geographic data.

The respondent's residence is further described by metropolitan area. In rounds 1-7, the variables define the respondent's residence by Metropolitan Statistical Area (MSA), Consolidated Metropolitan Statistical Area (CMSA), or New England Consolidated Metropolitan Area (NECMA). The MSA, CMSA, and NECMA definitions in the 1994 County and City Data Book (CCDB) are used to create the MSA variable. In this method, survey staff compare the respondent's county and state of residence to the MSA definitions to determine if the respondent resides in an MSA. If so, GEO03 lists the respondent's MSA, CMSA, or NECMA of residence.  In rounds 1-5, GEO04 lists the Primary Metropolitan Statistical Area (PMSA) code for respondents residing in a CMSA (see Attachment 101 for further explanation). In rounds 1-5, an additional variable, GEO05, reports whether the code in GEO03 is an MSA, CMSA, or NECMA. Users should note that there are some slight differences between the 1994 codes and the Census Bureau's FIPS standards.

Beginning in round 8, the data follow the new Census Bureau standards for Core-Based Statistical Areas (CBSA). GEO03 presents the code for the respondent's CBSA rather than MSA.  More information about the new CBSA coding system is in Attachment 101.

Table 1. Residence Variables Available by Survey Round

  Rd 1 Rd 2 Rd 3 Rd 4 Rd 5 Rd 6 Rd 7 Rds 8 & up
GEO01 County of Residence * * * * * * * *
GEO02 State of Residence * * * * * * * *
GEO04 PMSA of Residence * * * * *      
GEO05 MSA/CMSA/NECMA of Residence Record Type * * * * *      
GEO06 Quality of Match Flag for Geographic Residence Variables * * * * * * * *

Created Variables

Main File Created Variables. Three of the created geographic variables may be found on the main file NLSY97 data release in addition to their inclusion on the geocode CD.  First, CV_CENSUS_REGION provides researchers with information about whether the respondent lives in the Northeast, North Central, South, or West region of the country as defined by the Census Bureau. The states comprising each region are listed in the codebook for the region variable as well as the "Geographic Indicators" section of the NLSY97 User's Guide.

The second variable, CV_MSA, identifies the respondent's residence status related to a metropolitan area.  For rounds 1-7, this variable reports whether the respondent lives in the central city of an MSA, in an MSA but not in the central city, or outside of an MSA. The central city boundaries are defined by the 1992 Census TIGER/Line files and are included in the geocoding software; the MSA definitions used in this variable are the standard Census Bureau definitions rather than those drawn from the 1994 CCDB. This means that a few respondents may be listed as residing in an MSA in the status variable but do not have an MSA code in GEO03. Beginning in round 8, this variable reports whether the respondent lives in the principal city of a CBSA, in a CBSA but not in the principal city, or outside of a CBSA. The principal city boundaries are defined by the 2000 Census Bureau TIGER/Line files as included in the geocoding software.

Finally, the main file variable CV_URBAN-RURAL indicates whether the respondent lives in an urban or rural area. Areas are identified as urban or rural by the Census Bureau. According to the Census Bureau, about 25 percent of the U.S. population lives in rural areas. In rounds 1-7, using the 1990 Census standards, urban places were those in "urbanized areas" or "places" with a population of at least 2,500; all other areas were rural. Beginning in round 8, the definition of urban areas was changed to follow the new 2000 Census Bureau standards, which defined all territory within an urban cluster or urbanized area as "urban." 

If the respondent's residence was located using a street name match (method 2 above) or a zip centroid match (method 3), the MSA/CBSA status and urban/rural variables are further evaluated. The MSA/CBSA variable indicates whether the respondent lives in an MSA (or CBSA beginning in round 8). We first evaluate whether the street or zip code falls completely inside or outside of the boundaries of the MSA/CBSA and assign the appropriate status. Next, if the respondent is within the MSA/CBSA, we determine whether the street or zip code falls completely inside or outside the boundaries of the central city, and assign the appropriate status. In rounds 1-11, if we could locate the respondent within an MSA/CBSA but could not determine central city status, then the respondent was coded as "4 - MSA/CBSA, central city status unknown"; if we could not determine whether the respondent was in an MSA/CBSA, an invalid skip was assigned. In rounds 12-15, the code "4 - CBSA, central city status unknown" was used both for respondents whose street or zip code crossed the boundaries of the central city and for respondents whose street or zip code crossed the CBSA boundaries.

Similarly, respondents are only assigned to an urban or rural status if their entire street or zip code lies within an urban or rural area. If the street or zip code crosses an urban/rural boundary, the respondent is assigned to an unknown status.

Integrated Postsecondary Education Data System (IPEDS) Codes. During the interview, respondents report the name and location of each college they have attended. Information from the Integrated Postsecondary Education Data System (IPEDS) is then merged with the name and address of the youth's college to provide users with the code identifying the school (GEO69) and its location (GEO70). More information about college codes is provided in Attachment 102.

Unemployment Rate Variable Creation. The next supplemental created variable, GEO71, is the unemployment rate for the respondent's area of residence. Unlike the unemployment variables listed in Table 2 below, which are reported for the respondent's county and are only available through round 5, this variable provides the unemployment rate for the respondent's metropolitan area, if applicable, or the balance of the respondent's state. The sources of the state and metropolitan area labor force data used for the unemployment rate variable are as follows:

  Round 1 Data for March 1998, published in May 1998 Employment and Earnings
  Round 2 Data for March 1999, published in May 1999 Employment and Earnings
  Round 3 Data for March 2000, published in June 2000 Employment and Earnings
  Round 4 Data for March 2001, published in June 2001 Employment and Earnings
  Round 5 Data for March 2002, published in June 2002 Employment and Earnings
  Round 6 Data for March 2003, published in June 2003 Employment and Earnings
  Round 7 Data for March 2004, published in June 2004 Employment and Earnings
  Round 8 Data for March 2005, published in May 2006 Employment and Earnings
  Round 9 Data for March 2006, available online*
  Round 10 Data for March 2007, available online*
  Round 11 Data for March 2008, available online*
  Round 12 Data for March 2009, available online*
  Round 13 Data for March 2010, available online*
  Round 14 Data for March 2011, available online*
  Round 15 Data for March 2012, available online*
  * These data can be accessed through the BLS News Release Archive at Employment and Earnings, published by the U.S. Department of Labor, Bureau of Labor Statistics, lists the civilian labor force and number of unemployed persons for every state and metropolitan area.

The respondent's metropolitan statistical area is assigned based on the latitude and longitude of current residence using the process described earlier in this introduction. If a respondent resides in a metropolitan area that is listed in Employment and Earnings, then the unemployment rate in the NLSY97 variable is the unemployment rate for that metropolitan area. This rate is calculated by dividing the number of unemployed persons by the number of people in the civilian labor force as reported by BLS.

If the respondent does not reside in a metropolitan area, he or she is assigned a "balance of state" unemployment rate. For these cases, the figures provided for the state and its metropolitan areas are used to compute the unemployment rate for the portion of the state that is not represented in any metropolitan statistical area. In some rounds, because the Employment and Earnings numbers are based on an older set of MSA codes than the NLSY97 data, there are also some cases in which NLSY97 metropolitan areas do not match those used in Employment and Earnings. In these cases, respondents are assigned the balance of state unemployment rate even though they do reside in a metropolitan area. (Interested users can examine the Employment and Earnings MSA definitions in each year's May edition; NLSY97 codes are provided in Attachment 101.)

Randomized Code for Primary Sampling Unit (PSU). The geocode CD includes a variable identifying respondents who lived in the same sampling area in the initial survey year. Defined by NORC, primary sampling units are the areas used to draw the sample for the NLSY97 (see "Sample Design and Screening Process" for details). This variable, GEO72, presents a scrambled version of the PSU code so that researchers can identify respondents from the same area but cannot determine the exact PSU from which they were drawn. The randomized PSU data can be used in the estimation of design effects for the NLSY97 sample.

CCDB variables. In rounds 1-5, a number of additional created variables were provided in the NLSY97 geocode data file. Unless otherwise noted, these variables are based on the 1994 County and City Data Book (CCDB), prepared by the U.S. Bureau of the Census. The CCDB data file includes information from the 1990 Census of the Population and Housing, the Current Population Surveys, and other supplemental data derived from a variety of federal government and private agencies. Table 2 lists the CCDB variables included on the geocode CD for rounds 1-5. These variables are no longer created by NLS staff beginning in round 6, but interested researchers can use the state and county FIPS codes (from GEO01 and GEO02) to match NLSY97 information with the CCDB or other similar sources of geographic data.

Table 2. County-level geocode variables

GEO07 1990 land area in square miles GEO38 1988 number of deaths
GEO08 1992 population GEO39 1988 deaths per 1,000 population
GEO09 1992 population, 1990 square miles GEO40 1988 deaths of infants under 1 year per 1,000 live births
GEO10 1990 population by race, White GEO41 1988 active nonfederal physicians per 100,000 population [copyright]
GEO11 1990 population by race, Black GEO42 1991 community hospital beds per 100,000 population [copyright]
GEO12 1990 population by race, American Indian, Eskimo, or Aleut GEO43 1991 serious crimes per 100,000 population
GEO13 1990 population by race, Asian or Pacific Islander GEO44 1990 persons 25 years and over, % high school graduate or higher
GEO14 1990 Hispanic origin population (of any race) GEO45 1990 persons 25 years and over, % with bachelor's degree or higher
GEO15 1990 Hispanic origin population, % of total population GEO46 1989 median family money income
GEO16 1990 population by age, % under 5 years GEO47 1989 per capita money income
GEO17 1990 population by age, % 5 to 17 years GEO48 1989 % of families with income below poverty level
GEO18 1990 population by age, % 18 to 20 years GEO49 1990 total families--base for GEO48
GEO19 1990 population by age, % 21 to 24 years GEO50 1989 % of families with female householder (no spouse present) below poverty level
GEO20 1990 population by age, % 25 to 34 years GEO51 1990 female householders (no spouse present), family households--base for GEO50
GEO21 1990 population by age, % 35 to 44 years GEO52 1989 % of persons with income below poverty level
GEO22 1990 population by age, % 45 to 54 years GEO53 1989 % of related children under 18 years below poverty level
GEO23 1990 population by age, % 55 to 64 years GEO54 1990 workers 16 years and over, % working outside county of residence
GEO24 1990 population by age, % 65 to 74 years GEO55 1991 civilian labor force
GEO25 1990 population by age, % 75 years and older GEO56 1991 civilian labor force--number unemployed
GEO26 1990 population--base for GEO15 to GEO25 GEO57 1991 civilian labor force--unemployment rate
GEO27 1990 male population GEO58 1990 civilian labor force
GEO28 1990 % of persons 5 years and older living in different house in 1985 GEO59 1990 civilian labor force--% female
GEO29 1990 % of persons 5 years and older living in different house, same state in 1985 GEO60 1990 civilian labor force--% unemployed
GEO30 1990 % of persons 5 years and older living in different house, different state in 1985 GEO61 1990 civilian labor force--number employed
GEO31 1990 family households, percent with own children under 18 years GEO62 1990 civilian labor force, % employed in agriculture, forestry, and fisheries
GEO32 1990 female householders (no spouse present), family households GEO63 1990 civilian labor force, % employed in manufacturing
GEO33 1990 female householders (no spouse present), family households, % with own child GEO64 1990 civilian labor force, % employed in wholesale and retail trade
GEO34 1988 number of births GEO65 1990 civilian labor force, % employed in finance, insurance, and real estate
GEO35 1988 births, % to mothers under 20 years GEO66 1990 civilian labor force, % employed in health services
GEO36 1988 births per 1,000 population GEO67 1990 civilian labor force, % employed in public administration
GEO37 1988 population--base for GEO 36 and GEO39 GEO68 1990 per capita personal income

Migration History Variables

In the household information (YHHI) section, the NLSY97 survey collects information about each residence of the respondent since the previous interview date. Respondents who had moved to a different city, county, or state were asked to report the date of the move and the new city, county, and state of residence. These data were recorded for each move. In each round (starting in round 2), these data were geocoded using the standard state and county FIPS codes. The codes are included on the geocode CD so that researchers can track respondents' migration patterns.

Geocoded migration variables can be located in the data set by searching for question names that start with "GEO_M." For each move, the respondent will have a state variable (e.g., GEO_M_ST.01) and a county variable (e.g., GEO_M_CO.01). The number at the end of the question name indicates which move the data apply to. For example, variables ending in ".01" refer to the first move after the last interview, those ending in ".02" refer to the second move, and so on. More information about migration history variables is provided in Attachment 103.

To support research on respondent mobility and supplement the state/county migration variables, we created a series of variables for the distance between respondent addresses at each interview round. This pairwise matrix of variables enables various types of migration research by enabling users to consider the distance between residences and to identify return migration to an area where the respondent has lived in the past. A similar set of variables reports the distance between the respondent's and parents' residence(s).The migration distance matrix and parent distance variables are described in detail in Attachment 103.

Day of Birth Variables

Data indicating the day of birth of the respondent, his or her parents, children, and other household members are included on the geocode CD. Month and year of birth variables appear in the main public use data set. The reference numbers and question names for the day of birth variables correspond to those used in the main data set for month and year of birth. For example, variables KEY!BDATE_M and KEY!BDATE_Y (R05364.01 and R05364.02) in the main data set contain information about the respondent’s birth month and year. The corresponding variable in the geocode data is KEY!BDATE_D (R05364.00) and provides information about the respondent’s day of birth.

Important Information

Due to technical considerations, the day of birth variables in the geocode data do not have the word “day” in the title, nor do they show information about day of birth in the codebook. Instead, these variables have “month/year” in the title and list month and year information on the codebook page. However, the actual data do reflect day of birth information. Researchers can verify that they are using day of birth data by examining the question name, which contains a “D” for day variables.

Missing Data

Missing values for geocode variables are assigned using the same coding system as the main NLSY97 data file; see the NLSY97 User's Guide for more information. In general, the following codes have been used:

Refusal. Applies mainly to respondent-reported data rather than created variables, but may also be used in a created variable if a respondent answered "refuse" to the relevant input variables.
Don't know. Applies mainly to respondent-reported data rather than created variables, but may also be used in a created variable if a respondent answered "don't know" to the relevant input variables.
Invalid skip. Address data cannot be geocoded. In these cases, the respondent lives in the United States but has provided incomplete or conflicting address information. The data file contains as much information as possible for these respondents; for example, if survey staff are confident that the state is correct but cannot identify the county, the state variable will have a valid code and the county variable will have a value of -3.
Valid skip. The respondent has no information for this variable because it does not apply. For example, respondents living outside the United States have a -4 for all residence variables; respondents not attending college have a -4 for the college variables.
Noninterview. In round 2 and beyond, respondents who did not participate in the survey are assigned a value of -5 for all data in that round.

Use of the File

Suggestions concerning the use of these NLSY97 geographic data files:

  1. The data file and the accompanying documentation should be used in conjunction with the printed version of the 1994 County and City Data Book and the IPEDS codes so that researchers have complete information regarding variable descriptions and coding idiosyncrasies.
  2. The data should not be used in any fashion that would endanger the confidentiality of any sample member. To use these data, users must sign a written licensing agreement consenting to protect respondent confidentiality and to other conditions; agree not to make, or allow to be made, unauthorized copies of the geocode file; and further agree to indemnify the Center for Human Resource Research for all claims arising from misuse of the file.