Appendix 10: Geocode Documentation

Appendix 10: Geocode Documentation

 

Documentation for the NLSY79 Geocode Data Files

Geocode data format

This document provides a discussion on the creation of the variables available on the NLSY79 geocode data files. The geocode data are now being released as comma-delimited ASCII data with support files. Because of the volume of variables, the data have been split into five sets of files based on content:

  • Location data for respondent
  • Data from survey responses and created variables
  • Computed distance measures between addresses at each survey point for each respondent
  • County and SMSA level data for 1979-1992 (County and City Data Book variables)
  • County and SMSA level data for 1993-2002 (County and City Data Book variables)

The geocode CD also includes text/ASCII data files as well as SPSS, SAS and STATA programs to read them. In addition, the geocode CD contains documentation files (user's guide and codebook supplement) for the NLSY79 in HTML format. The data files contain variables that are only available on the restricted geocode release, plus the identification number to allow merging with the main 1979-2020 public data.

Geocode data source files and content

County and City Data Books
For survey years 1979-2002, selected variables from the County and City Data Books from various years are provided along with geographic variables from the NLSY79 main data file. No variables from the County and City Data Books are included for survey years 2004-2020.

The county and state of residence for each NLSY79 respondent for each survey year between 1979 and 2002 were matched with the county and state variables in the specific County and City Data Book data files used for each year. Selected county-level or SMSA-level environmental variables were extracted from those files and included in the geocode data. The County and City Data Book data files were prepared by the U.S. Census Bureau. Related printed matter for each of these data files can be found in the County and City Data Book for the specified year. These books are also published by the Census Bureau.

The following is a brief description of the various NLSY79 geocode data for specific survey years and the County and City Data Book data files that were merged with the different years of NLSY79 data:

  1. The 1979-82 geocode data include county-level and SMSA-level variables from the County and City Data Book, 1972 data file, which provides data from the 1970 Census of the Population and Housing, the 1972 Economic Census, and the 1969 Census of Agriculture, and other data derived from a variety of federal government and private agencies.
  2. The 1979-82 geocode data include county-level and SMSA-level variables from the County and City Data Book, 1977 data file, which provides data from the 1970 Census of the Population and Housing, the 1972 Economic Census, and the 1974 Census of Agriculture, and other data derived from a variety of federal government and private agencies.
  3. The 1983-87 geocode data include county-level variables from the County and City Data Book, 1983 data file, which provides data from the 1980 Census of the Population and Housing, the 1977 Economic Census, the 1978 Census of Agriculture, and other data derived from a variety of federal government and private agencies.
  4. The 1988-96 geocode data includes county-level variables from the County and City Data Book, 1988 data file which provides data from the 1980 Census of the Population and Housing, the Current Population Surveys, and other data derived from a variety of federal government and private agencies.
  5. The 1998-2002 geocode data includes county-level variables from the County and City Data Book, 1994 data file, which provides data from the 1990 Census of the Population and Housing, the Current Population Surveys, and other data derived from a variety of federal government and private agencies.

Variables from the 1988 County and City Data Book data file were selected with an eye toward comparability with the 1983 County and City Data Book variables. Similar considerations were made between the 1994 and 1988 variables. In the absence of updated information from the 1988 County and City Data Book data file, the 1983 County and City Data Book variables were retained. However, some differences do exist between similar variables selected from the various County and City Data Book data files.

The 1983 County and City Data Book data file variables for MSA/NECMA and CMSA have been combined into one 4-digit variable in the 1988 County and City Data Book data file. Therefore, the 1988 County and City Data Book geographic variables correspond to the 1983 County and City Data Book geographic variables in the following manner:

  1. The MSA/NECMA codes that existed in the 1983 data are identical in the 1988 data.
  2. Six MSAs were added, one MSA was expanded, and one CMSA was expanded in the 1988 data. The MSAs that were added have their own unique 4-digit code.
  3. The 1983 CMSA variable was recorded with a new unique 4-digit code for each CMSA in the 1988 combined variable. The 1983 PMSA variable was retained and is identical to the 1988 PMSA variable. Therefore, each CMSA and PMSA is still identifiable in the same manner they were with the separate 1983 CMSA variable.
  4. Two 1983 CMSAs were redefined as MSAs in the 1988 data. These are Kansas City and St. Louis. They have been recorded with their own unique 4-digit code.
  5. From 1979-1987, respondents living in New England were excluded from merging with the County and City Data Book, since the SMSA/MSA variable on the County and City Data Book, 1983 data file is the New England County Metropolitan Areas (NECMA) code. NECMA residents were not excluded when merging with the 1988 County and City Data Book data file. In the 1988 County and City Data Book data file, the MSA/NECMA and the CMSA variables found in the 1983 County and City Data Book data file were combined into one 4-digit variable. The addition of a "Record Type" variable in the County and City Data Book 1988 data makes it possible to distinguish separately between MSAs, NECMAs, and CMSAs. This "Record Type" variable classifies cases in the 1988 data combined MSA/NECMA/CMSA variable according to whether they provide information for the U.S., States, MSAs, NECMAs, CMSAs, or a Nonmetropolitan County. The use of this variable allows the user to exclude any of these groups from the analysis without having to conduct a county-by-county or state-by-state determination of NECMA/non-NECMA status. Respondents residing in NECMAs are found in the New England states of Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island and Vermont.

The population by age variables from the 1988 County and City Data Book data file are estimates made for the National Cancer Institute by the Census Bureau. These figures suppress data for counties in which the population is under 20,000. Users should keep this in mind during analysis.

City Reference Files
Another type of data file, the City Reference File (CRF) for various years, was also merged with the NLSY79 data in order to identify the SMSA/MSA for each respondent according to zip code. The City Reference File data files, prepared by the U.S. Census Bureau, contain the Federal Information Process Standards (FIPS) county and state codes, zip codes, and SMSA/MSA codes.

The following is a list of the various City Reference Files that were merged with the different years of NLSY79 data to identify the SMSA/MSA for each respondent:

  1. The 1979-82 NLSY79 data was merged with the City Reference File, 1973, which contains the SMSA codes as defined by the Office of Management and Budget (OMB) as of August 15, 1973.
  2. The 1983 NLSY79 data was merged with the City Reference File, 1982, which contains the SMSA codes defined by OMB prior to June 30, 1983.
  3. The 1984-87 NLSY79 data was merged with the City Reference File, 1983, which contains the MSA codes as defined by OMB as of June 30, 1983.
  4. The 1988-92 NLSY79 data was merged with the City Reference File, 1987, which contains the MSA codes as defined by OMB as of June 30, 1987.
  5. The 1993-1998 NLSY79 data was merged with the City Reference File, 1993, which contains the MSA codes as defined by OMB as of July 31, 1993.

Local Exchange Routing Guides
Between 1989 and 1994, a third type of data file was used to verify the geocode information provided by NORC for each respondent. The Local Exchange Routing Guide (LERG) data file is constructed by Bell Communications Research (BELLCORE) and contains address information for the "switches" which regulate each telephone area code and exchange.

  1. The LERG data file used for the 1989 NLSY79 geocodes was updated through October 1989.
  2. The LERG file used for the 1990 and 1991 NLSY79 geocodes was updated through January 20, 1992.
  3. The LERG file used for the 1992 NLSY79 geocodes was updated through March 1, 1993.
  4. The LERG file used for the 1993-94 NLSY79 geocodes was updated through August 1, 1994.

Geostatistical Mapping Software
Beginning in 1996, geostatistical mapping software was employed in the geocoding process to assign latitude and longitude coordinates and other geographical codes. Basic standard geographic information such as latitude and longitude was linked to each respondent's address. This was accomplished by matching address data to information in the software database. Matching records were appended with the matching address, coordinates, Census information, and FIPS (Federal Information Processing Standards) codes for state, county, MCD (Minor Civil Division), and MSA (Metropolitan Statistical Area). The software packages used in specific survey years are listed below.

  1. Matchmaker for Windows, V2.5 was used in survey years 1996 and 1998.
  2. Maptitude V4.2 was used in survey year 2000.
  3. Maptitude V4.6 was used in survey years 2002 and 2004.
  4. ArcGIS (ArcMap) V9.2 was used in survey years 2006-2012.
  5. Maptitude 2020 was used in survey year 2020.

1979-88 Geocode Data Creation Procedures

The following briefly outlines the procedures used to create the 1979-88 NLSY79 geocode data.

  1. State, county, and zip codes are reported by each NLSY79 respondent. Missing information was hand-edited whenever possible. A discussion of the hand-editing process follows. The variable GEO10 was created to indicate the type of hand-editing that was done on each case. The creation of GEO10 is described in this document.
  2. The state, county, and zip codes were then matched with the CRF. For those cases where the NLSY79 state, county, and zip codes matched with a state, county, and zip code from the CRF, the SMSA/MSA from the CRF was added to each respondent's record.
  3. Between 1983 and 1987, for cases missing an SMSA/MSA because there was no match with the CRF, a match was made based on the NLSY79 county and state and the CRF county and state so that an SMSA/MSA could be provided.
  4. The NLSY79 data, with SMSA/MSA added when there was a match on all three residence variables, was then merged with the County and City Data Book data files. In the 1987 and 1988 NLSY79 data, if there was not an exact match on the state, county, and zip codes, two additional steps were taken. First, a match was attempted on the state and zip codes. If the state and zip codes matched, then the county and SMSA/MSA codes from the CRF were added to the respondent's record (assigned an edit code of 10 in GEO10). Second, if there was no match on state and zip codes, but there was a match on the zip code only, then the state, county, and SMSA/MSA codes from the CRF were added to the respondent's record (assigned an edit code of 11 in GEO10).

Hand-Edits and Changes in Matching Procedures

1979-1982
More than 1,000 hand-edits for each survey year from 1979 through 1982 were performed to constrain respondents' reported state, county, and zip codes so that they conformed to legitimate state-county-zip combinations. In some cases, this involved making estimates for one or more of the above items. The state and county codes from the main NLSY79 data for each year that are included in the NLSY79 geocode data are the original, unedited values for the respondents. The hand-edited versions of the state and county codes were used to match with the CRF and with the City and County Data Book data files for these years.

In compiling the 1981 geocode information, a systematic review of hand-edited state, county, and zip codes was also undertaken. All cases that required a hand-edit in any of the three survey years were included in this inspection. The point of this review was: (1) to check for consistency in hand-editing decision rules over the three years, and (2) where possible, to use the respondent's reported geocodes in subsequent years to check on the accuracy of hand-edits performed in preceding years (this was possible for those cases that required hand-edits in early years and which showed no change of residence over the period).

The results of these consistency checks were very encouraging. Only 13 cases turned up that seemed to be in error. These cases had their geocodes revised accordingly. While doing this review, several dozen other cases with keypunch or coding errors in the hand-edit code variables were also uncovered. These errors were also corrected. In any case, this procedure provides substantial validation to the overall hand-editing process.

1983-1988
In 1983-1987, the majority of hand-edits involved the derivation and addition of a zip code. Changes in the matching strategy were made because the zip code was more accurate than either the county or state geocodes taken individually. Some mismatching, however, did occur because the zip code was in error rather than the county or state code, but this error rate was smaller than another matching algorithm not requiring case by case hand edits. It is probable that some mismatching did occur because the county itself was in error. Nevertheless, we are confident that matching by zip code improved the quality of the match.

Users are cautioned that matching by state and zip code or by zip code only may result in a higher moving rate between 1987 and the previous interview year than might actually have occurred. We suspect that some NLSY79 county and state geocodes were not updated if the respondent reported an address change prior to the 1987 interview or the previous interview. If the geocodes were not updated in a previous interview, then there would have been an under-reporting of moving to a new county and/or state in that interview year that would now show up with the 1987 NLSY79 data because of the improved matching algorithm. In the 1987 NLSY79 geocode data, if the zip code and state did not match but the zip code alone matched, the state and county were added to the record.

Because the 1988 procedure required both the zip code and state to match, some cases in which the zip code alone matched, and which were possibly in error in 1987, may have been hand-edited in 1988. This may affect mobility rates between 1987 and 1988 to the extent that those inaccurate zip codes in 1987 may have been corrected in the 1988 NLSY79 data.

In 1988, more than 1,000 hand-edits were performed. Approximately 56.6% of these involved the derivation and addition of a zip code, while approximately 48.4% involved correction of the state of residence.

We believe that the requiring a zip code and state correspondence further improved the accuracy of the resulting matches. In support of this assumption, the cases that were actually hand-edited, produced only approximately 6% with an invalid county. The possibility of zip codes continuing across adjacent counties suggests that this may even be an overestimate of the actual error occurring.

1981 Changes in SMSA Designations

For those using these data to track the mobility of respondents over the 1979-81 survey years an additional caution applies. In June of 1981, the OMB announced the designation of 36 new SMSAs, the disqualification of one pre-existing SMSA, and the merger of two pre-existing SMSAs into one new area. The 1973 CRF file was updated by CHRR to reflect these changes, and the updates were applied beginning with the 1980 interview place of residence in the 1980 geocode data file. One consequence of these changes is that when attempting to match places of residence for respondents using data from the 1979 geocode data and separate 1980 updated geocode data, some respondents give the appearance of moving into (or out of) an SMSA between 1979 and 1980 when in fact they may not have moved at all. This faulty inference of mobility would be reached if one compared changes in SMSA designation between the separate 1979 and 1980 updated geocode data.

Users ordering a full complement of geocode data at any given point should not find this discrepancy in mobility. This applies only to those who ordered a 1979 geocode data and then updated that data with single year files in subsequent years. As single year files were no longer available after the 1979-89 release, recent purchasers of the geocode data would have received all available years of the geocode data, and should not detect the discrepancy resulting from the 1981 SMSA changes between the 1979 and 1980 separate data files. A variable representing the 1981 SMSA designation (if applicable) of place of residence at interview is currently present in the geocode data for all survey years, including 1979.

It is possible, however, that the created variable based upon SMSA of residence and found in the main NLSY79 data file named KEYVARS ("Is R's Current Residence in SMSA?"), would give a false impression of mobility in and out of an SMSA for respondents living in the same location for which the SMSA designation was changed between survey years. (See Appendix 6: Urban-Rural and SMSA-Central City Variables in the public-use file Codebook Supplement for further details on the creation of this variable.)

Note that all other SMSA environmental variables for those living in these new areas remain NA, since the County and City Data Book, 1972 and 1977 data files did not contain information for these SMSAs. 

Rewrite of 1979-82 Geocode Data

In 1989, work was undertaken to reduce the number of variables provided in the 1979-82 NLSY79 geocode data so that the number and type of variables included in these data more closely resembled the geographic data available for the 1983 and subsequent survey years. The previous 1979-82 NLSY79 geocode data file contained 2,245 variables. This number was reduced to 545 variables with county-level and SMSA-level data retained. In addition, four new variables were included in the 1979-82 NLSY79 geocode data. These variables provide data on the "Continuous Unemployment Rate for the Labor Market of Current Residence" for each survey year. This reduction in the number of variables made it possible to better document the geocode variables and to produce codebooks like the ones produced for the main NLSY79 data.

1989-1994 Geocode Data Creation Procedure

A new procedure was implemented in 1989 as an initial step in verifying the county and state of residence by using address information from the "switch" associated with each area code and exchange. In the hand-editing process for the 1988 geocode data, reported telephone information was found to be very accurate, even in cases for which some or all of the address information was in error. Thus the telephone information presented itself as a reliable, independent source of verification for the address information. The state and county generated from the phone number are compared to the state and county in the NORC address file for each respondent. Cases in which the telephone information would indicate a different state and/or county from that in the address file are identified through this process. This procedure helped identify respondents with incorrect or inconsistent records. Cases that produced such a non-match were checked for accuracy and hand-edited if necessary.

The following briefly outlines the procedures used to create the 1989-1994 geocode data.

  1. For 1989-1991, an initial data set was constructed containing state, county, and zip code information for the "switch" which regulates each area code and exchange (the "PHONE" data set). The procedures for the creation of the 1992-94 geocode data were streamlined, particularly in terms of the hand-editing required on individual cases. A new locator database was created containing the most recent information on each respondent's residence at the time of the survey. Wherever possible, a code was then assigned for each state, county, and country (if applicable). The information in the new locator database was then compared to locator information from the previous interview year. This included an electronic comparison of the character strings entered for street addresses of respondents. If the state, county, and zipcode information matched that from the previous year, and the address strings matched in whole or in significant part (indicating probable typos or keypunch errors), the same state and county geocodes were assigned to a case as were assigned in the previous year. Cases for which a partial or full mismatch occurred, or for which information was missing from any field, were identified during this process and were hand-edited wherever necessary. These cases were then assigned a state and county code based upon the hand-edited data. The procedure of electronic matching of address strings has considerably reduced the number of cases requiring individual hand-editing. Data from each respondent's locator record was then matched by zip code to the state and county from the CRF data file (the "ZIP CODE" data) and by area code and extension to the state and zip from the Local Exchange Routing Guide (LERG) data file (the "PHONE" data).
  2. A second data set was constructed containing the state, county, zip code, and telephone information reported by the respondent ("ADDRESS" data) and the state and county information from the CRF for the respondent's reported zip code ("ZIP CODE" data).
  3. The state variables from each of these sources were then compared and a "quality of match" variable (GEO10 - see discussion in this document) was computed based upon the extent to which the "PHONE" state, the "ZIP CODE" state and the "ADDRESS" state match. The highest quality match exists if the "PHONE" state, the "ZIP CODE" state and the "ADDRESS" state all match. For 1989-1991, if a non-match occurred between these state variables, then the geocode information was represented by data matching the "PHONE" information. For 1992-1994, cases in which erroneous or missing zipcode and/or phone information could not be assigned and, in turn, which prevented assignment of state and county geocodes from either the "ZIP CODE" or the "PHONE" data files, were hand-edited as necessary. The matching procedure was then repeated.
  4. In 1989, the state and county established through this matching and verification procedure were then compared to the state and county reported by NORC for each respondent. In 1990 and 1991, the comparison was made to the state and county reported in the previous survey year. Cases for which a non-match occurred between states and/or counties were examined individually. These cases were hand-edited if possible.

From this point, the procedures closely follow those applied in constructing the geocode data files in prior survey years, with minor modifications. The CRF matching was based upon state and county only for the purposes of the final matching of information from the County and City Data Book data files. As metropolitan statistical area information is based upon county delineations (except in New England), matching on cleaned state and county data should not affect the assignment of respondent MSAs.

  1. The state and county were then matched with the CRF. For those cases where the NLSY79 state and county (and zip code in 1990/1991) matched with a state and county (and zip code in 1990/1991) from the CRF, the SMSA/MSA from the CRF was added to each respondent's record.
  2. The NLSY79 data, with SMSA/MSA added when there was a match on county and state of residence was then merged with the County and City Data Book data files.

Hand-Edits and Changes in Matching Procedures

In creating the 1989-94 geocode data, the same logical procedures were applied in identifying cases requiring individual examination. However, the automation of the decision rules and procedures to check for and identify such cases resulted in a substantial reduction in the number of cases requiring hand-editing.

1989-1994
The effect of the 1989 phone verification procedures on the ability to detect errors in the NORC geocode data may also affect mobility rates between 1988 and 1989. Due to time and personnel constraints, it was not possible to examine every case that did not initially match on the state, county, and zip codes.

In the 1989 procedure the geocodes established by the phone number were compared to the geocodes received directly from NORC. By using the 1989 CHRR-edited versions of the geocodes for comparison, updates and corrections that were made to the geocodes during the 1989 hand-editing processes were incorporated. This reduced the number of mismatches between the geocode information based upon the current phone number and the respondent-reported geocode information and increased the amount of consistency observed between survey years. The number of cases requiring individual examination was thereby reduced.

From this point, the procedures closely follow those applied in constructing the 1988 geocode data, with minor modifications. For 1989, CRF matching was based upon state and county only for the purposes of the final matching of information from the County and City Data Book data. A match on state, county, and zip was also required to construct a variable reflecting a respondent's SMSA/non-SMSA residence status for inclusion in the NLSY79 main data file. This match, which was included in the geocode procedures prior to 1989, was done separately for the 1989 release when the new set of initial procedures was instituted. To streamline programming tasks, however, the zip information was reinserted in the CRF matching program for 1990. Therefore, the CRF matching for the 1990 geocode data was again based upon state, county, and zip code, as it had been prior to 1989.

In earlier survey years, residence information was usually collected by NORC interviewers only when there was a change in that information from the previous interview. In 1990, however, an effort was made to get current information for all respondents. Many of the cases in this current update information also included counties that have been inconclusive (even in case-by-case hand-editing) in previous years. These are generally cases in which a zip code spans more than one county, and for which valid county data is missing from the respondent's reported residence information. For such cases, the possibility existed in the 1989 (and prior) data that counties assigned based upon such multiple-county zip codes might be in error in a small number of cases. This would result in the assignment of a county adjacent to the county in which the respondent actually lived. To the extent that current update information for the county of residence in 1990 showed the assigned county in 1989 to be in error, mobility determinations may have been affected. In contrast, using the 1989 CHRR-edited versions of the geocodes for comparison with the current geocode information should have improved the accuracy of mobility ratings. This is a more dependable confirmation of past geocode information, eliminating the need to make individual determinations in many cases with multiple-county zip codes as discussed above.

1996-2002 Geocode Data Creation Procedure

The procedures for the creation of the 1996 and subsequent geocode data changed from those used in previous years. Software packages were used to create the data for 1996-2008. The following briefly outlines the process used in these survey years.

Although different software packages were used, the procedures for data creation were essentially the same across these years. Three graduated matching methods were applied, depending on the quality of the address data available.

  1. An automated match was done between the respondent's address data and the software database. Address records with matching street segments were appended with the matching address, coordinates (latitude and longitude values for a specific location), Census information (county, tract, and block group codes for an address), FIPS codes for state, county, MCD, and MSA. In some cases, addresses had to be cleaned before software matching could be done. Cleaning involves steps such as standardizing the address format, correcting obvious misspellings, identifying apartment numbers and locating them in the correct field, etc. It does not include any changes that might result in a change in the actual address location.
  2. For some addresses the procedure outlined in step #1 failed to produce a match between the respondent's address data and the software database. For these cases in 1996-1998, individual respondent addresses were temporarily corrected in order to match them to the software database. By correcting obvious errors and referring to lists of valid address components, a map display, and commercial maps, a temporary working address was constructed; this was used to assign geocodes to these cases. These temporary address corrections were made in a working file to test the improvements in matching to the software database. The original address data remained unchanged. Successful address corrections were matched by this method and geocodes assigned accordingly. In 2000-2008, for cases failing Step #1, geocode staff used software to locate the correct street. If the street number could be located along this street, the latitude and longitude were assigned. However, some streets in software databases do not include information about street numbers. If this is the case, the address is manually located in the center of the street. The street is then classified as either a short street or a long street. Long streets cross Census tract or block group boundaries while short streets do not. As a result, the level of certainty about geographical information is much higher for short streets than for long streets.
  3. Addresses unmatched by either of the first two procedures were assigned latitude and longitude coordinates and related Census data according to a 5-digit ZIP centroid. A centroid is essentially the mid-point of a ZIP Code area. Centroid matches were made only for addresses that could not been matched by any other means. Addresses with ZIP codes that were no longer current were appended with latitude and longitude coordinates only.

The procedures outlined in steps #2 and #3 approximate the hand-editing process described in previous survey years for records with different degrees of matched address data.

2004-2020 Geocode Data Creation Procedure

For survey years 2004-2012, a software called ArcGIS (ArcMap 9.2 in 2006 - 2012) was used in the creation of the geocode data, while Maptitude 2014 was used for survey year 2014. Maptitude software updated annually was used to create geocode data for each survey year, through the current round. The procedures for data creation were essentially the same as those used in 1996-2002.

Three graduated matching methods were applied, depending on the quality of the address data available.

  1. An automated match was done between the respondent's locating address data and the ArcGIS database. In some cases, addresses had to be cleaned before they could be matched by ArcGIS. Cleaning involves steps such as standardizing the address format, correcting obvious misspellings and locating them in the correct field, etc. It does not include any changes that might result in a change in the actual address location.
  2. For some addresses, the procedure outlined in Step #1 failed to produce a match between the respondent's address data and the ArcGIS database. In these cases, geocode staff used ArcGIS to locate the correct street. If the street number could be located along this street, the latitude and longitude were assigned. However, some streets in the ArcGIS database do not include information about street numbers. If this is the case, the address is manually located in the center of the street. The street is then classified as either a short street or a long street. Long streets cross Census tract or block group boundaries while short streets do not. As a result, the level of certainty about geographical information is much higher for short streets than for long streets.
  3. Addresses unmatched by either of the first two procedures were assigned latitude and longitude coordinates according to a 5-digit zip centroid. A centroid is essentially the midpoint of a zip code area. The geographic information is less certain for respondents located using the zip centroid method.

Researchers can identify the method used to locate the respondent's address by using the variable "GEO10" which provides information about the quality of the geographic match. This variable differentiates between addresses located based on the actual address, in the center of a short street, in the center of a long street, or using the zip centroid method. This variable can be used to determine the level of certainty for the respondent's geographic data.

Supplementary Created Variables

Urban-Rural and SMSA/CBSA-Central City residence variables
The procedures for creating the Urban-Rural and SMSA/CBSA-Central City residence variables (released in the KEYVARS area of interest) were modified for the 2000 public release. In 2000 and later survey years, these variables were created with the same software used to create the other geocode data.

For 2004-present, the Urban-Rural and CBSA-principal city variables. If the respondent's residence was located using a street name match (method 2 above) or a zip centroid match (method 3), the CBSA-Principal City and Urban/Rural variables are further evaluated. For the CBSA-Principal City variable, if the street or zip code falls completely inside or outside the boundaries of the CBSA and the principal city, then the respondent is assigned to the appropriate status. If the street or zip code crosses the boundaries of the CBSA and the central city, then the respondent is coded as CBSA, principal city status unknown. Similarly, respondents are only assigned to an urban or rural status if their entire street or zip code lies within an urban or rural area. If the street or zip code crosses an urban/rural boundary, the respondent is assigned to an unknown status.

For further discussion of these variables, see Appendix 6: Urban-Rural and SMSA/CBSA-Central City Variables in the main file NLSY79 Codebook Supplement.

Migration History variables
In NLS79 survey years 2000-2020, respondents who had moved to a different county or state since the date of last interview were asked to report each address and the dates of each move. The FIPS code for the state and county of each address are included in the 2020 geocode data. The address items collected are found in the geocode files titled "survey_and_created_variables" and in the questionnaire and codebook, with question names beginning with "MIGR_". Similar migration histories were collected in several early survey years of the NLSY79.

Distance Measures variables
A series of variables was added with the 2006 geocode release containing the collapsed distance between each pair of residential addresses reported by the respondent for all survey years and indicating whether the respondent changed zipcode between each pair of addresses. These data have been updated for the 2020 release with more information for many address pairs in various survey years added.

Editing/Quality of Match Variables
A variable named "GEO10" provides information about the quality of the respondent's address match and the method used to locate the address. In 1994 and prior years, GEO10 contains information on the degree of match between different address elements. Between 1996 and 2004, GEO10 identifies whether the county was assigned based on the respondent-provided address or the zip centroid method. In 2006-2020, this variable differentiates between addresses located based on the actual address, the center of a short street, in the center of a long street or using a zip centroid method. This variable can be used to determine the level of certainty for the respondent's geographic data.

National Death Index (NDI) Data

The round 29 (2020) data release contains information regarding cause, dates and location of death for deceased respondents for whom a matching death certificate was returned from an NDI search. Variables depicting the cause, year and region of death are restricted to the Geocode data release. Variables containing the month and state, territory or country of death are further restricted to the Zipcode data release. For further information on available variables and access requirements for restricted data, see the Health section for information on NDI related variables. In addition, NLSY79 Attachment 8: Health Codes, contains an explanation of the matching process and coding for underlying cause of death.  

Missing Data

The missing data values for all items on the geocode data files are-3, -4 and -5. The -5 values indicate a noninterview for a given year. -3 codes in the data after 1996 indicate respondents whose latitude and longitude of current residence could not be determined. Respondents who have a -4 value in the data for any variables from the County and City Data Book or other residence indicators fall into the following categories:

  1. Respondents who were in the military or who had an APO address
  2. Respondents who were residing outside of the United States
  3. Respondents whose state or county codes could not be determined
  4. Respondents who reside in a county or SMSA/MSA for which the County and City Data Book is missing data for that geographic location for that specific item
  5. Respondents who do not reside in an SMSA for any survey year 1979-82 will be missing SMSA level environmental variables for that year
  6. Respondents whose state, county, and zip codes for any survey year 1979-82 do not lead to an unambiguous SMSA designation. This generally applies only to a small number of respondents living in New England.
  7. Respondents residing in the New England states of Connecticut, Maine Massachusetts, New Hampshire, Rhode Island and Vermont who did not match on county, state and zip code on the 1982 or 1983 CRF are coded -4 on all of the metropolitan statistical area variables with NECMA codes for any survey years 1983-1987 that they resided in those areas.
  8. In the 1988-2000 NLSY79 geocode data, respondents residing in the New England states of Connecticut, Maine Massachusetts, New Hampshire, Rhode Island and Vermont who did not match on county, state and zip code on the CRF are coded -4 on all of the 1983 metropolitan statistical area variables with NECMA codes for the survey years 1988-1998.

Use of the File

Finally, we have a few notes and suggestions concerning the use of these NLSY79 geographic data.

The NLSY79 geocode data should not be used in any fashion that would endanger the confidentiality of any sample member. Only those users who have signed a written licensing agreement consenting to protect respondent confidentiality and to other conditions, who agree not to make, or allow to be made, unauthorized copies of the geocode file, and who also agree to indemnify the Center for Human Resource Research for all claims arising from misuse of the file may use these data.

The data and the accompanying documentation should be used in conjunction with the printed versions of the 1972, 1977, 1983, 1988, and 1994 County and City Data Books that correspond to each variable desired in order to have complete information regarding variable descriptions and coding idiosyncrasies. No variables from the County and City Data Books have been included in the geocode data after 2002. Users wishing to attach specific individual items from that or other sources may do so by using the state, county and/or various MSA variables to merge data.

Edited variables describing the location of each respondent's residence are created as a result of this matching process. The first two variables, question names "GEO1" and "GEO2", provide the FIPS code for the respondent's county and state of residence. Two versions of the county and state of residence variables are included in the geocode data for most survey years from 1979-92. The state and county variables appearing at the beginning of each year's variable listing are the edited versions that incorporate all revisions deemed necessary in the hand-editing process for each year. These edited variables are used in the construction of the final geocode data. The state and county variables appearing near the end of the variable listing for most of those years are the unedited version, as received directly from NORC. It is generally recommended that users employ the edited version as these contain corrected geocodes based upon the most current available information.

Researchers are encouraged to use caution during analyses because several modifications were made since 1987 in the programming procedures that create the geocode data files. Please refer to this document for discussion of specific modifications of note.

In years for which zip code centroids were assigned, users should note that there is some small possibility that a respondent's county may be misassigned using the centroid method in cases where more than one county is represented in a given ZIP code. In these cases, it is possible that a respondent might live in one county but that the center of the ZIP code area is in another county. However, since ZIP codes infrequently cross county lines and less than a quarter of respondents' counties were assigned using the ZIP centroid method, the number of counties incorrectly assigned should be quite small.