Attachment 103: Migration Distance Variables for Respondent Locations

Respondent Migration Variables

In addition to current address, each NLSY97 interview since round 2 has collected some information about moves between interviews. Respondents are asked to report any moves to a new city, state or county (but not local moves). A set of migration variables on the geocode CD reports the state and county of each of these between-interview (non-local) moves. These variables are:

  • GEO_M_ST.xx -- State respondent moved to
  • GEO_M_CO.xx -- County respondent moved to

If the respondent moved to a foreign address, both the state and county variables will be coded as -4 (valid skip). These respondents can be identified by examining the quality variable described below.

Two additional variables provide information about the quality of the data in the migration variables. These are:

  • GEO_M_QUALITY -- Indicates whether the raw data had a state/county match; a state and county mismatch; a state and county missing; a state missing; a county missing; or a foreign move
  • GEO_M_IMPUTE -- Indicates that neither the state nor county was imputed; the state was imputed; the county was imputed; both were imputed; or the respondent had a foreign address.

Users may want to note that these created migration variables are the building blocks for the CV_MIGRATE.xx variable in the main public data file.

Migration Distance Variables

To support research on respondent mobility, we created a series of variables for the distance between respondent addresses at each interview round. This supplements the data on state and county of residence in the geocode release. The distance between the respondent's addresses at each date of interview was created for all unique pairs of survey years with names in the form DISTANCE_DTyrA_yrB. (In order to avoid unnecessary duplication of variables, these distance-based variables are between years A and B only for B<A.) The data described here do not actually provide a location for the respondent's residence; these variables only provide distances between the various places the respondent has lived. This pairwise matrix of variables enables various types of migration research by enabling users to consider the distance between residences and to identify return migration to an area where the respondent has lived in the past.

In addition to the set of distance variables, we created indicators of the quality of the geographic data. For a variety of reasons, we may not have an address for the respondent. In such cases we geocode the respondent to the centroid of the zipcode when we can determine the zipcode. To identify these cases, an indicator for the quality of this distance measure was created based on the quality of the matches in both years. This variable, DISTANCE_DFyrA_yrB, takes a value of 1 if the addresses in both years were exact address matches, a value of 2 if one match was based on the zip centroid and the other was an exact address match, and a value of 3 if both were zip_centroid matches.

Finally, DISTANCE_DZCyrA_yrB, an indicator for whether the respondent was located in the same zip code, was created for all pairs of years.

Sample codebook pages are below:

       Z90856.00    [DISTANCE_DT98_1997]                              Survey Year: XRND
       YEAR 1997 (COLLAPSED)
       Actual distance in miles from 1998 to prior residence, calculated from latitudes
       and longitudes of addresses.
       Based on edited address information that conforms to legitimate state-county-zip
       combinations, not original unedited values.  Latitudes and longitudes are 
       generated from exact address matches and zip centroids from inexact address 
         xxxx           0: 0 miles (non-mover)
          xxx           1: 0-999 feet
          xxx           2: 1000 feet - 1 mile
          xxx           3: 1-5 miles
          xxx           4: 5-20 miles
          xxx           5: 20-50 miles
          xxx           6: 50-100 miles
          xxx           7: 100-500 miles
          xxx           8: 500+ miles
       Refusal(-1)           xx
       Don't Know(-2)        xx
       Invalid Skip(-3)      xx
       TOTAL =========>    xxxx   VALID SKIP(-4)     xxx     NON-INTERVIEW(-5)       0
       Min:              0        Max:              8        Mean:                1.01
       Lead In: Z90855.00[Default]
       Default Next Question: Z90857.00
       Z90857.00    [DISTANCE_DF98_1997]                              Survey Year: XRND
       Flag for precision of addresses/zipcodes used to calculate distance
       Latitudes and longitudes are generated from exact address matches and zip 
       centroids from inexact address matches.
         xxxx       1 Distance calculated between 2 valid addresses
          xxx       2 Distance calculated between 1 zip centroid and 1 valid address
          xxx       3 Distance calculated between 2 zip centroids
       Refusal(-1)           xx
       Don't Know(-2)        xx
       Invalid Skip(-3)      xx
       TOTAL =========>    xxxx   VALID SKIP(-4)     xxx     NON-INTERVIEW(-5)       0
       Lead In: Z90856.00[Default]
       Default Next Question: Z90858.00
       Z90858.00    [DISTANCE_DZC98_1997]                             Survey Year: XRND
       Addresses in different zip code
       Latitudes and longitudes are generated from exact address matches and zip 
       centroids from inexact address matches.
         xxxx       1 Zip code changed between survey years
         xxxx       0 No zip code change between survey years
       Refusal(-1)           xx
       Don't Know(-2)        xx
       Invalid Skip(-3)      xx
       TOTAL =========>    xxxx   VALID SKIP(-4)     xxx     NON-INTERVIEW(-5)       0
       Lead In: Z90857.00[Default]
       Default Next Question: Z90859.00

Data Notes and Variable Construction

Respondent address data can be rather inconsistent, with some fields of an address either missing or incorrectly carried over from a previous round. The most reliable address components we have are the state and zipcode of residence at each interview round. Archivists put considerable effort into examining conflicting pieces of evidence to arrive at the most reasonable judgment as to where the respondent lived at the time of the interview. Users should bear in mind that some respondents live in temporary quarters with friends, relatives or in a shelter; are working at remote locations where the employer supplies a place to stay; are homeless or living out of a car; or choose not to say where they live. Some addresses are rural routes, are descriptive statements (at that trailer park just outside of town), or the respondent uses a PO Box, making it impossible to assign longitude and latitude to the address. Archivists consulted a variety of data resources in both the interview form and records from the field to make as accurate a determination as possible.

The major task was in placing the respondent at a particular location in the county so that we could compute distances. A corollary to this is that distance measures involving moves from one county to another, and especially one state to another, will likely contain errors that are a relatively small fraction of the total distance moved. However, the major motivation of this effort is to provide the user with an indication of when a respondent moved back to a location that is relatively nearby one of their former addresses, signifying a return to a place where the respondent may have an existing network of contacts.

There are several caveats in using these data about which researchers should be aware.

  1. The distance between a respondent's addresses in pairs of years is an imperfect proxy for whether the respondent ever moved between those years. There is a fair amount of return migration, and respondents may have experienced multiple, high-frequency address changes and be located at the same address at the time of the two consecutive interviews despite having moved between interviews. It is possible to detect some of these interim moves from migration histories collected during the interview, but these moves are not represented in these data, which only deal with location at the date of interview.
  2. There have been changes over time in the FIPS (Federal Interagency Processing Standards) codes for some counties and county equivalents. In each round of data the state and county codes reflect the contemporaneous FIPS codes. In creating the distance-based variables the changes in coding were reconciled. There are cases in which the FIPS codes are different across pairs of years but the change of address variable indicates that the respondent was located at the same address. That is, the distance moved variables we created will show no move whereas an examination of FIPS codes would suggest a move.
  3. For pairs of addresses within the same county, we reviewed addresses to determine whether what appeared to be different addresses were actually the same place but with differences in spelling or other minor errors that might lead to an address being put in a different location or treated as not translatable into latitude and longitude. For example, 126 Elm Street and 126 Alm Street may be the same place with a typographical error. We have made an effort to reconcile these differences, eliminating such false moves.
  4. Because software packages that assign latitude and longitude to addresses have evolved over time and differ slightly among vendors, the same address can generate slightly different latitude and longitude based on variation in the software system. We have enforced matching coordinates for matching addresses based upon the presumption that geocoding software has seen a secular improvement so that more recently generated coordinates are most likely more reliable. When the same address generates different coordinates, we accept the more recent coordinates.

Parent Distance Variables

The geocode CD also includes variables reporting the distance between the respondent's address at the interview date and the addresses reported in the locator section for the respondent's mother and father. The parent distance variables were first created in 2003 for all respondents and continue until respondents reach age 25; 2009 is the last round in which these variables were created for any respondents. These variables are:


Collapsed versions of these variables are available in the public use dataset (CV_DISTANCE_MOM_COL and CV_DISTANCE_DAD_COL).

Corresponding variables reporting data quality information are available in the public release dataset (available at These variables are:


The quality variables let the researcher know whether the respondent's address is zip centroided, the parent's address is zip centroided, both are zip centroided, none are zip centroided, or either the respondent or parent lives in a foreign location and so a distance can't be determined. If there is a zip centroid involved in any of the addresses, we use the distance from that centroid as if it is the address to calculate the distance.