Item Response Theory (IRT) item parameter estimates, scores and standard errors have been calculated for several attitudinal scales administered in multiple NLSY79 survey years as well as in other NLS cohorts.
Click below to view detailed descriptions of the creation procedures for these scales.
CES-D IRT item parameter estimates, scores and standard errors with custom weighted Z-scores and percentile ranks
CES-D ITEM PARAMETER ESTIMATES. These four variables, shown below, represent item response theory (IRT) item parameter estimates for each Center for Epidemiologic Studies-Depression Scale (CES-D) item, including measures of discrimination (â) and severity (b̂1, b̂2, b̂3), which were calibrated using a graded response model in Multilog.
CES-D IRT SCORE and STANDARD ERROR. These two variables represent the CES-D IRT scores (θ̂ ) and their standard errors of measurement (SE), which are presented in standardized metric and were calculated using the CES-D IRT parameter estimates.
CES-D CUSTOM WEIGHTED Z-SCORE and PERCENTILE RANK. These two variables were calculated using the cross-sectional custom weights for each survey wave within each cohort, which correct the raw data for the effects of over-sampling, differential base year participation and differential wave and item non-response.
Data collected
The CES-D IRT item parameter estimates, scores and standard errors were calculated using data from following cohorts and surveys:
- Mature Women
- 1995, 1997, 1999, 2001, 2003
- Young Women
- 1995, 1997, 1999, 2001, 2003
- NLSY79
- 1992, 1994, Age 40
- NLSY79 Child and Young Adult
- 1994, 1996, 1998, 2000, 2002, 2004, 2006, 2008
Mature Women and Young Women
Seven of the CES-D scale items, which appear in Table 1c, were presented to respondents in the 1995 (R33678.00 – R33684.00), 1997 (R41361.00 – R41367.00), 1999 (R50935.00 – R50941.00) and 2001 (R61869.00 – R61875.00) surveys. The 2003 surveys included the full CES-D scale (R72915.00 – R72934.00), as presented in Table 2c.
Table 1c. Subset of CES-D Scale items presented in the 1995, 1997, 1999, and 2001 Mature Women and Young Women surveys
- I felt that I could not shake off the blues even with help from my family or friends.
- I had trouble keeping my mind on what I was doing.
- I felt that everything I did was an effort.
- My sleep was restless.
- I felt lonely.
- I felt sad.
- I could not get “going.”
Table 2c. Full CES-D Scale presented in the 2003 Mature Women and Young Women surveys
- I was bothered by things that usually don’t bother me.
- I did not feel like eating; my appetite was poor.
- I felt that I could not shake off the blues even with help from my family or friends.
- I felt I was just as good as other people.
- I had trouble keeping my mind on what I was doing.
- I felt depressed.
- I felt that everything I did was an effort.
- I felt hopeful about the future.
- I thought my life had been a failure.
- I felt fearful.
- My sleep was restless.
- I was happy.
- I talked less than usual.
- I felt lonely.
- People were unfriendly.
- I enjoyed life.
- I had crying spells.
- I felt sad.
- I felt that people dislike me.
- I could not get “going.”
NLSY79 and NLSY79 Child and Young Adult
For the NLSY79 1992 survey, respondents were presented with the full CES-D scale (R38949.00 – R38968.00), as it appears in Table 2c. Respondents from the NLSY79 1994 (R49783.00 – R49789) survey and all of the Young Adult (1994: Y03360.00 - Y3366.00; 1996: Y06364.00 – Y06370.00; 1998: Y09309.00 – Y09315.00; 2000: Y11615.00 – Y11621.00; 2002: Y13966.00 – Y13972.00; 2004: Y16481.00 – Y16487.00; 2006: Y19199.00 – Y19205.00; 2008: Y22334.00 – Y22340.00) surveys were presented with seven of the CES-D scale items, which appear in Table 3c. When NLSY79 respondents were surveyed at age 40, they were presented with nine of the CES-D items (H00003.00 – H00011.00), which are displayed in Table 4c.
Table 3c. Subset of CES-D Scale items presented in the NLSY79 1994 survey and All of the Young Adult surveys
- I did not feel like eating; my appetite was poor.
- I had trouble keeping my mind on what I was doing.
- I felt depressed.
- I felt that everything I did was an effort.
- My sleep was restless.
- I felt sad.
- I could not get “going”.
Table 4c. Subset of CES-D Scale items presented to NLSY79 respondents at Age 40
- I did not feel like eating; my appetite was poor.
- I felt that I could not shake off the blues even with help from my family or friends.
- I had trouble keeping my mind on what I was doing.
- I felt depressed.
- I felt that everything I did was an effort.
- My sleep was restless.
- I felt lonely.
- I felt sad.
- I could not get “going.”
Depressive symptoms
The Center for Epidemiological Studies – Depression Scale (CES-D) measures an individual’s current level of depressive symptoms and is intended for use in general population surveys (Radloff, 1977). The 20-item scale (see Table 2c) comprises 16 negative symptoms and four positive symptoms, representing a single continuum from depression to happiness (Wood, Taylor, & Joseph, 2010). Symptom severity is measured by asking the frequency of occurrence for each item during the preceding week. Response options range from 0 (rarely or none of the time/1 day) to 3 (most or all of the time/5-7 days).
Why these variables may be helpful
Depression is a mental health condition with both a substantial prevalence and impact. Some have conjectured that it is the single largest cause of lost time on the job worldwide. The original method of scoring these questions was to use a 0-3 scale for the responses (reverse coding where appropriate) and then sum. This is the approach of “Classical Test Theory” (CTT), which used to dominate the psychometrics of scaling. There are several problems with CTT that are relevant for users of the depression data. First, if the questions asked from year-to-year or from cohort to cohort differ, the scores are not comparable. Second, CTT imposes the very strong restriction that a “1” means the same thing for all questions, and third, that going from a “1” to a “2” likewise is equally informative about depression for all questions. Item Response Theory does not impose these restrictions.
Our scoring of the CES-D using IRT generates a θ value that is comparable across rounds, across cohorts and summarizes the information contained in all the responses to the CES-D questions for a particular round. We also provide an estimated standard error for θ to provide the user with guidance on the precision of the estimated value of θ.
Because θ measures depressivity with error (this being unavoidable given the data resources at hand), we suggest that when θ is being used as a right-hand side variable in a regression, users consider taking advantage of the repeated measures on θ available in the various rounds, by using these other measured values as “instruments” for the observation of θ being used as a regressor. The method of instrumental variables (IV) is due to Geary and Reiersol, dating back to 1945, and is discussed in many advanced texts on statistical methods for the social sciences. By using IV, the well-known problem of attenuation bias due to measurement error can be overcome. The substantial stability of CES-D scores over the life course makes the use of IV especially efficacious. Whether IV is appropriate depends on the model specification and the assumed error structure.
Moreover, because the subset of CES-D questions asked changes and IRT scoring accounts for such changes, using IRT simplifies comparisons over the life course and across cohorts, which are staples of longitudinal analysis.
Item Response Theory
Within the item response theory (IRT) framework, the latent construct (θ) being measured (i.e., depressive symptoms) is assumed to follow a standard normal distribution. Given the meaningfully ordered (with respect to θ), multiple response options appearing with the CES-D items, the IRT analyses were conducted using a graded response model (GRM; Samejima, 1969), in Multilog (Thissen, Chen, & Bock, 2003). For a GRM, the a parameter or slope estimate (â) represents item discrimination, which indicates how well an item differentiates between individuals with varying levels of θ. Items with low slopes (i.e., close to zero) are problematic because they do not distinguish between individuals with varying levels of depressive symptoms. Therefore, items with higher â values are generally more desirable than those with lower â values. Each b parameter or severity estimate (b̂1, b̂2, b̂3) identifies the point along θ where one response category becomes more likely to be endorsed than any other option, given the respondent’s level of depressive symptoms. Items with equally distributed b̂ values, across the range of θ, identify clear distinctions between individuals with varying levels of depressive symptoms, according to the response options that they choose. Items with b̂ values that are extreme (greater than 4.5 standard deviations in either direction) or too close together are less desirable because knowledge of the selected response option does not provide clear information regarding an individual’s level of θ. The unidimensionality (i.e., the items measure a single latent construct) of the CES-D scale (Wood et al., 2010), ensures that any subset of CES-D items will also provide a unidimensional measure of depressive symptoms. The presentation of the CES-D IRT scores in standardized metric, with mean of zero and standard deviation of one, allows for easy and meaningful comparison of scores and standard errors with standardized scores from other scales.
The IRT item calibration of the CES-D scale items was conducted on a combined data set of responses from the four cohorts, across multiple years (see above; n = 20,225; Table 5c). Because previous research has identified the presence of differential item functioning (DIF) for some CES-D items across the life course (Cooksey, Eberwein, Gardecki, Ing, & Olsen, 2010), additional analyses were conducted to ensure the appropriateness of using a single set of item parameter estimates for calculating the CES-D IRT scores for respondents of different ages. Separate item calibrations were conducted for eight age groups ranging from teens to those in their eighties and DIF analyses were conducted on the eight sets of parameter estimates. While differences between parameter estimates across age groups displayed statistical significance through chi-square difference tests in IRTLRDIF (Thissen, 2001), these differences are not substantively significant. The practical significance of the group differences was evaluated by comparing two sets of IRT scores for each age group; one based on their own item parameter estimates and one calculated from the combined data parameter estimates. The Spearman rank-order correlation coefficients for the two sets of IRT scores indicate a high degree of similarity in the percentile rankings of the scores (r ≥ .97); a person in the top 5% for depressive symptoms using an IRT score will most likely be in the top 5%, regardless of which set of parameter estimates are used to calculate the score. Given these results, the CES-D IRT item parameter estimates based on the combined data set were used to calculate all of the CES-D IRT scores and standard errors.
Custom weighted scores and percentile ranks
Every NLS data release contains a set of cross-sectional weights. Using these weights provides a simple method for users to correct the raw data for the effects of over-sampling, clustering and differential base year participation. The custom weighted z-scores and percentile ranks were calculated using the CES-D IRT scores and the cross-sectional weights for each survey wave, within each NLS cohort.
| Item | Parameter | Estimate | S.E. |
|---|---|---|---|
| 1. I was bothered by things that usually don’t bother me. |
a |
1.56 |
0.03 |
|
b1 |
0.47 |
0.02 |
|
|
b2 |
1.57 |
0.03 |
|
|
b3 |
2.49 |
0.05 |
|
| 2. I did not feel like eating; my appetite was poor. |
a |
1.37 |
0.03 |
|
b1 |
1.16 |
0.02 |
|
|
b2 |
2.04 |
0.04 |
|
|
b3 |
2.93 |
0.06 |
|
| 3. I felt that I could not shake off the blues even with help from my family or friends. |
a |
2.76 |
0.06 |
|
b1 |
0.78 |
0.01 |
|
|
b2 |
1.49 |
0.02 |
|
|
b3 |
2.04 |
0.03 |
|
| 4. I felt I was just as good as other people. |
a |
0.16 |
0.03 |
|
b1 |
-4.70 |
0.65 |
|
|
b2 |
-3.07 |
0.39 |
|
|
b3 |
-0.98 |
0.19 |
|
| 5. I had trouble keeping my mind on what I was doing. |
a |
1.52 |
0.03 |
|
b1 |
0.39 |
0.02 |
|
|
b2 |
1.45 |
0.02 |
|
|
b3 |
2.44 |
0.04 |
|
| 6. I felt depressed. |
a |
3.21 |
0.05 |
|
b1 |
0.66 |
0.01 |
|
|
b2 |
1.38 |
0.01 |
|
|
b3 |
1.90 |
0.02 |
|
| 7. I felt that everything I did was an effort. |
a |
0.95 |
0.02 |
|
b1 |
-0.02 |
0.02 |
|
|
b2 |
1.11 |
0.03 |
|
|
b3 |
1.84 |
0.05 |
|
| 8. I felt hopeful about the future. |
a |
0.10 |
0.03 |
|
b1 |
-8.51 |
1.89 |
|
|
b2 |
-3.40 |
0.90 |
|
|
b3 |
2.88 |
0.83 |
|
| 9. I thought my life had been a failure. |
a |
1.97 |
0.05 |
|
b1 |
1.48 |
0.03 |
|
|
b2 |
2.16 |
0.04 |
|
|
b3 |
2.69 |
0.06 |
|
| 10. I felt fearful. |
a |
1.88 |
0.05 |
|
b1 |
1.17 |
0.02 |
|
|
b2 |
2.08 |
0.04 |
|
|
b3 |
2.70 |
0.06 |
|
| 11. My sleep was restless. |
a |
1.22 |
0.02 |
|
b1 |
0.23 |
0.02 |
|
|
b2 |
1.33 |
0.03 |
|
|
b3 |
2.18 |
0.04 |
|
| 12. I was happy. |
a |
0.09 |
0.04 |
|
b1 |
-9.27 |
2.10 |
|
|
b2 |
-4.05 |
0.87 |
|
|
b3 |
3.03 |
0.74 |
|
| 13. I talked less than usual. |
a |
1.29 |
0.03 |
|
b1 |
0.91 |
0.03 |
|
|
b2 |
1.86 |
0.04 |
|
|
b3 |
2.74 |
0.07 |
|
| 14. I felt lonely. |
a |
2.06 |
0.04 |
|
b1 |
0.81 |
0.02 |
|
|
b2 |
1.59 |
0.03 |
|
|
b3 |
2.18 |
0.04 |
|
| 15. People were unfriendly. |
a |
0.99 |
0.03 |
|
b1 |
1.62 |
0.05 |
|
|
b2 |
2.88 |
0.09 |
|
|
b3 |
3.82 |
0.13 |
|
| 16. I enjoyed life. |
a |
0.10 |
0.02 |
|
b1 |
-7.44 |
1.52 |
|
|
b2 |
-4.14 |
1.08 |
|
|
b3 |
0.09 |
0.23 |
|
| 17. I had crying spells. |
a |
2.20 |
0.06 |
|
b1 |
1.35 |
0.02 |
|
|
b2 |
2.02 |
0.03 |
|
|
b3 |
2.66 |
0.05 |
|
| 18. I felt sad. |
a |
2.83 |
0.05 |
|
b1 |
0.59 |
0.01 |
|
|
b2 |
1.49 |
0.02 |
|
|
b3 |
2.15 |
0.03 |
|
| 19. I felt that people dislike me. |
a |
1.56 |
0.05 |
|
b1 |
1.64 |
0.03 |
|
|
b2 |
2.58 |
0.06 |
|
|
b3 |
3.26 |
0.09 |
|
| 20. I could not get “going.” |
a |
1.52 |
0.03 |
|
b1 |
0.57 |
0.02 |
|
|
b2 |
1.76 |
0.03 |
|
|
b3 |
2.64 |
0.05 |
References
Cooksey, E., Eberwein, C., Gardecki, R., Ing, P., & Olsen, R. J. (2010, September). Depressivity over the life course and across generations. Paper presented at the Society for Longitudinal and Life Course Studies Conference, Cambridge, UK.
Geary, R. (1949). Determination of Linear Relationships between Systematic Parts of Variables with Errors of Observation the Variances of which are Unknown. Econometrica, 17.
Radloff, L. S. (1977). The CES-D scale: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1(3), 385-401.
Reiersol, O. (1945) Confluence Analysis by Means of Instrumental Sets of Variables. Arkiv for Mathematik, Astronomi och Fysik, 32.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph, No. 17.
Schmitt, D.P., & Allik, J. (2005). Simultaneous administration of the Rosenberg Self-Esteem Scale in 53 nations: Exploring the universal and culture-specific features of global self-esteem. Journal of Personality and Social Psychology, 89, 623-642.
Thissen, D. (2001). IRTLRDIF (Version 2.0b) [Computer software]. Chapel Hill, North Carolina: L. L. Thurstone Psychometric Laboratory.
Thissen, D., Chen, W., & Bock, R. D. (2003). Multilog (Version 7.03) [Computer software]. Lincolnwood, Illinois: Scientific Software International.
Wood, A. M., Taylor, P. J., & Joseph, S. (2010). Does the CES-D measure a continuum from depression to happiness? Comparing substantive and artifactual models. Psychiatry Research, 177, 120-123.
Pearlin Mastery Scale IRT item parameter estimates, scores and standard errors with custom weighted Z-scores and percentile ranks
PM ITEM PARAMETER ESTIMATES. These four variables, shown below, represent item response theory (IRT) item parameter estimates for each Pearlin Mastery Scale (PM) item, including measures of discrimination (â) and severity (b̂1, b̂2, b̂3), which were calibrated using a graded response model in Multilog.
PM IRT SCORE and STANDARD ERROR. These two variables represent the PM IRT scores (θ̂ ) and their standard errors of measurement (SE), which are presented in standardized metric, and were calculated using the PM ITEM PARAMETER ESTIMATES.
PM CUSTOM WEIGHTED Z-SCORE and PERCENTILE RANK. These two variables were calculated using the cross-sectional custom weights for each survey wave within each cohort, which correct the raw data for the effects of over-sampling, differential base year participation and differential wave and item non-response.
Data collected
The PM IRT item parameter estimates, scores and standard errors were calculated using data from following cohorts and surveys:
- NLSY79
- 1992
- NLSY79 Child and Young Adult
- 1994, 1996, 1998, 2000, 2002, 2004, 2006, 2008
NLSY79 and NLSY79 Child and Young Adult
For the NLSY79 1992 survey (R38942.00 – R38948.00) and the NLSY79 Child and Young Adult surveys (1994: Y03343.00 - Y3349.00; 1996: Y06347.00 – Y06353.00; 1998: Y09292.00 – Y09298.00; 2000: Y11592.00 – Y11598.00; 2002: Y13943.00 – Y13949.00; 2004: Y16458.00 – Y16464.00; 2006: Y19176.00 – Y19182.00; 2008: Y22328.00 – Y22334.00), respondents were presented with the 7-item PM scale, as it appears in Table 1p.
Table 1p. Pearlin Mastery Scale Items
- There is really no way I can solve some of the problems I have. (RC)
- Sometimes I feel that I’m being pushed around in life. (RC)
- I have little control over the things that happen to me. (RC)
- *I can do just about anything I really set my mind to.
- I often feel helpless in dealing with the problems of life. (RC)
- What happens to me in the future mostly depends on me.
- There is little I can do to change many of the important things in my life. (RC)
Note: RC represents item values that require reverse coding prior to scoring.
*Item 4 was removed from the scale prior to conducting IRT analyses, due to redundancy with Item 6.
Mastery
The Pearlin Mastery Scale (PM) measures an individual’s level of mastery, which is a psychological resource that has been defined as “the extent to which one regards one’s life-chances as being under one’s own control in contrast to being fatalistically ruled” (Pearlin & Schooler, 1978, p.5). The 7-item scale (see Table 1p) comprises five negatively worded items and two positively worded items, presented with the following response options: (1) Strongly Disagree (2) Disagree (3) Agree (4) Strongly Agree. The negatively worded items require reverse coding prior to scoring, resulting in a score range of 7 to 28, with higher scores indicating greater levels of mastery.
Why these variables may be helpful
Mastery has been shown to provide a protective buffer for individuals’ mental and physical health and well-being when facing persistent life stresses, such as economic and occupational hardships (e.g., Pearlin & Schooler, 1978; Pudrovska, Schieman, Pearlin & Nguyen, 2005). The original method of scoring these questions was to use a 1-4 scale for the responses (reverse coding where appropriate) and then sum. This is the approach of “Classical Test Theory” (CTT), which used to dominate the psychometrics of scaling. CTT imposes the very strong restriction that a “1” means the same thing for all questions, and third, that going from a “1” to a “2” likewise is equally informative about mastery for all questions. Item Response Theory does not impose these restrictions.
Our scoring of the PM using IRT generates a θ value that is comparable across rounds, across cohorts and summarizes the information contained in all the responses to the PM questions for a particular round. We also provide an estimated standard error for θ to provide the user with guidance on the precision of the estimated value of θ.
Because θ measures mastery with error (this being unavoidable given the data resources at hand), we suggest that when θ is being used as a right-hand side variable in a regression, users consider taking advantage of the repeated measures on θ available in the various rounds, by using these other measured values as “instruments” for the observation of θ being used as a regressor. The method of instrumental variables (IV) is due to Geary and Reiersol, dating back to 1945, and is discussed in many advanced texts on statistical methods for the social sciences. By using IV, the well-known problem of attenuation bias due to measurement error can be overcome. The stability of PM scores over the life course makes the use of IV efficacious. Whether IV is appropriate depends on the model specification and the assumed error structure. Moreover, because IRT scoring can account for any changes made to the number of PM items being asked, using IRT simplifies comparisons over the life course and across cohorts, which are staples of longitudinal analysis.
Item Response Theory
Within the item response theory (IRT) framework, the latent construct (θ) being measured (i.e., mastery) is assumed to follow a standard normal distribution. Given the meaningfully ordered (with respect to θ), multiple response options appearing with the PM items, the IRT analyses were conducted using a graded response model (GRM; Samejima, 1969), in Multilog (Thissen, Chen, & Bock, 2003). For a GRM, the a parameter or slope estimate (â) represents item discrimination, which indicates how well an item differentiates between individuals with varying levels of θ. Items with low slopes (i.e., close to zero) are problematic because they do not distinguish between individuals with varying levels of mastery. Therefore, items with higher â values are generally more desirable than those with lower â values. Each b parameter or severity estimate (b̂1, b̂2, b̂3) identifies the point along θ where one response category becomes more likely to be endorsed than any other option, given the respondent’s level of mastery. Items with equally distributed b̂ values, across the range of θ, identify clear distinctions between individuals with varying levels of mastery, according to the response options that they choose. Items with b̂ values that are extreme (greater than 4.5 standard deviations in either direction) or too close together are less desirable because knowledge of the selected response option does not provide clear information regarding an individual’s level of θ. The unidimensionality (i.e., the items measure a single latent construct) of the PM scale ensures that any subset of PM items will also provide a unidimensional measure of mastery. The presentation of the PM IRT scores in standardized metric, with mean of zero and standard deviation of one, allows for easy and meaningful comparison of scores and standard errors with standardized scores from other scales.
Results from a series of factor analyses (both exploratory and confirmatory, using independent data samples) provided evidence to support the unidimensionality of the PM scale. Prior to conducting the IRT item analyses, Item 4 (see Table 1p) was removed from the PM scale, due to redundancy with Item 6 (i.e., correlated error variance), which violates the IRT assumption of independence of items, after accounting for θ (i.e., mastery). The IRT item calibration of the six PM scale items was conducted on the combined NLSY79 1992 and NLSY79 Child and Young Adult 2008 data (n = 15,287; Table 2p). Given the potential influence of age and gender on level of mastery, differential item functioning (DIF) analyses were conducted to ensure the appropriateness of using a single set of item parameter estimates for calculating PM scores for respondents of different ages and gender. First, separate item calibrations were conducted for four age groups ranging from 14-19 years to 30-38 years and followed up with DIF analyses of the four sets of parameter estimates. Next, item calibrations were conducted for gender groups (i.e., males and females) with DIF analyses conducted on the two sets of item parameter 4 estimates. While differences between parameter estimates across age groups and gender displayed statistical significance, through chi-square difference tests in IRTLRDIF (Thissen, 2001), these differences are not substantively significant. The practical significance of the group differences was evaluated by comparing two sets of IRT scores for each age and gender group; one based on their own item parameter estimates and one calculated from the combined data item parameter estimates. The Spearman rank-order correlation coefficients for the two sets of IRT scores indicate a high degree of similarity in the percentile rankings of the scores (r ≥ .99); a person in the top 5% for mastery using an IRT score will most likely be in the top 5%, regardless of which set of parameter estimates are used to calculate the score. Given these results, the PM IRT item parameter estimates based on the combined data set were used to calculate all of the PM IRT scores and standard errors.
Custom weighted scores and percentile ranks
Every NLS data release contains a set of cross-sectional weights. Using these weights provides a simple method for users to correct the raw data for the effects of over-sampling, clustering and differential base year participation. The custom weighted z-scores and percentile ranks were calculated using the PM IRT scores and the cross-sectional weights for each survey wave within each NLS cohort.
| Item | Parameter | Estimate | S.E. |
|---|---|---|---|
| 1. There is really no way I can solve some of the problems I have. |
a |
1.93 |
0.03 |
|
b1 |
-2.36 |
0.04 |
|
|
b2 |
-1.14 |
0.02 |
|
|
b3 |
0.63 |
0.02 |
|
| 2. Sometimes I feel that I’m being pushed around in life. |
a |
1.68 |
0.03 |
|
b1 |
-2.66 |
0.05 |
|
|
b2 |
-0.98 |
0.02 |
|
|
b3 |
0.89 |
0.02 |
|
| 3. I have little control over the things that happen to me. |
a |
2.26 |
0.04 |
|
b1 |
-2.59 |
0.05 |
|
|
b2 |
-1.47 |
0.02 |
|
|
b3 |
0.51 |
0.02 |
|
| 5. I often feel helpless in dealing with the problems of life. |
a |
2.19 |
0.03 |
|
b1 |
-2.53 |
0.04 |
|
|
b2 |
-1.20 |
0.02 |
|
|
b3 |
0.73 |
0.02 |
|
| 6. What happens to me in the future mostly depends on me. |
a |
1.03 |
0.03 |
|
b1 |
-4.60 |
0.15 |
|
|
b2 |
-3.22 |
0.09 |
|
|
b3 |
0.34 |
0.03 |
|
| 7. There is little I can do to change many of the important things in my life. |
a |
1.76 |
0.03 |
|
b1 |
-2.82 |
0.06 |
|
|
b2 |
-1.49 |
0.03 |
|
|
b3 |
0.70 |
0.02 |
References
Geary, R. (1949). Determination of Linear Relationships between Systematic Parts of Variables with Errors of Observation the Variances of which are Unknown. Econometrica, 17.
Pearlin, L.I., & Schooler, C. (1978). The structure of coping. Journal of Health and Social Behavior, 19, 2-21.
Pudrovska, T., Schieman, S., Pearlin, L.I., & Nguyen, K. (2005). The sense of mastery as a mediator and moderator in the association between economic hardship and health in late life. Journal of Aging and Health, 17, 634-660.
Reiersol, O. (1945) Confluence Analysis by Means of Instrumental Sets of Variables. Arkiv for Mathematik, Astronomi och Fysik, 32.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph, No. 17.
Thissen, D. (2001). IRTLRDIF (Version 2.0b) [Computer software]. Chapel Hill, North Carolina: L. L. Thurstone Psychometric Laboratory.
Thissen, D., Chen, W., & Bock, R. D. (2003). Multilog (Version 7.03) [Computer software]. Lincolnwood, Illinois: Scientific Software International.
Rosenberg Self-Esteem Scale IRT item parameter estimates, scores and standard errors with custom weighted Z-scores and percentile ranks
RSE ITEM PARAMETER ESTIMATES. These four variables, shown below, represent item response theory (IRT) item parameter estimates for each Rosenberg Self-Esteem Scale (RSE) item, including measures of discrimination (â) and severity (b̂1, b̂2, b̂3), which were calibrated using a graded response model in Multilog.
RSE IRT SCORE and STANDARD ERROR. These two variables represent the RSE IRT scores (θ̂ ) and their standard errors of measurement (SE), which are presented in standardized metric, and were calculated using the RSE ITEM PARAMETER ESTIMATES.
RSE CUSTOM WEIGHTED Z-SCORE and PERCENTILE RANK. These two variables were calculated using the cross-sectional custom weights, for each survey wave within each cohort, which correct the raw data for the effects of over-sampling, differential base year participation and differential wave and item non-response.
Data collected
The RSE IRT item parameter estimates, scores and standard errors were calculated using data from following cohorts and surveys:
- NLSY79
- 1980, 1987, 2006
- NLSY79 Child and Young Adult
- 1994, 1996, 1998, 2000, 2002, 2004, 2006, 2008
NLSY79 and NLSY79 Child and Young Adult
For the NLSY79 surveys (1980: R03035.00 – R03044.00; 1987: R23491.00 – R23500.00; 2006: T08998.00 – T08998.09) and the NLSY79 Child and Young Adult surveys (1994: Y03350.00 - Y3359.00; 1996: Y06354.00 – Y06363.00; 1998: Y09299.00 – Y09308.00; 2000: Y11599.00 – Y11608.00; 2002: Y13950.00 – Y13959.00; 2004: Y16465.00 – Y16474.00; 2006: Y19183.00 – Y19192.00; 2008: Y22335.00 – Y22344.00), respondents were presented with the 10-item RSE scale, as it appears in Table 1r.
Table 1r. Rosenberg Self-Esteem Scale Items
- *I feel that I’m a person of worth, at least on equal basis with others. (RC)
- I feel that I have a number of good qualities. (RC)
- All in all, I am inclined to feel that I am a failure.
- I am able to do things as well as most other people. (RC)
- I feel I do not have much to be proud of.
- I take a positive attitude toward myself. (RC)
- On the whole, I am satisfied with myself. (RC)
- I wish I could have more respect for myself.
- I certainly feel useless at times.
- At times I think I am no good at all.
Note: RC represents item values that require reverse coding prior to scoring.
*Item 1 was removed from the scale prior to conducting IRT analyses, due to redundancy with Item 2.
Self-Esteem
The Rosenberg Self-Esteem Scale (RSE; Rosenberg, 1965) provides a measure of global self-esteem, which has been defined as an individual’s general sense of personal worth (Rosenberg, 1979). The 10-item scale (see Table 1r) comprises four positively worded items and six negatively worded items, presented with the following response options: (1) Strongly Agree (2) Agree (3) Disagree (4) Strongly Disagree. The positively worded items require reverse coding prior to scoring, resulting in a score range of 10 to 40, with higher scores indicating greater levels of self-esteem.
Why these variables may be helpful
Self-esteem has been described as a core component of positive self-concept, which displays positive relationships with both job satisfaction and job performance (Judge & Bono, 2001). The original method of scoring these questions was to use a 1-4 scale for the responses (reverse coding where appropriate) and then sum. This is the approach of “Classical Test Theory” (CTT), which used to dominate the psychometrics of scaling. CTT imposes the very strong restriction that a “1” means the same thing for all questions, and third, that going from a “1” to a “2” likewise is equally informative about mastery for all questions. Item Response Theory does not impose these restrictions.
Our scoring of the RSE using IRT generates a θ value that is comparable across rounds, across cohorts and summarizes the information contained in all the responses to the RSE questions for a particular round. We also provide an estimated standard error for θ to provide the user with guidance on the precision of the estimated value of θ.
Because θ measures self-esteem with error (this being unavoidable given the data resources at hand), we suggest that when θ is being used as a right-hand side variable in a regression, users consider taking advantage of the repeated measures on θ available in the various rounds, by using these other measured values as “instruments” for the observation of θ being used as a regressor. The method of instrumental variables (IV) is due to Geary and Reiersol, dating back to 1945, and is discussed in many advanced texts on statistical methods for the social sciences. By using IV, the well-known problem of attenuation bias due to measurement error can be overcome. The stability of RSE scores over the life course makes the use of IV efficacious. Whether IV is appropriate depends on the model specification and the assumed error structure. Moreover, because IRT scoring can account for any changes made to the number of RSE items being asked, using IRT simplifies comparisons over the life course and across cohorts, which are staples of longitudinal analysis.
Item Response Theory
Within the item response theory (IRT) framework, the latent construct (θ) being measured (i.e., self-esteem) is assumed to follow a standard normal distribution. Given the meaningfully ordered (with respect to θ), multiple response options appearing with the RSE items, the IRT analyses were conducted using a graded response model (GRM; Samejima, 1969), in Multilog (Thissen, Chen, & Bock, 2003). For a GRM, the a parameter or slope estimate (â) represents item discrimination, which indicates how well an item differentiates between individuals with varying levels of θ. Items with low slopes (i.e., close to zero) are problematic because they do not distinguish between individuals with varying levels of self-esteem. Therefore, items with higher â values are generally more desirable than those with lower â values. Each b parameter or severity estimate (b̂1, b̂2, b̂3) identifies the point along θ where one response category becomes more likely to be endorsed than any other option, given the respondent’s level of self-esteem. Items with equally distributed b̂ values, across the range of θ, identify clear distinctions between individuals with varying levels of self-esteem, according to the response options that they choose. Items with b̂ values that are extreme (greater than 4.5 standard deviations in either direction) or too close together, are less desirable because knowledge of the selected response option does not provide clear information regarding an individual’s level of θ. The unidimensionality (i.e., the items measure a single latent construct) of the RSE scale ensures that any subset of RSE items will also provide a unidimensional measure of self-esteem. The presentation of the RSE IRT scores in standardized metric, with mean of zero and standard deviation of one, allows for easy and meaningful comparison of scores and standard errors with standardized scores from other scales.
Results from a series of factor analyses (both exploratory and confirmatory, using independent data samples) provided evidence to support the unidimensionality of the RSE scale. Prior to conducting the IRT item analyses, Item 1 (see Table 1r) was removed from the PM scale, due to redundancy with Item 2 (i.e., correlated error variance), which violates the IRT assumption of independence of items, after accounting for θ (i.e., mastery). The IRT item calibration of the nine RSE scale items was conducted on the combined NLSY79 2006 and NLSY79 Child and Young Adult 2008 data (n = 13,947; Table 2r). Given the potential influence of age and gender on level of self-esteem, differential item functioning (DIF) analyses were conducted to ensure the appropriateness of using a single set of item parameter estimates for calculating RSE scores for respondents of different ages and gender. First, separate item calibrations were conducted for five age groups ranging from 14-19 years to 41-50 years and followed up with DIF analyses of the five sets of parameter estimates. Next, item calibrations were conducted for gender groups (i.e., males and females) with DIF analyses conducted on the two sets of item parameter estimates. While differences between parameter estimates across age groups and gender displayed statistical significance, through chi-square difference tests in IRTLRDIF (Thissen, 2001), these differences are not substantively significant. The practical significance of the group differences was evaluated by comparing two sets of IRT scores for each age and gender group; one based on their own item parameter estimates and one calculated from the combined data item parameter estimates. The Spearman rank-order correlation coefficients for the two sets of IRT scores indicate a high degree of similarity in the percentile rankings of the scores (r ≥ .99); a person in the top 5% for self-esteem using an IRT score will most likely be in the top 5%, regardless of which set of parameter estimates are used to calculate the score. Given these results, the RSE IRT item parameter estimates based on the combined data set were used to calculate all of the RSE IRT scores and standard errors.
Custom weighted scores and percentile ranks
Every NLS data release contains a set of cross-sectional weights. Using these weights provides a simple method for users to correct the raw data for the effects of over-sampling, clustering and differential base year participation. The custom weighted z-scores and percentile ranks were calculated using the RSE IRT scores and the cross-sectional weights for each survey wave, within each NLS cohort.
| Item | Parameter | Estimate | S.E. |
|---|---|---|---|
| 2. I feel that I have a number of good qualities. |
a |
2.21 |
0.04 |
|
b1 |
-3.58 |
0.13 |
|
|
b2 |
-2.86 |
0.07 |
|
|
b3 |
-0.05 |
0.01 |
|
| 3. All in all, I am inclined to feel that I am a failure. |
a |
2.81 |
0.05 |
|
b1 |
-2.90 |
0.07 |
|
|
b2 |
-2.05 |
0.03 |
|
|
b3 |
-0.05 |
0.01 |
|
| 4. I am able to do things as well as most other people. |
a |
2.08 |
0.04 |
|
b1 |
-3.22 |
0.09 |
|
|
b2 |
-2.19 |
0.04 |
|
|
b3 |
0.26 |
0.02 |
|
| 5. I feel I do not have much to be proud of. |
a |
2.55 |
0.05 |
|
b1 |
-2.62 |
0.06 |
|
|
b2 |
-1.84 |
0.03 |
|
|
b3 |
0.05 |
0.01 |
|
| 6. I take a positive attitude toward myself. |
a |
2.71 |
0.05 |
|
b1 |
-2.92 |
0.08 |
|
|
b2 |
-1.94 |
0.03 |
|
|
b3 |
0.27 |
0.01 |
|
| 7. On the whole, I am satisfied with myself. |
a |
2.04 |
0.04 |
|
b1 |
-3.06 |
0.08 |
|
|
b2 |
-1.75 |
0.03 |
|
|
b3 |
0.62 |
0.02 |
|
| 8. I wish I could have more respect for myself. |
a |
1.73 |
0.03 |
|
b1 |
-2.74 |
0.07 |
|
|
b2 |
-1.22 |
0.03 |
|
|
b3 |
0.72 |
0.02 |
|
| 9. I certainly feel useless at times. |
a |
2.17 |
0.04 |
|
b1 |
-2.72 |
0.06 |
|
|
b2 |
-1.19 |
0.02 |
|
|
b3 |
0.60 |
0.02 |
|
| 10. At times I think I am no good at all. |
a |
2.88 |
0.05 |
|
b1 |
-2.71 |
0.06 |
|
|
b2 |
-1.59 |
0.02 |
|
|
b3 |
0.20 |
0.01 |
References
Geary, R. (1949). Determination of Linear Relationships between Systematic Parts of Variables with Errors of Observation the Variances of which are Unknown. Econometrica, 17.
Gray-Little, B., Williams, V.S.L., & Hancock, T.D. (1997). An item response theory analysis of the Rosenberg Self-Esteem Scale. Personality and Social Psychology Bulletin, 23, 443-451.
Judge, T.A., & Bono, J.E. (2001). Relationship of core self-evaluations traits – self-esteem, generalized self-efficacy, locus of control, and emotional stability – with job satisfaction and job performance: A meta-analysis. Journal of Applied Psychology, 86, 80-92.
Reiersol, O. (1945) Confluence Analysis by Means of Instrumental Sets of Variables. Arkiv for Mathematik, Astronomi och Fysik, 32.
Rosenberg, M. (1965). Society and the adolescent self-image. Princeton, NJ: Princeton University Press.
Rosenberg, M. (1979). Conceiving the self. New York: Basic Books.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph, No. 17.
Schmitt, D.P., & Allik, J. (2005). Simultaneous administration of the Rosenberg Self-Esteem Scale in 53 nations: Exploring the universal and culture-specific features of global self-esteem. Journal of Personality and Social Psychology, 89, 623-642.
Thissen, D. (2001). IRTLRDIF (Version 2.0b) [Computer software]. Chapel Hill, North Carolina: L. L. Thurstone Psychometric Laboratory.
Thissen, D., Chen, W., & Bock, R. D. (2003). Multilog (Version 7.03) [Computer software]. Lincolnwood, Illinois: Scientific Software International.