Region of measurement equivalence (Y. Zhang, Lai, & Palardy)
Impact on selection/diagnostic/classification accuracy (Lai & Y. Zhang)
blavaan
)That the same construct is measured in the same way
That the same construct is measured in the same way
Examples
Same distance, same number in different states
Blood pressure not systematically higher or lower
Instead of requiring the same score, it requires the same probability distribution
E.g.,
PsycINFO Keyword (2000 Jan 1 to 2020 Dec 31):
ti("measurement invariance" OR "measurement equivalence" OR "factorial invariance" OR "differential item functioning") OR ab("measurement invariance" OR "measurement equivalence" OR "factorial invariance" OR "differential item functioning")
For subsequent analyses
Use latent variable models with partial invariance
Or factor scores (e.g., McNeish & Wolf, 2020; Curran et al., 2009)
Nil hypothesis (Cohen, 1994)
Nil hypothesis (Cohen, 1994)
"My work in power analysis led me to realize that the nil hypothesis is always false." (p. 1000)
Possible reasons
Possible reasons
False positives
Inconsistent methods and cutoffs used to determine invariance
Possible reasons
False positives
Inconsistent methods and cutoffs used to determine invariance
Gender is confounded with some other sample characteristics that differ across studies
Possible reasons
False positives
Inconsistent methods and cutoffs used to determine invariance
Gender is confounded with some other sample characteristics that differ across studies
Or, the invariance hypothesis is always rejected when sample size is large
Generally, each study gives binary results on invariance/noninvariance
Generally, each study gives binary results on invariance/noninvariance
Hard to summarize/synthesize the invariance literature
Practical implications?
Like repeating the history in early studies when each study is either significant/non-significant
dMACSj =1SDjp√∫E(YjR−YjF|η)2fF(η)dη (F = focal group, R = reference group)
How much does the noninvariance lead to group differences in the item scores in standardized unit
Only 31% provided sufficient information to compute dMACS
Only 31% provided sufficient information to compute dMACS
Information commonly missing: loadings, intercepts, item SDs
Invariance results are highly inconsistent across studies (at least for binary conclusions)
Little synthesis across invariance studies
Implication for future use of the test is unclear
Invariance results are highly inconsistent across studies (at least for binary conclusions)
Little synthesis across invariance studies
Implication for future use of the test is unclear
Compute effect sizes in invariance studies
Report group-specific parameters, or sufficient statistics (e.g., means and covariance matrix)
Usually, test scores, not item scores, are used for
Usually, test scores, not item scores, are used for
Usually, test scores, not item scores, are used for
Operationalization of constructs in research
Comparing individuals on psychological constructs
Usually, test scores, not item scores, are used for
Operationalization of constructs in research
Comparing individuals on psychological constructs
Making selection/diagnostic/classification decisions
Usually, test scores, not item scores, are used for
Operationalization of constructs in research
Comparing individuals on psychological constructs
Making selection/diagnostic/classification decisions
Noninvariance in items may cancel out
Usually, test scores, not item scores, are used for
Operationalization of constructs in research
Comparing individuals on psychological constructs
Making selection/diagnostic/classification decisions
Noninvariance in items may cancel out
P(T|η,G=g)=P(T|η)for all g,η
Educational Longitudinal Study (ELS: 2002; U.S. Department of Education, 2004)
Educational Longitudinal Study (ELS: 2002; U.S. Department of Education, 2004)
5-Item Math-Specific Self-Efficacy
(1 = Almost Never to 4 = Almost Always)
dMACS = 0.097 to 0.11 for items 2, 4, 5
Test level: d=1SDTp√∫E(TR−TF|η)2fF(η)dη=.026
Limitation: overall ES may be small, but the test may not be suitable for specific population
Null hypothesis significance testing does not allow accepting the null hypothesis (e.g., Yuan & Chan, 2016)
Null hypothesis significance testing does not allow accepting the null hypothesis (e.g., Yuan & Chan, 2016)
Accept δ=0 practically if P(δ∈ROPE) is high
Accept δ=0 practically if P(δ∈ROPE) is high
δ: some measure of difference between groups
ROPE (Berger & Hsu, 1996; Kruschke, 2011):
values inside [δ0L,δ0U] considered practically equivalent to null
Reject H0: |δ|≥ϵ for some small ϵ when
E.g., pharmaceutical industry, whether two drugs are equivalent to make the drug generic; bioequivalence
Accept δ=0 practically if P(δ∈ROPE) is high
δ: some measure of difference between groups
ROPE (Berger & Hsu, 1996; Kruschke, 2011):
δ=E(T2−T1|η)=(∑jΔνj)+(∑jΔλj)η
Determining ROME
δ=E(T2−T1|η)=(∑jΔνj)+(∑jΔλj)η
Determining ROME
Practically invariant if HPDI of E(T2−T1|η) is within ROME for all practical levels of η
If not, can identify whether the test is practically invariant for some levels of η
Estimation with blavaan (Merkle et al.)
Black horizontal lines indicate ROME
Estimation with Mplus (ESTIMATOR=BAYES
)
Tests are commonly used to make binary decisions
Tests are commonly used to make binary decisions
Adverse Impact (AI) ratio=ER(PSF)PSR,
SR is also called positive predictive value
Millsap & Kwok (2004); Stark et al. (2004)
When selection based on multiple subtests
When selection based on multiple subtests
E.g., 20-item mini-IPIP (International Personality Item Pool)
Agreeableness, Conscientiousness, Extraversion, Neuroticism, Openness to Experience
When selection based on multiple subtests
E.g., 20-item mini-IPIP (International Personality Item Pool)
Agreeableness, Conscientiousness, Extraversion, Neuroticism, Openness to Experience
Latent composite: ζ=wη
Observed composite: Z=cy
(Zgζg)=N([cνg+cΛgαgwαg],[cΛgΨgΛ′gc′+cΘgc′cΛgΨgw′wΨgw′])
Selection weights based on previous studies on criterion validity of the mini-IPIP
When using the mini-IPIP to select the top 10% of candidates . . .
AI ratio = 0.93: slight disadvantage for males due to noninvariance
Despite tremendous growth in invariance research, the practical implications of empirical studies are unclear
Despite tremendous growth in invariance research, the practical implications of empirical studies are unclear
In my opinions, reporting measures of practical significance is a step forward for understanding how noninvariance
Despite tremendous growth in invariance research, the practical implications of empirical studies are unclear
In my opinions, reporting measures of practical significance is a step forward for understanding how noninvariance
Despite tremendous growth in invariance research, the practical implications of empirical studies are unclear
In my opinions, reporting measures of practical significance is a step forward for understanding how noninvariance
Measures of practical significance can be used to set thresholds for practical invariance
Despite tremendous growth in invariance research, the practical implications of empirical studies are unclear
In my opinions, reporting measures of practical significance is a step forward for understanding how noninvariance
Measures of practical significance can be used to set thresholds for practical invariance
More research efforts needed to translate and synthesize invariance research
Synthesizing effect sizes from invariance research
Synthesizing effect sizes from invariance research
ROME with alignment/Bayesian approximate invariance
Synthesizing effect sizes from invariance research
ROME with alignment/Bayesian approximate invariance
Berger, R. L., & Hsu, J. C. (1996). Bioequivalence trials, intersection-union tests and equivalence confidence sets. Statistical Science, 11(4), 283–319. https://doi.org/10.1214/ss/1032280304
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003. https://doi.org/10.1037/0003-066X.49.12.997
Curran, P. J., & Hussong, A. M. (2009). Integrative data analysis: The simultaneous analysis of multiple data sets. Psychological Methods, 14(2), 81–100. https://doi.org/10.1037/a0015914
Gunn, H. J., Grimm, K. J., & Edwards, M. C. (2020). Evaluation of six effect size measures of measurement non-Invariance for continuous outcomes. Structural Equation Modeling: A Multidisciplinary Journal, 27(4), 503–514. https://doi.org/10.1080/10705511.2019.1689507
Kruschke, J. (2011). Bayesian assessment of null values via parameter estimation and model comparison. Perspectives on Psychological Science, 6(3), 299–312. https://doi.org/10.1177/1745691611406925
McNeish, D., Wolf, M.G. Thinking twice about sum scores. Behav Res 52, 2287–2305 (2020). https://doi.org/10.3758/s13428-020-01398-0
Meade, A. (2010). A taxonomy of effect size measures for the differential functioning of items and scales. Journal of Applied Psychology, 95(4), 728–743. https://doi.org/10.1037/a0018966
Mellenbergh, G. J. (1989). Item bias and item response theory. International Journal of Educational Research, 13(2), 127–143. https://doi.org/10.1016/0883-0355(89)90002-5
Millsap, R. E., & Kwok, O.-M. (2004). Evaluating the impact of partial factorial invariance on selection in two populations. Psychological Methods, 9(1), 93–115. https://doi.org/10.1037/1082-989X.9.1.93
Nye, C. D., Bradburn, J., Olenick, J., Bialko, C., & Drasgow, F. (2019). How big are my effects? Examining the magnitude of effect sizes in studies of measurement equivalence. Organizational Research Methods, 22(3), 678–709. https://doi.org/10.1177/1094428118761122
Nye, C. D., & Drasgow, F. (2011). Effect size indices for analyses of measurement equivalence: Understanding the practical importance of differences between groups. Journal of Applied Psychology, 96(5), 966–980. https://doi.org/10.1037/a0022955
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2004). Examining the effects of differential item (functioning and differential) test functioning on selection decisions: When are statistically significant effects practically important? Journal of Applied Psychology, 89(3), 497–508. https://doi.org/10.1037/0021-9010.89.3.497
U.S. Department of Education, National Center for Education Statistics. (2004). Education longitudinal study of 2002: Base year data file user’s manual, by Steven J. Ingels, Daniel J. Pratt, James E. Rogers, Peter H. Siegel, and Ellen S. Stutts. Washington, DC.
Yuan, K. H., & Chan, W. (2016). Measurement invariance via multigroup SEM: Issues and solutions with chi-square-difference tests. Psychological Methods, 21(3),405–426. https://doi.org/10.1037/met0000080
Questions?
My email: hokchiol@usc.edu
Slides created via the R package xaringan.
The chakra comes from remark.js, knitr, and R Markdown.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |