What to Do If Measurement Invariance Does Not Hold?Let’s Look at the Practical SignificanceHok Chio (Mark) LaiUniversity of Southern CaliforniaSeptember 24, 20211 / 46

Overview

2 / 46

Overview

What is measurement invariance?

2 / 46

Overview

What is measurement invariance?

An attempt to synthesize invariance studies on a depression scale across genders (G. Zhang, Yue, & Lai)

2 / 46

Overview

What is measurement invariance?

An attempt to synthesize invariance studies on a depression scale across genders (G. Zhang, Yue, & Lai)

Effect size for noninvariance

2 / 46

Overview

What is measurement invariance?

An attempt to synthesize invariance studies on a depression scale across genders (G. Zhang, Yue, & Lai)

Effect size for noninvariance

Practical invariance at the test score level

Region of measurement equivalence (Y. Zhang, Lai, & Palardy)
Impact on selection/diagnostic/classification accuracy (Lai & Y. Zhang)

2 / 46

Background
- Graph of measurement invariance
  - Numerical representation
  - Weighting scale in the moon vs. the earth
- A lot of invariance studies
  - Problem: Are invariance results interpretable?
  - Problem: How do we make use of invariance results?
- Example: CES-D
  - All items were found noninvariant in at least one study
  - Studies generally do not reference earlier study on same/similar research questions
  - Very little consequences for subsequent practices * Most papers still uses the same CES-D score for regression analyses
Adjusted estimations and inferences
- Using traditional SEM framework
  - Also cite alignment papers
  - CONS: A complex model for a simple research question
  - CONS: Interpretational confounds/propagation of errors
    - Use examples from Levy (2017)
  - CONS:
- Using two-step approaches
  - Integrative data analysis
    - Does not incorporate measurement error in factor scores
  - Factor score regression (not yet studied for invariance)
- Two-Stage Path Analysis
  - Step 1: factor score estimation, with reliability information
    - CFA, IRT, network quantities, as long as information on reliability is available
  - Step 2: path analysis with definition variables
    - Reliability adjustment (much earlier in the literature)
    - But, individual-specific reliability is allowed, using definition variables
    - Can be done in OpenMx and Mplus, and potentially in Bayesian engines (not sure in blavaan)
    - Does require normality assumption on the conditional sampling/posterior distributions of the factor scores
      - May not hold for IRT models in small samples
  - Simulation results

Measurement Invariance

That the same construct is measured in the same way

3 / 46

Measurement Invariance

That the same construct is measured in the same way

Examples

Same distance, same number in different states
- cf. kilometer vs. mile
Blood pressure not systematically higher or lower
- cf. blood pressure machine in grocery vs. in hospital

3 / 46

Psychological Measurement

4 / 46

With Measurement Error

5 / 46

Instead of requiring the same score, it requires the same probability distribution

Formal Definition (Mellenbergh, 1989)For yjyj, the score of the jjth item, 
P(yj|η,G=g)=P(yj|η)for all g,ηP(yj|η,G=g)=P(yj|η)for all g,η
6 / 46

Formal Definition (Mellenbergh, 1989)

For $y_{j}$ , the score of the $j$ th item, $P (y_{j} | η, G = g) = P (y_{j} | η) for all g, η$

6 / 46

Violation of Measurement Invariance (aka Non-Invariance)

E.g.,

$η$ = true depression level
$y$ = scores on the Center for Epidemiologic Studies Depression Scale (CES-D)

7 / 46

Invariance Research Is Popular

PsycINFO Keyword (2000 Jan 1 to 2020 Dec 31):

ti("measurement invariance" OR "measurement equivalence" OR 
"factorial invariance" OR "differential item functioning") OR 
ab("measurement invariance" OR "measurement equivalence" OR 
"factorial invariance" OR "differential item functioning")

8 / 46

What Should We Do When Invariance Does Not Hold?

For subsequent analyses

Use latent variable models with partial invariance
Or factor scores (e.g., McNeish & Wolf, 2020; Curran et al., 2009)
- Though most researchers still use composite scores

9 / 46

Can We Still Use the Test in the Future?10 / 46

Can We Still Use the Test in the Future?Discard the Test?10 / 46

Can We Still Use the Test in the Future?Discard the Test?Delete noninvariant items?10 / 46

Can We Still Use the Test in the Future?Discard the Test?Delete noninvariant items?Should it depend on the size of noninvariance?10 / 46

Can We Still Use the Test in the Future?Discard the Test?Delete noninvariant items?Should it depend on the size of noninvariance?Does invariance generally hold in psychological measurement?10 / 46

Can We Still Use the Test in the Future?Discard the Test?Delete noninvariant items?Should it depend on the size of noninvariance?Does invariance generally hold in psychological measurement?Or even in physical measurement?Could two thermostats of the same model show a systematic difference of 0.01 degree? 0.0001 degree?
10 / 46

Can the Invariance Hypothesis Be True in Practice?

Nil hypothesis (Cohen, 1994)

H0 $H_{0}$ : ω=0
- I.e., absolutely zero difference in measurement parameters across groups

11 / 46

Can the Invariance Hypothesis Be True in Practice?

Nil hypothesis (Cohen, 1994)

H0: ω=0
- I.e., absolutely zero difference in measurement parameters across groups

"My work in power analysis led me to realize that the nil hypothesis is always false." (p. 1000)

11 / 46

Case StudyCES-D Across Genders12 / 46

G. Zhang, Yue, & Lai (2021 IMPS presentation)32 articles conducting invariance tests on CES-D across genders
13 / 46

G. Zhang, Yue, & Lai (2021 IMPS presentation)

32 articles conducting invariance tests on CES-D across genders

13 / 46

EVERY ONE of the 20 CES-D items was found noninvariant at least once14 / 46

EVERY ONE of the 20 CES-D items was found noninvariant at least once

Possible reasons

False positives

14 / 46

EVERY ONE of the 20 CES-D items was found noninvariant at least once

Possible reasons

False positives
Inconsistent methods and cutoffs used to determine invariance

14 / 46

EVERY ONE of the 20 CES-D items was found noninvariant at least once

Possible reasons

False positives
Inconsistent methods and cutoffs used to determine invariance
Gender is confounded with some other sample characteristics that differ across studies

14 / 46

EVERY ONE of the 20 CES-D items was found noninvariant at least once

Possible reasons

False positives
Inconsistent methods and cutoffs used to determine invariance
Gender is confounded with some other sample characteristics that differ across studies
Or, the invariance hypothesis is always rejected when sample size is large

14 / 46

Case Study: CES-D Across Genders

Generally, each study gives binary results on invariance/noninvariance

Hard to summarize/synthesize the invariance literature

15 / 46

Case Study: CES-D Across Genders

Generally, each study gives binary results on invariance/noninvariance

Hard to summarize/synthesize the invariance literature
Practical implications?
- Dropping the "crying spell" item?
- Different cutoffs for screening?

15 / 46

Like repeating the history in early studies when each study is either significant/non-significant

Effect Size

Meade (2010) "A taxonomy of effect size measures"

Nye & Drasgow (2011): more comparable to the popular Cohen's

(F = focal group, R = reference group)

Extensions in Nye et al. (2019) and Gunn et al. (2020) on signed differences

16 / 46

How much does the noninvariance lead to group differences in the item scores in standardized unit

Effect Sizes in the CES-D Invariance Synthesis

Only 31% provided sufficient information to compute
- For ones that can be computed, = -.20 to .97

17 / 46

Effect Sizes in the CES-D Invariance Synthesis

Only 31% provided sufficient information to compute
- For ones that can be computed, = -.20 to .97
Information commonly missing: loadings, intercepts, item SDs

17 / 46

Summary

Invariance results are highly inconsistent across studies (at least for binary conclusions)
Little synthesis across invariance studies
Implication for future use of the test is unclear

18 / 46

Summary

Invariance results are highly inconsistent across studies (at least for binary conclusions)
Little synthesis across invariance studies
Implication for future use of the test is unclear

Recommendations

Compute effect sizes in invariance studies
- Need software implementation
Report group-specific parameters, or sufficient statistics (e.g., means and covariance matrix)

18 / 46