+ - 0:00:00
Notes for current slide
Notes for next slide

Reliability of Composite Scores With Multilevel Data

Hok Chio (Mark) Lai

University of Southern California

2021/05/05

1 / 43

Outline

Single-Level Reliability

2 / 43

Outline

Single-Level Reliability

Multilevel Data

  • Types of composites
  • Reliability indices
2 / 43

Outline

Single-Level Reliability

Multilevel Data

  • Types of composites
  • Reliability indices

Longitudinal Data?

2 / 43

Reliability

  • Psychological scales are not perfect
3 / 43

As we know, psychological scales are not perfect. A scale measuring people's happiness usually have random errors, and sometimes systematic errors.

Reliability

  • Psychological scales are not perfect

  • Certain level of reliability needed

    • Statistical analyses are not trustworthy when the numbers are not consistent

Image credit: Reliability by Nick Youngson CC BY-SA 3.0 Alpha Stock Images

3 / 43

As we know, psychological scales are not perfect. A scale measuring people's happiness usually have random errors, and sometimes systematic errors.

APA Journal Article Reporting Standards (JARS)

Estimate and report values of reliability coefficients for the scores analyzed (i.e., the research's sample) (p. 7)

4 / 43

Similar recommendations can be found in numerous journal and methodological guidelines

Reliability

5 / 43

Before talking about multilevel reliability, I wanted to give just a quick introduction on the foundational work on reliability that this research relies on.

Classical Test Theory

Lord & Novick (1968)

Observed score = True score + Error

Y=T+E

T and E independent, so

σY2=σT2+σE2

6 / 43

The presentation today will be drawing on the classical test theory framework, as described by the classic work by Lord & Novick, which is the framework we use to define things like alpha and omega reliability.

The idea in classical test theory is that, as shown on the graph here, the measured score of a person Y, say happiness, is determined by the person's true score or true happiness level, T, and some random error E.

Notice that T and E are represented in circles as they are latent variables, meaning that they are not observables, and instead can only be inferred from the data.

Classical Test Theory

Lord & Novick (1968)

Observed score = True score + Error

Y=T+E

T and E independent, so

σY2=σT2+σE2

Reliability ρ=σT2σY2=σT2σT2+σE2=[Corr(Y,T)]2

6 / 43

The presentation today will be drawing on the classical test theory framework, as described by the classic work by Lord & Novick, which is the framework we use to define things like alpha and omega reliability.

The idea in classical test theory is that, as shown on the graph here, the measured score of a person Y, say happiness, is determined by the person's true score or true happiness level, T, and some random error E.

Notice that T and E are represented in circles as they are latent variables, meaning that they are not observables, and instead can only be inferred from the data.

Composite Reliability

(Essential) Tau Equivalence

7 / 43

That is the case when we only have one item to measure happiness. When we have multiple items, say four items here, usually what we do is to assume that each item is an indicator of the true happiness construct. This is the idea of a factor model as represented in this figure.

The tau equivalence assumption here means that the items all have loadings of 1, meaning that the true score variance in each item is the same.

Composite Reliability

(Essential) Tau Equivalence

7 / 43

That is the case when we only have one item to measure happiness. When we have multiple items, say four items here, usually what we do is to assume that each item is an indicator of the true happiness construct. This is the idea of a factor model as represented in this figure.

The tau equivalence assumption here means that the items all have loadings of 1, meaning that the true score variance in each item is the same.

α Reliability

Variance of unweighted composite: V(Z)=p2ψ+kθk

8 / 43

α Reliability

Variance of unweighted composite: V(Z)=p2ψ+kθk Reliability α=p2ψV(p2ψ+kθk)

8 / 43

There were different ways to justify the derivation of α

α Reliability

Variance of unweighted composite: V(Z)=p2ψ+kθk Reliability α=p2ψV(p2ψ+kθk)

Alternative form: α=pp1(1tr(Σ)1Σ1)

8 / 43

There were different ways to justify the derivation of α

The idea is that we can simply use the covariance matrix of the items to compute reliability, without the need to do a factor analysis.

Composite Reliability

Congeneric

9 / 43

Composite Reliability

Congeneric

9 / 43

ω Reliability

  • True Score Variance V(T)=(kλk)2ψ

  • Error Variance V(E)=kθk

Reliability ω=V(T)V(T)+V(E)

10 / 43

ω Reliability

  • True Score Variance V(T)=(kλk)2ψ

  • Error Variance V(E)=kθk

Reliability ω=V(T)V(T)+V(E)

Note: Reliability is a property of observed test scores (Z), not the latent scores (η)

10 / 43

Multilevel Reliability

Lai (2021)

11 / 43

Example

2007 Trends in International Mathematics and Science Study (TIMSS; Williams et al., 2009)

  • 7,896 students (4th grade) from 515 schools

Positive attitudes toward math (PATM)

Item Wording
AS4MAMOR Would like to do more math
AS4MAENJ I enjoy learning mathematics
AS4MALIK I like math
AS4MABOR Math is boring (reverse-coded)
12 / 43

With this kind of multilevel data, in analysis we need to account for the clustering structure, as it violates the independent observation assumptions; otherwise the statistical inferences will be incorrect.

Multilevel Reliability Not Consistently Reported

Kim et al. (2016): Only 54% reported reliability, among 39 articles using multilevel confirmatory factor analysis (MCFA)

  • Usually only one reliability reported for one scale
13 / 43

However, discussion on multilevel reliability is not new

Reliability Options in the Literature

Ignore clustering

  • Omit uncertainty (SE too small/CI too narrow)

    • Can use multilevel factor analysis (Raykov, et al., 2005; Raykov, et al., 2006)
  • No level-specific information

14 / 43

Level-Specific Reliability (Geldhof, et al., 2014)

15 / 43

Level-Specific Reliability (Geldhof, et al., 2014)

"Unconstrained" Multilevel Factor Model

j indexes cluster

Yij=ν+λbηjb+ϵjbbetween model+λjwηijw+ϵijwwithin model

  • Between- and within- "reliability" coefficients
15 / 43

No cross-level invariance

ϵ is the uniqueness, separated into the within and the between level

For the TIMSS data

  • Est ω~w = .857, 95% CI [.849, .863]

  • Est ω~b = .977, 95% CI [.964, .987]

16 / 43

For the TIMSS data

  • Est ω~w = .857, 95% CI [.849, .863]

  • Est ω~b = .977, 95% CI [.964, .987]

Not uncommon in the literature . . .

16 / 43

For the TIMSS data

  • Est ω~w = .857, 95% CI [.849, .863]

  • Est ω~b = .977, 95% CI [.964, .987]

Not uncommon in the literature . . .

  • Positive and negative affects: ω~b = .94 to .97 (Rush, et al., 2014)

  • Instructional Skills Questionnaire: α~b between .90 to .99 (Knol, et al., 2016)

16 / 43
  1. Repeated measures within persons

  2. Multiple factors in ISQ, Team from Netherlands

What Went Wrong?

17 / 43

What Went Wrong?

Y1b and Y1w are not observed scores!

Observed school mean True school mean (unless nj is large; Raudenbush, et al., 2002)

Lai, M. H. C. (2021). "Composite reliability of multilevel data: It’s about observed scores and construct meanings." In: Psychological Methods 26 (1). DOI: 10.1037/met0000287.

17 / 43

What Are the Observed Scores With Multilevel Analysis?

18 / 43

What Are the Observed Scores With Multilevel Analysis?

  • Raw/Overall composite (Zij)
  • Composite school means of (cluster mean; Zjb)
  • Composite student deviations (cluster-mean centered; Zijw=ZijZjb)

18 / 43
IDSCHOOL AS4MAMOR AS4MAENJ AS4MALIK AS4MABORr Z Zb Zw
1 2 2 1 2 7 6.5000 0.5000
1 2 1 1 1 5 6.5000 -1.5000
1 2 1 1 1 5 6.5000 -1.5000
1 2 1 2 1 6 6.5000 -0.5000
1 1 1 1 1 4 6.5000 -2.5000
2 3 2 2 2 9 6.5625 2.4375
2 1 2 2 1 6 6.5625 -0.5625
2 1 1 1 1 4 6.5625 -2.5625
2 3 2 1 1 7 6.5625 0.4375
2 2 2 3 1 8 6.5625 1.4375
19 / 43

Sampling Error of School Means

20 / 43

Multilevel α and ω Reliability Indices

21 / 43

Variance Components in Reliability Coefficients

22 / 43

Variance Components in Reliability Coefficients

Within V(Tw)V(Tw)+V(Ew)

22 / 43

Variance Components in Reliability Coefficients

Within V(Tw)V(Tw)+V(Ew)

Between V(Tb)V(Tb)+V(Eb)+[V(Tw)+V(Ew)]/nsampling error

22 / 43

Variance Components in Reliability Coefficients

Within V(Tw)V(Tw)+V(Ew)

Between V(Tb)V(Tb)+V(Eb)+[V(Tw)+V(Ew)]/nsampling error

Raw V(Tw)+V(Tb)V(Tw)+V(Ew)+V(Tb)+V(Eb)

22 / 43

Extensions of α

First, obtain estimated covariance matrices at the between (Σb) and the within levels (Σw)

23 / 43

Extensions of α

First, obtain estimated covariance matrices at the between (Σb) and the within levels (Σw)

α2l=pp1(kk(σkkb+σkkw)1Σb1+1Σw1)αb=pp1(kkσkkb1Σb1+1Σw1/n~)αw=pp1(kkσkkw1Σw1)

n~ = harmonic mean of cluster sizes (e.g., number of students in each school in the sample)

23 / 43

Extensions of ω

Depends on the model

  • Cross-level invariance (Jak, et al., 2014; Mehta, et al., 2005)

24 / 43

Other Model Considerations

  • Unidimensionality

    • Can be assessed using parallel analysis on the between- and within- correlation matrices (Y. Ji et al., in preparation)
  • "Shared" construct (Stapleton, et al., 2016; Stapleton, et al., 2019)

    • E.g., Use aggregated student ratings to measure teacher qualities
    • Between-level composite may measure both teacher effectiveness + acquiescence

25 / 43

26 / 43

Which One(s) to Report?

  • If a variable is partitioned in a multilevel model (most likely an individual construct), all three (ω2l,ωb,ωw) should be reported

    • Cluster means and cluster-mean centered predictors

    • Outcome variable

27 / 43

Which One(s) to Report?

  • If a variable is partitioned in a multilevel model (most likely an individual construct), all three (ω2l,ωb,ωw) should be reported

    • Cluster means and cluster-mean centered predictors

    • Outcome variable

27 / 43

R Function multilevel_reliability()

multilevel_reliability(
timss_usa[c("AS4MAMOR", "AS4MAENJ", "AS4MALIK", "AS4MABORr")],
id = timss_usa$IDSCHOOL
)
## Parallel analysis suggests that the number of factors = NA and the number of components = 1
## Parallel analysis suggests that the number of factors = NA and the number of components = 1
## $alpha
## alpha2l alphab alphaw
## 0.8675110 0.6527865 0.8592387
##
## $alpha_ci
## 2.5% 97.5%
## alpha2l 0.8604497 0.8741798
## alphab 0.5802825 0.7046562
## alphaw 0.8517564 0.8663573
##
## $omega
## omega2l omegab omegaw
## 0.8674844 0.6233126 0.8605125
##
## $omega_ci
## 2.5% 97.5%
## omega2l 0.8602266 0.8739374
## omegab 0.5501061 0.6761630
## omegaw 0.8529576 0.8672256
##
## $ncomp
## within between
## 1 1
28 / 43

Longitudinal Data

Lai, F. Ji, & Chen (in preparation). Suggestions greatly appreciated.

29 / 43

Midlife in the United States

Data from MIDUS 2: Daily Stress Project, 2004-2009 (Ryff, et al., 2009)

  • 2,022 participants, 8 days each
30 / 43

Midlife in the United States

Data from MIDUS 2: Daily Stress Project, 2004-2009 (Ryff, et al., 2009)

  • 2,022 participants, 8 days each

  • Target construct: Positive affect

Item Wording
b2dc24 Did you feel attentive?
b2dc25 Did you feel proud?
b2dc26 Did you feel active?
b2dc27 Did you feel confident?
30 / 43

Midlife in the United States

Data from MIDUS 2: Daily Stress Project, 2004-2009 (Ryff, et al., 2009)

  • 2,022 participants, 8 days each

  • Target construct: Positive affect

Item Wording
b2dc24 Did you feel attentive?
b2dc25 Did you feel proud?
b2dc26 Did you feel active?
b2dc27 Did you feel confident?
  • Type of scores: raw composite, person means, person-mean centered
30 / 43

From MCFA

Est ICC(η)=.778

Composite Est α 95% CI Est ω 95% CI
Raw .832 [.820, .843] .829 [.817, .841]
Within .646 [.628, .664] .645 [.625, .662]
Between .862 [.849, .873] .860 [.817, .872]

31 / 43

From MCFA

Est ICC(η)=.778

Composite Est α 95% CI Est ω 95% CI
Raw .832 [.820, .843] .829 [.817, .841]
Within .646 [.628, .664] .645 [.625, .662]
Between .862 [.849, .873] .860 [.817, .872]

However, the model assumes no time-specific variances and covariances

31 / 43

Incorporating Time

Cross-Classified CFA (Jeon, et al., 2012; Asparouhov, et al., 2016)

Assuming cross-level invariance for an individual construct, with decomposition ηti=ηi(P)+ηt(T)+ηti(W)

W = P x T + Error

Bayesian Estimation in Stan

32 / 43

With the MIDUS daily diary data, there is essentially no time-level variance

  • E.g., no day of participation effect
33 / 43

With the MIDUS daily diary data, there is essentially no time-level variance

  • E.g., no day of participation effect

Reliability coefficients remain the same when modeling the time-level variance

33 / 43

Many Questions Remain

  1. Discrete indicators?
34 / 43

Many Questions Remain

  1. Discrete indicators?

  2. Should we treat within-person and between-person constructs as parts of one construct?

    • Local homogeneity assumption (Ellis, et al., 1993)
34 / 43

Many Questions Remain

  1. Discrete indicators?

  2. Should we treat within-person and between-person constructs as parts of one construct?

    • Local homogeneity assumption (Ellis, et al., 1993)
  1. Time-varying reliability?

    • Martin & Rast (2020, PsyArXiv)
34 / 43

Many Questions Remain

  1. Discrete indicators?

  2. Should we treat within-person and between-person constructs as parts of one construct?

    • Local homogeneity assumption (Ellis, et al., 1993)
  1. Time-varying reliability?

    • Martin & Rast (2020, PsyArXiv)
  1. Are there "shared" constructs at the person level?

    • E.g., response bias --> item scores more correlated at the between level
34 / 43

Summary

Computing and reporting reliability information is important for multilevel data

  • Both cross-sectional and longitudinal
35 / 43

Summary

Computing and reporting reliability information is important for multilevel data

  • Both cross-sectional and longitudinal

Reliability information is needed for raw, cluster means, and cluster-mean centered scores

35 / 43

Summary

Computing and reporting reliability information is important for multilevel data

  • Both cross-sectional and longitudinal

Reliability information is needed for raw, cluster means, and cluster-mean centered scores

Previous approach to between-level reliability is an overestimate when cluster size is small

35 / 43

Summary

Computing and reporting reliability information is important for multilevel data

  • Both cross-sectional and longitudinal

Reliability information is needed for raw, cluster means, and cluster-mean centered scores

Previous approach to between-level reliability is an overestimate when cluster size is small

Nature of target construct should be considered, and it has implications on reliability computation

  • E.g., "shared" constructs
35 / 43

Acknowledgement

  • Collaborators: Feng Ji (UC Berkeley) and Shi Chen (Northern Arizona University)

  • Oi-man Kwok and Hio Wa Mak for suggestions on the paper

36 / 43

Thanks!

Slides created via the R package xaringan.

37 / 43

References

Appelbaum, M. et al. (2018). "Journal article reporting standards for quantitative research in psychology". In: American Psychologist 73.1, pp. 3-25. DOI: 10.1037/amp0000191.

Asparouhov, T. et al. (2016). "General random effect latent variable modeling: Random subjects, items, contexts, and parameter". In: Advances in multilevel modeling for educational research: Addressing practical issues found in real-world applications. Charlotte, NC: Information Age, pp. 163-192.

Ellis, J. L. et al. (1993). "Local homogeneity in latent trait models. A characterization of the homogeneous monotone irt model". In: Psychometrika 58.3, pp. 417-429. DOI: 10.1007/bf02294649.

Geldhof, G. J. et al. (2014). "Reliability estimation in a multilevel confirmatory factor analysis framework". In: Psychological Methods 19.1, pp. 72-91. DOI: 10.1037/a0032138.

38 / 43

References (cont'd)

Jak, S. et al. (2014). "Measurement bias in multilevel data". In: Structural Equation Modeling: A Multidisciplinary Journal 21.1, pp. 31-39. DOI: 10.1080/10705511.2014.856694.

Jeon, M. et al. (2012). "Profile-likelihood approach for estimating generalized linear mixed models with factor structures". In: Journal of Educational and Behavioral Statistics 37.4, pp. 518-542. DOI: 10.3102/1076998611417628.

Knol, M. H. et al. (2016). "Measuring the quality of university lectures: Development and validation of the Instructional Skills Questionnaire (ISQ)". In: PLOS ONE 11.2. Ed. by D. S. Courvoisier, p. e0149163. DOI: 10.1371/journal.pone.0149163.

Lai, M. H. C. (2021). "Composite reliability of multilevel data: It’s about observed scores and construct meanings." In: Psychological Methods 26 (1). DOI: 10.1037/met0000287.

39 / 43

References (cont'd)

Mehta, P. D. et al. (2005). "People are variables too: Multilevel structural equations modeling". In: Psychological Methods 10.3, pp. 259-284. DOI: 10.1037/1082-989X.10.3.259.

Raudenbush, S. W. et al. (2002). Hierarchical linear models: Applications and data analysis methods. 2nd ed. Thousand Oaks, CA: Sage. ISBN: 076191904X.

Raykov, T. et al. (2006). "On multilevel model reliability estimation from the perspective of structural equation modeling". In: Structural Equation Modeling: A Multidisciplinary Journal 13.1, pp. 130-141. DOI: 10.1207/s15328007sem1301_7.

Raykov, T. et al. (2005). "Estimation of reliability for multiple-component measuring instruments in hierarchical designs". In: Structural Equation Modeling: A Multidisciplinary Journal 12.4, pp. 536-550. DOI: 10.1207/s15328007sem1204_2.

40 / 43

References (cont'd)

Rush, J. et al. (2014). "Differences in within- and between-person factor structure of positive and negative affect: Analysis of two intensive measurement studies using multilevel structural equation modeling.". In: Psychological Assessment 26.2, pp. 462-473. DOI: 10.1037/a0035666.

Ryff, C. D. et al. (2009). Midlife in the United States (MIDUS 2): Daily Stress Project, 2004-2009: Version 2. type: dataset. DOI: 10.3886/ICPSR26841.V2.

Stapleton, L. M. et al. (2019). "Models to Examine the Validity of Cluster-Level Factor Structure Using Individual-Level Data". In: Advances in Methods and Practices in Psychological Science 2.3, pp. 312-329. DOI: 10.1177/2515245919855039.

Stapleton, L. M. et al. (2016). "Construct meaning in multilevel settings". In: Journal of Educational and Behavioral Statistics 41.5, pp. 481-520. DOI: 10.3102/1076998616646200.

41 / 43

Influence of ICC and Cluster Size

Assume cross-level invariance holds and ω2l=.8

ICC(η)=ψbψb+ψw

42 / 43

Relation to Generalizability Theory

Most meaningful with systematic time effects

Cranford, et al. (2006): generalizability coefficients for diary studies

43 / 43

Not the case for the MIDUS data, as everyone starts on a different day

Relation to Generalizability Theory

Most meaningful with systematic time effects

Cranford, et al. (2006): generalizability coefficients for diary studies

‍Differences:

  • Fixed vs. Random item facet (in estimation)
43 / 43

Not the case for the MIDUS data, as everyone starts on a different day

Relation to Generalizability Theory

Most meaningful with systematic time effects

Cranford, et al. (2006): generalizability coefficients for diary studies

‍Differences:

  • Fixed vs. Random item facet (in estimation)

  • Relax the essential parallel test assumption

    • Item-specific loadings and uniqueness
43 / 43

Not the case for the MIDUS data, as everyone starts on a different day

Relation to Generalizability Theory

Most meaningful with systematic time effects

Cranford, et al. (2006): generalizability coefficients for diary studies

‍Differences:

  • Fixed vs. Random item facet (in estimation)

  • Relax the essential parallel test assumption

    • Item-specific loadings and uniqueness
  • Test invariance assumptions
43 / 43

Not the case for the MIDUS data, as everyone starts on a different day

Relation to Generalizability Theory

Most meaningful with systematic time effects

Cranford, et al. (2006): generalizability coefficients for diary studies

‍Differences:

  • Fixed vs. Random item facet (in estimation)

  • Relax the essential parallel test assumption

    • Item-specific loadings and uniqueness
  • Test invariance assumptions

  • Flexible SEM modeling

43 / 43

Not the case for the MIDUS data, as everyone starts on a different day

Outline

Single-Level Reliability

2 / 43
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow