As we know, psychological scales are not perfect. A scale measuring people's happiness usually have random errors, and sometimes systematic errors.
Psychological scales are not perfect
Certain level of reliability needed
Image credit: Reliability by Nick Youngson CC BY-SA 3.0 Alpha Stock Images
As we know, psychological scales are not perfect. A scale measuring people's happiness usually have random errors, and sometimes systematic errors.
Estimate and report values of reliability coefficients for the scores analyzed (i.e., the research's sample) (p. 7)
Similar recommendations can be found in numerous journal and methodological guidelines
Before talking about multilevel reliability, I wanted to give just a quick introduction on the foundational work on reliability that this research relies on.
Lord & Novick (1968)
Observed score = True score + Error
Y=T+E
T and E independent, so
σ2Y=σ2T+σ2E
The presentation today will be drawing on the classical test theory framework, as described by the classic work by Lord & Novick, which is the framework we use to define things like alpha and omega reliability.
The idea in classical test theory is that, as shown on the graph here, the measured score of a person Y, say happiness, is determined by the person's true score or true happiness level, T, and some random error E.
Notice that T and E are represented in circles as they are latent variables, meaning that they are not observables, and instead can only be inferred from the data.
Lord & Novick (1968)
Observed score = True score + Error
Y=T+E
T and E independent, so
σ2Y=σ2T+σ2E
Reliability ρ=σ2Tσ2Y=σ2Tσ2T+σ2E=[Corr(Y,T)]2
The presentation today will be drawing on the classical test theory framework, as described by the classic work by Lord & Novick, which is the framework we use to define things like alpha and omega reliability.
The idea in classical test theory is that, as shown on the graph here, the measured score of a person Y, say happiness, is determined by the person's true score or true happiness level, T, and some random error E.
Notice that T and E are represented in circles as they are latent variables, meaning that they are not observables, and instead can only be inferred from the data.
That is the case when we only have one item to measure happiness. When we have multiple items, say four items here, usually what we do is to assume that each item is an indicator of the true happiness construct. This is the idea of a factor model as represented in this figure.
The tau equivalence assumption here means that the items all have loadings of 1, meaning that the true score variance in each item is the same.
That is the case when we only have one item to measure happiness. When we have multiple items, say four items here, usually what we do is to assume that each item is an indicator of the true happiness construct. This is the idea of a factor model as represented in this figure.
The tau equivalence assumption here means that the items all have loadings of 1, meaning that the true score variance in each item is the same.
Variance of unweighted composite: V(Z)=p2ψ+∑kθk
Variance of unweighted composite: V(Z)=p2ψ+∑kθk Reliability α=p2ψV(p2ψ+∑kθk)
There were different ways to justify the derivation of α
Variance of unweighted composite: V(Z)=p2ψ+∑kθk Reliability α=p2ψV(p2ψ+∑kθk)
Alternative form: α=pp−1(1−tr(Σ)1′Σ1)
There were different ways to justify the derivation of α
The idea is that we can simply use the covariance matrix of the items to compute reliability, without the need to do a factor analysis.
True Score Variance V(T)=(∑kλk)2ψ
Error Variance V(E)=∑kθk
Reliability ω=V(T)V(T)+V(E)
True Score Variance V(T)=(∑kλk)2ψ
Error Variance V(E)=∑kθk
Reliability ω=V(T)V(T)+V(E)
Lai (2021)
2007 Trends in International Mathematics and Science Study (TIMSS; Williams et al., 2009)
Positive attitudes toward math (PATM)
Item | Wording |
---|---|
AS4MAMOR | Would like to do more math |
AS4MAENJ | I enjoy learning mathematics |
AS4MALIK | I like math |
AS4MABOR | Math is boring (reverse-coded) |
With this kind of multilevel data, in analysis we need to account for the clustering structure, as it violates the independent observation assumptions; otherwise the statistical inferences will be incorrect.
Kim et al. (2016): Only 54% reported reliability, among 39 articles using multilevel confirmatory factor analysis (MCFA)
However, discussion on multilevel reliability is not new
Ignore clustering
Omit uncertainty (SE too small/CI too narrow)
No level-specific information
j indexes cluster
Yij=ν+λbηbj+ϵbjbetween model+λwjηwij+ϵwijwithin model
No cross-level invariance
ϵ is the uniqueness, separated into the within and the between level
For the TIMSS data
Est ~ωw = .857, 95% CI [.849, .863]
Est ~ωb = .977, 95% CI [.964, .987]
For the TIMSS data
Est ~ωw = .857, 95% CI [.849, .863]
Est ~ωb = .977, 95% CI [.964, .987]
Not uncommon in the literature . . .
For the TIMSS data
Est ~ωw = .857, 95% CI [.849, .863]
Est ~ωb = .977, 95% CI [.964, .987]
Not uncommon in the literature . . .
Positive and negative affects: ~ωb = .94 to .97 (Rush, et al., 2014)
Instructional Skills Questionnaire: ~αb between .90 to .99 (Knol, et al., 2016)
Repeated measures within persons
Multiple factors in ISQ, Team from Netherlands
Yb1 and Yw1 are not observed scores!
Observed school mean ≠ True school mean (unless nj is large; Raudenbush, et al., 2002)
Lai, M. H. C. (2021). "Composite reliability of multilevel data: It’s about observed scores and construct meanings." In: Psychological Methods 26 (1). DOI: 10.1037/met0000287.
What Are the Observed Scores With Multilevel Analysis?
What Are the Observed Scores With Multilevel Analysis?
IDSCHOOL | AS4MAMOR | AS4MAENJ | AS4MALIK | AS4MABORr | Z | Zb | Zw |
---|---|---|---|---|---|---|---|
1 | 2 | 2 | 1 | 2 | 7 | 6.5000 | 0.5000 |
1 | 2 | 1 | 1 | 1 | 5 | 6.5000 | -1.5000 |
1 | 2 | 1 | 1 | 1 | 5 | 6.5000 | -1.5000 |
1 | 2 | 1 | 2 | 1 | 6 | 6.5000 | -0.5000 |
1 | 1 | 1 | 1 | 1 | 4 | 6.5000 | -2.5000 |
2 | 3 | 2 | 2 | 2 | 9 | 6.5625 | 2.4375 |
2 | 1 | 2 | 2 | 1 | 6 | 6.5625 | -0.5625 |
2 | 1 | 1 | 1 | 1 | 4 | 6.5625 | -2.5625 |
2 | 3 | 2 | 1 | 1 | 7 | 6.5625 | 0.4375 |
2 | 2 | 2 | 3 | 1 | 8 | 6.5625 | 1.4375 |
Within V(Tw)V(Tw)+V(Ew)
Within V(Tw)V(Tw)+V(Ew)
Between V(Tb)V(Tb)+V(Eb)+[V(Tw)+V(Ew)]/nsampling error
Within V(Tw)V(Tw)+V(Ew)
Between V(Tb)V(Tb)+V(Eb)+[V(Tw)+V(Ew)]/nsampling error
Raw V(Tw)+V(Tb)V(Tw)+V(Ew)+V(Tb)+V(Eb)
First, obtain estimated covariance matrices at the between (Σb) and the within levels (Σw)
First, obtain estimated covariance matrices at the between (Σb) and the within levels (Σw)
α2l=pp−1(∑k≠k′(σbkk′+σwkk′)1′Σb1+1′Σw1)αb=pp−1(∑k≠k′σbkk′1′Σb1+1′Σw1/~n)αw=pp−1(∑k≠k′σwkk′1′Σw1)
~n = harmonic mean of cluster sizes (e.g., number of students in each school in the sample)
Depends on the model
Unidimensionality
"Shared" construct (Stapleton, et al., 2016; Stapleton, et al., 2019)
If a variable is partitioned in a multilevel model (most likely an individual construct), all three (ω2l,ωb,ωw) should be reported
Cluster means and cluster-mean centered predictors
Outcome variable
If a variable is partitioned in a multilevel model (most likely an individual construct), all three (ω2l,ωb,ωw) should be reported
Cluster means and cluster-mean centered predictors
Outcome variable
multilevel_reliability()
multilevel_reliability( timss_usa[c("AS4MAMOR", "AS4MAENJ", "AS4MALIK", "AS4MABORr")], id = timss_usa$IDSCHOOL)
## Parallel analysis suggests that the number of factors = NA and the number of components = 1 ## Parallel analysis suggests that the number of factors = NA and the number of components = 1
## $alpha## alpha2l alphab alphaw ## 0.8675110 0.6527865 0.8592387 ## ## $alpha_ci## 2.5% 97.5%## alpha2l 0.8604497 0.8741798## alphab 0.5802825 0.7046562## alphaw 0.8517564 0.8663573## ## $omega## omega2l omegab omegaw ## 0.8674844 0.6233126 0.8605125 ## ## $omega_ci## 2.5% 97.5%## omega2l 0.8602266 0.8739374## omegab 0.5501061 0.6761630## omegaw 0.8529576 0.8672256## ## $ncomp## within between ## 1 1
Lai, F. Ji, & Chen (in preparation). Suggestions greatly appreciated.
Data from MIDUS 2: Daily Stress Project, 2004-2009 (Ryff, et al., 2009)
Data from MIDUS 2: Daily Stress Project, 2004-2009 (Ryff, et al., 2009)
2,022 participants, 8 days each
Target construct: Positive affect
Item | Wording |
---|---|
b2dc24 | Did you feel attentive? |
b2dc25 | Did you feel proud? |
b2dc26 | Did you feel active? |
b2dc27 | Did you feel confident? |
Data from MIDUS 2: Daily Stress Project, 2004-2009 (Ryff, et al., 2009)
2,022 participants, 8 days each
Target construct: Positive affect
Item | Wording |
---|---|
b2dc24 | Did you feel attentive? |
b2dc25 | Did you feel proud? |
b2dc26 | Did you feel active? |
b2dc27 | Did you feel confident? |
Est ICC(η)=.778
Composite | Est α | 95% CI | Est ω | 95% CI |
---|---|---|---|---|
Raw | .832 | [.820, .843] | .829 | [.817, .841] |
Within | .646 | [.628, .664] | .645 | [.625, .662] |
Between | .862 | [.849, .873] | .860 | [.817, .872] |
Est ICC(η)=.778
Composite | Est α | 95% CI | Est ω | 95% CI |
---|---|---|---|---|
Raw | .832 | [.820, .843] | .829 | [.817, .841] |
Within | .646 | [.628, .664] | .645 | [.625, .662] |
Between | .862 | [.849, .873] | .860 | [.817, .872] |
Cross-Classified CFA (Jeon, et al., 2012; Asparouhov, et al., 2016)
Assuming cross-level invariance for an individual construct, with decomposition ηti=η(P)i+η(T)t+η(W)ti
W = P x T + Error
Bayesian Estimation in Stan
With the MIDUS daily diary data, there is essentially no time-level variance
With the MIDUS daily diary data, there is essentially no time-level variance
Reliability coefficients remain the same when modeling the time-level variance
Discrete indicators?
Should we treat within-person and between-person constructs as parts of one construct?
Discrete indicators?
Should we treat within-person and between-person constructs as parts of one construct?
Time-varying reliability?
Discrete indicators?
Should we treat within-person and between-person constructs as parts of one construct?
Time-varying reliability?
Are there "shared" constructs at the person level?
Computing and reporting reliability information is important for multilevel data
Computing and reporting reliability information is important for multilevel data
Reliability information is needed for raw, cluster means, and cluster-mean centered scores
Computing and reporting reliability information is important for multilevel data
Reliability information is needed for raw, cluster means, and cluster-mean centered scores
Previous approach to between-level reliability is an overestimate when cluster size is small
Computing and reporting reliability information is important for multilevel data
Reliability information is needed for raw, cluster means, and cluster-mean centered scores
Previous approach to between-level reliability is an overestimate when cluster size is small
Nature of target construct should be considered, and it has implications on reliability computation
Collaborators: Feng Ji (UC Berkeley) and Shi Chen (Northern Arizona University)
Oi-man Kwok and Hio Wa Mak for suggestions on the paper
Appelbaum, M. et al. (2018). "Journal article reporting standards for quantitative research in psychology". In: American Psychologist 73.1, pp. 3-25. DOI: 10.1037/amp0000191.
Asparouhov, T. et al. (2016). "General random effect latent variable modeling: Random subjects, items, contexts, and parameter". In: Advances in multilevel modeling for educational research: Addressing practical issues found in real-world applications. Charlotte, NC: Information Age, pp. 163-192.
Ellis, J. L. et al. (1993). "Local homogeneity in latent trait models. A characterization of the homogeneous monotone irt model". In: Psychometrika 58.3, pp. 417-429. DOI: 10.1007/bf02294649.
Geldhof, G. J. et al. (2014). "Reliability estimation in a multilevel confirmatory factor analysis framework". In: Psychological Methods 19.1, pp. 72-91. DOI: 10.1037/a0032138.
Jak, S. et al. (2014). "Measurement bias in multilevel data". In: Structural Equation Modeling: A Multidisciplinary Journal 21.1, pp. 31-39. DOI: 10.1080/10705511.2014.856694.
Jeon, M. et al. (2012). "Profile-likelihood approach for estimating generalized linear mixed models with factor structures". In: Journal of Educational and Behavioral Statistics 37.4, pp. 518-542. DOI: 10.3102/1076998611417628.
Knol, M. H. et al. (2016). "Measuring the quality of university lectures: Development and validation of the Instructional Skills Questionnaire (ISQ)". In: PLOS ONE 11.2. Ed. by D. S. Courvoisier, p. e0149163. DOI: 10.1371/journal.pone.0149163.
Lai, M. H. C. (2021). "Composite reliability of multilevel data: It’s about observed scores and construct meanings." In: Psychological Methods 26 (1). DOI: 10.1037/met0000287.
Mehta, P. D. et al. (2005). "People are variables too: Multilevel structural equations modeling". In: Psychological Methods 10.3, pp. 259-284. DOI: 10.1037/1082-989X.10.3.259.
Raudenbush, S. W. et al. (2002). Hierarchical linear models: Applications and data analysis methods. 2nd ed. Thousand Oaks, CA: Sage. ISBN: 076191904X.
Raykov, T. et al. (2006). "On multilevel model reliability estimation from the perspective of structural equation modeling". In: Structural Equation Modeling: A Multidisciplinary Journal 13.1, pp. 130-141. DOI: 10.1207/s15328007sem1301_7.
Raykov, T. et al. (2005). "Estimation of reliability for multiple-component measuring instruments in hierarchical designs". In: Structural Equation Modeling: A Multidisciplinary Journal 12.4, pp. 536-550. DOI: 10.1207/s15328007sem1204_2.
Rush, J. et al. (2014). "Differences in within- and between-person factor structure of positive and negative affect: Analysis of two intensive measurement studies using multilevel structural equation modeling.". In: Psychological Assessment 26.2, pp. 462-473. DOI: 10.1037/a0035666.
Ryff, C. D. et al. (2009). Midlife in the United States (MIDUS 2): Daily Stress Project, 2004-2009: Version 2. type: dataset. DOI: 10.3886/ICPSR26841.V2.
Stapleton, L. M. et al. (2019). "Models to Examine the Validity of Cluster-Level Factor Structure Using Individual-Level Data". In: Advances in Methods and Practices in Psychological Science 2.3, pp. 312-329. DOI: 10.1177/2515245919855039.
Stapleton, L. M. et al. (2016). "Construct meaning in multilevel settings". In: Journal of Educational and Behavioral Statistics 41.5, pp. 481-520. DOI: 10.3102/1076998616646200.
Assume cross-level invariance holds and ω2l=.8
ICC(η)=ψbψb+ψw
Most meaningful with systematic time effects
Cranford, et al. (2006): generalizability coefficients for diary studies
Not the case for the MIDUS data, as everyone starts on a different day
Most meaningful with systematic time effects
Cranford, et al. (2006): generalizability coefficients for diary studies
Differences:
Not the case for the MIDUS data, as everyone starts on a different day
Most meaningful with systematic time effects
Cranford, et al. (2006): generalizability coefficients for diary studies
Differences:
Fixed vs. Random item facet (in estimation)
Relax the essential parallel test assumption
Not the case for the MIDUS data, as everyone starts on a different day
Most meaningful with systematic time effects
Cranford, et al. (2006): generalizability coefficients for diary studies
Differences:
Fixed vs. Random item facet (in estimation)
Relax the essential parallel test assumption
Not the case for the MIDUS data, as everyone starts on a different day
Most meaningful with systematic time effects
Cranford, et al. (2006): generalizability coefficients for diary studies
Differences:
Fixed vs. Random item facet (in estimation)
Relax the essential parallel test assumption
Test invariance assumptions
Flexible SEM modeling
Not the case for the MIDUS data, as everyone starts on a different day
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |