Model Selection for Multilevel Modeling

Author

Mark Lai

Published

December 26, 2017

In social sciences, many times we use statistical methods to answer well-defined research questions that are derived from some theory or previous research. For example, theory may suggest that interventions to improve students’ self-efficacy may help benefit their academic performance, so we would like to test a mediation model of intervention –> self-efficacy –> academic performance. We may also learn from previous studies that the intervention may work differently for different genders, so we would like to include a intervention $\times$ gender interaction.

However, innovations often arise from exploratory data analysis where existing theory may provide only partial or little guidance to understand our data. This is especially true for multilevel modeling (MLM), as theories that are truly multilevel are relatively rare. The additional model choices in MLM also contribute to this, as theories seldom tell whether a relationship between two variables are the same or different in different levels, or whether there are heterogeneity of level-1 relationship across level-2 units, or whether there are specific cross-level interactions. One takeaway from this of course is we need better theories in our disciplines. But for research with a more exploratory focus and in the absence of established theories, we want to fully explore our data while having some measures to save us from over-interpreting the noise in a single data set.

Complexity of MLM

With single-level analyses, if one has one predictor, call it $X$ , one only needs to estimate the coefficient of that predictor and perhaps evaluates whether there is statistical evidence that the predictor has predictive power (e.g., with hypothesis tests), assuming the assumptions of the statistical model is satisfied. However, with a two-level analyses, especially if the predictor is at level 1, then things can already get complicated. For example, one can ask these questions:

Is $X$ related to the outcome overall?
Does $X$ has both a lv-1 effect and a lv-2 effect?
If yes, are the effects at different levels the same or different?
If $X$ has a lv-1 effect, is there a random slope?

Just imagine the complexity with more predictors and the potential for different interactions and cross-level interaction effects. Such complexity has two major consequences:

If we were to do a statistical test for all of the fixed and random effects, we run into risks of huge Type I error inflation (just like post hoc testing in ANOVA). $P$ -values are not trustworthy!
Perhaps more importantly, unless one has a very large sample size, the parameter estimates are highly unstable with a complex model and can be completely not reproducible.

Information Criteria

If statistical significance and $p$ -values cannot do the job, what can? How about effect size like $R^{2}$ we discussed before? Unfortunately, $R^{2}$ is not designed for that purpose, and choosing models that have the largest $R^{2}$ will generally also result in models that are unstable and not reproducible.

In statistics, there are indices that are specifically designed for comparing different models and evaluating their reproducibility, and by far two of the most common indices are the Akaike Information Criterion (AIC) and the so called Bayesian Information Criterion (BIC) (also called the Schwarz Criterion).

There are different ways to understand these information criteria, and AIC and BIC are actually developed with very different motivation and background, despite how commonly they are mentioned together. Nevertheless, the most classic way, at least for AIC, is that it is a measure of the fit of the model to a different and independent sample. In other words, if you come up with a model using the data you have now, you then collect a new sample, how good can your model describe what is happening in the new sample? The smaller the AIC/BIC, the better the fit of the model to a new sample. All information criteria follow a general form:

$IC = Deviance + Penalty$

The Penalty term is always a function of the complexity of the model, usually measured by the number of parameters estimated. Remember that

The smaller the deviance, the better the model fit, and
A more complex model almost always gives a smaller deviance.

Therefore, without the Penalty one always selects the most complex model, which may be never reproducible in an independent sample. Instead, AIC and BIC usually are formulated as $\begin{aligned} AIC & = Deviance + 2 q \\ BIC & = Deviance + q \log N, \end{aligned}$ where $q$ is the number of estimated parameters (both fixed and random). Note, however, that the computation of AIC and BIC may be different for different software, especially for BIC as most software packages define $N$ as the number of groups, but some other packages define $N$ as the number of units at the lowest level. Nevertheless, regardless of the definitions, they tend to work fine for the general purpose of comparing models, and are generally better than using deviance or other significance tests. We will look at the models we have previously fitted in the HSB data set with SES and SECTOR to predict MATHACH.

Example

library(tidyverse)
library(haven)
library(lme4)

hsb <- haven::read_sas('https://stats.idre.ucla.edu/wp-content/uploads/2016/02/hsb.sas7bdat')

The following models are fitted:

M0: Random-intercept only
M1: Adding SES
M2: M1 + random slope of SES
M3: M2 + MEANSES
M4: M3 + SECTOR
M5: M4 + SECTOR $\times$ SES interaction
M6: M5 + SECTOR $\times$ MEANSES interaction

# It's generally recommended to use ML instead of REML to get ICs
m0 <- lmer(MATHACH ~ (1 | ID), data = hsb, REML = FALSE)
# You can use the `update` function to add terms
m1 <- update(m0, . ~ . + SES)
# To add random slope, replace the original random intercept term
m2 <- update(m1, . ~ . - (1 | ID) + (SES | ID))
m3 <- update(m2, . ~ . + MEANSES)
m4 <- update(m3, . ~ . + SECTOR)
m5 <- update(m4, . ~ . + SES:SECTOR)
m6 <- update(m5, . ~ . + MEANSES:SECTOR)

Selecting Fixed Effects

As a starting point, one can usually first compare models with different fixed effects, especially those that are at level 1. For example, if one has three predictors: MINORITY, FEMALE, and SES, and wonder whether there are main and interaction effects from them in predicting MATHACH1 without strong prior theory, one can easily run all of the possible models without random slopes and level-2 effects by first defining the most complex model:

# Need ML
m_3wayinter <- lmer(MATHACH ~ MINORITY * FEMALE * SES + (1 | ID), data = hsb, 
                    REML = FALSE)

and then with the help of the MuMIn package in R:

library(MuMIn)
options(na.action = "na.fail")  # set the missing data handling method
dd <- dredge(m_3wayinter)
model.sel(dd, rank = AIC)

Global model call: lmer(formula = MATHACH ~ MINORITY * FEMALE * SES + (1 | ID), 
    data = hsb, REML = FALSE)
---
Model selection table 
    (Int)    FEM    MIN   SES FEM:MIN FEM:SES MIN:SES FEM:MIN:SES df    logLik
56  14.07 -1.211 -3.084 2.152          0.3740 -0.8009              8 -23184.72
64  14.12 -1.300 -3.272 2.119  0.3651  0.4258 -0.7901              9 -23184.20
40  14.06 -1.214 -3.075 2.343                 -0.7846              7 -23186.63
128 14.12 -1.300 -3.271 2.119  0.3642  0.4273 -0.7875   -0.004688 10 -23184.20
48  14.08 -1.255 -3.161 2.340  0.1675         -0.7785              8 -23186.51
24  14.13 -1.228 -2.967 1.909          0.3458                      7 -23191.78
32  14.18 -1.332 -3.191 1.874  0.4312  0.4074                      8 -23191.06
8   14.11 -1.230 -2.962 2.091                                      6 -23193.42
16  14.14 -1.289 -3.086 2.089  0.2410                              7 -23193.17
39  13.42        -3.058 2.395                 -0.8286              6 -23214.30
7   13.47        -2.938 2.129                                      5 -23221.82
22  13.29 -1.185        2.197          0.3079                      6 -23293.61
6   13.28 -1.188        2.358                                      5 -23294.87
5   12.66               2.391                                      4 -23320.50
4   14.38 -1.390 -3.702                                            5 -23377.51
12  14.43 -1.483 -3.900        0.3860                              6 -23376.92
3   13.65        -3.688                                            4 -23411.65
2   13.35 -1.359                                                   4 -23526.66
1   12.64                                                          3 -23557.90
        AIC  delta weight
56  46385.4   0.00  0.413
64  46386.4   0.96  0.256
40  46387.3   1.83  0.166
128 46388.4   2.96  0.094
48  46389.0   3.59  0.069
24  46397.6  12.13  0.001
32  46398.1  12.68  0.001
8   46398.8  13.40  0.001
16  46400.3  14.92  0.000
39  46440.6  55.17  0.000
7   46453.6  68.20  0.000
22  46599.2 213.79  0.000
6   46599.7 214.31  0.000
5   46649.0 263.57  0.000
4   46765.0 379.58  0.000
12  46765.8 380.40  0.000
3   46831.3 445.86  0.000
2   47061.3 675.89  0.000
1   47121.8 736.38  0.000
Models ranked by AIC(x) 
Random terms (all models): 
  1 | ID

The above commands rank all possible models by their AIC values in ascending orders, as shown in the column AIC. As you can see, all of the best models have the main effects of FEMALE, MINORITY, and SES, and the best model also has FEMALE $\times$ MINORITY and MINORITY $\times$ SES interactions.

You can use BIC for the same purpose (results not shown):

model.sel(dd, rank = BIC)

However, the best model according to BIC does not contain the FEMALE $\times$ MINORITY interaction.

How to Choose Between AIC and BIC?

This is one of the most debatable issue in the field of education. The (not so) short answer is that, although AIC and BIC may give different orderings of candidate models, the set of models with lowest AIC and BIC should be similar. Indeed, it is never a good practice to only select one model out of all models, especially when two or more models have very similar AIC/BIC values. Ultimately, AIC and BIC should be used to suggest a few “best” models, and the researcher is responsible to select the one that they feel more inline with theory/literature based on their subjective judgement.

More technically, AIC and BIC are based on different motivations, with AIC an index based on what is called Information Theory, which has a focus on predictive accuracy, and BIC an index derived as an approximation of the Bayes Factor, which is used to find the true model if it ever exists. Practically, AIC tends to select a model that maybe slightly more complex but has optimal predictive ability, whereas BIC tends to select a model that is more parsimonius but may sometimes be too simple. Therefore, if the goal is to have a model that can predict future samples well, AIC should be used; if the goal is to get a model as simple as possible, BIC should be used. For more technical discussion, please read Vrieze (2012).

Including Lv-2 Predictors

One can also add the contextual effects or level-2 effects of all the level-1 predictors. For example, adding MEANSES will increase the number of possible models quite a bit. The following code will select a model with all main effects, the two-way interactions of SES and FEMALE, MINORITY, and MEANSES, the MEANSES $\times$ MINORITY interaction, and the MEANSES $\times$ MINORITY $\times$ SES three-way interaction. BIC, on the other hand, would select a much simpler model with only the four main effects and the MINORITY $\times$ SES interaction. You may verify the results yourself.

m_4wayinter <- lmer(MATHACH ~ MINORITY * FEMALE * SES * MEANSES + (1 | ID), 
                    data = hsb, REML = FALSE)
dd <- dredge(m_4wayinter)
model.sel(dd, rank = AIC)

Workflow

As recommended in Hox, Moerbeek, and Van de Schoot (2018), with exploratory multilevel modeling, one proceeds with the following workflow:

Select level-1 predictors
Select level-1 random slopes
Select lv-2 effects of lv-1 predictors as well as level-2 predictors
Select cross-level interactions

m_bic <- lmer(MATHACH ~ MINORITY + FEMALE + SES + MINORITY * SES + (1 | ID), 
              data = hsb, REML = FALSE)
m_bic_rs1 <- update(m_bic, . ~ . - (1 | ID) + (MINORITY | ID))
m_bic_rs2 <- update(m_bic, . ~ . - (1 | ID) + (FEMALE | ID))
m_bic_rs3 <- update(m_bic, . ~ . - (1 | ID) + (SES | ID))
# BIC suggests adding the random slope for MINORITY
BIC(m_bic, m_bic_rs1, m_bic_rs2, m_bic_rs3)
# Now add the MEANSES, SIZE, and SECTOR variable, as well as their interactions
# Note: because of the scale of the SIZE variable is huge, I will divide the
#   values by 1000 so that it is measured in thousand students
hsb <- mutate(hsb, SIZE1000 = SIZE / 1000)
m_bic_rs1_lv2 <- update(m_bic_rs1, . ~ . + MEANSES * SIZE1000 * SECTOR)

dd <- dredge(m_bic_rs1_lv2, 
             fixed = ~ MINORITY + FEMALE + SES + MINORITY * SES + 
               (MINORITY | ID))
model.sel(dd, rank = BIC)

# The best model will add MEANSES and SECTOR main effects
# Finally, let's add the possible two-way cross-level interactions:
m_bic_rs1_lv2_cross <- update(m_bic_rs1, . ~ . + MEANSES * SIZE1000 * SECTOR + 
                                MEANSES * (MINORITY + FEMALE + SES) + 
                                SIZE1000 * (MINORITY + FEMALE + SES) + 
                                SES * (MINORITY + FEMALE + SES))

dd <- dredge(m_bic_rs1_lv2_cross, 
             fixed = ~ MINORITY + FEMALE + SES + MINORITY * SES + 
               (MINORITY | ID))
model.sel(dd, rank = BIC)
# Using BIC, none of the cross-level interactions should be included. A more 
# complex model will be selected using AIC

Regularization

In many areas research, the goal of inference is more about getting good predictive performance, and less about finding a “true” model. For example, instead of identifying whether gender and SES are true determinant of math achievement, a school administrator may instead be more interested in a predictive equation to identify which students may perform the best based on their background information. In such cases, therefore, finding the “true” model is not as important as finding a model with good predictive performance.

However, a more complex model does not guarantee good predictive performance. It may be at first strange to you why adding more predictors may actually give worse predictions. The reason is that the performance of a predictive equation depends on the parameter estimates (e.g., fixed effects, random effects), and with a more complex model, one needs a much larger sample to obtain good parameter estimates. That’s why if you have a small sample size, a linear model with one or two predictors may actually give you the most stable parameter estimates with maximized predictive performance, and would be much better than some of the advanced models and analytic techniques.

However, a more advanced technique generally called regularization would allow you to “simplify” a more complex model by shrinking some of the coefficients to closer to zero, which is actually the same motivation of using multilevel models as opposed to including dummy group indicators. Two methods to do regularization in R is to use model averaging and Bayesian shrinkage priors. As this is somehow beyond the scope of this course, I simply show two examples below:

# Start with a complex model without cross-level interactions
dd_complex <- dredge(m_bic_rs1_lv2, 
                     fixed = ~ MINORITY + FEMALE + SES + MINORITY * SES + 
                       (MINORITY | ID), beta = "sd")
# Model Averaging
model.avg(dd_complex)  # by default it used AICc


Call:
model.avg(object = dd_complex)

Component models: 
'1+2+3+4+5+6+7+8+9'       '1+2+3+4+5+6+7+8+9+10'    '1+2+3+4+5+6+7+8+9+10+11' 
'1+2+3+4+5+6+9'           '1+2+3+4+5+6+7+9'         '1+2+3+4+5+6+8+9'         
'1+2+3+4+5+6+9+10'        '1+2+3+4+5+6+7+9+10'      '1+2+3+4+5+6+8+9+10'      
'1+2+3+4+5+7+9'           '1+2+3+4+5+9'             '1+3+4+5+6+9'             
'1+2+3+5+9'               '1+2+3+5+6+8+9'           '1+3+4+5+6+9+10'          
'1+2+3+5+6+9'             '1+3+4+5+9'               '1+3+5+9'                 
'1+3+5+6+9'              

Coefficients: 
       (Intercept)   MEANSES   SECTOR   SIZE1000      FEMALE   MINORITY
full             0 0.2340741 0.116595 0.03515988 -0.09056013 -0.2040835
subset           0 0.2340741 0.116595 0.03546700 -0.09056013 -0.2040835
            SES MEANSES:SECTOR MEANSES:SIZE1000 MINORITY:SES SECTOR:SIZE1000
full   0.247674    -0.06486368       -0.0757840  -0.06400748      0.02279661
subset 0.247674    -0.08518771       -0.1042316  -0.06400748      0.04641916
       MEANSES:SECTOR:SIZE1000
full               0.006670189
subset             0.046988987

# It is usually recommended to scale the variables to have SD = 1 when doing 
#   regularization. 
hsb_s <- mutate_at(hsb, 
                   vars(MATHACH, MINORITY, FEMALE, SES, MEANSES, SIZE, SECTOR), 
                   funs(. / sd(.)))
library(rstanarm)  # Bayesian multilevel modeling
options(mc.cores = 2L)
m_reg <- stan_lmer(MATHACH ~ MINORITY + FEMALE + SES + MINORITY * SES + 
                     MEANSES * SIZE * SECTOR + 
                     (MINORITY + FEMALE + SES | ID), 
                   data = hsb_s, prior = hs(global_scale = 0.05), 
                   prior_covariance = decov(3), iter = 800L, 
                   adapt_delta = 0.99)
print(m_reg, digits = 2)

stan_lmer
 family:       gaussian [identity]
 formula:      MATHACH ~ MINORITY + FEMALE + SES + MINORITY * SES + MEANSES * 
       SIZE * SECTOR + (MINORITY + FEMALE + SES | ID)
 observations: 7185
------
                    Median MAD_SD
(Intercept)          1.92   0.06 
MINORITY            -0.20   0.02 
FEMALE              -0.09   0.01 
SES                  0.24   0.02 
MEANSES              0.22   0.07 
SIZE                 0.01   0.02 
SECTOR               0.07   0.05 
MINORITY:SES        -0.04   0.01 
MEANSES:SIZE        -0.03   0.03 
MEANSES:SECTOR      -0.03   0.04 
SIZE:SECTOR          0.04   0.03 
MEANSES:SIZE:SECTOR  0.00   0.01 

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.87   0.01  

Error terms:
 Groups   Name        Std.Dev. Corr             
 ID       (Intercept) 0.2045                    
          MINORITY    0.0944   -0.07            
          FEMALE      0.0596   -0.68  0.06      
          SES         0.0528    0.06 -0.20 -0.06
 Residual             0.8654                    
Num. levels: ID 160 

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

You can see that the coefficients were quite similar using both methods. If you are to run a model using the regular MLM using lme4 or other software, you should see that the coefficients are less closer to zero in regular MLM than the ones reported here with regularization.

Bibliography

Hox, Joop J., Mirjam Moerbeek, and Rens Van de Schoot. 2018. Multilevel analysis: Techniques and applications. 3rd ed. New York, NY: Routledge.

Vrieze, Scott I. 2012. “Model selection and psychological theory: A discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC).” Psychological Methods 17 (2): 228–43. https://doi.org/10.1037/a0027127.