Estimating a 2-PL Model in Julia (Part 2)

Statistics
Programming
Author

Mark Lai

Published

May 28, 2022

The EM algorithm is usually used for estimation problems that involve some latent variables \(\bv Z\) and parameters \(\bv \omega\), where the conditional likelihood \(L(\bv \omega; \bv Y, \bv Z)\) is relatively easy to solve, but the marginal likelihood, which requires integrating \(\bv Z\) out, is intractable. In the 2-PL estimation problem, we consider the item parameters \(\bv a\) and \(\bv d\) as \(\bv \omega\), and the person parameters \(\bv \theta\) as \(\bv Z\).

In a general IRT model, the marginal likelihood function \(L(\bv \omega; \bv Y)\) requires integrating out the person parameters \(\bv \theta\), which is usually difficult. We can instead use the EM algorithm. First, we can define the complete-data likelihood

\[L(\bv \omega; \bv Y, \bv \theta) = P(\bv Y \mid \bv \theta, \bv \omega) P(\bv \theta \mid \bv \omega)\]

assuming that observations are independently and identically distributed given \(\bv \theta\), and local independence such that the item responses are independent when conditioning on \(\bv \theta\), we have

\[L(\bv \omega; \bv Y, \bv \theta) = \prod_i^N \prod_j L_{ij} P(\theta_i \mid \bv \omega_j)\]

where \(L_{ij} = P(y_{ij} \mid \theta_i, \bv \omega_j)\). Thus, the complete-data loglikelihood is

\[\ell(\bv \omega; \bv Y, \bv \theta) = \log L(\bv \omega; \bv Y, \bv \theta) = \sum_i^N \sum_j \left[\ell_{ij} + \log P(\theta_i \mid \bv \omega_j)\right],\]

where \(\ell_{ij} = \log L_{ij}\)

With the EM algorithm, we update our parameter estimates by iterating between two steps:

Note: the setup of the problem in this post deviates a bit from the one in the original Bock and Aitkin paper, which assumes a multinomial distribution of the sample counts of each response pattern. The resulting estimating equations should be the same.

E-Step

First consider the E-step for the 2-PL model, which has

\[ \begin{aligned} \ell_{ij} & = y_{ij} \log \eta_{ij} - \log[1 + \exp(\eta_{ij})] \\ \eta_{ij} & = a_j \theta_i + d_j \end{aligned}. \]

In addition, in IRT, because item responses are discrete, the \(N\) observations can usually be reduced to a smaller number of \(l\) response patterns. For example, with the LSAT data, which has 1,000 observations, it only has 30 response patterns of 5 binary items:

using RCall
lsat = rcopy(R"mirt::LSAT6")
30×6 DataFrame
 Row │ Item_1  Item_2  Item_3  Item_4  Item_5  Freq
     │ Int64   Int64   Int64   Int64   Int64   Int64
─────┼───────────────────────────────────────────────
   1 │      0       0       0       0       0      3
   2 │      0       0       0       0       1      6
   3 │      0       0       0       1       0      2
   4 │      0       0       0       1       1     11
   5 │      0       0       1       0       0      1
   6 │      0       0       1       0       1      1
   7 │      0       0       1       1       0      3
   8 │      0       0       1       1       1      4
  ⋮  │   ⋮       ⋮       ⋮       ⋮       ⋮       ⋮
  24 │      1       1       0       0       1     56
  25 │      1       1       0       1       0     21
  26 │      1       1       0       1       1    173
  27 │      1       1       1       0       0     11
  28 │      1       1       1       0       1     61
  29 │      1       1       1       1       0     28
  30 │      1       1       1       1       1    298
                                      15 rows omitted

So instead of computing the loglikelihood for 1,000 observations, we only need to do it for \(s = 30\) response patterns. So the complete-data loglikelihood can be written as

\[\ell(\bv \omega; \bv Y, \bv \theta) = \sum_{l = 1}^s \sum_j n_l \ell_{lj} + \sum_{i = 1}^n \sum_j \log P(\theta_i \mid \bv \omega_j),\]

with \(\ell_{lj} = \log L_{lj} = \log P(y_{lj} \mid \bv \omega_j, \theta)\) for a fixed \(\theta\) value.

For the E-step, we first need the conditional distribution \(P(\theta \mid \bv Y, \bv a^{(t)}, \bv d^{(t)})\) with some fixed values of \(\bv a^{(t)}\) and \(\bv d^{(t)}\), which can be obtained with Bayes’ theorem

\[P(\theta_i \mid \bv y_l, \bv a^{(t)}, \bv d^{(t)}) = \frac{P(\bv y_l \mid \theta_i, \bv a^{(t)}, \bv d^{(t)})P(\theta_i)}{\int P(\bv y_l \mid \theta, \bv a^{(t)} \bv d^{(t)})P(\theta) d\theta}\]

We need a prior distribution \(P(\theta)\), which is commonly chosen as a normal distribution (and standard normal for single group analysis). Using Gaussian-Hermite (GH) quadrature to replace \(P(\theta_k)\) by \(w_k\), where \(k\) being one of the \(q\) quadrature points, and replacing the intergral with the finite sum, we have:

\[ \begin{aligned} P(\theta_k \mid \bv y_l, \bv a^{(t)}, \bv d^{(t)}) & \approx \frac{P(\bv y_l \mid \theta_i, \bv a, \bv d) w_k}{\sum_{k = 1}^q P(\bv y_l \mid \theta, \bv a \bv d) w_k} \\ & = \frac{L^{(t)}_l(\theta_k)w_k}{\sum_k L^{(t)}_l(\theta_k) w_k} \\ & = \frac{L^{(t)}_l(\theta_k)w_k}{P_l} \end{aligned} \]

Note the subscript \((t)\) indicating that \(L^{(t)}_l(\cdot)\) is a function of \(\theta\) with fixed values of the item parameters. Now we can taking the expectation of the complete-data loglikelihood with respect to the conditional distribution we just saw:

\[ \begin{aligned} E_{\bv \theta \mid \bv Y, \bv \omega^{(t)}}[\ell(\bv \omega; \bv Y, \bv \theta)] & = E_{\bv \theta \mid \bv Y, \bv \omega^{(t)}}\left[\sum_l \sum_j n_l \ell_{lj} \right] \\ & \quad + E_{\bv \theta \mid \bv Y, \bv \omega^{(t)}} \left[\sum_l \sum_j \log P(\theta_i \mid \bv \omega^{(t)})\right], \end{aligned} \]

Note that the second expectation term is not a function of the item parameters (it is a function of \(\theta\) treating the item parameters as fixed), so we can write it as \(h(\bv \theta)\), which is a constant for the maximization step. So using G-H quadrature to approximate the integral when taking the expectation, we have

\[ \begin{aligned} E_{\bv \theta \mid \bv Y, \bv \omega^{(t)}}[\ell(\bv \omega; \bv Y, \bv \theta)] & = \int \left[\sum_l \sum_j n_l \ell_{lj} P(\theta \mid \bv y_l, \bv \omega^{(t)})\right] d \theta + h(\bv \theta) \\ & = \sum_k \sum_l \sum_j \frac{ n_l \ell_{lj} L^{(t)}_l(\theta_k)w_k}{P_l} + h(\bv \theta) \end{aligned} \]

M-Step

In the M-step, we find new values \(\bv \omega^{(t + 1)}\) that maximize the expected loglikelihood above. To find the maximizer, we take the derivative of the expected loglikelihood with respect to each parameter, and set the derivative to zero. This gives us a set of estimating equations. First, it can be verified that

\[ \begin{aligned} \frac{\partial \ell_{lj}}{\partial a_j} & = \left[y_{lj} - P_j(\theta_k)\right] \theta_k \\ \frac{\partial \ell_{lj}}{\partial d_j} & = y_{lj} - P_j(\theta_k), \end{aligned} \]

where \(P_j(\theta_k) = \eta_{kj} / (1 + \eta_{kj})\). So the derivative of the expected loglikelihood with respect to \(a_j\) is

\[ \begin{aligned} \frac{\partial}{\partial a_j} E_{\bv \theta \mid \bv Y, \bv \omega^{(t)}}[\ell(\bv \omega; \bv Y, \bv \theta)] & = \sum_k \sum_l \frac{n_l [y_{lj} - P_j(\theta_k)] \theta_k L^{(t)}_l(\theta_k)w_k}{P_l}. \end{aligned} \]

Following the previous literature, let

\[ \begin{aligned} \bar r_{jk} & = \sum_l \frac{n_l y_{lj} L^{(t)}_l(\theta_k)w_k}{P_l} \\ \bar n_k & = \sum_l \frac{n_l L^{(t)}_l(\theta_k)w_k}{P_l}, \end{aligned} \]

where \(n_k\) is the expected number of participants with ability level \(\theta_k\), and \(r_{jk}\) is the expected number of participants with ability level \(\theta_k\) endorsing item \(j\). Then the estimating equations are

\[ \begin{aligned} \sum_k [\bar r_{jk} - \bar n_k P_j(\theta_k)] = 0 \\ \sum_k [\bar r_{jk} - \bar n_k P_j(\theta_k)] \theta_k = 0 \end{aligned} \]

for \(j = 1, \ldots, p\) where \(p\) is the number of items. These can be solved using non-linear solvers.

Estimating a 2-PL Model with EM in Julia

Here is my attempt to implement the EM algorithm in Julia, following the steps laid out in Harwell (1988). First, load the packages

using LinearAlgebra, LogExpFunctions
using FastGaussQuadrature: gausshermite
using NLsolve
using BenchmarkTools

Find \(\bar r_{jk}\) and \(\bar n_k\)

# Helper for computing logits: ηᵢⱼ = aⱼθ + dⱼ
function compute_logits(θ, a, d)
    [θ[i] * a[j] + d[j]
     for i = eachindex(θ), j = eachindex(a)]
end
compute_logits (generic function with 1 method)
function eloglik_2pl_em(y, n, θ, w, parₜ)
    num_items = size(y, 2)
    aₜ = parₜ[1:num_items]
    dₜ = parₜ[num_items+1:end]
    ηₜ = compute_logits(θ, aₜ, dₜ)
    sum1pexpη = sum(log1pexp, ηₜ, dims=2)
    wpy_given_θ = Matrix{eltype(aₜ)}(undef, length(θ), length(n))
    for l in eachindex(n)
        wpy_given_θ[:, l] = w .* exp.(ηₜ * view(y, l, :) .- sum1pexpη)
    end
    pθ_given_y = wpy_given_θ ./ sum(wpy_given_θ, dims=1)
    (bar_nₖ=pθ_given_y * n,
        bar_rⱼₖ=pθ_given_y * (n .* y))
end
eloglik_2pl_em (generic function with 1 method)
# Test:
gh15 = gausshermite(15)  # 15 quadrature points
([-4.499990707309391, -3.669950373404453, -2.9671669279056054, -2.3257324861738606, -1.7199925751864926, -1.136115585210924, -0.5650695832555779, -3.552713678800501e-15, 0.5650695832555779, 1.136115585210924, 1.7199925751864926, 2.3257324861738606, 2.9671669279056054, 3.669950373404453, 4.499990707309391], [1.5224758042535368e-9, 1.0591155477110773e-6, 0.00010000444123250024, 0.0027780688429127607, 0.030780033872546228, 0.15848891579593563, 0.41202868749889865, 0.5641003087264175, 0.41202868749889865, 0.15848891579593563, 0.030780033872546228, 0.0027780688429127607, 0.00010000444123250024, 1.0591155477110773e-6, 1.5224758042535368e-9])
gh15_node = gh15[1] .*2
15-element Vector{Float64}:
 -6.363947888829838
 -5.190093591304782
 -4.196207711269019
 -3.289082424398771
 -2.4324368270097634
 -1.6067100690287344
 -0.7991290683245511
 -5.0242958677880805e-15
  0.7991290683245511
  1.6067100690287344
  2.4324368270097634
  3.289082424398771
  4.196207711269019
  5.190093591304782
  6.363947888829838
gh15_weight = gh15[2] ./ √π
15-element Vector{Float64}:
 8.589649899633383e-10
 5.975419597920666e-7
 5.642146405189039e-5
 0.0015673575035499477
 0.01736577449213769
 0.08941779539984435
 0.23246229360973225
 0.31825951825951826
 0.23246229360973225
 0.08941779539984435
 0.01736577449213769
 0.0015673575035499477
 5.642146405189039e-5
 5.975419597920666e-7
 8.589649899633383e-10
exp1 = eloglik_2pl_em(Matrix(lsat[:, 1:5]), lsat[:, 6],
    gh15_node, gh15_weight,
    [ones(5); zeros(5)])
(bar_nₖ = [2.7564518778742383e-8, 2.0114861555321952e-5, 0.002128243156556154, 0.07560094100777943, 1.3352261691433547, 14.249248990842812, 89.7255889231157, 278.1775106946097, 365.8604302697971, 199.39830527246278, 46.43045474007526, 4.57238810409275, 0.17124982332670236, 0.001845015100526724, 2.6708429185723037e-6], bar_rⱼₖ = [4.138800953383068e-10 4.341143688898378e-11 … 8.455317850390647e-11 2.4966992659649684e-10; 9.488948442862142e-7 1.0961092559330794e-7 … 2.0276924419430197e-7 5.791071220863187e-7; … ; 0.0018436478096575996 0.0018377274514118978 … 0.0018394578991136525 0.0018424609777163434; 2.670229820441103e-6 2.667573675000273e-6 … 2.6683500732256074e-6 2.669698207072498e-6])

The output is a tuple, with the first element bar_nₖ being a \(q\) \(\times\) 1 vector and bar_rⱼₖ being a \(q\) \(\times\) \(p\) matrix.

Solve estimating equations

function compute_probs(θ, a, d)
    [logistic(θ[i] * a[j] + d[j]) for i = eachindex(θ), j = eachindex(a)]
end
compute_probs (generic function with 1 method)
function esteq_2pl_em(par, bar_r, bar_n, θ)
    num_items = size(bar_r, 2)
    a = par[1:num_items]
    d = par[num_items+1:end]
    rmntpθ = bar_r .- bar_n .* compute_probs(θ, a, d)
    vec([sum(rmntpθ, dims=1) θ' * rmntpθ])
end
esteq_2pl_em (generic function with 1 method)
# Test:
root1 = nlsolve(x -> esteq_2pl_em(x, 
    exp1.bar_rⱼₖ, exp1.bar_nₖ, gh15_node),
    [ones(5); zeros(5)])
Results of Nonlinear Solver Algorithm
 * Algorithm: Trust-region with dogleg and autoscaling
 * Starting Point: [1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 * Zero: [0.9441787368725837, 0.9412834339373086, 0.9702850411757558, 0.932179532744094, 0.9197528908890249, 2.166166737455263, 0.41020205101729856, -0.37879989990817486, 0.7262454452107591, 1.5321165384312954]
 * Inf-norm of residuals: 0.000000
 * Iterations: 6
 * Convergence: true
   * |x - x'| < 0.0e+00: false
   * |f(x)| < 1.0e-08: true
 * Function Calls (f): 7
 * Jacobian Calls (df/dx): 7

The solution is contained in the zero field. These will be passed back to the E-step.

Iterations

We can do two more iterations:

exp2 = eloglik_2pl_em(Matrix(lsat[:, 1:5]), lsat[:, 6],
    gh15_node, gh15_weight,
    root1.zero)
(bar_nₖ = [1.7202823621463788e-7, 0.00012478229640175463, 0.012917226708649957, 0.42950281900235543, 6.374685649928196, 47.54065170285358, 177.55073039082802, 324.1944490261614, 289.0821550106182, 125.99472138888558, 26.268899898287724, 2.460042585603299, 0.09015529832828216, 0.0009626597451594212, 1.3887248384670485e-6], bar_rⱼₖ = [5.372365387113749e-9 6.042186880320284e-10 … 1.1967481795709406e-9 3.692939772198374e-9; 1.1195428419388414e-5 1.486089377064305e-6 … 2.671355380687325e-6 7.62773320369415e-6; … ; 0.0009620030652981468 0.000959109431245618 … 0.0009598307020344003 0.0009612789437790556; 1.3884114755863066e-6 1.3870244275199353e-6 … 1.3873554006168715e-6 1.3880472692324014e-6])
root2 = nlsolve(x -> esteq_2pl_em(x, 
    exp2.bar_rⱼₖ, exp2.bar_nₖ, gh15_node),
    root1.zero)
Results of Nonlinear Solver Algorithm
 * Algorithm: Trust-region with dogleg and autoscaling
 * Starting Point: [0.9441787368725837, 0.9412834339373086, 0.9702850411757558, 0.932179532744094, 0.9197528908890249, 2.166166737455263, 0.41020205101729856, -0.37879989990817486, 0.7262454452107591, 1.5321165384312954]
 * Zero: [0.8851376144605885, 0.8780270229700008, 0.9305712408536134, 0.8617362606683743, 0.8410113215381403, 2.5424571803032903, 0.781691569809412, -0.0034125639910551494, 1.0943152659197337, 1.8930834808194317]
 * Inf-norm of residuals: 0.000000
 * Iterations: 4
 * Convergence: true
   * |x - x'| < 0.0e+00: false
   * |f(x)| < 1.0e-08: true
 * Function Calls (f): 5
 * Jacobian Calls (df/dx): 5
exp3 = eloglik_2pl_em(Matrix(lsat[:, 1:5]), lsat[:, 6],
    gh15_node, gh15_weight,
    root2.zero)
(bar_nₖ = [4.3023109522800455e-7, 0.0003040218519114245, 0.02987372125524563, 0.9055221912858598, 11.602389316066303, 71.03280737295191, 214.59437201659995, 325.67522637877084, 253.23782066569376, 101.01070654940031, 20.01802278052005, 1.826159972919833, 0.06609181187521956, 0.000701760614451534, 1.0099632324488185e-6], bar_rⱼₖ = [2.0531055047839344e-8 2.5014192320684056e-9 … 4.965693882437507e-9 1.5557900531690245e-8; 3.8265586589167865e-5 5.753192037170773e-6 … 1.0076467391887043e-5 2.8241538126234826e-5; … ; 0.0007012180412131716 0.0006987629848589311 … 0.0006992811247620892 0.0007004963056719968; 1.00968636885505e-6 1.0084204735673418e-6 … 1.008662595953656e-6 1.0092845394530493e-6])
root3 = nlsolve(x -> esteq_2pl_em(x, 
    exp3.bar_rⱼₖ, exp3.bar_nₖ, gh15_node),
    root2.zero)
Results of Nonlinear Solver Algorithm
 * Algorithm: Trust-region with dogleg and autoscaling
 * Starting Point: [0.8851376144605885, 0.8780270229700008, 0.9305712408536134, 0.8617362606683743, 0.8410113215381403, 2.5424571803032903, 0.781691569809412, -0.0034125639910551494, 1.0943152659197337, 1.8930834808194317]
 * Zero: [0.8479491477352421, 0.8323916555812338, 0.902644467313084, 0.8111833641429015, 0.7863596508859064, 2.6865158580988453, 0.9283333274643143, 0.1559502636058875, 1.2352944393730914, 2.0242412851515406]
 * Inf-norm of residuals: 0.000000
 * Iterations: 4
 * Convergence: true
   * |x - x'| < 0.0e+00: false
   * |f(x)| < 1.0e-08: true
 * Function Calls (f): 5
 * Jacobian Calls (df/dx): 5

Stopping criteria

One way to stop the iteration is when the absolute change in the parameter estimates is less than a certain threshold (e.g., 0.00001). For example, the following shows the maximum absolute change in the parameter estimates from the second and the third iteration:

maximum(abs.(root3.zero .- root2.zero))
0.15936282759694267

which is pretty large. So we should do more iterations.

Benchmarking

We can wrap the steps into a function

function estimate_2pl_em(y, n, init, 
    n_quadpts=101, par_tol=1e-5, rtol=1e-5, max_iter=1000)
    parₜ = init
    parₜ₊₁ = parₜ
    # Convert y to matrix
    y = Matrix(y)
    # Obtain quadrature nodes and weights
    ghq = gausshermite(n_quadpts)
    ghq_θ = ghq[1] .*2
    ghq_w = ghq[2] ./ √π
    i = 1
    while i < max_iter
        expₜ = eloglik_2pl_em(y, n,
    ghq_θ, ghq_w, parₜ)
        root = nlsolve(x -> esteq_2pl_em(x, 
    expₜ.bar_rⱼₖ, expₜ.bar_nₖ, ghq_θ),
            parₜ, autodiff=:forward)
        parₜ₊₁ = root.zero
        if maximum(abs.(parₜ₊₁ - parₜ)) < par_tol
            break
        else
            parₜ = parₜ₊₁
            i += 1
        end
    end
    (estimate=parₜ₊₁, num_iter=i)
end
estimate_2pl_em (generic function with 5 methods)
@btime est_em = estimate_2pl_em(lsat[:, 1:5], lsat[:, 6], [ones(5); zeros(5)])
  16.968 ms (16212 allocations: 29.76 MiB)

(estimate = [0.8256464036061023, 0.7227767138869371, 0.8907854481666115, 0.6883896452213724, 0.6568975184341934, 2.773226187628446, 0.9902095352382095, 0.24914112339506816, 1.284763785097769, 2.0532889359951128], num_iter = 66)

Compare to mirt

data("LSAT", package = "ltm")
library(mirt)
bench::mark(
    mirt = mirt(LSAT,
                verbose = FALSE, quadpts = 101,
                TOL = 1e-5)
)
# A tibble: 1 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 mirt          107ms    107ms      9.38    22.4MB     28.1

Remark

My implementation of the EM takes a shorter time to run than direct MML (see my post in Part 1), but it does not compute the standard errors. Also, it probably uses a different convergence criterion than direct MML using Optim.jl, so it’s hard to say which one is faster.