Bayesian Polychoric Correlation

Bayesian polychoric correlation with Stan, using multinomial likelihood

Author

Mark Lai

Published

March 18, 2025

This post shows something I came up with about estimating the polychoric correlation in Stan. I previously discussed that in maximum likelihood. While the polychoric correlation is commonly used in SEM, surprisingly there hasn’t been a lot of discussion of it, especially in the Stan community (e.g., https://discourse.mc-stan.org/t/estimating-polychoric-correlation-matrix-with-stan/18797/4). With some exploration, I found using the bivariate normal probability is more efficient and scalable for this purpose, as shown below.

Note

For more details of polychoric correlation, see the previous blog post.

library(cmdstanr)
register_knitr_engine()
options(mc.cores = 4)
library(lavaan)
library(mvtnorm, include.only = c("dmvnorm"))
library(ggplot2)
library(bayesplot)

Example data

We will use the first two variables.

# Holzinger and Swineford (1939) example
HS3 <- lavaan::HolzingerSwineford1939[, c("x1", "x2", "x3")]
# ordinal version, with three categories
HS3ord <- as.data.frame(lapply(HS3, function(v) {
    as.ordered(cut(v, breaks = 3, labels = FALSE))
}))
# Contingency table
table(HS3ord[1:2])
   x2
x1    1   2   3
  1   4  19   3
  2  12 162  41
  3   2  33  25
# Correlation with lavaan
lavCor(HS3ord[1:2])
      x1    x2
x1 1.000      
x2 0.317 1.000
# Get standard errors
# polychoric correlations, two-stage estimation
pcor_lavaan <- lavCor(HS3ord[1:2],
    se = "robust.sem", output = "fit"
)
subset(
    parameterEstimates(pcor_lavaan),
    op == "~~" # polychoric correlations and thresholds
)
  lhs op rhs   est   se     z pvalue ci.lower ci.upper
1  x1 ~~  x1 1.000 0.00    NA     NA     1.00    1.000
2  x2 ~~  x2 1.000 0.00    NA     NA     1.00    1.000
3  x1 ~~  x2 0.317 0.07 4.534      0     0.18    0.455

Bayesian model

Given the contingency table, our goal is to estimate the threshold vector τi for each item i, and the correlation ρ between the underlying latent variates, X, assumed to be bivariate normal with means 0 and SDs of 1. Let F(x)=P(Xx) be the cumulative distribution function (CDF) of the bivariate normal. Then, given the thresholds and ρ, the probability of observing the discrete data vector (X1 = j, X2 = k) is πjk, where

πjk=P(τ1j1X1τ1j,τ2k1X2τ2k)=F(τ1j,τ2k)F(τ1j1,τ2k)F(τ1j,τ2k1)+F(τ1j1,τ2k1),

with τ10 = τ20 = and τ1J = τ2K = , where J and K are the number of categories.

Then, the vector of observed counts (n11, n21, , n12, n22, , nJK) can be modelled as a multinomial distribution with probabilities (π11, π21, , π12, π22, , πJK), where the π’s sum to 1.

Implementation in Stan

The main challenge lies in computing the bivariate normal cdf, as it is currently not implemented in Stan. Fortunately, the Stan User’s Guide provides an implementation (https://mc-stan.org/docs/stan-users-guide/custom-probability.html#bivariate-normal-cumulative-distribution-function), which we adapted in the code below:

//
// This Stan program estimates the polychoric correlation between two
// ordinal variables. 
//
functions {
  // from https://mc-stan.org/docs/stan-users-guide/custom-probability.html
  real binormalcdf(real z1, real z2, real rho) {
    if (z1 == negative_infinity() || z2 == negative_infinity()) {
      return 0.0;
    }
    if (rho == 1) {
      return z1 > z2 ? Phi(z2) : Phi(z1);
    }
    if (rho == -1) {
      real b = Phi(z2) + Phi(z1) - 1;
      return b < 0 ? 0 : b;
    }
    if (z1 != 0 || z2 != 0) {
      real denom = abs(rho) < 1.0 ? sqrt((1 + rho) * (1 - rho))
                                   : not_a_number();
      real a1 = (z2 / z1 - rho) / denom;
      real a2 = (z1 / z2 - rho) / denom;
      real product = z1 * z2;
      real delta = product < 0 || (product == 0 && (z1 + z2) < 0);
      return 0.5 * (Phi(z1) + Phi(z2) - delta)
                   - owens_t(z1, a1) - owens_t(z2, a2);
    }
    return 0.25 + asin(rho) / (2 * pi());
  }
  matrix contingency_prob(vector b1, vector b2, real rho) {
    int k = size(b1) + 1;
    matrix[k, k] prob;
    matrix[k, k] cprob;
    for (j in 1:k) {
      for (i in 1:k) {
        if (i == k && j == k) {
          cprob[i, j] = 1.0;
        } else if (i == k && j < k) {
          cprob[i, j] = Phi(b2[j]);
        } else if (j == k && i < k) {
          cprob[i, j] = Phi(b1[i]);
        } else {
          cprob[i, j] = binormalcdf(b1[i], b2[j], rho);
        }
        if (j == 1 && i == 1) {
          prob[i, j] = cprob[i, j];
        } else if (j == 1 && i > 1) {
          prob[i, j] = cprob[i, j] - cprob[i - 1, j];
        } else if (j > 1 && i == 1) {
          prob[i, j] = cprob[i, j] - cprob[i, j - 1];
        } else {
          prob[i, j] = cprob[i, j] - cprob[i - 1, j] - cprob[i, j - 1] +
                         cprob[i - 1, j - 1];
        }
      }
    }
    return prob;
  }
}
data {
  int<lower=0> k;  // number of categories;
  array[k * k] int<lower=0> ns;
}
parameters {
  real<lower=0,upper=1> scaled_rho;
  array[2] ordered[k - 1] thres;
}
transformed parameters {
  matrix[k, k] theta;
  real rho = scaled_rho * 2 - 1;
  theta = contingency_prob(thres[1], thres[2], rho);
}
// The model to be estimated. 
model {
  scaled_rho ~ beta(2, 2);
  ns ~ multinomial(to_vector(theta));
}

The contingency_prob Stan function computes the bivariate normal probabilities for each cell of the contingency table.

pc <- cmdstan_model("marginal_ordinal.stan")
pc_fit <- pc$sample(
    data = list(k = 3, ns = as.integer(table(HS3ord[1:2]))),
    refresh = 1000
)
Running MCMC with 4 parallel chains...

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[2] = -4.0503e-18, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1 
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[3] = -5.42075e-18, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1 
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[9] = -4.4886e-17, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1 
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[6] = -2.86229e-17, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1 
Chain 2 Rejecting initial value:
Chain 2   Log probability evaluates to log(0), i.e. negative infinity.
Chain 2   Stan can't start sampling from this initial value.
Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[8] = -1.11022e-16, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2 
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[3] = -4.51028e-17, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2 
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[8] = -2.60209e-18, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2 
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[3] = -5.31259e-17, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2 
Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[1] = -2.21711e-291, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3 
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[1] = -7.08697e-32, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3 
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[7] = -4.16334e-17, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3 
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[8] = -2.77556e-17, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3 
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[1] = -3.2255e-18, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3 
Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[1] = -1.94908e-269, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 4 
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[5] = -1.35525e-20, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 4 
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[7] = -1.21431e-17, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 4 
Chain 1 finished in 0.1 seconds.
Chain 2 finished in 0.1 seconds.
Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 3 finished in 0.2 seconds.
Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 4 finished in 0.3 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.2 seconds.
Total execution time: 0.4 seconds.
pc_fit$print("rho")
 variable mean median   sd  mad   q5  q95 rhat ess_bulk ess_tail
      rho 0.31   0.31 0.07 0.07 0.18 0.43 1.00     3102     2624
pc_fit$draws("rho") |>
    mcmc_hist()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As shown above, the estimation is very quick. The method is likely scalable with the use of the multinomial distribution, with computing time increasing as a function of the number of categories, instead of the sample sizes.

In comparison, the alternative approach in Stan forum takes about 15 seconds, so the current approach is about 30 times faster in this example.