library(cmdstanr)
register_knitr_engine()
options(mc.cores = 4)
library(lavaan)
library(mvtnorm, include.only = c("dmvnorm"))
library(ggplot2)
library(bayesplot)
Bayesian polychoric correlation with Stan, using multinomial likelihood
This post shows something I came up with about estimating the polychoric correlation in Stan. I previously discussed that in maximum likelihood. While the polychoric correlation is commonly used in SEM, surprisingly there hasn’t been a lot of discussion of it, especially in the Stan community (e.g., https://discourse.mc-stan.org/t/estimating-polychoric-correlation-matrix-with-stan/18797/4). With some exploration, I found using the bivariate normal probability is more efficient and scalable for this purpose, as shown below.
For more details of polychoric correlation, see the previous blog post.
Example data
We will use the first two variables.
# Holzinger and Swineford (1939) example
<- lavaan::HolzingerSwineford1939[, c("x1", "x2", "x3")]
HS3 # ordinal version, with three categories
<- as.data.frame(lapply(HS3, function(v) {
HS3ord as.ordered(cut(v, breaks = 3, labels = FALSE))
}))# Contingency table
table(HS3ord[1:2])
x2
x1 1 2 3
1 4 19 3
2 12 162 41
3 2 33 25
# Correlation with lavaan
lavCor(HS3ord[1:2])
x1 x2
x1 1.000
x2 0.317 1.000
# Get standard errors
# polychoric correlations, two-stage estimation
<- lavCor(HS3ord[1:2],
pcor_lavaan se = "robust.sem", output = "fit"
)subset(
parameterEstimates(pcor_lavaan),
== "~~" # polychoric correlations and thresholds
op )
lhs op rhs est se z pvalue ci.lower ci.upper
1 x1 ~~ x1 1.000 0.00 NA NA 1.00 1.000
2 x2 ~~ x2 1.000 0.00 NA NA 1.00 1.000
3 x1 ~~ x2 0.317 0.07 4.534 0 0.18 0.455
Bayesian model
Given the contingency table, our goal is to estimate the threshold vector
with
Then, the vector of observed counts (
Implementation in Stan
The main challenge lies in computing the bivariate normal cdf, as it is currently not implemented in Stan. Fortunately, the Stan User’s Guide provides an implementation (https://mc-stan.org/docs/stan-users-guide/custom-probability.html#bivariate-normal-cumulative-distribution-function), which we adapted in the code below:
//
// This Stan program estimates the polychoric correlation between two
// ordinal variables.
//
functions {
// from https://mc-stan.org/docs/stan-users-guide/custom-probability.html
real binormalcdf(real z1, real z2, real rho) {
if (z1 == negative_infinity() || z2 == negative_infinity()) {
return 0.0;
}if (rho == 1) {
return z1 > z2 ? Phi(z2) : Phi(z1);
}if (rho == -1) {
real b = Phi(z2) + Phi(z1) - 1;
return b < 0 ? 0 : b;
}if (z1 != 0 || z2 != 0) {
real denom = abs(rho) < 1.0 ? sqrt((1 + rho) * (1 - rho))
: not_a_number();real a1 = (z2 / z1 - rho) / denom;
real a2 = (z1 / z2 - rho) / denom;
real product = z1 * z2;
real delta = product < 0 || (product == 0 && (z1 + z2) < 0);
return 0.5 * (Phi(z1) + Phi(z2) - delta)
- owens_t(z1, a1) - owens_t(z2, a2);
}return 0.25 + asin(rho) / (2 * pi());
}matrix contingency_prob(vector b1, vector b2, real rho) {
int k = size(b1) + 1;
matrix[k, k] prob;
matrix[k, k] cprob;
for (j in 1:k) {
for (i in 1:k) {
if (i == k && j == k) {
1.0;
cprob[i, j] = else if (i == k && j < k) {
}
cprob[i, j] = Phi(b2[j]);else if (j == k && i < k) {
}
cprob[i, j] = Phi(b1[i]);else {
}
cprob[i, j] = binormalcdf(b1[i], b2[j], rho);
}if (j == 1 && i == 1) {
prob[i, j] = cprob[i, j];else if (j == 1 && i > 1) {
} 1, j];
prob[i, j] = cprob[i, j] - cprob[i - else if (j > 1 && i == 1) {
} 1];
prob[i, j] = cprob[i, j] - cprob[i, j - else {
} 1, j] - cprob[i, j - 1] +
prob[i, j] = cprob[i, j] - cprob[i - 1, j - 1];
cprob[i -
}
}
}return prob;
}
}data {
int<lower=0> k; // number of categories;
array[k * k] int<lower=0> ns;
}parameters {
real<lower=0,upper=1> scaled_rho;
array[2] ordered[k - 1] thres;
}transformed parameters {
matrix[k, k] theta;
real rho = scaled_rho * 2 - 1;
1], thres[2], rho);
theta = contingency_prob(thres[
}// The model to be estimated.
model {
2, 2);
scaled_rho ~ beta(
ns ~ multinomial(to_vector(theta)); }
The contingency_prob
Stan function computes the bivariate normal probabilities for each cell of the contingency table.
<- cmdstan_model("marginal_ordinal.stan")
pc <- pc$sample(
pc_fit data = list(k = 3, ns = as.integer(table(HS3ord[1:2]))),
refresh = 1000
)
Running MCMC with 4 parallel chains...
Chain 1 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1 Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[2] = -4.0503e-18, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[3] = -5.42075e-18, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[9] = -4.4886e-17, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[6] = -2.86229e-17, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1
Chain 2 Rejecting initial value:
Chain 2 Log probability evaluates to log(0), i.e. negative infinity.
Chain 2 Stan can't start sampling from this initial value.
Chain 2 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2 Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[8] = -1.11022e-16, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[3] = -4.51028e-17, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[8] = -2.60209e-18, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[3] = -5.31259e-17, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2
Chain 3 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3 Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[1] = -2.21711e-291, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[1] = -7.08697e-32, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[7] = -4.16334e-17, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[8] = -2.77556e-17, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[1] = -3.2255e-18, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3
Chain 4 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4 Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[1] = -1.94908e-269, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 4
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[5] = -1.35525e-20, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 4
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: multinomial_lpmf: Probabilities parameter is not a valid simplex. Probabilities parameter[7] = -1.21431e-17, but should be greater than or equal to 0 (in '/tmp/RtmpNJIUGM/model-d705e16038741.stan', line 76, column 2 to column 37)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 4
Chain 1 finished in 0.1 seconds.
Chain 2 finished in 0.1 seconds.
Chain 3 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3 finished in 0.2 seconds.
Chain 4 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4 finished in 0.3 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.2 seconds.
Total execution time: 0.4 seconds.
$print("rho") pc_fit
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
rho 0.31 0.31 0.07 0.07 0.18 0.43 1.00 3102 2624
$draws("rho") |>
pc_fitmcmc_hist()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As shown above, the estimation is very quick. The method is likely scalable with the use of the multinomial distribution, with computing time increasing as a function of the number of categories, instead of the sample sizes.
In comparison, the alternative approach in Stan forum takes about 15 seconds, so the current approach is about 30 times faster in this example.