Theoretical Framework for Bayesian Hierarchical Two-Parameter Logistic Item Response Models

Abstract

This paper provides a theoretical formulation of the hierarchical Two-Parameter Logistic (2PL) Item Response Theory (IRT) model within a Bayesian framework. The model introduces hierarchical priors on respondent abilities \( \theta \), and informative priors for item discrimination \( a \) and difficulty \( b \), allowing for more flexible and realistic modeling of latent traits and item characteristics. To overcome the computational challenges posed by the high-dimensional posterior, we employ Automatic Differentiation Variational Inference (ADVI), which offers a scalable and efficient alternative to traditional Markov Chain Monte Carlo (MCMC) methods. We present a complete derivation of the model's probabilistic structure, including both prior and likelihood functions, and detail the variational inference process central to ADVI. The mathematical foundations are carefully developed, providing a clear and formal account of the inferential methodology. This work focuses on theoretical aspects, offering a comprehensive mathematical exposition without empirical applications or simulation studies.

item response theory, two-parameter logistic model, bayesian inference, hierarchical model, variational inference

1. Introduction

Item Response Theory (IRT) continues to serve as a pillar framework in psychometrics and educational assessment, modeling the interaction between latent traits—such as ability, knowledge, or proficiency—and observed item responses (van der Linden & Hambleton, 2013). This probabilistic modeling is highly versatile and essential for improving test development, adaptive testing, and psychometric research (Embretson & Reise, 2013).

The Rasch model, or the one-parameter logistic (1PL) model, is a seminal model in IRT that assumes all items discriminate between respondents with equal precision (Rasch, 1960). While the Rasch model's simplicity offers certain advantages, such as sufficiency of raw scores, it fails to capture the variability in item discrimination, which limits its flexibility (Lord, 1980).

The Two-Parameter Logistic (2PL) model enhances the Rasch model by introducing item-specific discrimination parameters, which allow it to account for differences in how sensitive items are to respondents' abilities. This improvement makes the 2PL model more adaptable for a wider range of item types and testing contexts (Birnbaum, 1968).

1.1 The Two-Parameter Logistic (2PL) Model

In the 2PL model, the probability of a correct response depends on both item difficulty and discrimination. The logistic function governing this probability ensures that each item’s sensitivity to ability changes is captured through the discrimination parameter, while the difficulty parameter represents the ability level where a 50% chance of success occurs.

The role of the parameters in this model can be summarized as:

  • The respondent's latent ability influences the likelihood of a correct response.
  • The discrimination parameter indicates how sharply the probability of success changes as a respondent's ability increases, with higher values indicating greater sensitivity.
  • The difficulty parameter defines the ability level needed for a 50% chance of answering correctly.

Higher discrimination values lead to steeper probability curves around an item’s difficulty level, while lower values result in flatter curves, indicating less sensitivity to differences in ability.

1.2 Limitations of the Standard 2PL Model

Although the 2PL model addresses some limitations of the Rasch model, it still assumes that respondents' abilities are independently and identically distributed (i.i.d.), often from a common prior distribution. This assumption neglects potential structure in the data, such as shared latent traits among certain respondent groups (e.g., based on demographics). Furthermore, the standard 2PL model assumes that item parameters (discrimination and difficulty) are independent, which may not hold when items measure related constructs (Mislevy & Verhelst, 1990).

Recent studies have highlighted the importance of hierarchical IRT models in addressing these limitations by introducing groupings of respondents or items, allowing more complex relationships between parameters (König et al., 2022; Fox & Glas, 2001). These models can account for shared characteristics among groups of respondents or items, such as socio-economic status or performance patterns, leading to more accurate estimation of latent traits (Koenig et al., 2022).

1.3 The Hierarchical Bayesian 2PL Model

The hierarchical Bayesian 2PL model extends the standard 2PL framework by introducing hierarchical priors that capture heterogeneity among groups of respondents or items. This structure enables partial pooling of information, where group-specific estimates borrow strength from overall population trends while retaining their unique characteristics (Fox, 2010). This is particularly important when dealing with small sample sizes, where traditional maximum likelihood (ML) estimation methods often fail to provide reliable estimates due to data sparsity (Gelman et al., 2013).

Recent research has highlighted the significance of selecting appropriate priors in Bayesian IRT models. Zitzmann, Helm, and Hecht (2021) suggest that slightly informative priors can help stabilize estimates and reduce the Mean Squared Error (MSE) in small sample settings. Their study emphasizes that informative priors enhance parameter recovery, especially when working with sparse data or clustered designs.

However, using priors comes with potential drawbacks. Marcoulides (2018) demonstrated that priors—whether informative or non-informative—can introduce varying levels of bias in the estimation of item parameters in 2PL models. Depending on the specific design conditions, this bias can range from moderate to severe. Therefore, researchers must carefully balance the benefits of stabilized estimates against the risk of introducing bias, particularly when dealing with complex hierarchical models.

Zitzmann et al. (2021) further argue that when working with small samples, the variability of estimates may be more important than bias. While Bayesian approaches can introduce some level of bias, they often provide more accurate estimates (in terms of lower MSE) compared to ML methods, which may suffer from greater variability and inconsistency in small sample contexts. In fact, even when biased, Bayesian estimates can outperform ML by providing greater overall stability, especially when minimizing error is prioritized over eliminating bias entirely.

Smid et al. (2020) conducted a systematic review comparing Bayesian and frequentist estimation methods in small sample contexts, reinforcing the notion that default, non-informative priors in Bayesian models can lead to severe bias—often more so than frequentist methods. They recommend incorporating well-constructed priors based on previous knowledge to mitigate this bias. The study underscores the need for caution when applying Bayesian methods with default priors in small sample scenarios, encouraging researchers to make thoughtful decisions regarding prior construction to ensure stable and accurate estimates, particularly in multilevel or hierarchical models.

1.4 Variational Inference for Posterior Approximation

While hierarchical Bayesian models provide a powerful framework for IRT, their complexity often renders exact inference methods, like Markov Chain Monte Carlo (MCMC), computationally expensive. This is particularly true for large datasets. As a more efficient alternative, Automatic Differentiation Variational Inference (ADVI) approximates the true posterior using a simpler variational distribution, optimizing the Evidence Lower Bound (ELBO) to make the process scalable for large datasets (Blei et al., 2017; Kucukelbir et al., 2017).

1.5 Overview of the Article

This article offers an in-depth theoretical exploration of the hierarchical Bayesian 2PL model, focusing on the following aspects:

  • A formal definition of the likelihood function and the role of logistic functions in item response modeling.
  • An explanation of hierarchical priors for respondent abilities and item parameters, and how these enable more flexible and accurate modeling of complex data structures.
  • A discussion of variational inference techniques, with an emphasis on ADVI and its use in efficiently approximating the posterior distribution for large-scale problems.

By the conclusion of this article, readers will have a deep understanding of the hierarchical Bayesian 2PL model and its advantages over traditional IRT models in accommodating more complex relationships between test items and respondents.

2.1 Logistic Model for Item Response

The Two-Parameter Logistic (2PL) model is a foundational component of Item Response Theory (IRT) that models the probability that a respondent will answer an item correctly as a function of their latent ability and the item's characteristics (Birnbaum, 1968). Specifically, for each respondent \( i \) and item \( j \), the 2PL model expresses the probability of a correct response \( P_{ij} \) through a logistic function, which captures the interaction between the respondent's ability \( \theta_i \), the item's difficulty \( b_j \), and the item's discrimination \( a_j \). The mathematical form of this probability is:

\[ P_{ij}(\theta_i, a_j, b_j) = \frac{1}{1 + \exp(-D \cdot a_j \cdot (\theta_i - b_j))}, \]

where:

  • \( \theta_i \in \mathbb{R} \) is the latent ability of respondent \( i \) (Lord & Novick, 1968),
  • \( a_j \in \mathbb{R}^{+} \) is the discrimination parameter of item \( j \),
  • \( b_j \in \mathbb{R} \) is the difficulty parameter of item \( j \),
  • \( D = 1.702 \) is a scaling constant that aligns the logistic function more closely with the cumulative normal distribution for comparative purposes in psychometrics (Baker & Kim, 2004).

2.1.1 Role of Parameters

The logistic model defines the probability \( P_{ij} \) as a function of the latent ability \( \theta_i \), item discrimination \( a_j \), and item difficulty \( b_j \). Each of these parameters plays a distinct role in shaping the probability of a correct response:

  • Latent Ability \( \theta_i \): The latent ability \( \theta_i \) represents the unobserved trait or ability level of respondent \( i \), which determines how likely they are to answer an item correctly. This ability is typically assumed to follow a normal distribution in many IRT models (Fox, 2010).
  • Item Discrimination \( a_j \): The item discrimination parameter \( a_j \) controls the slope of the logistic function, determining how sharply the probability of a correct response changes with respect to changes in the respondent's ability \( \theta_i \).

Mathematically, the slope of the logistic function at the point where \( \theta_i = b_j \) is proportional to \( a_j \), meaning that:

\[ \frac{\partial P_{ij}}{\partial \theta_i} \bigg|_{\theta_i = b_j} \propto a_j. \]
  • Item Difficulty \( b_j \): The item difficulty parameter \( b_j \) shifts the logistic function along the ability axis. It represents the ability level at which a respondent has a 50% chance of answering the item correctly. In other words, when \( \theta_i = b_j \), the probability of a correct response is:
\[ P_{ij}(\theta_i = b_j, a_j, b_j) = \frac{1}{1 + \exp(0)} = 0.5. \]

The difficulty parameter \( b_j \) effectively serves as a threshold: the higher the value of \( b_j \), the more challenging the item, as it requires a higher ability \( \theta_i \) to achieve a 50% probability of a correct response (Baker, 2001).

2.1.2 Interpretation of the Logistic Function

The logistic function is commonly used in binary classification problems to map real-valued inputs (in this case, the difference \( \theta_i - b_j \)) to probabilities in the range \( [0, 1] \). For the 2PL model, the logistic function transforms the difference between the respondent's ability \( \theta_i \) and the item's difficulty \( b_j \), scaled by the item's discrimination \( a_j \), into the probability of a correct response (Embretson & Reise, 2000).

  • When \( \theta_i = b_j \), the probability of a correct response is exactly 0.5, regardless of the value of \( a_j \).
  • As \( \theta_i \) increases beyond \( b_j \), the probability of a correct response increases, and the rate of this increase depends on \( a_j \). The higher the discrimination parameter \( a_j \), the steeper the increase.
  • As \( \theta_i \) decreases below \( b_j \), the probability of a correct response approaches 0, and this decrease is sharper for larger values of \( a_j \).

2.1.3 Mathematical Properties

The logistic function used in the 2PL model possesses several desirable mathematical properties that make it suitable for modeling binary response data in psychometric applications:

  • Monotonicity: The logistic function is a strictly increasing function with respect to \( \theta_i \), meaning that as the respondent's ability increases, the probability of a correct response also increases.
  • Symmetry: The function is symmetric about \( \theta_i = b_j \), meaning that the probability of a correct response at ability \( \theta_i \) above \( b_j \) mirrors the probability of a wrong response at \( \theta_i \) below \( b_j \).
  • Asymptotic Behavior: As \( \theta_i \to \infty \), the probability \( P_{ij} \to 1 \), meaning that highly able respondents are almost certain to answer the item correctly. Similarly, as \( \theta_i \to -\infty \), the probability \( P_{ij} \to 0 \).
  • Derivatives and Information: The derivative of \( P_{ij} \) with respect to \( \theta_i \) is given by:
\[ \frac{\partial P_{ij}}{\partial \theta_i} = D \cdot a_j \cdot P_{ij}(1 - P_{ij}). \]

This derivative is useful in calculating item information, a measure of how much information an item provides about a respondent’s ability. The Fisher Information for item \( j \) at ability \( \theta_i \) is:

\[ I_{ij}(\theta_i) = D^2 \cdot a_j^2 \cdot P_{ij}(1 - P_{ij}). \]

This quantity tells us how much an item discriminates between respondents near a specific ability level. Items with higher information are more effective at distinguishing respondents' abilities (Lord, 1980).

2.2 Hierarchical Priors for Latent Abilities

In Bayesian Item Response Theory (IRT), the introduction of hierarchical priors allows for greater flexibility in modeling respondent abilities. By accounting for natural groupings within the data, hierarchical priors facilitate the sharing of information across respondents, which helps to stabilize estimates, especially in the presence of sparse data (Gelman et al., 2013; Fox, 2010). This section discusses the use of hierarchical priors for latent abilities \( \theta \) in the context of a two-parameter logistic (2PL) model.

2.2.1 Grouping Respondents by Raw Test Scores

To better capture the heterogeneity in respondent abilities, we impose a hierarchical structure on the latent ability parameters \( \theta_i \), where respondents are grouped based on observable characteristics—most commonly, their raw test scores (Lord, 1980). Let \( G \) denote the number of distinct groups, and let \( g \in \{1, 2, \dots, G\} \) index the groups. A group might correspond to respondents who share similar test scores or who belong to a common demographic cohort (Hambleton et al., 1991).

Within this framework, the ability of respondents within group \( g \), denoted by \( \theta_g \), is modeled as a random variable drawn from a group-level distribution. We assume that this group-level ability follows a normal distribution with parameters \( \mu_\theta \) and \( \sigma_\theta \), which capture the population-level mean and variance of abilities across all groups. Mathematically, the ability for each group is modeled as:

\[ \theta_g \sim \mathcal{N}(\mu_\theta, \sigma_\theta^2), \]

where:

  • \( \mu_\theta \in \mathbb{R} \) is the population-level mean of the abilities, representing the central tendency of the ability distribution across all groups (Fox, 2010),
  • \( \sigma_\theta \in \mathbb{R}^+ \) is the population-level standard deviation of the abilities, controlling the spread of abilities across different groups (Gelman et al., 2013).

2.2.2 Individual Abilities Within Groups

Once the group-level ability \( \theta_g \) is assigned, the ability \( \theta_i \) of an individual respondent \( i \) within group \( g \) is drawn from the group’s distribution. Specifically, we assume:

\[ \theta_i \sim \mathcal{N}(\theta_g, \tau_\theta^2), \]

where \( \tau_\theta^2 \) is the within-group variance of the abilities. This formulation allows for variability in abilities within each group while assuming that all respondents within the group share a common central tendency \( \theta_g \) (Carlin & Louis, 2000).

For many applications, \( \tau_\theta^2 \) is assumed to be small, meaning that respondents in the same group have similar abilities. In contrast, groups are allowed to differ more substantially, with the variability across groups governed by \( \sigma_\theta^2 \) (Gelman & Hill, 2006).

2.2.3 Priors for Hyperparameters

The hyperparameters \( \mu_\theta \) and \( \sigma_\theta \), which define the population-level distribution of group abilities, are also treated as random variables in the Bayesian framework. To account for uncertainty in these parameters, we assign them prior distributions, which reflect our prior beliefs about their possible values before observing the data (Gelman et al., 2013).

Prior for the Population-Level Mean \( \mu_\theta \):

We assume a normal prior for \( \mu_\theta \), centered at 0 with a standard deviation of 1:

\[ \mu_\theta \sim \mathcal{N}(0, 1), \]

This prior reflects the belief that, on average, the ability of respondents across all groups is centered around zero, which corresponds to a baseline ability level (Lee & Wagenmakers, 2014). The choice of a standard normal prior is common in Bayesian hierarchical models, as it represents a non-informative or weakly informative prior, allowing the data to strongly influence the posterior distribution (Gelman et al., 2017).

Prior for the Population-Level Standard Deviation \( \sigma_\theta \):

We assume a Half-Cauchy distribution for the population-level standard deviation \( \sigma_\theta \):

\[ \sigma_\theta \sim \text{HalfCauchy}(\beta=1), \]

The Half-Cauchy prior is commonly used for scale parameters in hierarchical models due to its heavy tails, which make it more robust to outliers and allow for greater uncertainty about the variability of abilities across groups (Polson & Scott, 2012). The shape parameter \( \beta=1 \) ensures that the prior density decreases slowly as \( \sigma_\theta \) increases, permitting the possibility of larger values for \( \sigma_\theta \) while still concentrating mass near smaller values.

2.2.4 Group Assignment Based on Raw Test Scores

In practice, respondents are grouped based on their raw test scores \( s_i \). For example, if a test includes 20 items, the raw test score \( s_i \) could range from 0 to 20, representing the number of items correctly answered by respondent \( i \) (Bock & Zimowski, 1997). These raw scores are then used to assign respondents to one of \( G \) groups. The group assignment is typically done by discretizing the score range or using a ranking scheme, and each respondent is assigned to a group based on their score (Lord & Novick, 1968).

Let \( g(s_i) \) denote the group to which respondent \( i \) is assigned based on their raw score \( s_i \). Then the ability \( \theta_i \) for respondent \( i \) is drawn from the group-level distribution for group \( g(s_i) \):

\[ \theta_i \sim \mathcal{N}(\theta_{g(s_i)}, \tau_\theta^2). \]

This hierarchical approach allows for partial pooling of information across respondents, as the group-level parameters \( \theta_g \) provide a common structure for abilities within each group, while still permitting individual variation through \( \tau_\theta^2 \) (Gelman et al., 2013).

2.2.5 Implications of the Hierarchical Structure

The hierarchical prior structure has several important implications for the shrinkage of ability estimates. In hierarchical Bayesian models, estimates of individual abilities \( \theta_i \) are "shrunk" towards the group-level means \( \theta_g \), and group-level means are themselves shrunk towards the population-level mean \( \mu_\theta \) (Gelman et al., 2006). This shrinkage effect is particularly valuable in the following cases:

  • Small groups: If a group has few respondents, the hierarchical structure allows information from other groups (via the population-level parameters \( \mu_\theta \) and \( \sigma_\theta \)) to inform the estimates for that group, reducing the variance of the estimates (Carlin & Louis, 2000).
  • Sparse data: In situations where individual respondents have few responses (e.g., if they answer only a subset of items), the hierarchical model borrows strength from other respondents in the same group and across groups to produce more stable estimates (Gelman et al., 2017).

This hierarchical shrinkage improves the model's robustness, preventing overfitting and reducing the impact of noisy data. It also helps in producing more interpretable and generalizable estimates of respondent abilities (Lee & Wagenmakers, 2014).

2.3 Informative Priors for Item Parameters

In Bayesian Item Response Theory (IRT), the inclusion of informative priors for item parameters such as discrimination (\( a_j \)) and difficulty (\( b_j \)) plays a crucial role in regularizing the model and incorporating prior knowledge. Informative priors can stabilize estimation, especially when data are sparse or noisy, by constraining parameter estimates to plausible ranges (Gelman et al., 2017; Lynch, 2007). This section details the choice of informative priors for these item-specific parameters in the context of the hierarchical two-parameter logistic (2PL) model.

2.3.1 Discrimination Parameter \( a_j \)

The discrimination parameter \( a_j \) determines how well an item can differentiate between respondents of differing abilities. It controls the steepness of the item characteristic curve (ICC), with higher values of \( a_j \) indicating that the item is more sensitive to differences in respondent abilities near the difficulty threshold \( b_j \). Specifically, for a given item \( j \), a larger \( a_j \) results in a steeper ICC, meaning that small changes in the respondent’s ability \( \theta_i \) lead to large changes in the probability of a correct response. Conversely, a smaller \( a_j \) corresponds to a flatter ICC, indicating that the item is less discriminative (Baker & Kim, 2004).

In practice, the discrimination parameter must be strictly positive (\( a_j > 0 \)), since negative or zero discrimination values would be nonsensical in the context of IRT (Lord, 1980). To enforce positivity and to regularize the estimation, we impose a Gamma prior on \( a_j \). The Gamma distribution is a common choice for positive-valued parameters due to its flexibility and ability to encode a wide range of beliefs about the plausible values of \( a_j \) (Murphy, 2012).

We define the prior for \( a_j \) as:

\[ a_j \sim \Gamma(\alpha = 2, \beta = 1), \]

where:

  • \( \alpha = 2 \) is the shape parameter, which controls the skewness of the distribution. This value ensures moderate regularization, favoring small positive values but allowing larger values with some probability (Carlin & Louis, 2000).
  • \( \beta = 1 \) is the rate parameter, which inversely controls the scale of the distribution.

The Gamma(2, 1) prior places more weight on smaller values of \( a_j \), reflecting the belief that most items will not be extremely discriminative but will still have positive discrimination. This prior allows for flexibility in \( a_j \) while regularizing estimates to avoid overly large or implausible values (Gelman et al., 2013). The choice of \( \alpha = 2 \) and \( \beta = 1 \) provides a compromise between flexibility and regularization, though these hyperparameters could be tuned to reflect stronger prior beliefs or domain-specific knowledge (Kruschke, 2014).

Properties of the Gamma Prior on \( a_j \)

The Gamma distribution is defined for \( x > 0 \), with the probability density function (PDF) given by:

\[ p(a_j) = \frac{\beta^\alpha}{\Gamma(\alpha)} a_j^{\alpha-1} e^{-\beta a_j}, \]

where \( \Gamma(\alpha) \) is the gamma function.

For \( \alpha = 2 \) and \( \beta = 1 \), the mean and variance of the distribution are:

\[ \mathbb{E}[a_j] = \frac{\alpha}{\beta} = 2, \]
\[ \text{Var}(a_j) = \frac{\alpha}{\beta^2} = 2. \]

Thus, the prior reflects a belief that the expected discrimination is around 2, with a moderate variance that allows the data to strongly influence the posterior estimate. The long right tail of the Gamma distribution allows for larger values of \( a_j \) when supported by the data, ensuring that highly discriminative items can still be identified (Bishop, 2006).

2.3.2 Difficulty Parameter \( b_j \)

The difficulty parameter \( b_j \) represents the ability level at which a respondent has a 50% probability of answering item \( j \) correctly. It shifts the item characteristic curve (ICC) along the ability axis, determining the point on the ability scale where the curve reaches its midpoint. Higher values of \( b_j \) correspond to more difficult items, requiring a higher ability \( \theta_i \) for a 50% chance of a correct response. Conversely, lower values of \( b_j \) correspond to easier items (Embretson & Reise, 2000).

The difficulty parameter \( b_j \) can take any real value, and its plausible range can vary widely depending on the nature of the test and the distribution of respondent abilities. In particular, \( b_j \) can occasionally take on extreme values due to noisy or outlier responses, making it important to use a prior that is robust to such variability (Hoff, 2009). For this reason, we impose a Student’s t-distribution prior on \( b_j \), which has heavier tails than a normal distribution, allowing for occasional extreme values while still concentrating most of the probability mass near the central location (Kruschke, 2014).

We define the prior for \( b_j \) as:

\[ b_j \sim \text{StudentT}(\nu = 3, \mu = 0, \sigma = 2), \]

where:

  • \( \nu = 3 \) is the degrees of freedom, which controls the thickness of the tails. A value of 3 provides sufficiently heavy tails to accommodate outliers without being too uninformative (Lange et al., 1989).
  • \( \mu = 0 \) is the location parameter, indicating that the difficulty values are centered around 0 on the latent ability scale.
  • \( \sigma = 2 \) is the scale parameter, reflecting the spread of the distribution and allowing for moderate variability in \( b_j \) across items (Gelman & Hill, 2006).

Properties of the Student's t-Distribution on \( b_j \)

The Student’s t-distribution with \( \nu = 3 \) degrees of freedom has heavier tails than a normal distribution, meaning that it assigns more probability to extreme values. This makes it a robust choice for modeling parameters that may occasionally exhibit outlier behavior, such as item difficulty in tests with heterogeneous populations (Carlin & Louis, 2000).

The probability density function (PDF) for the Student’s t-distribution is given by:

\[ p(b_j) = \frac{\Gamma\left(\frac{\nu + 1}{2}\right)}{\sqrt{\nu \pi} \sigma \Gamma\left(\frac{\nu}{2}\right)} \left( 1 + \frac{(b_j - \mu)^2}{\nu \sigma^2} \right)^{-\frac{\nu + 1}{2}}. \]

For \( \nu = 3 \), \( \mu = 0 \), and \( \sigma = 2 \), the distribution has the following properties:

  • The mean is equal to 0 (when \( \nu > 1 \)),
  • The variance is given by \( \frac{\nu \sigma^2}{\nu - 2} \) (when \( \nu > 2 \)).

For \( \nu = 3 \) and \( \sigma = 2 \), the variance is:

\[ \text{Var}(b_j) = \frac{3 \cdot 2^2}{3 - 2} = 12. \]

The heavy tails of the Student’s t-distribution ensure that the prior can accommodate items with extreme difficulty values (e.g., items that are either extremely hard or extremely easy), but most of the mass is centered around \( \mu = 0 \), reflecting the belief that, on average, items will not be too extreme in difficulty (Lynch, 2007).

2.3.3 Prior Regularization and Sensitivity

The choice of priors for \( a_j \) and \( b_j \) has important implications for the stability of the model and its ability to generalize. Informative priors such as the Gamma prior for \( a_j \) and the Student's t-distribution for \( b_j \) provide regularization, preventing overfitting by constraining the parameter space to plausible values. At the same time, these priors are flexible enough to let the data dominate when appropriate (Gelman et al., 2017).

  • The Gamma prior on \( a_j \) ensures that the discrimination parameters remain positive and tend towards smaller values, which is consistent with the assumption that most items will not be highly discriminative. By allowing occasional larger values, this prior also accommodates items that are highly sensitive to differences in respondent ability (Carlin & Louis, 2000).
  • The Student's t-distribution prior on \( b_j \) allows the difficulty parameters to take on a wide range of values, accounting for potential outliers and ensuring robustness in the presence of noisy or extreme data (Lange et al., 1989).

These priors also introduce a degree of sensitivity to the model. In practice, hyperparameters such as \( \alpha \), \( \beta \), \( \nu \), \( \mu \), and \( \sigma \) can be adjusted to reflect domain-specific knowledge or to increase the regularization if overfitting is a concern (Murphy, 2012).

3.1 Likelihood Function

In the Bayesian two-parameter logistic (2PL) Item Response Theory (IRT) model, the likelihood function plays a central role in linking the observed data—namely, respondents' answers to test items—to the model parameters, which include the respondents' abilities \( \theta \), the item discriminations \( a \), and the item difficulties \( b \). This section provides a detailed breakdown of the likelihood function and its components, ensuring a clear understanding of how the probability of the observed responses is modeled (Albert, 1992; van der Linden & Hambleton, 2013).

3.1.1 Definition of the Likelihood Function

Let \( y_{ij} \in \{0, 1\} \) represent the observed response of respondent \( i \) on item \( j \), where:

  • \( y_{ij} = 1 \) indicates a correct response,
  • \( y_{ij} = 0 \) indicates an incorrect response.

Given the model structure, the probability that respondent \( i \) answers item \( j \) correctly is denoted \( P_{ij}(\theta_i, a_j, b_j) \), which is modeled by the logistic function (Baker & Kim, 2004):

\[ P_{ij}(\theta_i, a_j, b_j) = \frac{1}{1 + \exp(-D \cdot a_j \cdot (\theta_i - b_j))}, \]

where:

  • \( \theta_i \in \mathbb{R} \) is the latent ability of respondent \( i \),
  • \( a_j \in \mathbb{R}^{+} \) is the discrimination parameter of item \( j \),
  • \( b_j \in \mathbb{R} \) is the difficulty parameter of item \( j \),
  • \( D = 1.702 \) is a scaling constant.

The likelihood function describes the joint probability of observing the entire set of responses \( \mathbf{y} = \{y_{ij}\} \) for all respondents and items, conditioned on the latent abilities \( \theta \), the item discriminations \( a \), and the item difficulties \( b \). For each response \( y_{ij} \), the likelihood contribution depends on whether the response is correct (\( y_{ij} = 1 \)) or incorrect (\( y_{ij} = 0 \)) (Fox, 2010).

For each pair \( (i, j) \), the likelihood for the observed response \( y_{ij} \) is modeled as a Bernoulli trial, where the probability of success (correct response) is \( P_{ij} \), and the probability of failure (incorrect response) is \( 1 - P_{ij} \) (Rasch, 1960). Therefore, the likelihood for the individual response \( y_{ij} \) is:

\[ P(y_{ij} | \theta_i, a_j, b_j) = P_{ij}^{y_{ij}} \cdot (1 - P_{ij})^{1 - y_{ij}}. \]

The full likelihood function for all responses across \( N \) respondents and \( J \) items is the product of the likelihoods for each individual response:

\[ L(\mathbf{y} | \theta, a, b) = \prod_{i=1}^{N} \prod_{j=1}^{J} P_{ij}(\theta_i, a_j, b_j)^{y_{ij}} \cdot (1 - P_{ij}(\theta_i, a_j, b_j))^{1 - y_{ij}}, \]

where:

  • \( N \) is the total number of respondents,
  • \( J \) is the total number of items.

3.1.2 Intuition Behind the Likelihood

The structure of the likelihood function reflects the assumption that each response \( y_{ij} \) is an independent realization from a Bernoulli distribution, with a probability of success \( P_{ij} \). This means that the model assumes conditional independence of responses, given the latent abilities \( \theta_i \) and item parameters \( a_j \) and \( b_j \). In other words, once we know a respondent’s ability and the characteristics of an item, the model treats the response to that item as independent of the respondent’s responses to other items (Bock & Aitkin, 1981; Lord & Novick, 1968).

For a correct response (\( y_{ij} = 1 \)), the likelihood contribution is proportional to the probability \( P_{ij} \), while for an incorrect response (\( y_{ij} = 0 \)), the contribution is proportional to \( 1 - P_{ij} \). Thus, the overall likelihood favors parameter values that make the observed correct responses likely and the observed incorrect responses unlikely (Skrondal & Rabe-Hesketh, 2004).

3.1.3 Log-Likelihood Function

In practice, it is often more convenient to work with the log-likelihood function rather than the likelihood itself, due to numerical stability and the fact that products of probabilities can result in extremely small values. Taking the logarithm of the likelihood simplifies the product into a sum, as follows (Murphy, 2012):

\[ \log L(\mathbf{y} | \theta, a, b) = \sum_{i=1}^{N} \sum_{j=1}^{J} \left[ y_{ij} \log P_{ij}(\theta_i, a_j, b_j) + (1 - y_{ij}) \log (1 - P_{ij}(\theta_i, a_j, b_j)) \right]. \]

This expression is used in maximum likelihood estimation and Bayesian inference methods (such as variational inference), where optimization algorithms seek to maximize the log-likelihood with respect to the parameters \( \theta \), \( a \), and \( b \) (Bishop, 2006).

3.1.4 Incorporating Priors in Bayesian Inference

In a Bayesian framework, the likelihood function is combined with prior distributions on the parameters \( \theta_i \), \( a_j \), and \( b_j \) to form the posterior distribution. The posterior reflects the updated beliefs about the parameters after observing the data. According to Bayes’ theorem, the posterior distribution is proportional to the product of the likelihood and the priors (Gelman et al., 2013):

\[ p(\theta, a, b | \mathbf{y}) \propto L(\mathbf{y} | \theta, a, b) \cdot p(\theta) \cdot p(a) \cdot p(b), \]

where \( p(\theta) \), \( p(a) \), and \( p(b) \) represent the prior distributions on the abilities, discriminations, and difficulties, respectively. These priors introduce regularization and encode prior knowledge about the plausible values of the parameters, as discussed in previous sections (Kruschke, 2014).

3.1.5 Conditional Independence and Exchangeability

The form of the likelihood function is built on two key assumptions:

  • Conditional independence: Given the latent abilities \( \theta_i \), discrimination parameters \( a_j \), and difficulty parameters \( b_j \), the responses \( y_{ij} \) are assumed to be independent. This assumption simplifies the model and is typical in IRT models, allowing us to write the joint likelihood as a product of individual likelihoods for each response (Albert, 1992).
  • Exchangeability: The respondents are typically assumed to be exchangeable, meaning that the order in which they are considered does not affect the likelihood. This is reflected in the fact that the likelihood treats each respondent symmetrically (Gelman et al., 2013).

3.1.6 Interpretation in a Hierarchical Framework

In the hierarchical Bayesian 2PL model, the likelihood is augmented by the hierarchical structure introduced for the respondent abilities \( \theta \), as discussed earlier. The abilities \( \theta_i \) are modeled as random variables drawn from group-level distributions, and the item parameters \( a_j \) and \( b_j \) are governed by their own priors (Fox, 2010).

Therefore, the marginal likelihood in a fully Bayesian hierarchical model integrates over the latent abilities \( \theta_i \), as well as the item parameters \( a_j \) and \( b_j \). This leads to a more complex likelihood, where the contribution of each individual likelihood term depends on the hyperparameters governing the prior distributions of \( \theta \), \( a \), and \( b \) (Gelman & Hill, 2006).

3.2 Posterior Inference

In Bayesian Item Response Theory (IRT), the goal is to infer the posterior distribution of the latent variables \( Z = (\theta, a, b) \), which include the respondents' abilities \( \theta \), the item discriminations \( a \), and the item difficulties \( b \). The posterior distribution \( p(Z | \mathbf{y}) \) represents the probability distribution of these latent variables given the observed data \( \mathbf{y} \), i.e., the test responses. However, due to the complexity of the hierarchical Bayesian 2PL model, directly computing or sampling from the posterior is typically intractable (Blei et al., 2017; Gelman et al., 2013). Traditional methods such as Markov Chain Monte Carlo (MCMC), while accurate, can be computationally expensive, especially for large datasets with many respondents and items (Bishop, 2006; Betancourt, 2017).

To address this issue, we employ Automatic Differentiation Variational Inference (ADVI), a scalable method for approximate Bayesian inference that uses optimization techniques to approximate the posterior distribution (Kucukelbir et al., 2017).

3.2.1 Variational Inference

Variational inference (VI) is a method for approximating complex posterior distributions by selecting a family of simpler, tractable distributions and finding the member of that family that is closest to the true posterior. The closeness between the true posterior \( p(Z | \mathbf{y}) \) and the approximate distribution \( q(Z) \) is measured by the Kullback-Leibler (KL) divergence (Jordan et al., 1999). In VI, we aim to minimize the KL divergence between the true posterior and the variational approximation:

\[ \text{KL}(q(Z) || p(Z | \mathbf{y})) = \mathbb{E}_{q(Z)} \left[ \log \frac{q(Z)}{p(Z | \mathbf{y})} \right]. \]

The KL divergence quantifies the discrepancy between the variational distribution \( q(Z) \) and the true posterior. By minimizing this divergence, we ensure that \( q(Z) \) is as close as possible to \( p(Z | \mathbf{y}) \) (Blei et al., 2017).

3.2.2 The Variational Family

In Automatic Differentiation Variational Inference (ADVI), we assume that the variational distribution \( q(Z) \) factorizes over the components of the latent variables \( Z = (\theta, a, b) \). This is known as the mean-field approximation, where the joint variational distribution \( q(Z) \) is expressed as a product of independent distributions over each component (Bishop, 2006):

\[ q(Z) = q(\theta) q(a) q(b). \]

This factorization simplifies the optimization problem by assuming that the latent variables are conditionally independent, allowing us to focus on optimizing the variational parameters associated with each component individually (Jordan et al., 1999). While this assumption introduces some approximation error, it makes the inference process much more tractable (Kucukelbir et al., 2017).

For each component, we select a variational family from which the variational distributions are drawn. In ADVI, the most common choice is a multivariate Gaussian distribution, as it provides flexibility and is easily parameterized. Specifically, for each component of \( Z \), we define the variational distribution as:

\[ q(\theta_i) = \mathcal{N}(\mu_{\theta_i}, \sigma_{\theta_i}^2), \]
\[ q(a_j) = \mathcal{N}(\mu_{a_j}, \sigma_{a_j}^2), \]
\[ q(b_j) = \mathcal{N}(\mu_{b_j}, \sigma_{b_j}^2). \]

Here, \( \mu_{\theta_i}, \sigma_{\theta_i}^2 \) are the mean and variance of the variational distribution for the ability \( \theta_i \), and similarly for the item parameters \( a_j \) and \( b_j \) (Blei et al., 2017). These variational parameters are optimized to best approximate the true posterior (Ranganath et al., 2014).

3.2.3 The Evidence Lower Bound (ELBO)

The optimization objective in variational inference is to maximize the Evidence Lower Bound (ELBO), which serves as a proxy for maximizing the log marginal likelihood of the observed data (also called the log-evidence). The ELBO is derived by rearranging the KL divergence minimization problem and is given by (Kucukelbir et al., 2017):

\[ \text{ELBO}(q) = \mathbb{E}_{q(Z)} \left[ \log p(\mathbf{y}, Z) - \log q(Z) \right], \]

where \( p(\mathbf{y}, Z) \) is the joint distribution of the data and latent variables, and \( q(Z) \) is the variational approximation (Bishop, 2006).

Maximizing the ELBO is equivalent to minimizing the KL divergence between \( q(Z) \) and the true posterior \( p(Z | \mathbf{y}) \). The ELBO can be interpreted as the sum of two terms (Jordan et al., 1999):

  • Expected log joint probability \( \mathbb{E}_{q(Z)} [\log p(\mathbf{y}, Z)] \): This encourages the variational distribution \( q(Z) \) to assign high probability to configurations of the latent variables that explain the observed data well.
  • Negative entropy of \( q(Z) \) \( -\mathbb{E}_{q(Z)} [\log q(Z)] \): This term promotes diversity in the variational distribution, preventing it from collapsing into a single point estimate.

3.2.4 Automatic Differentiation and Gradient-Based Optimization

To optimize the ELBO, ADVI employs gradient-based methods, where the gradients of the ELBO with respect to the variational parameters are computed using automatic differentiation. Automatic differentiation is a technique that allows for the efficient computation of gradients by exploiting the structure of the computational graph representing the ELBO (Kucukelbir et al., 2017). This is particularly useful for complex models such as hierarchical IRT models, where manually deriving gradients can be difficult and error-prone (Bishop, 2006).

The optimization proceeds iteratively, with each iteration updating the variational parameters using a gradient ascent method. The updates are typically based on stochastic optimization algorithms, such as stochastic gradient descent (SGD) or Adam, which efficiently handle large datasets by approximating the gradient using small minibatches of data (Kingma & Ba, 2015). The general update rule for the variational parameters \( \lambda \) is:

\[ \lambda^{(t+1)} = \lambda^{(t)} + \eta \nabla_{\lambda} \text{ELBO}(q_{\lambda}), \]

where:

  • \( \lambda^{(t)} \) represents the variational parameters at iteration \( t \),
  • \( \eta \) is the learning rate,
  • \( \nabla_{\lambda} \text{ELBO}(q_{\lambda}) \) is the gradient of the ELBO with respect to the variational parameters \( \lambda \) (Blei et al., 2017).

The optimization continues until convergence, typically when the change in the ELBO or the variational parameters between successive iterations falls below a predefined threshold (Kucukelbir et al., 2017).

3.2.5 Benefits of ADVI

ADVI offers several advantages over traditional sampling methods such as MCMC, especially for large-scale models:

  • Scalability: ADVI scales well with large datasets, as it uses gradient-based optimization, which is typically faster than MCMC, especially for high-dimensional parameter spaces (Hoffman et al., 2013).
  • Efficiency: The use of automatic differentiation ensures that the gradients of the ELBO are computed efficiently, even for complex hierarchical models like the 2PL IRT model (Kucukelbir et al., 2017).
  • Approximation Speed: Unlike MCMC, which can require a large number of samples to accurately estimate the posterior, ADVI converges relatively quickly by directly optimizing the variational parameters (Ranganath et al., 2014).
  • Deterministic Convergence: ADVI provides a deterministic approximation to the posterior, whereas MCMC produces samples that require additional post-processing for convergence diagnostics (Kingma & Ba, 2015).

3.2.6 Limitations of ADVI

While ADVI is computationally efficient, it has some limitations:

  • Approximation Quality: The factorization assumption (mean-field approximation) may not capture the full dependencies between latent variables, potentially leading to an inaccurate posterior approximation (Blei et al., 2017).
  • Convergence to Local Optima: Like other gradient-based methods, ADVI may converge to local optima, especially if the optimization landscape is non-convex (Betancourt, 2017).

3.3 Sampling from the Posterior

Once the variational approximation \( q(Z) \) is optimized in Automatic Differentiation Variational Inference (ADVI), the next step involves sampling from the approximate posterior distribution \( q(Z) \) to generate posterior samples of the latent variables \( \theta \), \( a \), and \( b \). These samples serve as approximations to the true posterior distribution \( p(Z | \mathbf{y}) \) and can be used for various downstream tasks, including generating posterior predictive distributions, evaluating model fit, and making inferences about the respondents' abilities and item characteristics (Blei et al., 2017; Kucukelbir et al., 2017).

3.3.1 Posterior Sampling from \( q(Z) \)

After the optimization of the variational parameters, the variational distribution \( q(Z) \) represents a multivariate distribution over the latent variables \( Z = (\theta, a, b) \). Given that \( q(Z) \) is typically assumed to be a product of independent normal distributions in the mean-field approximation, sampling from \( q(Z) \) is straightforward (Murphy, 2012). Specifically, we can draw posterior samples by sampling each component independently from its respective normal distribution.

  • Ability parameters \( \theta_i \) are sampled from their approximate posterior distribution \( q(\theta_i) = \mathcal{N}(\mu_{\theta_i}, \sigma_{\theta_i}^2) \),
  • Discrimination parameters \( a_j \) are sampled from their approximate posterior distribution \( q(a_j) = \mathcal{N}(\mu_{a_j}, \sigma_{a_j}^2) \),
  • Difficulty parameters \( b_j \) are sampled from their approximate posterior distribution \( q(b_j) = \mathcal{N}(\mu_{b_j}, \sigma_{b_j}^2) \).

This sampling procedure is repeated \( S \) times to generate a set of posterior samples \( \{ \theta^{(s)}, a^{(s)}, b^{(s)} \}_{s=1}^{S} \), where each sample provides an approximation of the latent variables based on the variational distribution (Hoffman et al., 2013).

3.3.2 Posterior Predictive Distributions

Once we have obtained posterior samples, we can use them to compute posterior predictive distributions, which allow us to predict new responses or evaluate the model's fit to the observed data. The posterior predictive distribution is given by:

\[ p(\tilde{y}_{ij} | \mathbf{y}) = \int p(\tilde{y}_{ij} | \theta_i, a_j, b_j) p(\theta, a, b | \mathbf{y}) \, d\theta \, da \, db. \]

In practice, this integral is approximated using the posterior samples obtained from \( q(Z) \) (Gelman et al., 2013). For each posterior sample \( (\theta_i^{(s)}, a_j^{(s)}, b_j^{(s)}) \), the predictive probability of a correct response by respondent \( i \) to item \( j \) is given by the logistic function (Bishop, 2006):

\[ P_{ij}^{(s)} = \frac{1}{1 + \exp(-D \cdot a_j^{(s)} \cdot (\theta_i^{(s)} - b_j^{(s)}))}. \]

The posterior predictive distribution for a new response \( \tilde{y}_{ij} \) can then be approximated by averaging over the posterior samples:

\[ p(\tilde{y}_{ij} = 1 | \mathbf{y}) \approx \frac{1}{S} \sum_{s=1}^{S} P_{ij}^{(s)}. \]

This provides a point estimate for the probability that respondent \( i \) answers item \( j \) correctly, based on the posterior distribution of the latent parameters (Blei et al., 2017).

3.3.3 Model Fit and Diagnostic Measures

The posterior samples can also be used to evaluate the fit of the model to the observed data and to diagnose any potential issues with the model. Common methods for assessing model fit in Bayesian IRT include:

  • Posterior Predictive Checks: Compare the observed responses \( \mathbf{y} \) with simulated responses generated from the posterior predictive distribution (Gelman & Hill, 2006). Discrepancies between the observed and simulated responses can indicate areas where the model may not be capturing the underlying data structure.
  • Goodness-of-Fit Metrics: Compute metrics such as the log-likelihood or deviance for each posterior sample and average these metrics to assess how well the model fits the data (Vehtari et al., 2017). The log-likelihood for each sample is given by:
    \[ \log L^{(s)} = \sum_{i=1}^{N} \sum_{j=1}^{J} \left[ y_{ij} \log P_{ij}^{(s)} + (1 - y_{ij}) \log (1 - P_{ij}^{(s)}) \right]. \]

    Averaging the log-likelihoods over the posterior samples gives an estimate of the model's overall fit.

  • Posterior Intervals: Compute credible intervals for the latent variables \( \theta \), \( a \), and \( b \) based on the posterior samples (McElreath, 2020). The 95% credible interval for a parameter \( \phi \) (which could be \( \theta_i \), \( a_j \), or \( b_j \)) is computed as the interval between the 2.5th and 97.5th percentiles of the posterior samples:
    \[ \text{CI}_{95\%}(\phi) = \left[ \phi^{(2.5\%)} , \phi^{(97.5\%)} \right]. \]
  • Posterior Predictive Discrepancies: Compute statistics such as the root mean squared error (RMSE) between the observed responses and the posterior predictive means (Gelman et al., 2013):
    \[ \text{RMSE} = \sqrt{ \frac{1}{N \times J} \sum_{i=1}^{N} \sum_{j=1}^{J} \left( y_{ij} - p(\tilde{y}_{ij} = 1 | \mathbf{y}) \right)^2 }. \]

    A lower RMSE suggests a better fit between the model's predictions and the observed data (Blei et al., 2017).

3.3.4 Inference on Latent Variables

By analyzing the posterior samples of the latent variables, we can make inferences about the respondents' abilities \( \theta_i \) and the characteristics of the items (discrimination \( a_j \) and difficulty \( b_j \)) (Bishop, 2006).

  • Ability Estimation: The posterior mean of the ability \( \theta_i \) for each respondent can be used as a point estimate:
    \[ \hat{\theta}_i = \frac{1}{S} \sum_{s=1}^{S} \theta_i^{(s)}. \]

    This estimate provides a summary of the respondent’s ability, based on the posterior distribution (McElreath, 2020).

  • Item Parameter Estimation: Similarly, the posterior means of the discrimination and difficulty parameters for each item can be used to summarize the item characteristics:
    \[ \hat{a}_j = \frac{1}{S} \sum_{s=1}^{S} a_j^{(s)}. \]
    \[ \hat{b}_j = \frac{1}{S} \sum_{s=1}^{S} b_j^{(s)}. \]

    These estimates give insight into how well each item discriminates between respondents and the difficulty level of each item (Fox, 2010).

  • Uncertainty Quantification: By examining the variability in the posterior samples, we can assess the uncertainty associated with each parameter estimate. Parameters with higher variability in the posterior samples will have wider credible intervals, indicating greater uncertainty in those estimates (Gelman et al., 2013).

4.1 Variational Approximation

Due to the intractability of the exact posterior distribution in the hierarchical Bayesian 2PL model, we resort to variational inference (VI) as an efficient alternative to traditional sampling-based methods like Markov Chain Monte Carlo (MCMC). Variational inference approximates the true posterior distribution \( p(\theta, a, b \mid \mathbf{y}) \), where \( \theta \) represents the latent abilities of respondents, and \( a \) and \( b \) are the item discrimination and difficulty parameters, respectively. In this approach, we replace the true posterior with a simpler, tractable variational distribution \( q(\theta, a, b; \lambda) \), which is parameterized by \( \lambda \) (Blei et al., 2017). The variational distribution is designed to be easier to work with while closely approximating the true posterior.

The primary goal of variational inference is to minimize the Kullback-Leibler (KL) divergence between the variational distribution \( q(\theta, a, b; \lambda) \) and the true posterior distribution \( p(\theta, a, b \mid \mathbf{y}) \). This divergence measures the difference between the variational approximation and the actual posterior:

\[ \text{KL}(q(\theta, a, b; \lambda) \, || \, p(\theta, a, b \mid \mathbf{y})) = \mathbb{E}_{q(\theta, a, b; \lambda)} \left[ \log \frac{q(\theta, a, b; \lambda)}{p(\theta, a, b \mid \mathbf{y})} \right]. \]

Minimizing the KL divergence ensures that the variational distribution \( q(\theta, a, b; \lambda) \) closely approximates the true posterior distribution (Jordan et al., 1999). However, directly minimizing the KL divergence is challenging due to the intractability of the true posterior. Instead, we optimize an alternative objective known as the Evidence Lower Bound (ELBO), which is equivalent to minimizing the KL divergence and allows for efficient optimization (Kucukelbir et al., 2017).

4.1.1 The Evidence Lower Bound (ELBO)

The ELBO is the key objective in variational inference and serves as a proxy for maximizing the log marginal likelihood of the data (also known as the log evidence) (Murphy, 2012). The ELBO is derived by rewriting the marginal likelihood \( \log p(\mathbf{y}) \) and bounding it below using Jensen's inequality:

\[ \log p(\mathbf{y}) = \log \int p(\mathbf{y}, \theta, a, b) \, d\theta \, da \, db \geq \mathbb{E}_{q(\theta, a, b; \lambda)} \left[ \log p(\mathbf{y}, \theta, a, b) - \log q(\theta, a, b; \lambda) \right] = \text{ELBO}. \]

The ELBO consists of two terms:

  • The expected log joint probability \( \mathbb{E}_{q(\theta, a, b; \lambda)} [\log p(\mathbf{y}, \theta, a, b)] \), which encourages the variational distribution to assign high probability to configurations of \( \theta, a, b \) that explain the observed data well,
  • The negative entropy \( \mathbb{E}_{q(\theta, a, b; \lambda)} [\log q(\theta, a, b; \lambda)] \), which promotes a spread-out variational distribution, avoiding overconcentration on a single point (Blei et al., 2017).

The ELBO can also be written explicitly in terms of the likelihood and priors as:

\[ \text{ELBO} = \mathbb{E}_{q(\theta, a, b; \lambda)} \left[ \log L(\mathbf{y} \mid \theta, a, b) + \log p(\theta, a, b) - \log q(\theta, a, b; \lambda) \right]. \]

Here:

  • \( L(\mathbf{y} \mid \theta, a, b) \) is the likelihood function, representing the probability of observing the data given the latent abilities \( \theta \) and item parameters \( a \) and \( b \),
  • \( p(\theta, a, b) \) is the prior distribution over the latent variables, encoding our prior beliefs about the parameters before observing the data (Gelman et al., 2013),
  • \( q(\theta, a, b; \lambda) \) is the variational distribution, parameterized by \( \lambda \), which approximates the posterior distribution.

4.1.2 Optimization of the ELBO

The optimization of the ELBO is carried out using gradient-based methods. The parameters \( \lambda \), which define the variational distribution \( q(\theta, a, b; \lambda) \), are iteratively updated to maximize the ELBO (Kingma & Welling, 2014). Since the ELBO involves an expectation over the variational distribution, we employ stochastic optimization to approximate the gradients of the ELBO using Monte Carlo sampling (Ranganath et al., 2014).

For each iteration, the gradients of the ELBO with respect to the variational parameters \( \lambda \) are computed as:

\[ \nabla_\lambda \text{ELBO} = \nabla_\lambda \mathbb{E}_{q(\theta, a, b; \lambda)} \left[ \log p(\mathbf{y}, \theta, a, b) - \log q(\theta, a, b; \lambda) \right]. \]

Using automatic differentiation, the gradients are computed efficiently, allowing for fast updates to the variational parameters (Kucukelbir et al., 2017). The optimization procedure iteratively refines the variational distribution \( q(\theta, a, b; \lambda) \), bringing it closer to the true posterior distribution.

The general update rule for the variational parameters \( \lambda \) at iteration \( t \) is:

\[ \lambda^{(t+1)} = \lambda^{(t)} + \eta \nabla_\lambda \text{ELBO}(q(\theta, a, b; \lambda)), \]

where \( \eta \) is the learning rate that controls the step size of each update (Kingma & Welling, 2014). Over successive iterations, the ELBO is maximized, and the variational distribution \( q(\theta, a, b; \lambda) \) becomes a more accurate approximation of the true posterior.

4.1.3 Variational Family for \( q(\theta, a, b; \lambda) \)

In the context of mean-field variational inference, we assume that the variational distribution factorizes over the components of the latent variables \( Z = (\theta, a, b) \), meaning that:

\[ q(\theta, a, b; \lambda) = q(\theta) q(a) q(b), \]

where each of \( q(\theta) \), \( q(a) \), and \( q(b) \) is typically modeled as a normal distribution, parameterized by its mean and variance. This factorization simplifies the inference process, as it allows us to optimize each component of \( q(\theta, a, b; \lambda) \) independently (Blei et al., 2017).

For each component:

  • The variational distribution for \( \theta \), the latent abilities, is modeled as:
    \[ q(\theta_i) = \mathcal{N}(\mu_{\theta_i}, \sigma_{\theta_i}^2), \]
    where \( \mu_{\theta_i} \) and \( \sigma_{\theta_i}^2 \) are the mean and variance of the variational distribution for respondent \( i \)'s ability (Murphy, 2012).
  • The variational distributions for the item discrimination and difficulty parameters \( a_j \) and \( b_j \) are similarly modeled as:
    \[ q(a_j) = \mathcal{N}(\mu_{a_j}, \sigma_{a_j}^2), \]
    \[ q(b_j) = \mathcal{N}(\mu_{b_j}, \sigma_{b_j}^2). \]

The variational parameters \( \lambda = (\mu_{\theta}, \sigma_{\theta}, \mu_a, \sigma_a, \mu_b, \sigma_b) \) are optimized to maximize the ELBO.

4.1.4 Advantages of the Variational Approach

Variational inference offers several advantages over traditional methods like MCMC, especially in large-scale applications:

  • Scalability: Variational inference is generally much faster than MCMC, as it transforms the inference problem into an optimization problem, which can be efficiently solved using gradient-based methods (Bishop, 2006).
  • Deterministic Convergence: Unlike MCMC, which relies on sampling and may require long chains to converge, variational inference provides a deterministic approximation of the posterior that converges to a solution once the ELBO is maximized (Kingma & Welling, 2014).
  • Approximation Quality: Although variational inference may not always perfectly capture the full complexity of the true posterior (particularly in high-dimensional settings), it provides a reasonable approximation that is often sufficient for practical purposes, especially when computational efficiency is crucial (Blei et al., 2017).

4.2 Full-Rank ADVI

In Automatic Differentiation Variational Inference (ADVI), we employ a more flexible variational family by assuming that the variational distribution \( q(\theta, a, b) \) is a multivariate Gaussian with a full-rank covariance structure (Kucukelbir et al., 2017). This contrasts with the simpler mean-field approach, where each latent variable is assumed to be independent. The full-rank assumption allows the variational distribution to capture correlations between the latent variables \( \theta \) (abilities), \( a \) (discriminations), and \( b \) (difficulties), leading to a more expressive approximation of the true posterior (Blei et al., 2017).

4.2.1 Full-Rank Multivariate Gaussian

In full-rank ADVI, the variational distribution \( q(Z) \), where \( Z = (\theta, a, b) \), is modeled as a multivariate Gaussian distribution with a mean vector \( \mu \) and a covariance matrix \( \Sigma \):

\[ q(Z) = \mathcal{N}(\mu, \Sigma), \]

where:

  • \( \mu \in \mathbb{R}^d \) is the mean vector of the distribution, with \( d \) being the total number of latent variables (i.e., the combined dimensionality of \( \theta \), \( a \), and \( b \)),
  • \( \Sigma \in \mathbb{R}^{d \times d} \) is the covariance matrix, which captures the variances and covariances between the latent variables.

The full-rank covariance matrix \( \Sigma \) introduces dependencies between the variables, enabling the variational approximation to model the correlations between the abilities, item discriminations, and difficulties, which may be present in the true posterior. In contrast, the mean-field assumption would ignore these dependencies by assuming a diagonal covariance matrix (Murphy, 2012).

The covariance matrix \( \Sigma \) is positive semi-definite and symmetric, and it can be decomposed as:

\[ \Sigma = LL^\top, \]

where \( L \) is the Cholesky factor of the covariance matrix, a lower triangular matrix with positive diagonal entries (Bishop, 2006). This decomposition ensures that \( \Sigma \) remains positive semi-definite, and it is computationally efficient to work with during optimization.

4.2.2 Optimizing the Variational Parameters

The goal of full-rank ADVI is to optimize the parameters \( \mu \) and \( \Sigma \) (or equivalently, the Cholesky factor \( L \)) by maximizing the Evidence Lower Bound (ELBO). The ELBO, as previously discussed, is an objective function that serves as a lower bound on the marginal likelihood of the data and is used to guide the optimization of the variational distribution. In the context of full-rank ADVI, the ELBO is a function of both the mean vector \( \mu \) and the covariance matrix \( \Sigma \) (Hoffman et al., 2013):

\[ \text{ELBO}(\mu, \Sigma) = \mathbb{E}_{q(Z)} \left[ \log p(\mathbf{y}, Z) - \log q(Z; \mu, \Sigma) \right]. \]

To maximize the ELBO, we employ stochastic gradient ascent. The gradients of the ELBO with respect to the variational parameters \( \mu \) and \( \Sigma \) (or \( L \)) are computed using automatic differentiation, which enables efficient gradient computation even for high-dimensional models like the hierarchical 2PL model (Ranganath et al., 2014).

The parameters are updated iteratively using gradient ascent or more advanced optimizers such as Adam or RMSProp, which are well-suited for stochastic optimization (Kingma & Ba, 2015). The update rules for the parameters at iteration \( t \) are as follows:

\[ \mu^{(t+1)} = \mu^{(t)} + \eta \nabla_\mu \text{ELBO}(\mu, \Sigma), \]
\[ L^{(t+1)} = L^{(t)} + \eta \nabla_L \text{ELBO}(\mu, L), \]

where:

  • \( \mu^{(t)} \) and \( L^{(t)} \) are the values of the variational parameters at iteration \( t \),
  • \( \eta \) is the learning rate that controls the step size,
  • \( \nabla_\mu \text{ELBO} \) and \( \nabla_L \text{ELBO} \) are the gradients of the ELBO with respect to \( \mu \) and \( L \), respectively.

4.2.3 Monte Carlo Estimation of the ELBO

Because the ELBO involves an expectation over the variational distribution \( q(Z) \), we typically use Monte Carlo sampling to estimate this expectation (Kucukelbir et al., 2017). Specifically, we sample from the variational distribution \( q(Z) = \mathcal{N}(\mu, \Sigma) \) to approximate the gradients of the ELBO. The reparameterization trick is commonly used to ensure low-variance gradient estimates (Kingma & Welling, 2014).

For each iteration, we generate a sample \( Z^{(s)} \) from the variational distribution using the reparameterization trick:

\[ Z^{(s)} = \mu + L \epsilon^{(s)}, \]

where \( \epsilon^{(s)} \sim \mathcal{N}(0, I) \) is a standard normal random variable, and \( L \) is the Cholesky factor of \( \Sigma \) (Blei et al., 2017). This transformation allows us to backpropagate gradients through the stochastic sampling process, ensuring that the gradient-based optimization remains efficient.

The Monte Carlo approximation of the ELBO is given by:

\[ \text{ELBO}(\mu, L) \approx \frac{1}{S} \sum_{s=1}^{S} \left[ \log p(\mathbf{y}, Z^{(s)}) - \log q(Z^{(s)}; \mu, L) \right], \]

where \( S \) is the number of Monte Carlo samples used to estimate the expectation (Jordan et al., 1999).

4.2.4 Advantages of Full-Rank ADVI

Full-rank ADVI offers several important advantages over simpler mean-field variational inference:

  • Capturing Correlations: By allowing a full-rank covariance matrix, full-rank ADVI can capture dependencies and correlations between the latent variables, which are often present in hierarchical models like the 2PL model. This leads to a more accurate approximation of the true posterior (Murphy, 2012).
  • Flexible Approximation: The multivariate Gaussian with a full-rank covariance structure is a more flexible variational family compared to the diagonal covariance assumption in mean-field inference. This flexibility allows full-rank ADVI to approximate more complex posterior distributions, especially when there are strong dependencies between the parameters (Hoffman et al., 2013).
  • Improved Inference Quality: Since full-rank ADVI is better at capturing the joint structure of the latent variables, it often leads to better predictive performance and more accurate uncertainty quantification compared to mean-field approaches (Blei et al., 2017).

4.2.5 Computational Considerations

While full-rank ADVI is more expressive than mean-field ADVI, it also comes with increased computational costs due to the need to store and manipulate a full covariance matrix \( \Sigma \). The covariance matrix grows quadratically with the number of latent variables, making full-rank ADVI more computationally intensive for large-scale models (Murphy, 2012). However, the Cholesky decomposition of \( \Sigma \) mitigates some of these computational challenges, as it allows for more efficient matrix operations during both the optimization and sampling steps (Bishop, 2006).

Despite the higher computational cost, full-rank ADVI is still significantly faster than traditional MCMC methods and remains a viable option for large-scale Bayesian models where capturing correlations between latent variables is important (Kucukelbir et al., 2017).

5. Conclusion

This paper presents a comprehensive mathematical formulation of the hierarchical Bayesian Two-Parameter Logistic (2PL) model, utilizing variational inference to address the challenges of complex posterior estimation. By incorporating informative priors for item-specific parameters and a hierarchical structure for latent abilities, the model accounts for both individual and group-level variation, providing a more refined and theoretically robust approach to latent trait estimation. The hierarchical structure facilitates partial pooling, which improves the stability of parameter estimates, especially in the context of small or noisy datasets.

The use of Automatic Differentiation Variational Inference (ADVI), particularly in its full-rank form, significantly enhances the approximation of complex, high-dimensional posteriors. This method offers a computationally efficient alternative to traditional sampling techniques, such as Markov Chain Monte Carlo (MCMC), making it well-suited for large-scale applications while capturing correlations between latent variables. However, as this work is theoretical, the practical performance and computational gains of ADVI in hierarchical IRT models remain to be empirically validated in future studies.

This theoretical exploration highlights the potential of the hierarchical Bayesian 2PL model and ADVI for scalable and flexible inference in latent trait modeling. Future research could explore extensions to multidimensional IRT models, enabling the simultaneous estimation of multiple latent traits, as well as the integration of nonparametric priors for increased model flexibility. These advancements could further extend the applicability of this framework across diverse fields, including psychometrics, machine learning, and the behavioral sciences.

References

Albert, J. (1992). Bayesian Estimation of Normal Ogive Item Response Curves Using Gibbs Sampling. Journal of Educational Statistics, 17(3), 251–269. https://doi.org/10.2307/1165149

Baker, F. B. (2001). The basics of item response theory. ERIC Clearinghouse on Assessment and Evaluation. https://doi.org/10.1201/9781482276725

Baker, F. B., & Kim, S. H. (2004). Item response theory: Parameter estimation techniques (2nd ed.). CRC Press.

Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518), 859–877. https://doi.org/10.1080/01621459.2017.1285773

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397–479). Addison-Wesley.

Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.

Carlin, B. P., & Louis, T. A. (2000). Bayes and empirical Bayes methods for data analysis (2nd ed.). Chapman & Hall/CRC.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Psychology Press. https://doi.org/10.4324/9781410605269

Fox, J.-P. (2010). Bayesian item response modeling: Theory and applications. Springer Science & Business Media. https://doi.org/10.1007/978-1-4419-0742-4

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis (3rd ed.). Chapman and Hall/CRC. https://doi.org/10.1201/b16018

Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press. https://doi.org/10.1017/CBO9780511790942

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage Publications.

Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015). https://arxiv.org/abs/1412.6980

Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114. https://arxiv.org/abs/1312.6114

König, C., Laubrock, J., Eppinger, B., & Fox, J.-P. (2022). Modeling individual differences in reading behavior with a hierarchical Bayesian item response theory model. Behavior Research Methods, 54(5), 2414–2430. https://doi.org/10.3758/s13428-022-02000-5

Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic differentiation variational inference. Journal of Machine Learning Research, 18(1), 430–474.

Marcoulides, K. (2018). Careful with Those Priors: A Note on Bayesian Estimation in Two-Parameter Logistic Item Response Theory Models. Measurement: Interdisciplinary Research and Perspectives, 16, 92-99. https://doi.org/10.1080/15366367.2018.1437305

Mislevy, R. J., & Verhelst, N. D. (1990). Modeling item responses when different subjects employ different solution strategies. Psychometrika, 55(2), 195–215. https://doi.org/10.1007/BF02295283

Polson, N. G., & Scott, J. G. (2012). On the half-Cauchy prior for a global scale parameter. Bayesian Analysis, 7(4), 887–902. http://dx.doi.org/10.1214/12-BA730

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Danish Institute for Educational Research.

Skrondal, A., & Rabe-Hesketh, S. (2004). Generalized latent variable modeling: Multilevel, longitudinal, and structural equation models. Chapman & Hall/CRC. https://doi.org/10.1201/9780203489437

Smid, S. C., McNeish, D., Miočević, M., & van de Schoot, R. (2020). Bayesian versus frequentist estimation for structural equation models in small sample contexts: A systematic review. Structural Equation Modeling, 27(1), 131–161. https://doi.org/10.1080/10705511.2019.1577140

van der Linden, W. J., & Hambleton, R. K. (2013). Handbook of modern item response theory. Springer Science & Business Media.

Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., & Bürkner, P.-C. (2017). Pareto smoothed importance sampling. arXiv preprint arXiv:1507.02646. https://arxiv.org/abs/1507.02646

Zitzmann, S., Helm, C., & Hecht, M. (2021). Prior Specification for More Stable Bayesian Estimation of Multilevel Latent Variable Models in Small Samples: A Comparative Investigation of Two Different Approaches. Frontiers in psychology, 11, 611267. https://doi.org/10.3389/fpsyg.2020.611267

Zitzmann, S., Lüdtke, O., Robitzsch, A., & Hecht, M. (2021). On the Performance of Bayesian Approaches in Small Samples: A Comment on Smid, McNeish, Miocevic, and van de Schoot (2020). Structural Equation Modeling: A Multidisciplinary Journal, 28(1), 40–50. https://doi.org/10.1080/10705511.2020.1752216

Author: Jouve, X.
Publication: 2024