Encoding algorithm#

Refresher on bayesian statistics#

In bayesian statstics, we have

\[p(\theta | y) = \frac{p(y | \theta)p(\theta)}{p(y)}\]

where \(p(\theta)\) is the prior distribution for parameter \(\theta\), \(p(y|\theta)\) is the likelihood of \(y\) given \(\theta\), and \(p(\theta|y)\) is the posterior distribution of parameter \(\theta\) using \(y\). In particular, we will focus on conjugate Bayesian models, where the prior distribution and posterior distribution of \(\theta\) are from the same family.

Example#

Consider a situation where the target variable in our dataset is binary. This means that \(y_{1}, ..., y_{n}\) are independent and identically distributed from a Bernoulli process where \(\theta\), the probability of a 1, is unknown.

Using Fink’s Compendium of conjugate priors, the prior distribution of \(\theta\) is a Beta distribution with hyperparameters \(\alpha\) and \(\beta\). i.e., \(\theta \sim Beta(\alpha, \beta)\)

Since we are using a conjugate Bayesian model, the posterior distribution \(p(\theta | y)\) follows a \(Beta(\alpha^{\prime}, \beta^{\prime})\). Fink stipulates that

\[\alpha^{\prime} = \alpha + \sum_{i = 1}^{n} y_{i}\]

and

\[\beta^{\prime} = \beta + n - \sum_{i = 1}^{n} y_{i}\]

Procedure#

Ok, let’s lay out the procedure for bayesian target encoding. Suppose you have \(n\) training observations, with \(Y = (y_{1}, ..., y_{n})\) representing the target and categorical variable \(X_{1} = (x_{1}, ..., x_{n})\) with distinct values \(V = (v_{1}, ..., v_{l})\).

Choose a likelihood for the target variable (e.g. Bernoulli for binary classification),
Derive the conjugate prior for the likelihood (e.g. Beta),
Use the training data to initialize the hyperparameters for the prior distribution (e.g. \(\alpha\) and \(\beta\)) [1],
Derive the methodology for generating the posterior distribution parameters,
For each level \(v_{i} \in V\),
1. Generate the posterior distribution using \(y_{1}, ..., y_{m} | x_{j} = v_{i}, \forall j \in (1, m)\),
2. Set the encoding value to a sample from the posterior distribution [2]