LabStats: The Kullback-Leibler Divergence: a bridge between Probability Theory and Information Theory

1. Introduction

Evaluating the difference between two probability distributions is an important task in many fields of statistics, machine learning, and data science. For example, we may want to compare the performance of different models, test hypotheses about the underlying data-generating process, or measure the divergence or similarity between two populations or groups.

There are different ways to evaluate the difference between two probability distributions, depending on the type and nature of the data, the assumptions we make about the distributions, and the goal of the analysis. Some common methods are:

· The Kolmogorov-Smirnov test: This is a non-parametric test that compares the empirical cumulative distribution functions (CDFs) of two samples and measures the maximum vertical distance between them. The test can be used to check if two samples come from the same distribution or if one distribution is stochastically larger than another.
· The Kullback-Leibler divergence: This is a measure of how much information is lost when one distribution is used to approximate another. It is also known as relative entropy or information divergence. The Kullback-Leibler divergence is not symmetric and does not satisfy the triangle inequality, so it is not a true distance metric. However, it can be used to compare the fit of different models or to quantify the uncertainty or surprise in a data set.
· The Wasserstein distance: This is a measure of how much mass needs to be moved to transform one distribution into another. It is also known as the earth mover's distance or the optimal transport distance. The Wasserstein distance is symmetric and satisfies the triangle inequality, so it is a true distance metric. It can be used to compare distributions that have different shapes or supports or to incorporate prior knowledge or constraints into the comparison.

In this blog post, we will look at the Kullback-Leibler divergence which is a widely used concept in probability theory and information theory and which has many applications in data analysis, machine learning, and statistics.

1.1 What is Kullback-Leibler divergence?

The Kullback-Leibler divergence is a way of quantifying the difference between two probability distributions p and q over the same variable x. It is defined as the expected logarithmic difference between p and p, where p is the true distribution and p is the approximate distribution:

KLD(p||q) = E[log(p(x)/q(x))] = $\sum_{x} p (x) \log (p (x) / q (x))$

This formula means that we sum up the products of p(x) and the logarithm of the ratio p(x)/q(x) for all possible values of x. The logarithm can be taken with any base, but usually, base 2 or base natural are used.

More explicitly, the Kullback-Leibler divergence measures how much information is lost when we use q to approximate p, so it tells us how many extra bits we need to encode samples from p using a code based on q, instead of using a code based on p. The smaller the Kullback-Leibler divergence, the more similar the two distributions are.

For example, suppose we have two distributions p and q over the letters A, B, C, and D, as shown in the table below:

Letter	p	q
A	0.5	0.25
B	0.25	0.125
C	0.125	0.5
D	0.125	0.125

The Kullback-Leibler divergence from p to q is:

KLD(p||q) = 0.5 log(0.5/0.25) + 0.25 log(0.25/0.125) + 0.125 log(0.125/0.5) + 0.125 log(0.125/0.125)

= 0.5 + 0.25 - 0.25 + 0

= 0.5

This result suggests we need a half extra bit per letter to encode samples from p using a code based on q.

The Kullback-Leibler divergence from q to p is:

KLD(q||p) = 0.25 log(0.25/0.5) + 0.125 log(0.125/0.25) + 0.5 log(0.5/0.125) + 0.125 log(0.125/0.125)

= -0.25 - 0.125 + 1 + 0

= 0.625

So we must add 0.125 bits per letter for encoding samples from q using a code based on p.

Notice that the Kullback-Leibler divergence is not symmetric: KLD(p||q) is not equal to KLD(q||p). This is because it depends on which distribution we use as the reference and which one is the approximation.

Also, notice that the Kullback-Leibler divergence is always non-negative: KLD(p||q) >= 0 for any p and q, and it is equal to zero if and only if P and Q are identical.

1.2 Why is Kullback-Leibler divergence useful?

The Kullback-Leibler divergence has many applications in various fields of science and engineering, such as:

 Data compression: The Kullback-Leibler divergence can be used to measure how well a compression algorithm preserves the information in the original data.

 Model selection: The Kullback-Leibler divergence can be used to compare how well different models fit the observed data and choose the best one according to some criterion.

 Information retrieval: The Kullback-Leibler divergence can be used to measure how relevant a document is to a query, by comparing their probability distributions over words or topics.

 Machine learning: The Kullback-Leibler divergence can be used to measure how well a learning algorithm minimizes the error between its predictions and the true labels, or how well it maximizes the likelihood of the data.

 Statistical inference: The Kullback-Leibler divergence can be used to measure how much evidence a hypothesis has against another hypothesis, by comparing their probability distributions over outcomes.

1.3 How to compute Kullback-Leibler divergence?

The Kullback-Leibler divergence can be computed directly from the formula given above if we know the probability distributions p and q explicitly.

However, in many cases, we only have samples from p and q, and we need to estimate their distributions from these samples.

One way to do this is to use frequency counts. This means we count how many times each possible outcome x occurs in the samples from P and q and divide by the total number of samples to get the empirical probabilities p_e(x) and q_e(x). Then we plug these probabilities into the formula for the Kullback-Leibler divergence and get an estimate of how different p and q are.

However, this method has some drawbacks. First, it requires that we know all the possible outcomes x in advance, which may not be feasible for complex or continuous distributions. Second, it may be inaccurate if the samples are small or sparse, as some outcomes may have zero or very low-frequency counts, leading to undefined or unreliable logarithms. Third, it may be biased if the samples are not representative of the true distributions, as they may overestimate or underestimate some probabilities.

1.4 What are the benefits of Kullback-Leibler divergence normalization?

The Kullback-Leibler divergence has some drawbacks, such as being asymmetric, unbounded, and undefined when one of the distributions has zero probability for some events. To overcome these limitations, some researchers have proposed to normalize the Kullback-Leibler divergence by dividing it by a suitable constant or function.

First, normalization makes the Kullback-Leibler divergence symmetric, meaning that the divergence from p to q is equal to the divergence from q to p. This property is desirable for many applications where the direction of comparison is negligible or arbitrary, such as clustering, classification, or information retrieval. Symmetry also allows us to interpret the normalized Kullback-Leibler divergence as a distance metric that satisfies the triangle inequality and the identity of indiscernibles.

Second, normalization makes the Kullback-Leibler divergence bounded in a finite range of values. This property is valuable for comparing different pairs of distributions or aggregating multiple divergences into a single score. Boundedness also enables us to interpret the normalized Kullback-Leibler divergence as a measure of similarity or dissimilarity that reflects how much information is lost or gained when approximating one distribution by another.

Third, normalization makes the Kullback-Leibler divergence defined for all pairs of distributions, even when they have disjointed supports or zero probabilities. This property is essential for dealing with sparse or incomplete data, where some events may not be observed or estimated. Normalization also allows us to handle cases where one distribution is a proper subset of another, which would otherwise result in an infinite or undefined divergence.

In conclusion, normalization of the Kullback-Leibler divergence has several advantages over the original Kullback-Leibler divergence, such as symmetry, boundedness, and definedness. These advantages make the normalized Kullback-Leibler divergence more suitable and robust for various tasks that involve comparing or contrasting probability distributions.

2. An information-theoretic perspective

One way to quantify the uncertainty reduction in Bayesian inference is to use the Kullback-Leibler divergence, also known as the relative entropy, between two probability distributions. Under this perspective, the Kullback-Leibler divergence measures how much information is lost when one distribution is used to approximate another. Specifically, in Bayesian inference, the Kullback-Leibler divergence can be used as a measure of the information gain in moving from a prior distribution, p ( θ ) ), to a posterior distribution, p ( θ | y ), where y is the vector of observed data. The Kullback-Leibler divergence from the prior to the posterior is given by:

K L D (p (θ | x) ∥ p (θ)) = \int p (θ | y) \cdot \log (p (θ | y) / p (θ)) d θ

This quantity can be interpreted as the expected logarithmic ratio of the posterior density to the prior density, where the expectation is taken with respect to the posterior distribution. A higher Kullback-Leibler divergence indicates a larger discrepancy between the prior and the posterior, or a greater amount of information gained from the data. A lower Kullback-Leibler divergence indicates a smaller discrepancy between the prior and the posterior, or a lesser amount of information gained from the data.

The Kullback-Leibler divergence has some desirable properties for Bayesian inference. For example, it is always non-negative and equals zero if and only if the prior and the posterior are identical. It is also invariant under one-to-one transformations of the parameter space. Moreover, it can be shown that minimizing the Kullback-Leibler divergence from the prior to the posterior is equivalent to maximizing the marginal likelihood of the data, p(y). This leads to the concept of reference priors, which are priors that minimize the Kullback-Leibler divergence and thus maximize the information gain from the data. Reference priors are often considered objective or non-informative priors in Bayesian inference.

Another application of Kullback-Leibler divergence in Bayesian inference is to construct credible intervals for the parameters based on the lowest posterior loss principle. The idea is to find an interval that minimizes the expected loss under a given loss function. One possible loss function is based on the Kullback-Leibler divergence between a point estimate and the true parameter value. For example, if we use a point estimate

\overset{\land}{θ}

that maximizes the posterior density (the mode), then we can define a loss function as:

L (\overset{\land}{θ}, θ) = K L D (δ_{\overset{\land}{θ}} ∥ p (θ | y)) = - \log (p (\overset{\land}{θ} | y))

where

δ_{\overset{\land}{θ}}

is a Dirac delta function centered at

\overset{\land}{θ}

. The expected loss under this loss function is given by:

E [L (θ, θ)] = - \int p (\overset{\land}{θ} | y) \log (p (\overset{\land}{θ} | y)) d \overset{\land}{θ}

This quantity can be minimized by finding an interval that contains most of the posterior mass around the mode. Such an interval is called a lowest posterior loss interval or a highest posterior density interval. It can be shown that this interval has a minimum length among all intervals with a given posterior probability.

Therefore, in terms of Bayesian inference, the Kullback-Leibler divergence can be used as a measure of the information gain in moving from a prior distribution, p(x), to a posterior distribution, p(x∣y). As such, Kullback-Leibler divergence is the basis of reference priors and lowest posterior loss intervals. Reference priors are priors that maximize the expected Kullback-Leibler divergence between the prior and the posterior, thus ensuring that the posterior is as informative as possible. The lowest posterior loss intervals are intervals that minimize the expected Kullback-Leibler divergence between the posterior and a point estimate, thus providing optimal summaries of the posterior distribution. Both reference priors and lowest posterior loss intervals are useful tools for Bayesian inference and decision-making under uncertainty.

2.1 Relationship between Kullback-Leibler divergence and the theorem of Sanov

We have shown that Kullback-Leibler divergence is a measure of the difference between two probability distributions p(x) and q(x) over the same variable x. It quantifies the information loss when q(x) is used to approximate p(x).

The theorem of Sanov gives a bound on the probability of observing an atypical sequence of samples from a given probability distribution q(x). It states that if we draw n i.i.d. samples from q(x), represented by the vector

X^{n} = (X_{1}, X_{2}, \dots X_{n})

, then the probability that the empirical measure of the samples falls within a set A of probability distributions over x is bounded by:

p_{q} (x^{n} i n A) = \exp (- n \cdot K L D (p^{*} (x) ∥ q (x)))

where p*(x) is the information projection of q(x) onto A, that is, the distribution in A that minimizes the Kullback-Leibler divergence from q(x).

The theorem of Sanov can be seen as a large deviation result for the empirical measure of a sequence of i.i.d. random variables. It identifies the rate function for large deviations as the Kullback-Leibler divergence from the true distribution to the atypical one. It also shows that there is a dominant atypical distribution, given by the information projection, that determines the asymptotic behaviour of the probability.

The relationship between Kullback-Leibler divergence and Sanov's theorem can be summarized as follows: Kullback-Leibler divergence measures how different two distributions are, while Sanov's theorem bounds how likely it is to observe a sequence of samples that follows a different distribution from the true one.

2.2 Information projection

In the introduction section, the Kullback-Leibler divergence is presented as one way to measure the similarity between two probability distributions. However, sometimes we are interested in finding the closest distribution to a given one within a certain set of distributions. This problem can be formulated as the Information projection, or I-projection I-projection of a probability distribution q onto a set of distributions P. The I-projection of q onto P is defined as the distribution p* in P that minimizes the Kullback-Leibler divergence from q to p, i.e., $p^{*} = \underset{p \in P}{argmin} K L D (q ∥ p)$ .

Thus, while the Kullback-Leibler divergence measures the amount of information lost when using one distribution to approximate another, the I-projection of q onto p is the distribution that minimizes this divergence and thus preserves the most information about q. The I-projection has many applications in statistics, machine learning, and information theory. For example, it can be used to find the maximum entropy distribution that satisfies some constraints or to perform variational inference in Bayesian models.

Noticeably, KLD (p || q) is always not lower than KLD (p || p*) + KLD (p* || q) if p is convex.

Let's remember that a probability distribution p is convex if it satisfies the following condition: for any two points x and y in the support of p and any λ ∈ [0,1], we have p(λx + (1 − λ)y) ≤ λp(x) + (1 − λ)p(y). This means that if we draw a line between any two points on the graph of p, then the graph of p lies below this line.

Also, remember that the support of a probability distribution can be loosely thought of as the closure of the set of possible values of a random variable having that distribution. It is the smallest closed set whose complement has probability zero.

In more detail, one of the advantages of having a convex probability distribution p is that it allows us to apply Jensen's inequality, which states that for any convex function f, the expected value of f(p) is greater than or equal to f of the expected value of p. This inequality can be useful for deriving lower bounds on various quantities of interest, such as entropy, mutual information, or divergence measures. Another advantage of having a convex probability distribution p is that it implies that p is log-concave, meaning that the logarithm of p is a concave function. This property can be helpful for analyzing the properties of p, such as its mode, variance, tail behavior, or concentration. Log-concave distributions also have nice computational properties, such as being easy to sample from or optimize over. A third advantage of having a convex probability distribution p is that it ensures that p is unimodal, meaning that it has a single peak or maximum value. This property can simplify the analysis of p, as it eliminates the possibility of multiple local optima or modes. Unimodal distributions also have intuitive interpretations, as they reflect a single dominant outcome or preference.

When p is not convex, it is possible that KLD (p || q) is lower than KLD (p || p*) + KLD (p* || q). This is because when p is not convex, there may be multiple local minima that can be reached by the optimization algorithm used to find p*. Therefore, it is possible that p* may not be the global minimum of KLD (p || q) over P.

2.3 Kullback-Leibler divergence vs. cross-entropy

The Kullback-Leibler divergence and the cross-entropy are two closely related concepts in information theory that measure the difference between two probability distributions. We have interpreted the Kullback-Leibler divergence as the expected amount of extra information needed to encode data from distribution p using a code based on distribution q.

The cross-entropy is defined as:

H (p, q) = - \sum_{x} p (x) \log (q (x))

where p and q are again two discrete probability distributions over the same set of events x. The cross-entropy can be explained as the expected number of bits needed to encode data from distribution p using a code based on distribution q. The cross-entropy is always greater than or equal to the entropy of p, which is defined as:

H (p) = - \sum_{x} p (x) \log (p (x))

The entropy of p can be interpreted as the minimum number of bits needed to encode data from distribution p using an optimal code. The relationship between the Kullback-Leibler divergence and the cross-entropy can be seen by rearranging the terms in their definitions:

KLD (p ∥ q) = H (p, q) − H (p)

This equation shows that the Kullback-Leibler divergence is equal to the difference between the cross-entropy and the entropy of p. In other words, the Kullback-Leibler divergence measures how much more information is needed to encode data from distribution p using a code based on distribution q than using an optimal code based on distribution p itself.

The Kullback-Leibler divergence and the cross-entropy are often used in machine learning to measure the discrepancy between a target distribution p (such as the true labels of data points) and a model distribution q (such as the predictions of a classifier). Minimizing the Kullback-Leibler divergence or the cross-entropy is equivalent to maximizing the likelihood of the data under the model distribution. However, since the entropy of p does not depend on q, minimizing the Kullback-Leibler divergence is equivalent to minimizing the cross-entropy. Therefore, both metrics can be used interchangeably as loss functions in machine learning.

2.4 Kullback-Leibler divergence vs. free-energy

The free energy principle is a mathematical framework that describes how adaptive systems, such as brains, minimize their surprise or uncertainty about their environment by changing their internal states (perception) or external states (action). The free energy principle states that the free energy of a system is an upper bound on its surprise and that minimizing free energy is equivalent to minimizing surprise.

One way to understand the relationship between the Kullback-Leibler divergence and the free energy principle is to consider how a system represents the causes of its sensory inputs. The system has a generative model that specifies the probability of sensory inputs given their causes, and a recognition model that specifies the probability of causes given sensory inputs. The recognition model is also known as the posterior distribution because it reflects the system's updated beliefs after observing the data. The Kullback-Leibler divergence between the recognition model and the true posterior distribution is a measure of how well the system infers the causes of its sensory inputs.

The free energy principle says that the system can reduce its free energy by changing its recognition model to make it closer to the true posterior distribution. This means that the system implicitly performs Bayesian inference, updating its beliefs in a way that maximizes their accuracy and minimizes their uncertainty. By doing so, the system also reduces its surprise, because it becomes more able to predict its sensory inputs. Therefore, minimizing free energy is equivalent to minimizing the Kullback-Leibler divergence between the recognition model and the true posterior distribution.

Another way to understand the relationship between the Kullback-Leibler divergence and the free energy principle is to consider how a system acts on its environment. The system has a prior model that specifies its preferences or goals, and an action model that specifies how its actions affect its sensory inputs. The action model is also known as the likelihood function because it reflects how likely the system is to observe certain outcomes given its actions. The Kullback-Leibler divergence between the prior model and the action model is a measure of how well the system achieves its goals by acting on its environment.

The free energy principle says that the system can reduce its free energy by changing its actions to make them more consistent with its prior model. This means that the system implicitly performs optimal control, choosing actions that maximize its expected reward or utility. By doing so, the system also reduces its surprise, because it can control its sensory inputs more. Therefore, minimizing free energy is equivalent to minimizing the Kullback-Leibler divergence between the prior model and the action model.

Thus, the relationship between the Kullback-Leibler divergence and the free energy principle is that both quantify how well a system perceives and acts on its environment. The Kullback-Leibler divergence measures the difference between two probability distributions, while the free energy measures the difference between a probability distribution and a scalar value (surprise). Minimizing free energy implies minimizing both types of Kullback-Leibler divergences: one for perception and one for action.

Summary

The Kullback-Leibler divergence is a measure of how different two probability distributions are. It is often used in machine learning, statistics, and information theory to compare models, test hypotheses, and quantify uncertainty. This blog post introduced the concept of the Kullback-Leibler divergence and its mathematical properties, such as its formula and normalization. We have seen that the divergence measures the difference between two probability distributions in terms of their relative entropy. We have also discussed how to normalize the divergence to obtain a value in a finite interval. Stimulatingly, it also presented some applications and interpretations of the divergence in various domains of information theory. Under this perspective, we have discovered how the Kullback-Leibler divergence relates to Sanov's theorem, which gives a bound on the probability of observing a certain empirical distribution given a true distribution. We have also explained how the divergence can be used to define the I-projection, which is the closest distribution to a given one under some constraints. Moreover, we have examined how the divergence can be seen as a special case of cross-entropy, which measures the expected number of bits needed to encode data from one distribution using a code based on another distribution. Finally, we have explored how the divergence can be linked to the free-energy principle, which states that any system that minimizes its free energy also minimizes its divergence from the true distribution of its environment. I hope that this blog post has given you a deeper understanding of the Kullback-Leibler divergence and its role in information theory.

Links

https://search.r-project.org/CRAN/refmans/LaplacesDemon/html/KLD.html

https://machinelearningmastery.com/divergence-between-probability-distributions/

https://rpubs.com/DeTwes/KL-Divergence

https://people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture8.pdf

https://link.springer.com/article/10.1007/s13370-019-00702-2

https://arxiv.org/pdf/math/0602091.pdf

https://www.statlect.com/glossary/support-of-a-random-variable

https://statistical.fandom.com/wiki/Support_(Probability_distribution)

https://en.wikipedia.org/wiki/Information_projection; https://people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture8.pdf

Sunday, 16 April 2023

The Kullback-Leibler Divergence: a bridge between Probability Theory and Information Theory