1. Introduction
Evaluating the difference between two probability distributions is an important task in many fields of statistics, machine learning, and data science. For example, we may want to compare the performance of different models, test hypotheses about the underlying data-generating process, or measure the divergence or similarity between two populations or groups.
There are
different ways to evaluate the difference between two probability
distributions, depending on the type and nature of the data, the assumptions we
make about the distributions, and the goal of the analysis. Some common methods
are:
- ·
The Kolmogorov-Smirnov test:
This is a non-parametric test that compares the empirical cumulative
distribution functions (CDFs) of two samples and measures the maximum vertical
distance between them. The test can be used to check if two samples come from
the same distribution or if one distribution is stochastically larger than
another.
- ·
The Kullback-Leibler
divergence: This is a measure of how much information is lost when one
distribution is used to approximate another. It is also known as relative
entropy or information divergence. The Kullback-Leibler divergence is not symmetric and does not satisfy the triangle inequality, so it
is not a true distance metric. However, it can be used to compare the fit of
different models or to quantify the uncertainty or surprise in a data set.
- ·
The Wasserstein distance:
This is a measure of how much mass needs to be moved to transform one
distribution into another. It is also known as the earth mover's distance or
the optimal transport distance. The Wasserstein distance is symmetric and
satisfies the triangle inequality, so it is a true distance metric. It can be
used to compare distributions that have different shapes or supports or to
incorporate prior knowledge or constraints into the comparison.
In this blog post, we will look at the Kullback-Leibler
divergence which is a widely used concept in probability theory and
information theory and which has many applications in data analysis, machine
learning, and statistics.
1.1 What is Kullback-Leibler divergence?
The Kullback-Leibler divergence is a way of
quantifying the difference between two probability distributions p and q over
the same variable x. It is defined as the expected logarithmic difference
between p and p, where p is the true distribution and p is the approximate
distribution:
KLD(p||q) = E[log(p(x)/q(x))] =
This formula means that we sum up the products of p(x)
and the logarithm of the ratio p(x)/q(x) for all possible values of x. The
logarithm can be taken with any base, but usually, base 2 or base natural are
used.
More explicitly, the Kullback-Leibler
divergence measures how much information is lost when we use q to approximate p, so it tells us how many extra bits we need to encode samples from p
using a code based on q, instead of using a code based on p. The smaller the Kullback-Leibler
divergence, the more similar the two distributions are.
For example, suppose we have two distributions p and q
over the letters A, B, C, and D, as shown in the table below:
Letter |
p |
q |
---|---|---|
A |
0.5 |
0.25 |
B |
0.25 |
0.125 |
C |
0.125 |
0.5 |
D |
0.125 |
0.125 |
The Kullback-Leibler divergence from p to q is:
KLD(p||q) = 0.5 log(0.5/0.25) + 0.25 log(0.25/0.125) +
0.125 log(0.125/0.5) + 0.125 log(0.125/0.125)
= 0.5
+ 0.25 - 0.25 + 0
= 0.5
This result suggests we need a half extra bit per letter to
encode samples from p using a code based on q.
The Kullback-Leibler divergence from q to p is:
KLD(q||p) = 0.25 log(0.25/0.5) + 0.125 log(0.125/0.25) +
0.5 log(0.5/0.125) + 0.125 log(0.125/0.125)
=
-0.25 - 0.125 + 1 + 0
= 0.625
So we must add 0.125 bits per letter for encoding samples from q using a code based on p.
Notice that the Kullback-Leibler divergence is not
symmetric: KLD(p||q) is not equal to KLD(q||p). This is because it depends on
which distribution we use as the reference and which one is the approximation.
Also, notice that the Kullback-Leibler divergence is
always non-negative: KLD(p||q) >= 0 for any p and q, and it is equal to
zero if and only if P and Q are identical.
1.2 Why is Kullback-Leibler divergence useful?
The Kullback-Leibler divergence has many applications in various fields of science and engineering, such as:
Data compression: The Kullback-Leibler divergence can be used to measure how well a compression algorithm preserves the information in the original data.
Model selection: The Kullback-Leibler divergence can be used to compare how well different models fit the observed data and choose the best one according to some criterion.
Information retrieval: The Kullback-Leibler divergence can be used to measure how relevant a document is to a query, by comparing their probability distributions over words or topics.
Machine learning: The Kullback-Leibler divergence can be used to measure how well a learning algorithm minimizes the error between its predictions and the true labels, or how well it maximizes the likelihood of the data.
Statistical inference: The Kullback-Leibler divergence can be used to measure how much evidence a hypothesis has against another hypothesis, by comparing their probability distributions over outcomes.
1.3 How to compute Kullback-Leibler divergence?
The Kullback-Leibler divergence can be computed
directly from the formula given above if we know the probability distributions p
and q explicitly.
However, in many cases, we only have samples from p
and q, and we need to estimate their distributions from these samples.
One way to do this is to use frequency counts. This
means we count how many times each possible outcome x occurs in the
samples from P and q and divide by the total number of samples to get the
empirical probabilities pe(x) and qe(x). Then we plug
these probabilities into the formula for the Kullback-Leibler divergence and
get an estimate of how different p and q are.
However, this method has some drawbacks. First, it
requires that we know all the possible outcomes x in advance, which may not be
feasible for complex or continuous distributions. Second, it may be inaccurate
if the samples are small or sparse, as some outcomes may have zero or very low-frequency counts, leading to undefined or unreliable logarithms. Third, it may
be biased if the samples are not representative of the true distributions, as
they may overestimate or underestimate some probabilities.
1.4 What are the benefits of Kullback-Leibler divergence normalization?
The Kullback-Leibler divergence has some drawbacks,
such as being asymmetric, unbounded, and undefined when one of the
distributions has zero probability for some events. To overcome these
limitations, some researchers have proposed to normalize the Kullback-Leibler
divergence by dividing it by a suitable constant or function.
Second, normalization makes the Kullback-Leibler
divergence bounded in a finite range of values. This property is valuable for
comparing different pairs of distributions or aggregating multiple divergences
into a single score. Boundedness also enables us to interpret the normalized Kullback-Leibler
divergence as a measure of similarity or dissimilarity that reflects how much
information is lost or gained when approximating one distribution by another.
Third, normalization makes the Kullback-Leibler
divergence defined for all pairs of distributions, even when they have disjointed
supports or zero probabilities. This property is essential for dealing
with sparse or incomplete data, where some events may not be observed or
estimated. Normalization also allows us to handle cases where one distribution
is a proper subset of another, which would otherwise result in an infinite or
undefined divergence.
In conclusion, normalization of the Kullback-Leibler
divergence has several advantages over the original Kullback-Leibler divergence,
such as symmetry, boundedness, and definedness. These advantages make the
normalized Kullback-Leibler divergence more suitable and robust for various
tasks that involve comparing or contrasting probability distributions.
2. An information-theoretic perspective
One way to quantify the uncertainty reduction in
Bayesian inference is to use the Kullback-Leibler divergence, also known as the
relative entropy, between two probability distributions. Under this
perspective, the Kullback-Leibler divergence measures how much information is
lost when one distribution is used to approximate another. Specifically, in
Bayesian inference, the Kullback-Leibler divergence can be used as a measure of
the information gain in moving from a prior distribution,
This quantity can be interpreted as the expected logarithmic ratio of the posterior density to the prior density, where the expectation is taken with respect to the posterior distribution. A higher Kullback-Leibler divergence indicates a larger discrepancy between the prior and the posterior, or a greater amount of information gained from the data. A lower Kullback-Leibler divergence indicates a smaller discrepancy between the prior and the posterior, or a lesser amount of information gained from the data.
Another application of Kullback-Leibler divergence in Bayesian inference is to construct credible intervals for the parameters based on the lowest posterior loss principle. The idea is to find an interval that minimizes the expected loss under a given loss function. One possible loss function is based on the Kullback-Leibler divergence between a point estimate and the true parameter value. For example, if we use a point estimate
where is a Dirac delta function centered at . The expected loss under this loss function is given by:
This quantity can be minimized by finding an interval that contains most of the posterior mass around the mode. Such an interval is called a lowest posterior loss interval or a highest posterior density interval. It can be shown that this interval has a minimum length among all intervals with a given posterior probability.
Therefore, in terms of Bayesian inference, the Kullback-Leibler divergence can be used as a measure of the information gain in moving from a prior distribution, p(x), to a posterior distribution, p(x∣y). As such, Kullback-Leibler divergence is the basis of reference priors and lowest posterior loss intervals. Reference priors are priors that maximize the expected Kullback-Leibler divergence between the prior and the posterior, thus ensuring that the posterior is as informative as possible. The lowest posterior loss intervals are intervals that minimize the expected Kullback-Leibler divergence between the posterior and a point estimate, thus providing optimal summaries of the posterior distribution. Both reference priors and lowest posterior loss intervals are useful tools for Bayesian inference and decision-making under uncertainty.
2.1 Relationship between Kullback-Leibler divergence and the theorem of Sanov
We have shown that Kullback-Leibler divergence is a
measure of the difference between two probability distributions p(x) and q(x)
over the same variable x. It quantifies the information loss when q(x) is used
to approximate p(x).
The theorem of Sanov gives a bound on the probability of observing an atypical sequence of samples from a given probability distribution q(x). It states that if we draw n i.i.d. samples from q(x), represented by the vector
, then the probability that the empirical measure of the samples falls within a set A of probability distributions over x is bounded by:where p*(x) is the information projection of q(x) onto A, that is, the distribution in A that minimizes the Kullback-Leibler divergence from q(x).
The theorem of Sanov can be seen as a large deviation
result for the empirical measure of a sequence of i.i.d. random variables. It
identifies the rate function for large deviations as the Kullback-Leibler
divergence from the true distribution to the atypical one. It also shows that
there is a dominant atypical distribution, given by the information projection,
that determines the asymptotic behaviour of the probability.
The relationship between Kullback-Leibler divergence
and Sanov's theorem can be summarized as follows: Kullback-Leibler divergence measures how different two
distributions are, while Sanov's theorem bounds how likely it is to observe a
sequence of samples that follows a different distribution from the true one.
2.2 Information projection
In the introduction section, the Kullback-Leibler divergence
is presented as one way to measure the similarity between two probability
distributions. However, sometimes we are interested in finding the closest
distribution to a given one within a certain set of distributions. This problem
can be formulated as the Information projection, or I-projection I-projection
of a probability distribution q onto a set of distributions P. The I-projection
of q onto P is defined as the distribution p* in P that minimizes the Kullback-Leibler
divergence from q to p, i.e., .
Thus, while the Kullback-Leibler divergence measures the amount of information lost when using one distribution to approximate another, the I-projection of q onto p is the distribution that minimizes this divergence and thus preserves the most information about q. The I-projection has many applications in statistics, machine learning, and information theory. For example, it can be used to find the maximum entropy distribution that satisfies some constraints or to perform variational inference in Bayesian models.
Noticeably, KLD (p || q) is always not lower than KLD (p || p*) + KLD (p* || q) if p is convex.
Let's remember that a probability distribution p is convex if it satisfies the following condition: for any two points x and y in the support of p and any λ ∈ [0,1], we have p(λx + (1 − λ)y) ≤ λp(x) + (1 − λ)p(y). This means that if we draw a line between any two points on the graph of p, then the graph of p lies below this line.
Also, remember that the support of a probability distribution can be loosely thought of as the closure of the set of possible values of a random variable having that distribution. It is the smallest closed set whose complement has probability zero.
In more detail, one of the advantages of having a
convex probability distribution p is that it allows us to apply Jensen's
inequality, which states that for any convex function f, the expected value of
f(p) is greater than or equal to f of the expected value of p. This inequality
can be useful for deriving lower bounds on various quantities of interest, such
as entropy, mutual information, or divergence measures. Another advantage of
having a convex probability distribution p is that it implies that p is
log-concave, meaning that the logarithm of p is a concave function. This
property can be helpful for analyzing the properties of p, such as its mode,
variance, tail behavior, or concentration. Log-concave distributions also have
nice computational properties, such as being easy to sample from or optimize
over. A third advantage of having a convex probability distribution p is that
it ensures that p is unimodal, meaning that it has a single peak or maximum
value. This property can simplify the analysis of p, as it eliminates the
possibility of multiple local optima or modes. Unimodal distributions also have
intuitive interpretations, as they reflect a single dominant outcome or
preference.
When p is not convex, it is possible that KLD (p || q) is lower than KLD (p || p*) + KLD (p* || q). This is because when p is not convex, there may be multiple local minima that can be reached by the optimization algorithm used to find p*. Therefore, it is possible that p* may not be the global minimum of KLD (p || q) over P.
2.3 Kullback-Leibler divergence vs. cross-entropy
The
Kullback-Leibler divergence and the cross-entropy are two closely
related concepts in information theory that measure the difference between two
probability distributions. We have interpreted the Kullback-Leibler divergence
as the expected amount of extra information needed to encode data from
distribution p using a code based on distribution q.
The cross-entropy
is defined as:
where p and q are again two discrete probability distributions over the same set of events x. The cross-entropy can be explained as the expected number of bits needed to encode data from distribution p using a code based on distribution q. The cross-entropy is always greater than or equal to the entropy of p, which is defined as:
The entropy of p can be interpreted as the minimum number of bits needed to encode data from distribution p using an optimal code. The relationship between the Kullback-Leibler divergence and the cross-entropy can be seen by rearranging the terms in their definitions:
This equation shows that the Kullback-Leibler
divergence is equal to the difference between the cross-entropy and the entropy
of p. In other words, the Kullback-Leibler divergence measures how much more
information is needed to encode data from distribution p using a code based on
distribution q than using an optimal code based on distribution p itself.
The Kullback-Leibler
divergence and the cross-entropy are often used in machine learning to measure
the discrepancy between a target distribution p (such as the true labels of
data points) and a model distribution q (such as the predictions of a
classifier). Minimizing
the Kullback-Leibler divergence or the cross-entropy is equivalent to
maximizing the likelihood of the data under the model distribution. However, since the entropy of p
does not depend on q, minimizing the Kullback-Leibler divergence is equivalent
to minimizing the cross-entropy. Therefore, both metrics can be used interchangeably
as loss functions in machine learning.
2.4 Kullback-Leibler divergence vs. free-energy
The free energy
principle is a mathematical framework that describes how adaptive systems, such
as brains, minimize their surprise or uncertainty about their environment by
changing their internal states (perception) or external states (action). The
free energy principle states that the free energy of a system is an upper bound
on its surprise and that minimizing free energy is equivalent to minimizing
surprise.
One way to
understand the relationship between the Kullback-Leibler divergence and the
free energy principle is to consider how a system represents the causes of its
sensory inputs. The system has a generative model that specifies the
probability of sensory inputs given their causes, and a recognition model
that specifies the probability of causes given sensory inputs. The recognition
model is also known as the posterior distribution because it reflects the
system's updated beliefs after observing the data. The Kullback-Leibler divergence between the
recognition model and the true posterior distribution is a measure of how well
the system infers the causes of its sensory inputs.
The free energy
principle says that the system can reduce its free energy by changing its
recognition model to make it closer to the true posterior distribution. This
means that the system implicitly performs Bayesian inference, updating its
beliefs in a way that maximizes their accuracy and minimizes their uncertainty.
By doing so, the system also reduces its surprise, because it becomes more able
to predict its sensory inputs. Therefore, minimizing free energy is equivalent to minimizing the
Kullback-Leibler divergence between the recognition model and the true
posterior distribution.
Another way to
understand the relationship between the Kullback-Leibler divergence and the
free energy principle is to consider how a system acts on its environment. The
system has a prior model that specifies its preferences or goals, and an action
model that specifies how its actions affect its sensory inputs. The action
model is also known as the likelihood function because it reflects how likely
the system is to observe certain outcomes given its actions. The
Kullback-Leibler divergence between the prior model and the action model is a
measure of how well the system achieves its goals by acting on its environment.
The free energy
principle says that the system can reduce its free energy by changing its
actions to make them more consistent with its prior model. This means that the
system implicitly performs optimal control, choosing actions that maximize its
expected reward or utility. By doing so, the system also reduces its surprise,
because it can control its sensory inputs more. Therefore, minimizing free energy is
equivalent to minimizing the Kullback-Leibler divergence between the prior
model and the action model.
Thus, the
relationship between the Kullback-Leibler divergence and the free energy
principle is that both quantify how well a system perceives and acts on its
environment. The Kullback-Leibler divergence measures the difference between
two probability distributions, while the free energy measures the difference
between a probability distribution and a scalar value (surprise). Minimizing
free energy implies minimizing both types of Kullback-Leibler divergences: one
for perception and one for action.
Summary
The
Kullback-Leibler divergence is a measure of how different two probability
distributions are. It is often used in machine learning, statistics, and
information theory to compare models, test hypotheses, and quantify
uncertainty. This blog post introduced the concept of the Kullback-Leibler
divergence and its mathematical properties, such as its formula and
normalization. We have seen that the divergence measures the difference between
two probability distributions in terms of their relative entropy. We have also
discussed how to normalize the divergence to obtain a value in a finite
interval. Stimulatingly, it also presented some applications and
interpretations of the divergence in various domains of information theory. Under
this perspective, we have discovered how the Kullback-Leibler divergence relates
to Sanov's theorem, which gives a bound on the probability of observing a
certain empirical distribution given a true distribution. We have also
explained how the divergence can be used to define the I-projection, which is
the closest distribution to a given one under some constraints. Moreover, we
have examined how the divergence can be seen as a special case of cross-entropy,
which measures the expected number of bits needed to encode data from one
distribution using a code based on another distribution. Finally, we have
explored how the divergence can be linked to the free-energy principle, which
states that any system that minimizes its free energy also minimizes its
divergence from the true distribution of its environment. I hope that this
blog post has given you a deeper understanding of the Kullback-Leibler
divergence and its role in information theory.
Links
https://search.r-project.org/CRAN/refmans/LaplacesDemon/html/KLD.html
https://machinelearningmastery.com/divergence-between-probability-distributions/
https://rpubs.com/DeTwes/KL-Divergence
https://people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture8.pdf
https://link.springer.com/article/10.1007/s13370-019-00702-2
https://arxiv.org/pdf/math/0602091.pdf
https://www.statlect.com/glossary/support-of-a-random-variable
https://statistical.fandom.com/wiki/Support_(Probability_distribution)
https://en.wikipedia.org/wiki/Information_projection; https://people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture8.pdf
No comments:
Post a Comment