LabStats: April 2023

Tuesday, 25 April 2023

The Relationship Between Weighted Geometric Mean and Shannon's Entropy

The geometric mean is a type of average that is useful for comparing quantities that have different units or vary over a wide range. It is defined as the n-th root of the product of n numbers

G M = {(\prod_{i = 1}^{n} x_{i})}^{1 / n}

The geometric mean has some interesting properties that make it useful for certain applications. For instance, the geometric mean is always less than or equal to the arithmetic mean (the ordinary average), and it is always greater than or equal to the harmonic mean (another type of average that is useful for rates and ratios). The geometric mean also preserves the ratios of the data values, meaning that multiplying or dividing all the data values by a constant does not change their geometric mean, and it tends to be closer to the smaller values in the data set. This makes it more robust to outliers and skewed distributions.
However, the geometric mean has a limitation: it assumes that all the values in the data set have equal importance or probability. This may not be true in some cases, such as when the data represents different scenarios with different likelihoods of occurrence. In such situations, a better way to calculate the mean is to use a probability-weighted geometric mean (PWGM), which we can define as follows:

Z = {(\prod_{i = 1}^{n} x_{i}^{p_{i}})}^{1 / n}

This quantity is simply the weighted geometric mean of the observations

x_{1}, \dots, x_{n}

with weights equal to their probabilities

p_{1}, \dots, p_{n}

. This means that Z is a measure of the average value of the random variable that takes into account how likely each observation is to occur. Z will be closer to the observation that has the highest probability and farther from the observation that has the lowest probability.

Let T be the sum of n observed values sampled from x. Since for every i, it holds that

x_{i}^{p_{i}} = \exp (p_{i} \log (x_{i}))

then we have that

\log (Z) = \sum_{i = 1}^{n} p_{i} \log (x_{i}) = \sum_{i = 1}^{n} p_{i} \log (\frac{x_{i}}{p_{i}}) + \sum_{i = 1}^{n} p_{i} \log (p_{i})

The second term in the last equality, is the Shannon's entropy H(x), while for every i the ratio

\frac{x_{i}}{p_{i}}

is equal to T, therefore we can rewrite in a simplified form

\log (Z) = \log (T) - H (x)

whereby we finally get:

Z = T \cdot \exp (- H (x))

where the term T represents the scale of the observations, and exp(-H) represents the degree of concentration or dispersion of the distribution around its mean.This finding suggests that the weighted geometric mean of the observations is proportional to the total sum of the observations, but also inversely proportional to the exponential of Shannon's entropy of the distribution. This means that the more uniform the distribution is, the higher the entropy and the lower the weighted geometric mean. Conversely, the more skewed the distribution is, the lower the entropy and the higher the weighted geometric mean.

The bottom line is that geometric mean and Shannon's entropy are two related concepts that can help us understand the characteristics of a random variable, and by using a simple formula, we can see how they are influenced by the scale and order of the random variable.

Sunday, 16 April 2023

R function Kullback_Leibler

The Kullback_Leibler() function calculates the Kullback-Leibler divergence between two probability distributions p and q. The function takes two arguments p and q, which can be either discrete or continuous probability distributions. The type of distribution is specified using the type argument.

The function first checks the type of distributions and their lengths. If the distributions are discrete, it checks that they have non-negative values and sum to one. If the distributions are continuous, it checks that they have positive values.

The Kullback-Leibler divergence is then calculated using the formula sum(p * log(p / q)). The function also calculates a normalized version of the divergence by dividing it by the entropy of p. The asymmetry of the divergence is checked by swapping p and q.

Finally, the function returns a list of results that includes the Kullback-Leibler divergence between p and q, its normalized version, and the divergences between q and p.

Kullback_Leibler <- function(p, q, type = c("discrete", "continuous")) {

# p, q: two distributions of the same size to be compared

# Check the type of distributions

type <- match.arg(type)

# Check the length of distributions

if (length(p) != length(q)) {

stop("The distributions must have the same length")

}

# Check the validity of distributions

if (type == "discrete") {

if (any(p < 0) || any(q < 0)) {

stop("The discrete distributions must have non-negative values")

}

if (abs(sum(p) - 1) > 1e-6 || abs(sum(q) - 1) > 1e-6) {

stop("The discrete distributions must sum to one")

}

} else {

if (any(p <= 0) || any(q <= 0)) {

stop("The continuous distributions must have positive values")

}

# Calculate the Kullback-Leibler divergence

kl <- sum(p * log(p / q))

# Normalize the divergence by dividing by the entropy of p

kl_norm <- kl / (-sum(p * log(p)))

# Check the asymmetry of the divergence by swapping p and q

kl_swap <- sum(q * log(q / p))

kl_swap_norm <- kl_swap / (-sum(q * log(q)))

# Return a list of results

Return(list(kld.pq = kl, kld.pq.norm = kl_norm, kld.qp = kl_swap, kld.qp.norm = kl_swap_norm))

}

#EXAMPLE 1: discrete distributions

# Discrete distributions

p <- c(0.1, 0.2, 0.3, 0.4)

q <- c(0.2, 0.3, 0.2, 0.3)

Kullback_Leibler(p,q,type="discrete")

#EXAMPLE 2: continuous distributions

# Continuous distributions

x <- seq(-5, 5, length.out = 100)

p <- dnorm(x, mean = -1, sd = 1)

q <- dnorm(x, mean = 1, sd = 1)

Kullback_Leibler(p,q,type="continuous")

In EXAMPLE 1, we have two discrete distributions p and q with probabilities 0.1, 0.2, 0.3, and 0.4 and 0.2, 0.3, 0.2, and 0.3, respectively.

When we call the Kullback_Leibler() function with these distributions and the argument type="discrete", we get the following results:

$kld.pq
[1] 0.04575749

$kld.pq.norm
[1] 0.064078

$kld.qp
[1] 0.05129329

$kld.qp.norm
[1] 0.0719366

The first result $kld.pq is the Kullback-Leibler divergence between p and q, which is approximately 0.04575749. The second result $kld.pq.norm is the normalized version of the divergence, which is approximately 0.064078.
The third result $kld.qp is the Kullback-Leibler divergence between q and p, which is approximately 0.05129329. The fourth result $kld.qp.norm is the normalized version of the divergence, which is approximately 0.0719366.

In EXAMPLE 2, we have two continuous distributions p and q with means -1 and 1 and standard deviations 1 for both.

When we call the Kullback_Leibler() function with these distributions and the argument type="continuous", we get the following results:

$kld.pq
[1] 1.386294

$kld.pq.norm
[1] 0.5

$kld.qp
[1] 1.386294

$kld.qp.norm
[1] 0.5

The first result $kld.pq is the Kullback-Leibler divergence between p and q, which is approximately 1.386294. The second result $kld.pq.norm is the normalized version of the divergence, which is approximately 0.5.

The third result $kld.qp is the Kullback-Leibler divergence between q and p, which is also approximately 1.386294. The fourth result $kld.qp.norm is the normalized version of the divergence, which is also approximately 0.5.

Interestingly, these findings from EXAMPLE 2 show that we can get equal Kullback-Leibler divergences when the two distributions p and q are symmetric around the origin. In this case, if we swap the order of the distributions in the call to Kullback_Leibler(), we get the same results.

The Kullback-Leibler Divergence: a bridge between Probability Theory and Information Theory

1. Introduction

Evaluating the difference between two probability distributions is an important task in many fields of statistics, machine learning, and data science. For example, we may want to compare the performance of different models, test hypotheses about the underlying data-generating process, or measure the divergence or similarity between two populations or groups.

There are different ways to evaluate the difference between two probability distributions, depending on the type and nature of the data, the assumptions we make about the distributions, and the goal of the analysis. Some common methods are:

· The Kolmogorov-Smirnov test: This is a non-parametric test that compares the empirical cumulative distribution functions (CDFs) of two samples and measures the maximum vertical distance between them. The test can be used to check if two samples come from the same distribution or if one distribution is stochastically larger than another.
· The Kullback-Leibler divergence: This is a measure of how much information is lost when one distribution is used to approximate another. It is also known as relative entropy or information divergence. The Kullback-Leibler divergence is not symmetric and does not satisfy the triangle inequality, so it is not a true distance metric. However, it can be used to compare the fit of different models or to quantify the uncertainty or surprise in a data set.
· The Wasserstein distance: This is a measure of how much mass needs to be moved to transform one distribution into another. It is also known as the earth mover's distance or the optimal transport distance. The Wasserstein distance is symmetric and satisfies the triangle inequality, so it is a true distance metric. It can be used to compare distributions that have different shapes or supports or to incorporate prior knowledge or constraints into the comparison.

In this blog post, we will look at the Kullback-Leibler divergence which is a widely used concept in probability theory and information theory and which has many applications in data analysis, machine learning, and statistics.

1.1 What is Kullback-Leibler divergence?

The Kullback-Leibler divergence is a way of quantifying the difference between two probability distributions p and q over the same variable x. It is defined as the expected logarithmic difference between p and p, where p is the true distribution and p is the approximate distribution:

KLD(p||q) = E[log(p(x)/q(x))] = $\sum_{x} p (x) \log (p (x) / q (x))$

This formula means that we sum up the products of p(x) and the logarithm of the ratio p(x)/q(x) for all possible values of x. The logarithm can be taken with any base, but usually, base 2 or base natural are used.

More explicitly, the Kullback-Leibler divergence measures how much information is lost when we use q to approximate p, so it tells us how many extra bits we need to encode samples from p using a code based on q, instead of using a code based on p. The smaller the Kullback-Leibler divergence, the more similar the two distributions are.

For example, suppose we have two distributions p and q over the letters A, B, C, and D, as shown in the table below:

Letter	p	q
A	0.5	0.25
B	0.25	0.125
C	0.125	0.5
D	0.125	0.125

The Kullback-Leibler divergence from p to q is:

KLD(p||q) = 0.5 log(0.5/0.25) + 0.25 log(0.25/0.125) + 0.125 log(0.125/0.5) + 0.125 log(0.125/0.125)

= 0.5 + 0.25 - 0.25 + 0

= 0.5

This result suggests we need a half extra bit per letter to encode samples from p using a code based on q.

The Kullback-Leibler divergence from q to p is:

KLD(q||p) = 0.25 log(0.25/0.5) + 0.125 log(0.125/0.25) + 0.5 log(0.5/0.125) + 0.125 log(0.125/0.125)

= -0.25 - 0.125 + 1 + 0

= 0.625

So we must add 0.125 bits per letter for encoding samples from q using a code based on p.

Notice that the Kullback-Leibler divergence is not symmetric: KLD(p||q) is not equal to KLD(q||p). This is because it depends on which distribution we use as the reference and which one is the approximation.

Also, notice that the Kullback-Leibler divergence is always non-negative: KLD(p||q) >= 0 for any p and q, and it is equal to zero if and only if P and Q are identical.

1.2 Why is Kullback-Leibler divergence useful?

The Kullback-Leibler divergence has many applications in various fields of science and engineering, such as:

 Data compression: The Kullback-Leibler divergence can be used to measure how well a compression algorithm preserves the information in the original data.

 Model selection: The Kullback-Leibler divergence can be used to compare how well different models fit the observed data and choose the best one according to some criterion.

 Information retrieval: The Kullback-Leibler divergence can be used to measure how relevant a document is to a query, by comparing their probability distributions over words or topics.

 Machine learning: The Kullback-Leibler divergence can be used to measure how well a learning algorithm minimizes the error between its predictions and the true labels, or how well it maximizes the likelihood of the data.

 Statistical inference: The Kullback-Leibler divergence can be used to measure how much evidence a hypothesis has against another hypothesis, by comparing their probability distributions over outcomes.

1.3 How to compute Kullback-Leibler divergence?

The Kullback-Leibler divergence can be computed directly from the formula given above if we know the probability distributions p and q explicitly.

However, in many cases, we only have samples from p and q, and we need to estimate their distributions from these samples.

One way to do this is to use frequency counts. This means we count how many times each possible outcome x occurs in the samples from P and q and divide by the total number of samples to get the empirical probabilities p_e(x) and q_e(x). Then we plug these probabilities into the formula for the Kullback-Leibler divergence and get an estimate of how different p and q are.

However, this method has some drawbacks. First, it requires that we know all the possible outcomes x in advance, which may not be feasible for complex or continuous distributions. Second, it may be inaccurate if the samples are small or sparse, as some outcomes may have zero or very low-frequency counts, leading to undefined or unreliable logarithms. Third, it may be biased if the samples are not representative of the true distributions, as they may overestimate or underestimate some probabilities.

1.4 What are the benefits of Kullback-Leibler divergence normalization?

The Kullback-Leibler divergence has some drawbacks, such as being asymmetric, unbounded, and undefined when one of the distributions has zero probability for some events. To overcome these limitations, some researchers have proposed to normalize the Kullback-Leibler divergence by dividing it by a suitable constant or function.

First, normalization makes the Kullback-Leibler divergence symmetric, meaning that the divergence from p to q is equal to the divergence from q to p. This property is desirable for many applications where the direction of comparison is negligible or arbitrary, such as clustering, classification, or information retrieval. Symmetry also allows us to interpret the normalized Kullback-Leibler divergence as a distance metric that satisfies the triangle inequality and the identity of indiscernibles.

Second, normalization makes the Kullback-Leibler divergence bounded in a finite range of values. This property is valuable for comparing different pairs of distributions or aggregating multiple divergences into a single score. Boundedness also enables us to interpret the normalized Kullback-Leibler divergence as a measure of similarity or dissimilarity that reflects how much information is lost or gained when approximating one distribution by another.

Third, normalization makes the Kullback-Leibler divergence defined for all pairs of distributions, even when they have disjointed supports or zero probabilities. This property is essential for dealing with sparse or incomplete data, where some events may not be observed or estimated. Normalization also allows us to handle cases where one distribution is a proper subset of another, which would otherwise result in an infinite or undefined divergence.

In conclusion, normalization of the Kullback-Leibler divergence has several advantages over the original Kullback-Leibler divergence, such as symmetry, boundedness, and definedness. These advantages make the normalized Kullback-Leibler divergence more suitable and robust for various tasks that involve comparing or contrasting probability distributions.

2. An information-theoretic perspective

One way to quantify the uncertainty reduction in Bayesian inference is to use the Kullback-Leibler divergence, also known as the relative entropy, between two probability distributions. Under this perspective, the Kullback-Leibler divergence measures how much information is lost when one distribution is used to approximate another. Specifically, in Bayesian inference, the Kullback-Leibler divergence can be used as a measure of the information gain in moving from a prior distribution, p ( θ ) ), to a posterior distribution, p ( θ | y ), where y is the vector of observed data. The Kullback-Leibler divergence from the prior to the posterior is given by:

K L D (p (θ | x) ∥ p (θ)) = \int p (θ | y) \cdot \log (p (θ | y) / p (θ)) d θ

This quantity can be interpreted as the expected logarithmic ratio of the posterior density to the prior density, where the expectation is taken with respect to the posterior distribution. A higher Kullback-Leibler divergence indicates a larger discrepancy between the prior and the posterior, or a greater amount of information gained from the data. A lower Kullback-Leibler divergence indicates a smaller discrepancy between the prior and the posterior, or a lesser amount of information gained from the data.

The Kullback-Leibler divergence has some desirable properties for Bayesian inference. For example, it is always non-negative and equals zero if and only if the prior and the posterior are identical. It is also invariant under one-to-one transformations of the parameter space. Moreover, it can be shown that minimizing the Kullback-Leibler divergence from the prior to the posterior is equivalent to maximizing the marginal likelihood of the data, p(y). This leads to the concept of reference priors, which are priors that minimize the Kullback-Leibler divergence and thus maximize the information gain from the data. Reference priors are often considered objective or non-informative priors in Bayesian inference.

Another application of Kullback-Leibler divergence in Bayesian inference is to construct credible intervals for the parameters based on the lowest posterior loss principle. The idea is to find an interval that minimizes the expected loss under a given loss function. One possible loss function is based on the Kullback-Leibler divergence between a point estimate and the true parameter value. For example, if we use a point estimate

\overset{\land}{θ}

that maximizes the posterior density (the mode), then we can define a loss function as:

L (\overset{\land}{θ}, θ) = K L D (δ_{\overset{\land}{θ}} ∥ p (θ | y)) = - \log (p (\overset{\land}{θ} | y))

where

δ_{\overset{\land}{θ}}

is a Dirac delta function centered at

\overset{\land}{θ}

. The expected loss under this loss function is given by:

E [L (θ, θ)] = - \int p (\overset{\land}{θ} | y) \log (p (\overset{\land}{θ} | y)) d \overset{\land}{θ}

This quantity can be minimized by finding an interval that contains most of the posterior mass around the mode. Such an interval is called a lowest posterior loss interval or a highest posterior density interval. It can be shown that this interval has a minimum length among all intervals with a given posterior probability.

Therefore, in terms of Bayesian inference, the Kullback-Leibler divergence can be used as a measure of the information gain in moving from a prior distribution, p(x), to a posterior distribution, p(x∣y). As such, Kullback-Leibler divergence is the basis of reference priors and lowest posterior loss intervals. Reference priors are priors that maximize the expected Kullback-Leibler divergence between the prior and the posterior, thus ensuring that the posterior is as informative as possible. The lowest posterior loss intervals are intervals that minimize the expected Kullback-Leibler divergence between the posterior and a point estimate, thus providing optimal summaries of the posterior distribution. Both reference priors and lowest posterior loss intervals are useful tools for Bayesian inference and decision-making under uncertainty.

2.1 Relationship between Kullback-Leibler divergence and the theorem of Sanov

We have shown that Kullback-Leibler divergence is a measure of the difference between two probability distributions p(x) and q(x) over the same variable x. It quantifies the information loss when q(x) is used to approximate p(x).

The theorem of Sanov gives a bound on the probability of observing an atypical sequence of samples from a given probability distribution q(x). It states that if we draw n i.i.d. samples from q(x), represented by the vector

X^{n} = (X_{1}, X_{2}, \dots X_{n})

, then the probability that the empirical measure of the samples falls within a set A of probability distributions over x is bounded by:

p_{q} (x^{n} i n A) = \exp (- n \cdot K L D (p^{*} (x) ∥ q (x)))

where p*(x) is the information projection of q(x) onto A, that is, the distribution in A that minimizes the Kullback-Leibler divergence from q(x).

The theorem of Sanov can be seen as a large deviation result for the empirical measure of a sequence of i.i.d. random variables. It identifies the rate function for large deviations as the Kullback-Leibler divergence from the true distribution to the atypical one. It also shows that there is a dominant atypical distribution, given by the information projection, that determines the asymptotic behaviour of the probability.

The relationship between Kullback-Leibler divergence and Sanov's theorem can be summarized as follows: Kullback-Leibler divergence measures how different two distributions are, while Sanov's theorem bounds how likely it is to observe a sequence of samples that follows a different distribution from the true one.

2.2 Information projection

In the introduction section, the Kullback-Leibler divergence is presented as one way to measure the similarity between two probability distributions. However, sometimes we are interested in finding the closest distribution to a given one within a certain set of distributions. This problem can be formulated as the Information projection, or I-projection I-projection of a probability distribution q onto a set of distributions P. The I-projection of q onto P is defined as the distribution p* in P that minimizes the Kullback-Leibler divergence from q to p, i.e., $p^{*} = \underset{p \in P}{argmin} K L D (q ∥ p)$ .

Thus, while the Kullback-Leibler divergence measures the amount of information lost when using one distribution to approximate another, the I-projection of q onto p is the distribution that minimizes this divergence and thus preserves the most information about q. The I-projection has many applications in statistics, machine learning, and information theory. For example, it can be used to find the maximum entropy distribution that satisfies some constraints or to perform variational inference in Bayesian models.

Noticeably, KLD (p || q) is always not lower than KLD (p || p*) + KLD (p* || q) if p is convex.

Let's remember that a probability distribution p is convex if it satisfies the following condition: for any two points x and y in the support of p and any λ ∈ [0,1], we have p(λx + (1 − λ)y) ≤ λp(x) + (1 − λ)p(y). This means that if we draw a line between any two points on the graph of p, then the graph of p lies below this line.

Also, remember that the support of a probability distribution can be loosely thought of as the closure of the set of possible values of a random variable having that distribution. It is the smallest closed set whose complement has probability zero.

In more detail, one of the advantages of having a convex probability distribution p is that it allows us to apply Jensen's inequality, which states that for any convex function f, the expected value of f(p) is greater than or equal to f of the expected value of p. This inequality can be useful for deriving lower bounds on various quantities of interest, such as entropy, mutual information, or divergence measures. Another advantage of having a convex probability distribution p is that it implies that p is log-concave, meaning that the logarithm of p is a concave function. This property can be helpful for analyzing the properties of p, such as its mode, variance, tail behavior, or concentration. Log-concave distributions also have nice computational properties, such as being easy to sample from or optimize over. A third advantage of having a convex probability distribution p is that it ensures that p is unimodal, meaning that it has a single peak or maximum value. This property can simplify the analysis of p, as it eliminates the possibility of multiple local optima or modes. Unimodal distributions also have intuitive interpretations, as they reflect a single dominant outcome or preference.

When p is not convex, it is possible that KLD (p || q) is lower than KLD (p || p*) + KLD (p* || q). This is because when p is not convex, there may be multiple local minima that can be reached by the optimization algorithm used to find p*. Therefore, it is possible that p* may not be the global minimum of KLD (p || q) over P.

2.3 Kullback-Leibler divergence vs. cross-entropy

The Kullback-Leibler divergence and the cross-entropy are two closely related concepts in information theory that measure the difference between two probability distributions. We have interpreted the Kullback-Leibler divergence as the expected amount of extra information needed to encode data from distribution p using a code based on distribution q.

The cross-entropy is defined as:

H (p, q) = - \sum_{x} p (x) \log (q (x))

where p and q are again two discrete probability distributions over the same set of events x. The cross-entropy can be explained as the expected number of bits needed to encode data from distribution p using a code based on distribution q. The cross-entropy is always greater than or equal to the entropy of p, which is defined as:

H (p) = - \sum_{x} p (x) \log (p (x))

The entropy of p can be interpreted as the minimum number of bits needed to encode data from distribution p using an optimal code. The relationship between the Kullback-Leibler divergence and the cross-entropy can be seen by rearranging the terms in their definitions:

KLD (p ∥ q) = H (p, q) − H (p)

This equation shows that the Kullback-Leibler divergence is equal to the difference between the cross-entropy and the entropy of p. In other words, the Kullback-Leibler divergence measures how much more information is needed to encode data from distribution p using a code based on distribution q than using an optimal code based on distribution p itself.

The Kullback-Leibler divergence and the cross-entropy are often used in machine learning to measure the discrepancy between a target distribution p (such as the true labels of data points) and a model distribution q (such as the predictions of a classifier). Minimizing the Kullback-Leibler divergence or the cross-entropy is equivalent to maximizing the likelihood of the data under the model distribution. However, since the entropy of p does not depend on q, minimizing the Kullback-Leibler divergence is equivalent to minimizing the cross-entropy. Therefore, both metrics can be used interchangeably as loss functions in machine learning.

2.4 Kullback-Leibler divergence vs. free-energy

The free energy principle is a mathematical framework that describes how adaptive systems, such as brains, minimize their surprise or uncertainty about their environment by changing their internal states (perception) or external states (action). The free energy principle states that the free energy of a system is an upper bound on its surprise and that minimizing free energy is equivalent to minimizing surprise.

One way to understand the relationship between the Kullback-Leibler divergence and the free energy principle is to consider how a system represents the causes of its sensory inputs. The system has a generative model that specifies the probability of sensory inputs given their causes, and a recognition model that specifies the probability of causes given sensory inputs. The recognition model is also known as the posterior distribution because it reflects the system's updated beliefs after observing the data. The Kullback-Leibler divergence between the recognition model and the true posterior distribution is a measure of how well the system infers the causes of its sensory inputs.

The free energy principle says that the system can reduce its free energy by changing its recognition model to make it closer to the true posterior distribution. This means that the system implicitly performs Bayesian inference, updating its beliefs in a way that maximizes their accuracy and minimizes their uncertainty. By doing so, the system also reduces its surprise, because it becomes more able to predict its sensory inputs. Therefore, minimizing free energy is equivalent to minimizing the Kullback-Leibler divergence between the recognition model and the true posterior distribution.

Another way to understand the relationship between the Kullback-Leibler divergence and the free energy principle is to consider how a system acts on its environment. The system has a prior model that specifies its preferences or goals, and an action model that specifies how its actions affect its sensory inputs. The action model is also known as the likelihood function because it reflects how likely the system is to observe certain outcomes given its actions. The Kullback-Leibler divergence between the prior model and the action model is a measure of how well the system achieves its goals by acting on its environment.

The free energy principle says that the system can reduce its free energy by changing its actions to make them more consistent with its prior model. This means that the system implicitly performs optimal control, choosing actions that maximize its expected reward or utility. By doing so, the system also reduces its surprise, because it can control its sensory inputs more. Therefore, minimizing free energy is equivalent to minimizing the Kullback-Leibler divergence between the prior model and the action model.

Thus, the relationship between the Kullback-Leibler divergence and the free energy principle is that both quantify how well a system perceives and acts on its environment. The Kullback-Leibler divergence measures the difference between two probability distributions, while the free energy measures the difference between a probability distribution and a scalar value (surprise). Minimizing free energy implies minimizing both types of Kullback-Leibler divergences: one for perception and one for action.

Summary

The Kullback-Leibler divergence is a measure of how different two probability distributions are. It is often used in machine learning, statistics, and information theory to compare models, test hypotheses, and quantify uncertainty. This blog post introduced the concept of the Kullback-Leibler divergence and its mathematical properties, such as its formula and normalization. We have seen that the divergence measures the difference between two probability distributions in terms of their relative entropy. We have also discussed how to normalize the divergence to obtain a value in a finite interval. Stimulatingly, it also presented some applications and interpretations of the divergence in various domains of information theory. Under this perspective, we have discovered how the Kullback-Leibler divergence relates to Sanov's theorem, which gives a bound on the probability of observing a certain empirical distribution given a true distribution. We have also explained how the divergence can be used to define the I-projection, which is the closest distribution to a given one under some constraints. Moreover, we have examined how the divergence can be seen as a special case of cross-entropy, which measures the expected number of bits needed to encode data from one distribution using a code based on another distribution. Finally, we have explored how the divergence can be linked to the free-energy principle, which states that any system that minimizes its free energy also minimizes its divergence from the true distribution of its environment. I hope that this blog post has given you a deeper understanding of the Kullback-Leibler divergence and its role in information theory.

Links

https://search.r-project.org/CRAN/refmans/LaplacesDemon/html/KLD.html

https://machinelearningmastery.com/divergence-between-probability-distributions/

https://rpubs.com/DeTwes/KL-Divergence

https://people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture8.pdf

https://link.springer.com/article/10.1007/s13370-019-00702-2

https://arxiv.org/pdf/math/0602091.pdf

https://www.statlect.com/glossary/support-of-a-random-variable

https://statistical.fandom.com/wiki/Support_(Probability_distribution)

https://en.wikipedia.org/wiki/Information_projection; https://people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture8.pdf