Light on Math ML: Intuitive Guide to KL Divergence

KL divergence measures how one probability distribution diverges from another. Think of it as an asymmetric distance between two explanations of the same phenomenon.

Abstract shadow image used in the original KL divergence article — Cover image from the original article (credit: thushv.com).

Introduction

In machine learning we often approximate an unknown data-generating process $P$ with a model $Q_\\theta$. To compare how good $Q$ is, we can use the Kullback–Leibler divergence:

$D_{\\mathrm{KL}}(P\\,\\Vert\\,Q)=\\sum_x P(x)\\,\\log\\frac{P(x)}{Q(x)}$ (discrete) or $\\displaystyle D_{\\mathrm{KL}}(P\\,\\Vert\\,Q)=\\int P(x)\\,\\log\\frac{P(x)}{Q(x)}\\,dx$ (continuous).

KL is zero only when $P\\equiv Q$ almost everywhere; otherwise it is positive. It is not symmetric.

Concept Grounding

What is a Distribution

A (probability) distribution assigns likelihoods to outcomes. Visualize the $x$‑axis as outcomes and the $y$‑axis as probabilities. Continuous distributions form smooth curves; discrete ones form bars.

What is an Event?

An event is a set of outcomes. For discrete spaces an event could be $X=k$; in continuous spaces it’s a range, e.g. $a\\le X \\le b$.

Back to KL Divergence

Suppose we have a true distribution $P$ (derived from data) and a candidate model $Q$. KL answers: how many extra nats/bits are needed when we encode samples from $P$ using a code optimized for $Q$.

Problem We’re Trying to Solve

Following the classic “space worms” story, we observe the number of teeth per worm. From counts we compute an empirical distribution $P$ over $x\\in\\{0,1,\\dots\\}$.

Let’s Tweak the Example

For concreteness, imagine 100 worms with probabilities peaking around 4–6 teeth. We’ll try to mimic this $P$ with simple parametric $Q$’s.

First try: Uniform Model

If we assumed each tooth-count was equally likely between $[0,K]$, $Q$ would be uniform. This may be too crude—KL will quantify the mismatch.

Second try: Binomial Model

A binomial (or Poisson) often models count-like phenomena. We choose parameters to fit the mean/variance of $P$.

Breaking Down the Equation

D_KL(P||Q) = Σ_x P(x) [log P(x) - log Q(x)]

The first term (entropy of $P$) is fixed; minimizing KL is equivalent to maximizing the expected log‑likelihood under $Q$ — i.e. maximum likelihood.

Mean & Variance

For binomial $\\text{Bin}(n,p)$, $\\mathbb E[X]=np$ and $\\mathrm{Var}(X)=np(1-p)$. Matching these to the data gives reasonable starting values for $(n,p)$.

Back to Modelling

After fitting, compute $D_{\\mathrm{KL}}(P\\Vert Q_{\\text{binom}})$ and compare to the uniform model. The lower the KL, the more faithful $Q$ is to $P$.

Summary so far

KL measures inefficiency of using $Q$ when the truth is $P$.
Minimizing KL is equivalent to maximizing expected log‑likelihood.
It’s asymmetric: $D_{\\mathrm{KL}}(P\\Vert Q) \\ne D_{\\mathrm{KL}}(Q\\Vert P)$.

How to Pick the Best Model?

Compare KL across candidate models or, equivalently, compare (regularized) log‑likelihoods / cross‑entropy. Use hold‑out evaluation to avoid overfitting.

Intuitive Breakdown of KL

If $Q$ puts very small probability where $P$ places mass, the log‑ratio $\\log \\tfrac{P}{Q}$ explodes, heavily penalizing mismatches in important regions. KL is sensitive where it matters.

Computing KL Divergence

For discrete $X$: sum over support. For continuous $X$: numeric integration or Monte‑Carlo:

// Monte-Carlo estimate with samples x~P
D_KL ≈ (1/N) Σ_i [ log P(x_i) - log Q(x_i) ]

Fun with KL

KL appears in variational inference, VAEs, policy optimization (RL), and information theory (relative entropy).

Conclusion

KL divergence is a workhorse metric for comparing probability models. With a clear intuition and a recipe for computing it, you can use KL to guide modelling choices.

References

Count Bayesie — “Kullback–Leibler Divergence Explained”, 2017.
Cover & Thomas — Elements of Information Theory.

Light on Math Machine Learning: Intuitive Guide to Understanding KL Divergence