KL divergence measures how one probability distribution diverges from another. Think of it as an asymmetric distance between two explanations of the same phenomenon.

Introduction
In machine learning we often approximate an unknown data-generating process $P$ with a model $Q_\\theta$. To compare how good $Q$ is, we can use the Kullback–Leibler divergence:
$D_{\\mathrm{KL}}(P\\,\\Vert\\,Q)=\\sum_x P(x)\\,\\log\\frac{P(x)}{Q(x)}$ (discrete) or $\\displaystyle D_{\\mathrm{KL}}(P\\,\\Vert\\,Q)=\\int P(x)\\,\\log\\frac{P(x)}{Q(x)}\\,dx$ (continuous).
KL is zero only when $P\\equiv Q$ almost everywhere; otherwise it is positive. It is not symmetric.
Concept Grounding
What is a Distribution
A (probability) distribution assigns likelihoods to outcomes. Visualize the $x$‑axis as outcomes and the $y$‑axis as probabilities. Continuous distributions form smooth curves; discrete ones form bars.
What is an Event?
An event is a set of outcomes. For discrete spaces an event could be $X=k$; in continuous spaces it’s a range, e.g. $a\\le X \\le b$.
Back to KL Divergence
Suppose we have a true distribution $P$ (derived from data) and a candidate model $Q$. KL answers: how many extra nats/bits are needed when we encode samples from $P$ using a code optimized for $Q$.
Problem We’re Trying to Solve
Following the classic “space worms” story, we observe the number of teeth per worm. From counts we compute an empirical distribution $P$ over $x\\in\\{0,1,\\dots\\}$.
Let’s Tweak the Example
For concreteness, imagine 100 worms with probabilities peaking around 4–6 teeth. We’ll try to mimic this $P$ with simple parametric $Q$’s.
First try: Uniform Model
If we assumed each tooth-count was equally likely between $[0,K]$, $Q$ would be uniform. This may be too crude—KL will quantify the mismatch.
Second try: Binomial Model
A binomial (or Poisson) often models count-like phenomena. We choose parameters to fit the mean/variance of $P$.
Breaking Down the Equation
D_KL(P||Q) = Σ_x P(x) [log P(x) - log Q(x)]
The first term (entropy of $P$) is fixed; minimizing KL is equivalent to maximizing the expected log‑likelihood under $Q$ — i.e. maximum likelihood.
Mean & Variance
For binomial $\\text{Bin}(n,p)$, $\\mathbb E[X]=np$ and $\\mathrm{Var}(X)=np(1-p)$. Matching these to the data gives reasonable starting values for $(n,p)$.
Back to Modelling
After fitting, compute $D_{\\mathrm{KL}}(P\\Vert Q_{\\text{binom}})$ and compare to the uniform model. The lower the KL, the more faithful $Q$ is to $P$.
Summary so far
- KL measures inefficiency of using $Q$ when the truth is $P$.
- Minimizing KL is equivalent to maximizing expected log‑likelihood.
- It’s asymmetric: $D_{\\mathrm{KL}}(P\\Vert Q) \\ne D_{\\mathrm{KL}}(Q\\Vert P)$.
How to Pick the Best Model?
Compare KL across candidate models or, equivalently, compare (regularized) log‑likelihoods / cross‑entropy. Use hold‑out evaluation to avoid overfitting.
Intuitive Breakdown of KL
If $Q$ puts very small probability where $P$ places mass, the log‑ratio $\\log \\tfrac{P}{Q}$ explodes, heavily penalizing mismatches in important regions. KL is sensitive where it matters.
Computing KL Divergence
For discrete $X$: sum over support. For continuous $X$: numeric integration or Monte‑Carlo:
// Monte-Carlo estimate with samples x~P
D_KL ≈ (1/N) Σ_i [ log P(x_i) - log Q(x_i) ]
Fun with KL
KL appears in variational inference, VAEs, policy optimization (RL), and information theory (relative entropy).
Conclusion
KL divergence is a workhorse metric for comparing probability models. With a clear intuition and a recipe for computing it, you can use KL to guide modelling choices.
References
- Count Bayesie — “Kullback–Leibler Divergence Explained”, 2017.
- Cover & Thomas — Elements of Information Theory.