Conditional entropy

Individual (H(X),H(Y)), joint (H(X,Y)), and conditional entropies for a pair of correlated subsystems X,Y with mutual information I(X; Y).

In information theory, the conditional entropy (or equivocation) quantifies the remaining entropy (i.e. uncertainty) of a random variable Y given that the value of another random variable X is known. It is referred to as the entropy of Y conditional on X, and is written H(Y | X). Like other entropies, the conditional entropy is measured in bits, nats, or bans.



More precisely, if H(Y | X = x) is the entropy of the variable Y conditional on the variable X taking a certain value x, then H(Y | X) is the result of averaging H(Y | X = x) over all possible values x that X may take.

Given discrete random variable X with support \mathcal X and Y with support \mathcal Y, the conditional entropy of Y given X is defined as:

H(Y|X)\ &\equiv \sum_{x\in\mathcal X}\,p(x)\,H(Y|X=x)\\
&{=}\sum_{x\in\mathcal X}p(x)\sum_{y\in\mathcal Y}\,p(y|x)\,\log\, \frac{1}{p(y|x)}\\
&=-\sum_{x\in\mathcal X}\sum_{y\in\mathcal Y}\,p(x,y)\,\log\,p(y|x)\\
&=-\sum_{x\in\mathcal X, y\in\mathcal Y}p(x,y)\log\,p(y|x)\\
&=-\sum_{x\in\mathcal X, y\in\mathcal Y}p(x,y)\log \frac {p(x,y)} {p(x)} \\
&= \sum_{x\in\mathcal X, y\in\mathcal Y}p(x,y)\log \frac {p(x)} {p(x,y)}.

Note: The supports of X and Y can be replaced by their domains if it is understood that 0log0 should be treated as being equal to zero.

Chain rule

From this definition and the definition of conditional probability, the chain rule for conditional entropy is

H(Y|X)\,=\,H(X,Y)-H(X) \, .

This is true because

H(Y|X)=&\sum_{x\in\mathcal X, y\in\mathcal Y}p(x,y)\log \frac {p(x)} {p(x,y)}\\
 =&-\sum_{x\in\mathcal X, y\in\mathcal Y}p(x,y)\log\,p(x,y) + \sum_{x\in\mathcal X, y\in\mathcal Y}p(x,y)\log\,p(x) \\
=& H(X,Y) + \sum_{x \in \mathcal X} p(x)\log\,p(x) \\
=& H(X,Y) - H(X).


Intuitively, the combined system contains H(X,Y) bits of information: we need H(X,Y) bits of information to reconstruct its exact state. If we learn the value of X, we have gained H(X) bits of information, and the system has H(Y | X) bits of uncertainty remaining.

H(Y | X) = 0 if and only if the value of Y is completely determined by the value of X. Conversely, H(Y | X) = H(Y) if and only if Y and X are independent random variables.

Generalization to quantum theory

In quantum information theory, the conditional entropy is generalized to the conditional quantum entropy.

Other properties

For any X and Y:

H(X|Y) \le H(X)

H(X,Y) = H(X | Y) + H(Y | X) + I(X;Y), where I(X;Y) is the mutual information between X and Y.

I(X;Y) \le H(X), where I(X;Y) is the mutual information between X and Y.

For independent X and Y:

H(Y | X) = H(Y) and H(X | Y) = H(X)

Although the specific-conditional entropy, H(X | Y = y), can be either lesser or greater than H(X | Y), H(X | Y = y) can never exceed H(X) when X is the uniform distribution.


  1. Theresa M. Korn; Korn, Granino Arthur. Mathematical Handbook for Scientists and Engineers: Definitions, Theorems, and Formulas for Reference and Review. New York: Dover Publications. pp. 613–614. ISBN 0-486-41147-8. 
  2. C. Arndt (2001). Information Measures: Information and its description in Science and Engineering. Berlin: Springer. pp. 370–373. ISBN 3-540-41633-1. 

See also

Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Conditional quantum entropy — The conditional quantum entropy is an entropy measure used in quantum information theory. It is a generalization of the conditional entropy of classical information theory. The conditional entropy is written S(ρ | σ), or H(ρ | σ), depending on… …   Wikipedia

  • Entropy (information theory) — In information theory, entropy is a measure of the uncertainty associated with a random variable. The term by itself in this context usually refers to the Shannon entropy, which quantifies, in the sense of an expected value, the information… …   Wikipedia

  • Conditional mutual information — In probability theory, and in particular, information theory, the conditional mutual information is, in its most basic form, the expected value of the mutual information of two random variables given the value of a third. Contents 1 Definition 2… …   Wikipedia

  • Entropy of mixing — The entropy of mixing is the change in the configuration entropy, an extensive thermodynamic quantity, when two different chemical substances or components are mixed. This entropy change must be positive since there is more uncertainty about the… …   Wikipedia

  • Conditional random field — A conditional random field (CRF) is a statistical modelling method often applied in pattern recognition. More specifically it is a type of discriminative undirected probabilistic graphical model. It is used to encode known relationships between… …   Wikipedia

  • Joint quantum entropy — The joint quantum entropy generalizes the classical joint entropy to the context of quantum information theory. Intuitively, given two quantum states ho and sigma, represented as density operators that are subparts of a quantum system, the joint… …   Wikipedia

  • Joint entropy — The joint entropy is an entropy measure used in information theory. The joint entropy measures how much entropy is contained in a joint system of two random variables. If the random variables are X and Y, the joint entropy is written H(X,Y). Like …   Wikipedia

  • Cross entropy — In information theory, the cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q,… …   Wikipedia

  • Differential entropy — (also referred to as continuous entropy) is a concept in information theory that extends the idea of (Shannon) entropy, a measure of average surprisal of a random variable, to continuous probability distributions. Contents 1 Definition 2… …   Wikipedia

  • Maximum-entropy Markov model — MEMM redirects here. For the German Nordic combined skier, see Silvio Memm. In machine learning, a maximum entropy Markov model (MEMM), or conditional Markov model (CMM), is a graphical model for sequence labeling that combines features of hidden …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”

We are using cookies for the best presentation of our site. Continuing to use this site, you agree with this.