Chain Rule

What is a Chain Rule in Machine Learning?

The chain rule, or general product rule, calculates any component of the joint distribution of a set of random variables using only conditional probabilities. This probability theory is used as a foundation for backpropagation and in creating Bayesian networks.

This simple chain of probability and random variables is expressed as:

P(A,B) = P(B | A) P(A)

How Does the Chain Rule Work?

A simple example would be the probability of picking winning raffle tickets out of different hats. Hat 1 has 1 winning ticket and 2 losing tickets inside. Hat 2 has 1 winning ticket and 3 winning tickets.

The first time you choose a hat at random would be event A. Your odds of choosing hat 1 are: P(A) = P(~A) = 1/2. 

event B is the chance of fishing out a winning ticket. In the case of the first hat, the chance of grabbing a winning ticket is P(B | A) = 1/3. 

Event “A, B” is the intersection of picking hat 1 and a winning ticket. Using the chain rule for probability shows that your odds of landing a winning ticket with Event A is 16.5% (33% x 50%).