What is a Chain Rule in Machine Learning?
The chain rule, or general product rule, calculates any component of the joint distribution of a set of random variables using only conditional probabilities. This probability theory is used as a foundation for backpropagation and in creating Bayesian networks.
This simple chain of probability and random variables is expressed as:
P(A,B) = P(B | A) P(A)
How Does the Chain Rule Work?
A simple example would be the probability of picking winning raffle tickets out of different hats. Hat 1 has 1 winning ticket and 2 losing tickets inside. Hat 2 has 1 winning ticket and 3 winning tickets.
The first time you choose a hat at random would be event A. Your odds of choosing hat 1 are: P(A) = P(~A) = 1/2.
event B is the chance of fishing out a winning ticket. In the case of the first hat, the chance of grabbing a winning ticket is P(B | A) = 1/3.
Event “A, B” is the intersection of picking hat 1 and a winning ticket. Using the chain rule for probability shows that your odds of landing a winning ticket with Event A is 16.5% (33% x 50%).