A tool that translates augmented markdown into HTML or latex
This paper is an attempt to explain all the matrix calculus you need in order to understand the training of deep neural networks. We assume no math knowledge beyond what you learned in calculus 1, and provide links to help you refresh the necessary math where needed. Note that you do not need to understand this material before you start learning to train and use deep learning in practice; rather, this material is for those who are already familiar with the basics of neural networks, and wish to deepen their understanding of the underlying math. Don't worry if you get stuck at some point along the way---just go back and reread the previous section, and try writing down and working through some examples. And if you're still stuck, we're happy to answer your questions in the Theory category at forums.fast.ai. Note: There is a reference section at the end of the paper summarizing all the key matrix calculus rules and terminology discussed here.READ FULL TEXT VIEW PDF
The lambda calculus is not upward confluent, one of counterexamples bein...
The widespread adoption of whole slide imaging has increased the demand ...
This is a purely pedagogical paper with no new results. The goal of the ...
On the topic of probabilistic rewriting, there are several works studyin...
I wrote this paper because technology can really improve people's lives....
After presentations of Raz and Tal's oracle separation of BQP and PH res...
The study of defeasible reasoning unites epistemologists with those work...
A tool that translates augmented markdown into HTML or latex
Most of us last saw calculus in school, but derivatives are a critical part of machine learning, particularly deep neural networks, which are trained by optimizing a loss function. Pick up a machine learning paper or the documentation of a library such asPyTorch and calculus comes screeching back into your life like distant relatives around the holidays. And it’s not just any old scalar calculus that pops up—you need differential matrix calculus, the shotgun wedding of linear algebra and multivariate calculus.
Well… maybe need isn’t the right word; Jeremy’s courses show how to become a world-class deep learning practitioner with only a minimal level of scalar calculus, thanks to leveraging the automatic differentiation built in to modern deep learning libraries. But if you really want to really understand what’s going on under the hood of these libraries, and grok academic papers discussing the latest advances in model training techniques, you’ll need to understand certain bits of the field of matrix calculus.
For example, the activation of a single computation unit in a neural network is typically calculated using the dot product (from linear algebra) of an edge weight vectorwith an input vector plus a scalar bias (threshold): . Function is called the unit’s affine function and is followed by a rectified linear unit, which clips negative values to zero:
. Such a computational unit is sometimes referred to as an “artificial neuron” and looks like:
Neural networks consist of many of these units, organized into multiple collections of neurons called layers. The activation of one layer’s units become the input to the next layer’s units. The activation of the unit or units in the final layer is called the network output.
Training this neuron means choosing weights and bias so that we get the desired output for all inputs . To do that, we minimize a loss function that compares the network’s final with the (desired output of ) for all input vectors. To minimize the loss, we use some variation on gradient descent, such as plain stochastic gradient descent (SGD), SGD with momentum, or Adam. All of those require the partial derivative (the gradient) of with respect to the model parameters and . Our goal is to gradually tweak and so that the overall loss function keeps getting smaller across all inputs.
If we’re careful, we can derive the gradient by differentiating the scalar version of a common loss function (mean squared error):
But this is just one neuron, and neural networks must train the weights and biases of all neurons in all layers simultaneously. Because there are multiple inputs and (potentially) multiple network outputs, we really need general rules for the derivative of a function with respect to a vector and even rules for the derivative of a vector-valued function with respect to a vector.
This article walks through the derivation of some important rules for computing partial derivatives with respect to vectors, particularly those useful for training neural networks. This field is known as matrix calculus, and the good news is, we only need a small subset of that field, which we introduce here. While there is a lot of online material on multivariate calculus and linear algebra, they are typically taught as two separate undergraduate courses so most material treats them in isolation. The pages that do discuss matrix calculus often are really just lists of rules with minimal explanation or are just pieces of the story. They also tend to be quite obscure to all but a narrow audience of mathematicians, thanks to their use of dense notation and minimal discussion of foundational concepts. (See the annotated list of resources at the end.)
In contrast, we’re going to rederive and rediscover some key matrix calculus rules in an effort to explain them. It turns out that matrix calculus is really not that hard! There aren’t dozens of new rules to learn; just a couple of key concepts. Our hope is that this short paper will get you started quickly in the world of matrix calculus as it relates to training neural networks. We’re assuming you’re already familiar with the basics of neural network architecture and training. If you’re not, head over to Jeremy’s course and complete part 1 of that, then we’ll see you back here when you’re done. (Note that, unlike many more academic approaches, we strongly suggest first learning to train and use neural networks in practice and then study the underlying math. The math will be much more understandable with the context in place; besides, it’s not necessary to grok all this calculus to become an effective practitioner.)
A note on notation: Jeremy’s course exclusively uses code, instead of math notation, to explain concepts since unfamiliar functions in code are easy to search for and experiment with. In this paper, we do the opposite: there is a lot of math notation because one of the goals of this paper is to help you understand the notation that you’ll see in deep learning papers and books. At the end of the paper, you’ll find a brief table of the notation used, including a word or phrase you can use to search for more details.
Hopefully you remember some of these main scalar derivative rules. If your memory is a bit fuzzy on this, have a look at Khan academy vid on scalar derivative rules.
|Rule||Scalar derivative notation with respect to||Example|
|Multiplication by constant|
|Chain Rule||, let|
There are other rules for trigonometry, exponentials, etc., which you can find at Khan Academy differential calculus course.
When a function has a single parameter, , you’ll often see and used as shorthands for . We recommend against this notation as it does not make clear the variable we’re taking the derivative with respect to.
You can think of as an operator that maps a function of one parameter to another function. That means that maps to its derivative with respect to , which is the same thing as . Also, if , then . Thinking of the derivative as an operator helps to simplify complicated derivatives because the operator is distributive and lets us pull out constants. For example, in the following equation, we can pull out the constant 9 and distribute the derivative operator across the elements within the parentheses.
That procedure reduced the derivative of to a bit of arithmetic and the derivatives of and , which are much easier to solve than the original derivative.
Neural network layers are not single functions of a single parameter, . So, let’s move on to functions of multiple parameters such as . For example, what is the derivative of (i.e., the multiplication of and )? In other words, how does the product change when we wiggle the variables? Well, it depends on whether we are changing or . We compute derivatives with respect to one variable (parameter) at a time, giving us two different partial derivatives for this two-parameter function (one for and one for ). Instead of using operator , the partial derivative operator is (a stylized and not the Greek letter ). So, and are the partial derivatives of ; often, these are just called the partials. For functions of a single parameter, operator is equivalent to (for sufficiently smooth functions). However, it’s better to use to make it clear you’re referring to a scalar derivative.
The partial derivative with respect to is just the usual scalar derivative, simply treating any other variable in the equation as a constant. Consider function . The partial derivative with respect to is written . There are three constants from the perspective of : 3, 2, and . Therefore, . The partial derivative with respect to treats like a constant: . It’s a good idea to derive these yourself before continuing otherwise the rest of the article won’t make sense. Here’s the Khan Academy video on partials if you need help.
To make it clear we are doing vector calculus and not just multivariate calculus, let’s consider what we do with the partial derivatives and (another way to say and ) that we computed for . Instead of having them just floating around and not organized in any way, let’s organize them into a horizontal vector. We call this vector the gradient of and write it as:
So the gradient of is simply a vector of its partials. Gradients are part of the vector calculus world, which deals with functions that map scalar parameters to a single scalar. Now, let’s get crazy and consider derivatives of multiple functions simultaneously.
When we move from derivatives of one function to derivatives of many functions, we move from the world of vector calculus to matrix calculus. Let’s compute partial derivatives for two functions, both of which take two parameters. We can keep the same from the last section, but let’s also bring in . The gradient for has two entries, a partial derivative for each parameter:
giving us gradient .
Gradient vectors organize all of the partial derivatives for a specific scalar function. If we have two functions, we can also organize their gradients into a matrix by stacking the gradients. When we do so, we get the Jacobian matrix (or just the Jacobian) where the gradients are rows:
Welcome to matrix calculus!
Note that there are multiple ways to represent the Jacobian. We are using the so-called numerator layout but many papers and software will use the denominator layout. This is just transpose of the numerator layout Jacobian (flip it around its diagonal):
So far, we’ve looked at a specific example of a Jacobian matrix. To define the Jacobian matrix more generally, let’s combine multiple parameters into a single vector argument: . (You will sometimes see notation for vectors in the literature as well.) Lowercase letters in bold font such as are vectors and those in italics font like are scalars. is the element of vector and is in italics because a single vector element is a scalar. We also have to define an orientation for vector . We’ll assume that all vectors are vertical by default of size :
With multiple scalar-valued functions, we can combine them all into a vector just like we did with the parameters. Let be a vector of scalar-valued functions that each take a vector of length where is the cardinality (count) of elements in . Each function within returns a scalar just as in the previous section:
For instance, we’d represent and from the last section as
It’s very often the case that because we will have a scalar function result for each element of the vector. For example, consider the identity function :
So we have functions and parameters, in this case. Generally speaking, though, the Jacobian matrix is the collection of all possible partial derivatives ( rows and columns), which is the stack of gradients with respect to :
Each is a horizontal -vector because the partial derivative is with respect to a vector, , whose length is . The width of the Jacobian is if we’re taking the partial derivative with respect to because there are parameters we can wiggle, each potentially changing the function’s value. Therefore, the Jacobian is always rows for equations. It helps to think about the possible Jacobian shapes visually:
The Jacobian of the identity function , with , has functions and each function has parameters held in a single vector . The Jacobian is, therefore, a square matrix since :
Make sure that you can derive each step above before moving on. If you get stuck, just consider each element of the matrix in isolation and apply the usual scalar derivative rules. That is a generally useful trick: Reduce vector expressions down to a set of scalar expressions and then take all of the partials, combining the results appropriately into vectors and matrices at the end.
Also be careful to track whether a matrix is vertical, , or horizontal, where means transpose. Also make sure you pay attention to whether something is a scalar-valued function, , or a vector of functions (or a vector-valued function), .
Element-wise binary operations on vectors, such as vector addition
, are important because we can express many common vector operations, such as the multiplication of a vector by a scalar, as element-wise binary operations. By “element-wise binary operations” we simply mean applying an operator to the first item of each vector to get the first item of the output, then to the second items of the inputs for the second item of the output, and so forth. This is how all the basic math operators are applied by default in numpy or tensorflow, for example. Examples that often crop up in deep learning areand (returns a vector of ones and zeros).
We can generalize the element-wise binary operations with notation where . (Reminder: is the number of items in .) The symbol represents any element-wise operator (such as ) and not the function composition operator. Here’s what equation looks like when we zoom in to examine the scalar equations:
where we write (not ) equations vertically to emphasize the fact that the result of element-wise operators give sized vector results.
Using the ideas from the last section, we can see that the general case for the Jacobian with respect to is the square matrix:
and the Jacobian with respect to is:
That’s quite a furball, but fortunately the Jacobian is very often a diagonal matrix, a matrix that is zero everywhere but the diagonal. Because this greatly simplifies the Jacobian, let’s examine in detail when the Jacobian reduces to a diagonal matrix for element-wise operations.
In a diagonal Jacobian, all elements off the diagonal are zero, where . (Notice that we are taking the partial derivative with respect to not .) Under what conditions are those off-diagonal elements zero? Precisely when and are contants with respect to , . Regardless of the operator, if those partial derivatives go to zero, the operation goes to zero, no matter what, and the partial derivative of a constant is zero.
Those partials go to zero when and are not functions of . We know that element-wise operations imply that is purely a function of and is purely a function of . For example, sums . Consequently, reduces to and the goal becomes . and look like constants to the partial differentiation operator with respect to when so the partials are zero off the diagonal. (Notation is technically an abuse of our notation because and are functions of vectors not individual elements. We should really write something like , but that would muddy the equations further, and programmers are comfortable overloading functions, so we’ll proceed with the notation anyway.)
We’ll take advantage of this simplification later and refer to the constraint that and access at most and , respectively, as the element-wise diagonal condition.
Under this condition, the elements along the diagonal of the Jacobian are :
(The large “0”s are a shorthand indicating all of the off-diagonal are 0.)
More succinctly, we can write:
where constructs a matrix whose diagonal elements are taken from vector .
Because we do lots of simple vector arithmetic, the general function in the binary element-wise operation is often just the vector . Any time the general function is a vector, we know that reduces to . For example, vector addition fits our element-wise diagonal condition because has scalar equations that reduce to just with partial derivatives:
That gives us , the identity matrix, because every element along the diagonal is 1. represents the square identity matrix of appropriate dimensions that is zero everywhere but the diagonal, which contains all ones.
Given the simplicity of this special case, reducing to , you should be able to derive the Jacobians for the common element-wise binary operations on vectors:
The and operators are element-wise multiplication and division; is sometimes called the Hadamard product. There isn’t a standard notation for element-wise multiplication and division so we’re using an approach consistent with our general binary operation notation.
When we multiply or add scalars to vectors, we’re implicitly expanding the scalar to a vector and then performing an element-wise binary operation. For example, adding scalar to vector , , is really where and . (The notation represents a vector of ones of appropriate length.) is any scalar that doesn’t depend on , which is useful because then for any and that will simplify our partial derivative computations. (It’s okay to think of variable as a constant for our discussion here.) Similarly, multiplying by a scalar, , is really where is the element-wise multiplication (Hadamard product) of the two vectors.
The partial derivatives of vector-scalar addition and multiplication with respect to vector use our element-wise rule:
This follows because functions and clearly satisfy our element-wise diagonal condition for the Jacobian (that refer at most to and refers to the value of the vector).
Using the usual rules for scalar partial derivatives, we arrive at the following diagonal elements of the Jacobian for vector-scalar addition:
Computing the partial derivative with respect to the scalar parameter , however, results in a vertical vector, not a diagonal matrix. The elements of the vector are:
The diagonal elements of the Jacobian for vector-scalar multiplication involve the product rule for scalar derivatives:
The partial derivative with respect to scalar parameter is a vertical vector whose elements are:
This gives us .
Summing up the elements of a vector is an important operation in deep learning, such as the network loss function, but we can also use it as a way to simplify computing the derivative of vector dot product and other operations that reduce vectors to scalars.
Let . Notice we were careful here to leave the parameter as a vector because each function could use all values in the vector, not just . The sum is over the results of the function and not the parameter. The gradient ( Jacobian) of vector summation is:
(The summation inside the gradient elements can be tricky so make sure to keep your notation consistent.)
Let’s look at the gradient of the simple . The function inside the summation is just and the gradient is then:
Because for , we can simplify to:
Notice that the result is a horizontal vector full of 1s, not a vertical vector, and so the gradient is . (The exponent of represents the transpose of the indicated vector. In this case, it flips a vertical vector to a horizontal vector.) It’s very important to keep the shape of all of your vectors and matrices in order otherwise it’s impossible to compute the derivatives of complex functions.
As another example, let’s sum the result of multiplying a vector by a constant scalar. If then . The gradient is:
The derivative with respect to scalar variable is :
We can’t compute partial derivatives of very complicated functions using just the basic matrix calculus rules we’ve seen so far. For example, we can’t take the derivative of nested expressions like directly without reducing it to its scalar equivalent. We need to be able to combine our basic vector rules using what we can call the vector chain rule. Unfortunately, there are a number of rules for differentiation that fall under the name “chain rule” so we have to be careful which chain rule we’re talking about. Part of our goal here is to clearly define and name three different chain rules and indicate in which situation they are appropriate. To get warmed up, we’ll start with what we’ll call the single-variable chain rule, where we want the derivative of a scalar function with respect to a scalar. Then we’ll move on to an important concept called the total derivative and use it to define what we’ll pedantically call the single-variable total-derivative chain rule. Then, we’ll be ready for the vector chain rule in its full glory as needed for neural networks.
The chain rule is conceptually a divide and conquer strategy (like Quicksort) that breaks complicated expressions into subexpressions whose derivatives are easier to compute. Its power derives from the fact that we can process each simple subexpression in isolation yet still combine the intermediate results to get the correct overall result.
The chain rule comes into play when we need the derivative of an expression composed of nested subexpressions. For example, we need the chain rule when confronted with expressions like . The outermost expression takes the of an intermediate result, a nested subexpression that squares . Specifically, we need the single-variable chain rule, so let’s start by digging into that in more detail.
Let’s start with the solution to the derivative of our nested expression: . It doesn’t take a mathematical genius to recognize components of the solution that smack of scalar differentiation rules, and . It looks like the solution is to multiply the derivative of the outer expression by the derivative of the inner expression or “chain the pieces together,” which is exactly right. In this section, we’ll explore the general principle at work and provide a process that works for highly-nested expressions of a single variable.
Chain rules are typically defined in terms of nested functions, such as for single-variable chain rules. (You will also see the chain rule defined using function composition , which is the same thing.) Some sources write the derivative using shorthand notation , but that hides the fact that we are introducing an intermediate variable: , which we’ll see shortly. It’s better to define the single-variable chain rule of explicitly so we never take the derivative with respect to the wrong variable. Here is the formulation of the single-variable chain rule we recommend:
To deploy the single-variable chain rule, follow these steps:
Introduce intermediate variables for nested subexpressions and subexpressions for both binary and unary operators; e.g., is binary, and other trigonometric functions are usually unary because there is a single operand. This step normalizes all equations to single operators or function applications.
Compute derivatives of the intermediate variables with respect to their parameters.
Combine all derivatives of intermediate variables by multiplying them together to get the overall result.
Substitute intermediate variables back in if any are referenced in the derivative equation.
The third step puts the “chain” in “chain rule” because it chains together intermediate results. Multiplying the intermediate derivatives together is the common theme among all variations of the chain rule.
Let’s try this process on :
Introduce intermediate variables. Let represent subexpression (shorthand for ). This gives us:
The order of these subexpressions does not affect the answer, but we recommend working in the reverse order of operations dictated by the nesting (innermost to outermost). That way, expressions and derivatives are always functions of previously-computed elements.
Notice how easy it is to compute the derivatives of the intermediate variables in isolation! The chain rule says it’s legal to do that and tells us how to combine the intermediate results to get .
You can think of the combining step of the chain rule in terms of units canceling. If we let be miles, be the gallons in a gas tank, and as gallons we can interpret as . The denominator and numerator cancel.
Another way to to think about the single-variable chain rule is to visualize the overall expression as a dataflow diagram or chain of operations (or abstract syntax tree for compiler people):
Changes to function parameter bubble up through a squaring operation then through a operation to change result . You can think of as “getting changes from to ” and as “getting changes from to .” Getting from to requires an intermediate hop. The chain rule is, by convention, usually written from the output variable down to the parameter(s), . But, the -to- perspective would be more clear if we reversed the flow and used the equivalent .
Conditions under which the single-variable chain rule applies. Notice that there is a single dataflow path from to the root . Changes in can influence output in only one way. That is the condition under which we can apply the single-variable chain rule. An easier condition to remember, though one that’s a bit looser, is that none of the intermediate subexpression functions, and , have more than one parameter. Consider , which would become after introducing intermediate variable . As we’ll see in the next section, has multiple paths from to . To handle that situation, we’ll deploy the single-variable total-derivative chain rule.
As an aside for those interested in automatic differentiation, papers and library documentation use terminology forward differentiation and backward differentiation (for use in the back-propagation algorithm). From a dataflow perspective, we are computing a forward differentiation because it follows the normal data flow direction. Backward differentiation, naturally, goes the other direction and we’re asking how a change in the output would affect function parameter . Because backward differentiation can determine changes in all function parameters at once, it turns out to be much more efficient for computing the derivative of functions with lots of parameters. Forward differentiation, on the other hand, must consider how a change in each parameter, in turn, affects the function output . The following table emphasizes the order in which partial derivatives are computed for the two techniques.
|Forward differentiation from to||Backward differentiation from to|
Automatic differentiation is beyond the scope of this article, but we’re setting the stage for a future article.
Many readers can solve in their heads, but our goal is a process that will work even for very complicated expressions. This process is also how automatic differentiation
works in libraries like PyTorch. So, by solving derivatives manually in this way, you’re also learning how to define functions for custom neural networks in PyTorch.
With deeply nested expressions, it helps to think about deploying the chain rule the way a compiler unravels nested function calls like into a sequence (chain) of calls. The result of calling function is saved to a temporary variable called a register, which is then passed as a parameter to . Let’s see how that looks in practice by using our process on a highly-nested equation like :
Introduce intermediate variables.
Combine four intermediate values.