I live in Umeå, near the Arctic Circle. Skiing is a basic survival skill over there.

I taught myself to ski at the grand age of 35 using the theory of gradient.

So, what is gradient?

Gradient is information that can be obtained from a curve or surface function. Simply put, it tells you which direction is the steepest. In the image, the red arrow points in the opposite direction of the gradient, meaning it indicates the steepest downward direction.

If you align skis (blue arrow) with the red arrow, you will descend quickly. When the skis perpendicular to the slope, you can stand still on it. Along the way, you can use the gradient as a reference to adjust the direction of your skis and thus control speed.

Congratulations, you can ski now!

Gradient not only helps us reach the bottom of the mountain safely but also guides us in finding the optimal value of an objective function.

Here, I have a smooth curve and have marked the gradient values at six points along it. Among them, \(w = -0.75\) is a special point. At this point, we obtain the minimum value of the function, and the gradient value here is 0.

First, let’s examine the differences between the colors. In the red section, all gradient values are positive, and the purple point is always to its left (negative direction). In the blue region, we can draw the opposite conclusion. Therefore, the sign of the gradient tells us which direction to move in order to reach the lowest point.

In other words, at each position, we have a “guide” that tells us which direction to go in order to reach the destination. Now, how far should we move at each position? Let’s continue and see.

Next, let’s compare the same color sections. In the blue section, you can see that the larger the absolute value of the gradient, the steeper the slope at that point. We can draw the same conclusion for the red section. So, the smaller the absolute value of the gradient, the closer you are to the target, and you should slow down as you approach it.

In other words, when we want to find the minimum of a loss function, \(L(w)\), we can start from any point and, using the guidance of the gradient, ultimately reach the destination. This is the basic ideas of the gradient descent algorithm.