Gradient Descent

Learning Objectives

  • Understand partial derivative and gradient

  • Understand how gradient descent may be used in optimization problems

  • Apply gradient descent in machine learning

Linear Approximation with Least Square Error

As mentioned in the previous section, supervised machinear learning can be formulated as a minimization problem: minimizing the error. This chapter starts with a problem that is probably farmiliar to many people already: linear approximation with least square error.

Consider a list of \(n\) points: \((x_1, \tilde{y_1})\), \((x_2, \tilde{y_2})\), …, \((x_n, \tilde{y_n})\). Please notice the convention: \(y = a x + b\) is the underlying equation and \(y\) is the “correct” value. It is generally not possible getting the correct value of \(y\) due to noise and limitation of meausrement instruments. Instead, we can get only the observed \(y\) with noise. To distinguish these two, we use \(\tilde{y}\) to express the observed value. It may be different from the true value of \(y\).

The problem is to find the values of \(a\) and \(b\) for a line \(y = a x + b\) such that

\(e(a, b)= \underset{i=1}{\overset{n}{\sum}} (y_i - (a x_i + b))^2\)

is as small as possible. This is the cumulative error. Let’s call it \(e(a, b)\) because it has two variables \(a\) and \(b\). Here we will solve this problem in two ways: analytically and numerically.

Analytics Method for Least Square Error

In Calculus, you have learned the concept of derivative. Suppose \(f(x)\) is a function of a single variable \(x\). The derivative of \(f(x)\) with respect to \(x\) is defined as

\(f'(x) = \frac{d}{dx} f(x) = \underset{h \rightarrow 0}{\text{lim}} \frac{f(x + h) - f(x)}{h}\)

The derivative calculates the ratio of change in \(f(x)\) and the change in \(x\). A geometric interpretation is the slope of \(f(x)\) at a specific point of \(x\).

Extend that concept to a multivariable function. Suppose \(f(x, y)\) is a function of two variables \(x\) and \(y\). The partial derivative of \(f(x,y)\) with respect to \(x\) at point \((x_0, y_0)\) is defined as

\(\frac{\partial f}{\partial x}| _{(x_0, y_0)} = \frac{d}{dx} f(x, y_0) | _{x = x_0} =\underset{h \rightarrow 0}{\text{lim}} \frac{f(x_0 + h, y_0) - f(x_0, y_0)}{h}\)

The derivative calculates the ratio of change in \(f(x, y)\) and the change in \(x\) while keeping the value of \(y\) unchanged.

Similarly, the partial derivative of \(f(x,y)\) with respect to \(y\) at point \((x_0, y_0)\) is defined as

\(\frac{\partial f}{\partial y}| _{(x_0, y_0)} = \frac{d}{dy} f(x_0, y) | _{y = y_0} =\underset{h \rightarrow 0}{\text{lim}} \frac{f(x_0, y_0 + h) - f(x_0, y_0)}{h}\)

To minimize the error function, we take the partial derivatives of \(a\) and \(b\) respectively:

\(\frac{\partial e}{\partial a} = 2 (a x_1 + b - y_1) x_1 + 2 (a x_2 + b - y_2) x_2 + ... + 2 (a x_n + b - y_n) x_n\)

\(\frac{\partial e}{\partial a} = 2 a (x_1^2 + x_2^2 + ... + x_n^2) + 2 b (x_1 + x_2 + ... + x_n) - 2 (x_1 y_1 + x_2 y_2 + ... + x_n y_n)\)

\(\frac{\partial e}{\partial a} = 2 (a \underset{i=1}{\overset{n}{\sum}} x_i^2 + b \underset{i=1}{\overset{n}{\sum}} x_i - \underset{i=1}{\overset{n}{\sum}} x_i y_i)\)

For \(b\):

\(\frac{\partial e}{\partial b} = 2 (a x_1 + b - y_1) + 2 (a x_2 + b - y_2) + ... + 2 (a \times x_n + b - y_n)\)

\(\frac{\partial e}{\partial b} = 2 a (x_1 + x_2 + ... + x_n) + 2 n b - 2 (y_1 + y_2 + ... + y_n)\)

\(\frac{\partial e}{\partial b} = 2 (a \underset{i=1}{\overset{n}{\sum}} x_i + b n - \underset{i=1}{\overset{n}{\sum}} y_i)\)

To find the minimum, set both to zero and obtain two linear equations of \(a\) and \(b\).

\(a \underset{i=1}{\overset{n}{\sum}} x_i^2 + b \underset{i=1}{\overset{n}{\sum}} x_i = \underset{i=1}{\overset{n}{\sum}} x_i y_i\)

\(a \underset{i=1}{\overset{n}{\sum}} x_i + b n = \underset{i=1}{\overset{n}{\sum}} y_i\)

The values of \(a\) and \(b\) can be expressed by

\(a =\frac{n \underset{i=1}{\overset{n}{\sum}} x_i y_i - \underset{i=1}{\overset{n}{\sum}} x_i \underset{i=1}{\overset{n}{\sum}} y_i}{n \underset{i=1}{\overset{n}{\sum}} x_i^2 - \underset{i=1}{\overset{n}{\sum}} x_i \underset{i=1}{\overset{n}{\sum}} x_i}\)

\(b =\frac{\underset{i=1}{\overset{n}{\sum}} y_i \underset{i=1}{\overset{n}{\sum}} x_i^2 - \underset{i=1}{\overset{n}{\sum}} x_i y_i \underset{i=1}{\overset{n}{\sum}} x_i}{n \underset{i=1}{\overset{n}{\sum}} x_i^2 - \underset{i=1}{\overset{n}{\sum}} x_i \underset{i=1}{\overset{n}{\sum}} x_i}\)

Let’s consider an example:



































































The pairs are plotted below:


The value of \(y\) is calculated by

\(y = 3 x - 5 + \epsilon\)

here \(\epsilon\) is the error (or noise) and it is set to a randeom number between -8 and 8.

The figure shows the line without noise:


Using the equations, \(a = 3.11806\) and \(b = -5.18776\).

This chapter starts with a review of multivariable calculus.

Next, we explain how to solve the problem using gradient descent.


The gradient of a two-variable function \(f(x, y)\) is at point \((x_0, y_0)\) is defined as

\(\nabla f|_{(x_0, y_0)} = \frac{\partial f}{\partial x} {\bf i} + \frac{\partial f}{\partial y} {\bf j}\)

Here, \({\bf i}\) and \({\bf j}\) are the unit vector in the \(x\) and \(y\) directions.

Suppose \({\bf v} = \alpha {\bf i} + \beta {\bf j}\) is a unit vector. Then, the amount of \(f\)’s changes in the direction of \({\bf v}\) is the inner product of \(\nabla f\) and \({\bf v}\). Apparently, the greatest change occurs at the direction when \({\bf v}\) is the unit vector of \(\nabla f\).

One way to understand graident is to think about speed bumps on roads.


The bump is modeled as a half cylinder. For simplicity, we assume that the bump is infinitely long in the \(x\) direction. A point on the surface of the bump can be expressed as

\(p = x {\bf i} + \alpha \cos(\theta) {\bf j} + \beta \sin(\theta) {\bf k}\)

The gradient is

\(\nabla p = \frac{\partial p}{\partial x} {\bf i} + \frac{\partial p}{\partial y} {\bf j} + \frac{\partial p}{\partial z} {\bf k} = {\bf i} + \alpha (- \sin(\theta)) {\bf j} + \beta \cos(\theta) {\bf k}.\)

This is the tangent of the point on the surface.

Next, consider another vector (such as the tire of your unicycle) goes through this bump. What is the rate of changes along this surface. If you ride the unicycle along this bump without getting onto the bump, then the vector of your movement can be expressed by

\(v = x {\bf i}\)

How is this affected by the slope of the bump? The calculation is the inner product of the two vectors:

\(\nabla p \cdot v = x {\bf i}\).

Notice that this inner product contains no \(\theta\). What does this mean? It means that your movement is not affected by \(\theta\). This is obvious because you are not riding onto the bump.

Next, consider that you ride straight to the bump. The vector will be

\(v = - y {\bf j}\)

The slop of the bump affects your actual movement, again by the inner product:

\(\nabla p \cdot v = y \alpha \sin(\theta) {\bf j}\).

How can we interpret this? The moment when your tire hits the bump, \(\theta\) is zero so your tire’s movement along the \({\bf j}\) direction is zero. This is understandable because you cannot penetrate into the bump. When the tire is at the top of the bump, \(\theta\) is \(\frac{\pi}{2}\) and the tire has the highest speed.

Based on this understanding, it is easier to answer the following question: “Which direction along the surface gives the greatest changes?” Because the actual change is the inner product of the direction and the gradient, the greatest change occurs along the direction of the gradient.

Gradient Descent

Gradient Descent is a method for solving minimization problems. The gradient of a function at a point is the rate of changes at that point. Let’s consider a simple example: use gradient descent to find a line that has the smallest sum of square error.

The gradient of a function is the direction of changes.

\(\nabla e = \frac{\partial e}{\partial a} {\bf i} + \frac{\partial e}{\partial b} {\bf j}\)

Suppose \(\Delta w = \alpha {\bf i} + \beta {\bf j}\) is a vector.

By definition, if \(\Delta w\) is small enough, then the change in \(\nabla e\) alogn the direction of \(\Delta w\) can be calculated as

\(\Delta e = \nabla e \cdot \Delta w\).

The goal is to reduce the error. Thus, \(\Delta w\) should be chosen to ensure that \(\Delta e\) is negative. If we make

\(\Delta w = - \eta \nabla e\),


\(\Delta e = \nabla e \cdot (- \eta \nabla e) = - \eta (\nabla e)^2\).

This ensures that the error \(e\) becomes smaller. The value \(\eta\) is called the learning rate. Its value is usually between 0.1 and 0.5. If \(\eta\) is too small, \(\Delta e\) changes very slowly, i.e., learning is slow. If \(\eta\) is too large, \(\Delta e = \nabla e \cdot \Delta w\) is not necessarily true.

Numerical Method for Least Square Error

It is possible to find an analytical solution for \(a\) and \(b\) because the function \(e\) is pretty simple. For many machine learning problems, the functions are highly complex and in many cases the functions are not even known in advance. For these problems, reducing the errors can be done numerically using data. This section is further divided into two different scenarios.

  • The first assumes that we know the function \(e\) but we do not have formulaes for \(a\) or \(b\).

  • The second assumes that we do not know the function \(e\) and certainly do not know the formulaes for \(a\) or \(b\). This is the common scenario.

For the first case,

\(\nabla e = \frac{\partial e}{\partial a} {\bf i} + \frac{\partial e}{\partial b} {\bf j}\)

\(\frac{\partial e}{\partial a} = 2 (a \underset{i=1}{\overset{n}{\sum}} x_i^2 + b \underset{i=1}{\overset{n}{\sum}} x_i - \underset{i=1}{\overset{n}{\sum}} x_i y_i)\)

\(\frac{\partial e}{\partial b} = 2 (a \underset{i=1}{\overset{n}{\sum}} x_i + b n - \underset{i=1}{\overset{n}{\sum}} y_i)\)

This is gradient1 below.

For the second case, the gradient can be estimated using the definition of partial derivative. This is shown in gradient2 below.

After finding the gradient using either method, the values of \(a\) and \(b\) change by \(- \eta \nabla e\).


import math
import random
import sys
import argparse

def readfile(file):
    # print(file)
    data = []
        fhd = open(file) # file handler
        return data
    for oneline in fhd:
    return data

def readfiles(f1, f2):
    x = readfile(f1)
    y = readfile(f2)
    return [x, y]

def sums(x, y):
    n = len(x)
    if (n != len(y)):
        print ("ERROR")
    sumx = 0
    sumx2 = 0
    sumy = 0
    sumxy = 0
    for ind in range(n):
        sumx = sumx + x[ind]
        sumy = sumy + y[ind]
        sumx2 = sumx2 + x[ind] * x[ind]
        sumxy = sumxy + x[ind] * y[ind]
    return [n, sumx, sumy, sumx2, sumxy]

def gradient1(a, b, n, sumx, sumy, sumx2, sumxy):
    pa = 2 * (a * sumx2 + b * sumx - sumxy)
    pb = 2 * (a * sumx + b * n - sumy)
    # convert to unit vector
    length = math.sqrt(pa * pa + pb * pb)
    ua = pa / length
    ub = pb / length
    return [ua, ub]

def gradient2(a, b, n, x, y):
    # use the definition
    # the partial derivative of function f respect to variable a is
    # (f(a + h) - f(a)) / h for small h
    h = 0.01
    error0 = 0
    errora = 0
    errorb = 0
    for ind in range(n):
        diff0 = y[ind] - (a * x[ind] + b)
        diffa = y[ind] - ((a + h) * x[ind] + b)
        diffb = y[ind] - (a * x[ind] + (b + h))
        error0 = error0 + diff0 * diff0
        errora = errora + diffa * diffa
        errorb = errorb + diffb * diffb
    pa = (errora - error0) / h
    pb = (errorb - error0) / h
    # convert to unit vector
    length = math.sqrt(pa * pa + pb * pb)
    ua = pa / length
    ub = pb / length
    return [ua, ub]

def findab(args):
    [x, y] = readfiles(args.file1, args.file2)
    method 1: know the formula for the gradient
    [n, sumx, sumy, sumx2, sumxy] = sums(x, y)
    # initial values for a and b
    a = -5
    b = 2
    eta = 0.1
    count =  1000
    while (count > 0):
        [pa, pb] = gradient1(a, b, n, sumx, sumy, sumx2, sumxy)
        a = a - eta * pa
        b = b - eta * pb
        count = count - 1
    print (a, b)
    method 2: do not know the formula
    a = -15
    b = 21
    count =  1000
    while (count > 0):
        [pa, pb] = gradient2(a, b, n, x, y)
        a = a - eta * pa
        b = b - eta * pb
        count = count - 1
    print (a, b)

def checkArgs(args = None):
  parser = argparse.ArgumentParser(description='parse arguments')
  parser.add_argument('-f1', '--file1', 
                      help = 'name of the first data file',
                      default = 'xval')
  parser.add_argument('-f2', '--file2', 
                      help = 'name of the second data file',
                      default = 'yval')
  pargs = parser.parse_args(args)
  return pargs

if __name__== "__main__":
  args = checkArgs(sys.argv[1:])

The two methods get similar results: The first method gets 3.153 and -5.187 for \(a\) and \(b\) respectively. The second method gets 3.027 and -5.192.