Gradient Descent
================

Learning Objectives

- Understand partial derivative and gradient

- Understand how gradient descent may be used in optimization problems

- Apply gradient descent in machine learning

Linear Approximation with Least Square Error
--------------------------------------------

As mentioned in the previous section, supervised machinear learning
can be formulated as a minimization problem: minimizing the
error. This chapter starts with a problem that is probably farmiliar
to many people already: *linear approximation with least square
error*.

Consider a list of :math:`n` points: :math:`(x_1, \tilde{y_1})`,
:math:`(x_2, \tilde{y_2})`, ..., :math:`(x_n, \tilde{y_n})`.  Please
notice the convention: :math:`y = a x + b` is the underlying equation
and :math:`y` is the "correct" value. It is generally not possible
getting the correct value of :math:`y` due to noise and limitation of
meausrement instruments. Instead, we can get only the observed
:math:`y` with noise.  To distinguish these two, we use
:math:`\tilde{y}` to express the observed value. It may be different
from the true value of :math:`y`.


The problem is to find the values of :math:`a` and :math:`b` for a
line :math:`y = a x + b` such that

:math:`e(a, b)= \underset{i=1}{\overset{n}{\sum}} (y_i - (a x_i +   b))^2`

is as small as possible. This is the cumulative error. Let's call it
:math:`e(a, b)` because it has two variables :math:`a` and :math:`b`.
Here we will solve this problem in two ways: analytically and
numerically.

Analytics Method for Least Square Error
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


In Calculus, you have learned the concept of derivative. Suppose
:math:`f(x)` is a function of a single variable :math:`x`. The
derivative of :math:`f(x)` with respect to :math:`x` is defined as

:math:`f'(x) = \frac{d}{dx} f(x) = \underset{h \rightarrow 0}{\text{lim}} \frac{f(x + h) - f(x)}{h}`

The derivative calculates the ratio of change in :math:`f(x)` and the
change in :math:`x`. A geometric interpretation is the slope of
:math:`f(x)` at a specific point of :math:`x`.

Extend that concept to a multivariable function. Suppose :math:`f(x,
y)` is a function of two variables :math:`x` and :math:`y`. The
partial derivative of :math:`f(x,y)` with respect to :math:`x` at
point :math:`(x_0, y_0)` is defined as

:math:`\frac{\partial f}{\partial x}| _{(x_0, y_0)} = \frac{d}{dx} f(x, y_0) | _{x = x_0} =\underset{h \rightarrow 0}{\text{lim}} \frac{f(x_0 + h, y_0) - f(x_0, y_0)}{h}`

The derivative calculates the ratio of change in :math:`f(x, y)` and the
change in :math:`x` *while keeping the value of* :math:`y` *unchanged*. 

Similarly, the partial derivative of :math:`f(x,y)` with respect to
:math:`y` at point :math:`(x_0, y_0)` is defined as

:math:`\frac{\partial f}{\partial y}| _{(x_0, y_0)} = \frac{d}{dy} f(x_0, y) | _{y = y_0} =\underset{h \rightarrow 0}{\text{lim}} \frac{f(x_0, y_0 + h) - f(x_0, y_0)}{h}`


To minimize the error function, we take the partial derivatives of
:math:`a` and :math:`b` respectively:

:math:`\frac{\partial e}{\partial a}  = 2 (a  x_1 + b - y_1)  x_1 + 2 (a  x_2 + b - y_2)  x_2 + ... + 2 (a  x_n + b - y_n)  x_n`

:math:`\frac{\partial e}{\partial a}  = 2 a (x_1^2 + x_2^2 + ... + x_n^2) + 2 b (x_1 + x_2 + ... + x_n) - 2 (x_1 y_1 + x_2 y_2 + ... + x_n y_n)`

:math:`\frac{\partial e}{\partial a}  = 2 (a \underset{i=1}{\overset{n}{\sum}} x_i^2 + b \underset{i=1}{\overset{n}{\sum}} x_i - \underset{i=1}{\overset{n}{\sum}} x_i y_i)`

For :math:`b`:   

:math:`\frac{\partial e}{\partial b} = 2 (a  x_1 + b - y_1) + 2 (a  x_2 + b - y_2) + ... + 2 (a \times x_n + b - y_n)`

:math:`\frac{\partial e}{\partial b} = 2 a (x_1 + x_2 + ... + x_n) + 2 n b  - 2 (y_1 + y_2 + ... + y_n)`

:math:`\frac{\partial e}{\partial b} = 2 (a \underset{i=1}{\overset{n}{\sum}} x_i + b n - \underset{i=1}{\overset{n}{\sum}} y_i)`          

      
To find the minimum, set both to zero and obtain two linear equations of :math:`a` and :math:`b`.
      
:math:`a \underset{i=1}{\overset{n}{\sum}} x_i^2 + b \underset{i=1}{\overset{n}{\sum}} x_i = \underset{i=1}{\overset{n}{\sum}} x_i y_i`       

:math:`a \underset{i=1}{\overset{n}{\sum}} x_i + b n = \underset{i=1}{\overset{n}{\sum}} y_i`       

The values of :math:`a` and :math:`b` can be expressed by

:math:`a =\frac{n \underset{i=1}{\overset{n}{\sum}} x_i y_i - \underset{i=1}{\overset{n}{\sum}} x_i \underset{i=1}{\overset{n}{\sum}} y_i}{n \underset{i=1}{\overset{n}{\sum}} x_i^2 - \underset{i=1}{\overset{n}{\sum}} x_i \underset{i=1}{\overset{n}{\sum}} x_i}`

:math:`b =\frac{\underset{i=1}{\overset{n}{\sum}} y_i \underset{i=1}{\overset{n}{\sum}} x_i^2 - \underset{i=1}{\overset{n}{\sum}} x_i y_i \underset{i=1}{\overset{n}{\sum}} x_i}{n \underset{i=1}{\overset{n}{\sum}} x_i^2 - \underset{i=1}{\overset{n}{\sum}} x_i \underset{i=1}{\overset{n}{\sum}} x_i}`

Let's consider an example:

+--------------------+----------------------+
| x                  | y                    |
+====================+======================+
| 8.095698376        | 23.51683637          |
+--------------------+----------------------+
| 6.358470914        | 9.792790054          |
+--------------------+----------------------+
| -9.053869996       | -39.96474572         |
+--------------------+----------------------+
| -8.718226575       | -30.35310844         |
+--------------------+----------------------+
| 8.92002599         | 26.16662601          |
+--------------------+----------------------+
| -9.304226583       | -37.4223126          |
+--------------------+----------------------+
| -4.413344816       | -18.03547019         |
+--------------------+----------------------+
| 9.24473846         | 24.39367474          |
+--------------------+----------------------+
| 2.717746556        | -4.589498946         |
+--------------------+----------------------+
| 5.87537092         | 15.7037148           |
+--------------------+----------------------+
| -2.962047549       | -7.508042385         |
+--------------------+----------------------+
| -1.793005634       | -11.81506333         |
+--------------------+----------------------+
| -2.341379964       | -14.96321124         |
+--------------------+----------------------+
| 4.742625547        | 2.282082477          |
+--------------------+----------------------+
| -2.007598497       | -5.068305913         |
+--------------------+----------------------+
| 9.333353675        | 28.44940642          |
+--------------------+----------------------+
| 2.570708237        | 3.086379154          |
+--------------------+----------------------+
| -4.846225403       | -25.6409577          |
+--------------------+----------------------+
| 2.571789981        | 7.795844519          |
+--------------------+----------------------+
| -9.044770879       | -26.25061389         |
+--------------------+----------------------+
| 5.09385439         | 8.166196092          |
+--------------------+----------------------+
| -5.665252693       | -21.99241714         |
+--------------------+----------------------+
| 1.193065754        | 0.698347441          |
+--------------------+----------------------+
| -8.739601542       | -31.96384225         |
+--------------------+----------------------+
| -5.850434065       | -17.51926158         |
+--------------------+----------------------+
| 4.556308579        | 9.854628779          |
+--------------------+----------------------+
| -0.509866694       | -10.85684654         |
+--------------------+----------------------+
| -0.24261641        | -8.33876201          |
+--------------------+----------------------+
| 7.930407455        | 19.56805947          |
+--------------------+----------------------+
| 6.201498841        | 5.836888055          |
+--------------------+----------------------+
| -3.524341584       | -19.45328039         |
+--------------------+----------------------+
| 6.034477356        | 19.15245129          |
+--------------------+----------------------+


The pairs are plotted below:

.. figure:: gradient/figures/xy1.png

The value of :math:`y` is calculated by

:math:`y = 3 x - 5 + \epsilon`

here :math:`\epsilon` is the error (or noise) and it is set to a randeom number between -8 and 8.

The figure shows the line without noise:

.. figure:: gradient/figures/xy2.png

Using the equations, :math:`a = 3.11806` and :math:`b = -5.18776`.


Next, we explain how to solve the problem using *gradient descent*.
      
Gradient
--------

The *gradient* of a two-variable function :math:`f(x, y)` is at point :math:`(x_0, y_0)` is defined as

:math:`\nabla f|_{(x_0, y_0)} = \frac{\partial f}{\partial x} {\bf i} + \frac{\partial f}{\partial y} {\bf j}`

Here, :math:`{\bf i}` and :math:`{\bf j}` are the unit vector in the :math:`x` and :math:`y` directions.

Suppose :math:`{\bf v} = \alpha {\bf i} + \beta {\bf j}` is a unit
vector. Then, the amount of :math:`f`'s changes in the direction of
:math:`{\bf v}` is the inner product of :math:`\nabla f` and
:math:`{\bf v}`.  The greatest change occurs at the direction when
:math:`{\bf v}` is the unit vector of :math:`\nabla f`.  Since
:math:`{\bf v}` is a unit vector, if its direction is different from
:math:`\nabla f`, the inner product must be smaller.

One way to understand graident is to think about speed bumps on roads.

.. figure:: gradient/figures/roadbump.png

The bump is modeled as a half cylinder. For simplicity, we assume that
the bump is infinitely long in the :math:`x` direction. A point on the
surface of the bump can be expressed as

:math:`p = x {\bf i} + \alpha \cos(\theta) {\bf j} + \beta \sin(\theta) {\bf k}`

The gradient is

:math:`\nabla p = \frac{\partial p}{\partial x} {\bf i} + \frac{\partial p}{\partial y} {\bf j} + \frac{\partial p}{\partial z} {\bf k} = {\bf i} + \alpha (- \sin(\theta)) {\bf j} + \beta \cos(\theta) {\bf k}.`

This is the *tangent* of the point on the surface.

Next, consider another vector (such as the tire of your unicycle) goes
through this bump.  We consider a unicycle because it is simpler: we
need to consider only one wheel. What is the rate of changes along
this surface? If you ride the unicycle along this bump without getting
onto the bump, then the vector of your movement can be expressed by

:math:`v = x {\bf i}`

How is this affected by the slope of the bump? The calculation is the *inner product* of the two vectors:

:math:`\nabla p \cdot v = x {\bf i}`.

Notice that this inner product contains no :math:`\theta`. What does
this mean? It means that your movement is not affected by
:math:`\theta`. This is obvious because you are not riding onto the
bump.

Next, consider that you ride straight to the bump. The vector will be

:math:`v = - y {\bf j}`

The slope of the bump affects your actual movement, again by the inner product:

:math:`\nabla p \cdot v = y \alpha \sin(\theta) {\bf j}`.

How can we interpret this? The moment when your tire hits the bump,
:math:`\theta` is zero so your tire's movement along the :math:`{\bf
j}` direction is zero. This is understandable because you cannot
penetrate into the bump.  When the tire is at the top of the bump,
:math:`\theta` is :math:`\frac{\pi}{2}` and the tire has the highest
speed.

Based on this understanding, it is easier to answer the following
question: "Which direction along the surface gives the greatest
changes?"  Because the actual change is the inner product of the
direction and the gradient, the greatest change occurs along the
direction of the gradient.


Gradient Descent
----------------

*Gradient Descent* is a method for solving *minimization* problems.
The gradient of a function at a point is the rate of changes at that
point.  Let's consider a simple example: use gradient descent to find
a line that has the smallest sum of square error. This is the same
problem described above. The earlier solution uses formulae to find
:math:`a` and :math:`b`. Here, we will not use the formulae.

The gradient of a function is the direction of changes. 

:math:`\nabla e = \frac{\partial e}{\partial a} {\bf i} + \frac{\partial e}{\partial b} {\bf j}`

Suppose :math:`\Delta w = \alpha {\bf i} + \beta {\bf j}` is a vector.   
    
By definition, if :math:`\Delta w` is small enough, then the change in
:math:`\nabla e` alogn the direction of :math:`\Delta w` can be
calculated as

:math:`\Delta e = \nabla e \cdot \Delta w`.

The goal is to reduce the error. Thus, :math:`\Delta w` should be chosen to ensure that :math:`\Delta e` is negative.
If we make

:math:`\Delta w = - \eta \nabla e`,

then

:math:`\Delta e = \nabla e \cdot (- \eta \nabla e) = - \eta (\nabla e)^2`.

This ensures that the error :math:`e` becomes smaller.  The value
:math:`\eta` is called the *learning rate*.  Its value is usually
between 0.1 and 0.5.  If :math:`\eta` is too small, :math:`\Delta e`
changes very slowly, i.e., learning is slow.  If :math:`\eta` is too
large, :math:`\Delta e = \nabla e \cdot \Delta w` is not necessarily
true.

Numerical Method for Least Square Error
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It is possible to find an analytical solution for :math:`a` and
:math:`b` because the function :math:`e` is pretty simple.  For many
machine learning problems, the functions are highly complex and in
many cases the functions are not even known in advance. For these
problems, reducing the errors can be done numerically using data.
This section is further divided into two different scenarios.

- The first assumes that we know the function :math:`e` but we do not have formulaes for :math:`a` or :math:`b`.

- The second assumes that we do not know the function :math:`e` and certainly do not know the formulaes for :math:`a` or :math:`b`.  This is the common scenario. 

For the first case, 

:math:`\nabla e = \frac{\partial e}{\partial a} {\bf i} + \frac{\partial e}{\partial b} {\bf j}`

            
:math:`\frac{\partial e}{\partial a} = 2 (a \underset{i=1}{\overset{n}{\sum}} x_i^2 + b \underset{i=1}{\overset{n}{\sum}} x_i - \underset{i=1}{\overset{n}{\sum}} x_i y_i)`

      
:math:`\frac{\partial e}{\partial b} = 2 (a \underset{i=1}{\overset{n}{\sum}} x_i + b n - \underset{i=1}{\overset{n}{\sum}} y_i)`

This is ``gradient1`` below.


For the second case, the gradient can be estimated using the definition of partial derivative. This is shown in
``gradient2`` below.

After finding the gradient using either method, the values of
:math:`a` and :math:`b` change by :math:`- \eta \nabla e`.

.. literalinclude:: gradient/code/gradientdescent.py
   :language: python

The two methods get similar results: The first method gets 3.153 and
-5.187 for :math:`a` and :math:`b` respectively.  The second method
gets 3.027 and -5.192.


Numerical Method for Least Square Error of Quadratic Function
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The examples above uses a linear function :math:`y = a x + b`. Now,
let's consider a more complex example: a quadratic function:
:math:`y = a x^2  + b x + c`. The error function is defined
by three terms: :math:`a`, :math:`b`, and :math:`c`.

:math:`e(a, b, c)= \underset{i=1}{\overset{n}{\sum}} (y_i - (a x_i ^ 2 +  b x_i + c))^2`

The value of :math:`y` is calculated by

:math:`y = 3 x^2 - 5 x + 4 + \epsilon`

here :math:`\epsilon` is the error (or noise) and it is set to a randeom number between -8 and 8.


+--------------------+----------------------+
| x                  | y                    |
+====================+======================+
| -3.370103805       | 60.79185325          |
+--------------------+----------------------+
| 0.417414305	     | -5.323619187         |
+--------------------+----------------------+
| -6.159102555	     | 154.3634865          |
+--------------------+----------------------+
| -4.407403766	     | 82.5948879           |
+--------------------+----------------------+
| 9.217171131	     | 211.5745285          |
+--------------------+----------------------+
| -7.174813026	     | 188.278774           |
+--------------------+----------------------+
| 6.361795959	     | 88.5112383           |
+--------------------+----------------------+
| 1.79223335	     | 11.99774387          |
+--------------------+----------------------+
| -1.926842791	     | 26.55005374          |
+--------------------+----------------------+
| 3.11961916	     | 12.0001045           |
+--------------------+----------------------+
| 3.407105822        | 25.23458989          |
+--------------------+----------------------+
| 3.516276644        | 27.86258459          |
+--------------------+----------------------+
| -6.446048868       | 153.4091251          |
+--------------------+----------------------+
| -1.611122918       | 16.29112594          |
+--------------------+----------------------+
| -9.423197392       | 320.9271831          |
+--------------------+----------------------+
| -0.163191843       | 1.413342642          |
+--------------------+----------------------+
| -7.984372477       | 237.3069972          |
+--------------------+----------------------+
| -1.083286108       | 20.77621844          |
+--------------------+----------------------+
| -1.721374317       | 28.35361787          |
+--------------------+----------------------+
| 9.307151676        | 211.0397578          |
+--------------------+----------------------+
| 2.847589125        | 15.49036578          |
+--------------------+----------------------+
| -9.704318719       | 333.9723539          |
+--------------------+----------------------+
| 6.652947501        | 106.6501811          |
+--------------------+----------------------+
| 1.286261333        | -2.972412887         |
+--------------------+----------------------+
| 6.730248976        | 104.7116952          |
+--------------------+----------------------+
| 4.198544935        | 29.78821666          |
+--------------------+----------------------+
| 8.730545018        | 183.571542           |
+--------------------+----------------------+
| -5.944582098       | 142.8709084          |
+--------------------+----------------------+
| 3.596351215        | 19.40151             |
+--------------------+----------------------+
| -5.606758376       | 132.9978749          |
+--------------------+----------------------+
| -7.680210557       | 215.0928846          |
+--------------------+----------------------+
| 1.852911853        | 2.423800574          |
+--------------------+----------------------+

The pairs are plotted below:

.. figure:: gradient/figures/quadratic.png

The program for calculating :math:`a`, :math:`b`, and :math:`c`
needs only slight changes from the previous program:
	    
.. literalinclude:: gradient/code/gradientdescent2.py
   :language: python

Running this program ten times gets the following results:


+--------------------+----------------------+----------------------+
| a                  | b                    | c                    |
+====================+======================+======================+
| 3.011255896382324  | -5.237109053403006   | 4.587766159787751    |
+--------------------+----------------------+----------------------+
| 2.9118463569721067 | -5.235739889402345   | 4.558316574435434    |
+--------------------+----------------------+----------------------+
| 2.913418168901663  | -5.236030395939285   | 4.611313910795599    |
+--------------------+----------------------+----------------------+
| 3.01028934524673 9 | -5.237423283136554   | 4.6514776776523155   |
+--------------------+----------------------+----------------------+
| 3.0116767376997156 | -5.2370171800113     | 4.568807336762989    |
+--------------------+----------------------+----------------------+
| 3.0116145023680554 | -5.237035419533481   | 4.572520167800626    |
+--------------------+----------------------+----------------------+
| 2.911538900320818  | -5.2359055455148615  | 4.591420135857997    |
+--------------------+----------------------+----------------------+
| 2.908676740232057  | -5.2363349865534214  | 4.682182248171528    |
+--------------------+----------------------+----------------------+
| 2.911471635130624  | -5.235840824838787   | 4.5789355660578455   |
+--------------------+----------------------+----------------------+
| 2.911385934228692  | -5.235874673892769   | 4.585754511975301    |
+--------------------+----------------------+----------------------+

The calculated values are pretty consistent.


Gradient Descent for Neural Networks
------------------------------------

Three-Layer Neural Networks
^^^^^^^^^^^^^^^^^^^^^^^^^^^


Neural networks are the foundation of the recent impressive progress
in machine learning. The problems described so far have pretty clear
mathematical foundations (for example, linear approximation or
quadratic approximation). More complex problems can be difficult to
express mathematically. Let's consider an example in our everyday
life: How do you describe a "chair"? "Simple", you may say. You define
"A chair has four legs with a flat surface on which a person can sit,
like this figure":

.. figure:: gradient/figures/chair1.jpg

If that is a chair, how about this one?  "Well, that is also a
chair.", you may say. 

.. figure:: gradient/figures/chair2.jpg

You change your answer to "A chair has two or four legs with a flat
surface on which a person can sit."  There are still problems: How about
the following two? Are they chairs?

.. figure:: gradient/figures/chair3.png

.. figure:: gradient/figures/chair4.jpg

As you can see, it is not so easy to define a chair.
   
The same problem occurs in many other cases. How do you define a car?
You may say, "A car is a vehicle with four wheels and can transport
people."  Some cars have only three wheels. Some cars have six
wheels. Some cars do not transport people.

Instead of giving a definition, another approach is to give many
examples and a complex enough function so that the function can learn
the common characteristics of these examples.  Neural networks are
complex *non-linear* functions and they have the capabilities of
learning from many examples.  More details about neural networks will
be covered in later chapters. This chapter gives only a few simple
examples showing how gradient descent can be used to learn.

Neural networks use one particular architecture for computing.  This
architecture is inspired by animal brains where billions of neurons
are connected.  This book does not intend to explain how neural
networks resemble brains, nor the biological experiments discovering the
mechanisms of brain functions.  Instead, this section solve relatively
simple problems (creating logic gates) using this particular
architecture.

The following figure is an example of a neural network with three
layers: an input layer, an output layer, and a hidden layer. A layer
is hidden simply because it is neither the input layer nor the output
layer.

.. figure:: gradient/figures/layers.png

   Neural network of three layers

In this example, the input layer has two neurons (neurons 0-1), the
hidden layer has four neurons (neurons 0-3), and the output layer has
two neurons (neurons 0-1).  Prior studies show that a neural network
with only two layers (input and output) without any hidden layer has
limited capability. Thus, almost all neural networks have hidden
layers.  Some complex networks can have dozens or even hundreds of
hidden layers.  It has been shown that *deep* neural networks can be
effective learning complex tasks. Here, deep means neural networks
have many hidden layers.  Information moves from the layers on the
left toward the layers on the right.  There is no feedback of
information.  Nor do two neurons of the same layer exchange
information.  Different layers may have different numbers of neurons.


.. figure:: gradient/figures/forward.png

   Connects neurons of different layers.  These are the only connections between enurons.  Please notice that the numbers of neurons are 2, 4, and 2 in the three layers.

The neurons in two adjacent layers are connected by weights, are
expressed as a three-dimensional array:

:math:`weights[l][s][d]`

connects the neuron at layer :math:`l` to layer :math:`l + 1`;
:math:`s` is the source and :math:`d` is the destination. It is also
written as :math:`w_{l, s, d}`.  Please notice that the indexes start
from zero.

.. figure:: gradient/figures/weights.png   

	    Definition of the indexes for weights

	    
In the example, the left weight :math:`weights[0][1][2]` has the first
index 0 because it connects the input layer (layer 0) with the hidden
layer (layer 1). From the top, this is the second neuron of the input
layer; thus, the second index is 1.  The destination is the third
neuron in the hidden layer and the index is 2.  The right weight is
:math:`weights[1][3][0]`. The first index is 1 because it is from the
hidden layer (layer 1). The second index is 3 because the source is
the fourth neuron from the top. The destination is the first neuron in
the output layer; thus, the third index is 0.

Each neuron performs relatively simple calculation, as shown below.

.. figure:: gradient/figures/neuron.png   

	    Computation performed by one neuron

A neuron in the input layer has only one input (only :math:`x_i`) and
the weight is always 1. A neuron in the hidden or the output layer has
multiple inputs.  Each neuron performs three stages of
computation. First, the products of the input and the weight are
added.

:math:`\underset{k = 0}{\overset{n-1}{\Sigma}} x_k w_k`.

Next, a constant called *bias* is added

:math:`b + \underset{k = 0}{\overset{n-1}{\Sigma}} x_k w_k`.

The value is then passed to an *activation function* :math:`f`.  This
must be a non-linear function because linear functions are too limited
in their ability to handle complex situations.

.. note::

   The world is not linear. Imagine that you really love chocolate. If
   you eat one piece of chocolate, you feel pretty good. If you eat
   two pieces of chocolate, you feel even better. Does this mean you
   feel better and better after you eat more and more chocolate?
   No. After you eat a lof of chocolate, you may feel that's enough
   and want to stop. The same situation applies to everything. Imagine
   you get paid $20 an hour. If you work for two hours, you get
   $40. Can you make more and more money if you work longer and
   longer? No. First, each day has only 24 hours. You can make at most
   $480 per day if you do not sleep.  If you do not sleep, you will
   fall ill soon and have to stop working.


Several activation
functions are popular. This section uses the *sigmoid* function,
expressed as :math:`\sigma(x)`, is one of the most widely used:

:math:`\sigma(x) = \frac{1}{1 + e^{-x}}`.

Because :math:`e^{-x} > 0`, :math:`\sigma(x)` is always between 0
and 1.

      
The :math:`\sigma` function has a special property: the derivative of
:math:`\sigma(x)` can be calculated easily:

.. math::
  :nowrap:
     
  \begin{eqnarray}
  & \frac{d \sigma(x)}{d x} \\
  & =  \frac{d}{dx} (\frac{1}{1 + e^{-x}}) \\
  & = - \frac{1}{(1 + e^{-x})^2} \frac{d}{dx} (1 + e^{-x}) \\
  & =  \frac{e^{-x}}{(1 + e^{-x})^2} \\
  & = \frac{1}{1 + e^{-x}} \frac{e^{-x}}{1 + e^{-x}} \\
  & = \frac{1}{1 + e^{-x}} \frac{1 + e^{-x} - 1}{1 + e^{-x}} \\
  & = \sigma(x) (1 - \sigma(x)).
  \end{eqnarray}  

What does this mean? It means that we can calculate :math:`\frac{d
\sigma(x)}{d x}` if we know :math:`\sigma(x)` *without* knowing the
value of :math:`x`.  If the value of :math:`\sigma(x)` is :math:`v`
for a particular value of :math:`x`, then the derivative of
is :math:`v (1-v)`.
      

Consider the following neuron. It has two inputs with values 0.99 and
0.01 respectively.  The weights are 0.8 and 0.2. The bias is 0.3.

.. figure:: gradient/figures/neuralnet1.png   

	    Example to calculate a neuron's output

The value of :math:`s` is 0.3 + 0.99 :math:`\times` 0.8 + 0.01 :math:`\times` 0.2 = 1.094.
After the *sigmoid* function, this neuron's output is


:math:`\frac{1}{1+ e^{-1.094}} = 0.7491342`.


Next, consider a complete neural network with all weights and biases:      
      
.. figure:: gradient/figures/neuralnet2.png   

	    Example to calculate a neural network's outputs

The following table shows how to calculate the outputs of the
four neurons in the hidden layer:


+--------+--------+--------+------+-------+--------------------+
| neuron | input  | weight | bias | sum   | output             |
+========+========+========+======+=======+====================+
| 0      | 0.99   |  0.7   | 0.1  | 0.798 | 0.689546499265968  | 
|        +--------+--------+      |       |                    |      
|        | 0.01	  |  0.5   |      |       |                    |
+--------+--------+--------+------+-------+--------------------+
| 1      | 0.99   |  0.8   | 0.3  | 1.094 | 0.749134199078648  | 
|        +--------+--------+      |       |                    |      
|        | 0.01	  |  0.2   |      |       |                    |
+--------+--------+--------+------+-------+--------------------+
| 2      | 0.99   |  0.8   | 0.5  | 1.2955| 0.785076666300568  | 
|        +--------+--------+      |       |                    |      
|        | 0.01	  |  0.35  |      |       |                    |
+--------+--------+--------+------+-------+--------------------+
| 3      | 0.99   |  0.6   | 0.4  | 0.9965| 0.730369880613158  | 
|        +--------+--------+      |       |                    |      
|        | 0.01	  |  0.25  |      |       |                    |
+--------+--------+--------+------+-------+--------------------+

How are these numbers calculated. This is the procedure for
the first neuron:

:math:`s = 0.1 + 0.99 \times 0.7 + 0.01 \times 0.5 = 0.798`

:math:`\frac{1}{1 + e^{-0.798}} = 0.689546499265968`

The next table shows how the outputs are calculated.

+--------+--------------------+--------+------+-----------------+--------------------+
| neuron | input              | weight | bias | sum             | output             |
+========+====================+========+======+=================+====================+
| 0      | 0.689546499265968  |  0.33  | 0.27 | 1.5473332709399 | 0.824528239528756  |
|        +--------------------+--------+      |                 |                    |      
|        | 0.7491341990786481 |  0.16  |      |                 |                    |
|        +--------------------+--------+      |                 |                    |
|        | 0.785076666300568  |  0.31  |      |                 |                    |
|        +--------------------+--------+      |                 |                    |
|        | 0.730369880613158  |  0.94  |      |                 |                    |
+--------+--------------------+--------+------+-----------------+--------------------+
| 1      | 0.689546499265968  |  0.28  | 0.45 | 1.75723688345954| 0.852863261520065  |
|        +--------------------+--------+      |                 |                    |      
|        | 0.7491341990786481 |  0.76  |      |                 |                    |
|        +--------------------+--------+      |                 |                    |
|        | 0.785076666300568  |  0.48  |      |                 |                    |
|        +--------------------+--------+      |                 |                    |
|        | 0.730369880613158  |  0.23  |      |                 |                    |
+--------+--------------------+--------+------+-----------------+--------------------+

The outputs of the two neurons are 0.824528239528756 and 0.852863261520065.

Gradient Descent in Neural Networks (Backpropagation)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

How can the concept of gradient descent be used to make neural
networks produce the desired outputs? The answer is to adjust the
weights and the biases. More precisely, the adjustment starts from the
outputs toward the inputs and it is called *back propagation*.  For
the three layer networks shown above, the changes start from
:math:`weight[1]` based on the differences of the desired and the
actual outputs. After adjusting :math:`weight[1]`, the method adjusts
:math:`weight[0]` between the hidden layer and the input layer.


When the actual outputs are different from the expected outputs, the
difference is called the *error*, usually expressed as :math:`E`.
Consider the following definition of error:

:math:`E = \underset{k=0}{\overset{t-1}{\sum}} \frac{1}{2} (exv_k - aco_k)^2`,

Here :math:`exv_k` is the expect value and :math:`aco_k` is the actual
output of the :math:`k^{th}` neuron (among :math:`t` neurons).  This
definition adds the squares of errors from all output neurons. Do not
worry about the constant :math:`\frac{1}{2}` because it is for
convenience here.  It will be cancelled later.

*Gradient descent* can also be used to reduce the errors.  The process
is more complex than the previous examples because neural networks are
more complex: We can change the weights but their effects go through
the non-linear activiation function.  It is not easy expressing the
relationships between the error and the weights.  In other words, the
effects of weights go through a composition of functions (additions
first and then the non-linear function).

Chain Rule
""""""""""

Before explaining how to adjust the weights, let's review function
composition.  Consider that :math:`f(x)` and :math:`g(x)` are
functions.  We can create a *composition of these two functions*. For
example, suppose :math:`f(x) = 3 x + 8` and :math:`g(x) = (x + 1) ^
2`. The composition of :math:`f(x)` and :math:`g(x)` can be written
as:

:math:`(f \circ g)(x) = f(g(x)) = f((x+1)^2) = 3 (x + 1) ^ 2 + 8 = 3 x^2 + 6 x + 11`.

In this case, :math:`g(x)` is applied first and then :math:`f(x)`.

It is also possible to switch the order of the composition, written differently:

:math:`(g \circ f)(x) = g(f(x)) = g(3x + 8) = ((3x + 8) + 1) ^ 2 = 9 x^2 + 54 x + 81`.

In this case, :math:`f(x)` is applied first and then :math:`g(x)`.

What happens if we want to calculate the effect of changes in
:math:`x` to the composite function :math:`(f \circ g)(x)`? We need to
apply the *chain rule*.  

:math:`(f \circ g)'(x) = f'(g(x)) \cdot g'(x)`.

Let's validate this by using the example:

:math:`(f \circ g)(x) = 3 x ^ 2 + 6 x + 11`.

Thus, :math:`(f \circ g)'(x) = 6 x + 6`.

Next, we calculate :math:`f'(x) = 3` and :math:`g'(x) = 2 x + 2`.

:math:`(f \circ g)'(x) = f'(g(x)) \cdot g'(x) = 3 (2 x + 2) = 6x + 6`.


Error and Weights between Output and Hidden Layers
""""""""""""""""""""""""""""""""""""""""""""""""""

.. figure:: gradient/figures/threelayers.png

	    Definitions of the symbols


Symbols and their meanings:

+-------------------+-----------------------------------------------------------------+
| symbol            | meaning                                                         |
+===================+=================================================================+
| :math:`r`         | number of neurons in the input layer                            |
+-------------------+-----------------------------------------------------------------+
| :math:`s`         | number of neurons in the hidden layer                           |
+-------------------+-----------------------------------------------------------------+
| :math:`t`         | number of neurons in the output layer                           |
+-------------------+-----------------------------------------------------------------+
| :math:`i`         | index of a neuron in the input layer :math:`0 \le i < r`        |
+-------------------+-----------------------------------------------------------------+
| :math:`j`         | index of a neuron in the hidden layer :math:`0 \le j < s`       |
+-------------------+-----------------------------------------------------------------+
| :math:`k`         | index of a neuron in the output layer :math:`0 \le k < t`       |
+-------------------+-----------------------------------------------------------------+
| :math:`\beta_j`   | bias of the :math:`j^{th}` neuron in the hidden layer           |
+-------------------+-----------------------------------------------------------------+
| :math:`b_k`       | bias of the :math:`k^{th}` neuron in the output layer           |
+-------------------+-----------------------------------------------------------------+
| :math:`u_i`       | input to the :math:`i^{th}` neuron in the input layer           |
+-------------------+-----------------------------------------------------------------+
| :math:`x_j`       | input to the :math:`j^{th}` neuron in the hidden layer          |
+-------------------+-----------------------------------------------------------------+
| :math:`y_j`       | output of the :math:`j^{th}` neuron in the hidden layer         |
+-------------------+-----------------------------------------------------------------+
| :math:`z_k`       | input to the :math:`k^{th}` neuron in the output layer          |
+-------------------+-----------------------------------------------------------------+
| :math:`aco_k`     | output of the :math:`k^{th}` neuron in the output layer         |
+-------------------+-----------------------------------------------------------------+
| :math:`exv_k`     | expected value of the :math:`k^{th}` neuron in the output layer |
+-------------------+-----------------------------------------------------------------+
|:math:`\psi_{i,j}` | weight connecting the :math:`i^{th}` and :math:`j^{th}` neurons |
+-------------------+-----------------------------------------------------------------+
|:math:`w_{j, k}`   | weight connecting the :math:`j^{th}` and :math:`k^{th}` neurons |
+-------------------+-----------------------------------------------------------------+
| :math:`\eta`      | learning rate                                                   |
+-------------------+-----------------------------------------------------------------+

Here are the equations expressing the relationships among these terms:

Error: 
:math:`E = \underset{k=0}{\overset{t-1}{\sum}} \frac{1}{2} (aco_k - exv_k)^2`,

Actual output of the :math:`k^{th}` neuron:       
:math:`aco_k = \sigma(z_k)`

Input to the :math:`k^{th}` neuron:       
:math:`z_k = b_k + \underset{j=0}{\overset{s-1}{\sum}} y_j w_{j,k}`

Output of the :math:`j^{th}` neuron:             
:math:`y_j = \sigma(x_j)`


Input to the :math:`j^{th}` neuron:       
:math:`x_j = \beta_j + \underset{i=0}{\overset{r-1}{\sum}} u_i \psi_{i,j}`      

Now, we can apply the chain rule to calculate the relationship between
the error and a weight or a bias.  Consider the output of a particular
neuron.  The expected value is a constant and its derivative is
zero. Thus, we can ignore :math:`\frac{\partial exv_k}{\partial
w_{j,k}}`. Instead, we need to worry about only the relationship
between the actual output :math:`aco_k` and a weight :math:`w_{j,k}`.

:math:`\frac{\partial E}{\partial w_{j,k}} = \frac{\partial E}{\partial aco_k} \frac{\partial aco_k}{\partial w_{j,k}}`

:math:`\frac{\partial E}{\partial aco_k} = (aco_k - exv_k)`, because
:math:`aco_k` does not depend on any other neuron in the output layer.

As you can see, the constant :math:`\frac{1}{2}` has been cancelled.

Applying the chain rule again:

:math:`\frac{\partial aco_k}{\partial w_{j,k}} = \frac{\partial aco_k}{\partial z_k} \frac{\partial z_k}{\partial w_{j,k}}`.

As explained earlier,
:math:`\frac{\partial aco_k}{\partial z_k} = \sigma(z_k)(1 - \sigma(z_k)) = aco _k(1 - aco_k)`.

:math:`\frac{\partial z_k}{\partial w_{j,k}} = y_j`.

Now, we can put everything together:

:math:`\frac{\partial E}{\partial w_{j,k}} = (aco_k - exv_k) aco_k (1 - aco_k) y_j`.

Using gradient descent, we want to change the weight

:math:`\Delta w_{j,k} = - \eta \frac{\partial E}{\partial w_{j,k}} = - \eta (aco_k - exv_k) aco_k (1 - aco_k) y_j`.

Similarly,

:math:`\frac{\partial E}{\partial b_k} = (aco_k - exv_k) aco_k (1 - aco_k)`.

To change the bias:

:math:`\Delta b_k = - \eta \frac{\partial E}{\partial b_k} = - \eta (aco_k - exv_k) aco_k (1 - aco_k)`.


Error and Weights between Hidden and Input Layers
"""""""""""""""""""""""""""""""""""""""""""""""""

Next, we adjust the weights between the input and the hidden layer.
The problem of the hidden layer is that we do not have expected values
for the neurons. Instead, we must rely on the expected values at the
output layer.  This is called *back propagation*: propagating the
expected values from the output layer to the hidden layer in order to
adjust the weights.  Consider the weight :math:`\psi_{i,j}` between
the :math:`i^{th}` neuron of the input layer and the :math:`j^{th}`
neuron in the hidden layer.  This weight may affect every neuron in
the output layer.

Continue from the previous derivation, the output of the neuron
:math:`y_j` depends on the values of the input layer, the weights
between the input layer and the hidden layer, and the bias.  To avoid
confusion, here :math:`\psi`, instead of :math:`w`, is used to
express the weight between the input and the hidden layer.


How does the weight :math:`\psi_{i,j}` affects the error? We can apply
the chain rule again:

:math:`\frac{\partial E}{\partial \psi_{i,j}} = \frac{\partial
}{\partial \psi_{i,j}} \underset{k=0}{\overset{t-1}{\sum}} \frac{1}{2}
(aco_k - exv_k)^2 = \underset{k=0}{\overset{t-1}{\sum}} (aco_k -
exv_k) \frac{\partial aco_k}{\partial \psi_{i,j}}`
      
Our next step is to calculate :math:`\frac{\partial aco_k}{\partial \psi_{i,j}}`

We can apply the chair rule again:

:math:`\frac{\partial aco_k}{\partial \psi_{i,j}} = \frac{\partial
aco_k}{\partial z_k} \frac{\partial z_k}{\partial \psi_{i,j}} = aco_k
(1 - aco_k) \frac{\partial z_k}{\partial \psi_{i,j}}`

:math:`\frac{\partial z_k}{\partial \psi_{i,j}} = \frac{\partial      z_k}{\partial y_j} \frac{\partial y_j}{\partial \psi_{i,j}} =      w_{j, k} \frac{\partial y_j}{\partial \psi_{i,j}}`

Apply the chain rule and we can get

:math:`\frac{\partial y_j}{\partial \psi_{i,j}} = \frac{\partial y_j}{\partial x_j} \frac{\partial x_j}{\partial \psi_{i,j}} = y_j (1 - y_j) \frac{\partial x_j}{\partial \psi_{i,j}}`

Finally,
:math:`\frac{\partial x_j}{\partial \psi_{i,j}} = u_i`.

Put everything together:

:math:`\frac{\partial E}{\partial \psi_{i,j}} = \underset{k=0}{\overset{t-1}{\sum}} (aco_k - exv_k)  aco_k (1 - aco_k)  w_{j, k} y_j (1 - y_j) u_i`


Using gradient descent, we want to change the weight
   

:math:`\Delta \psi_{i,j} = -\eta \underset{k=0}{\overset{t-1}{\sum}} (aco_k - exv_k)  aco_k (1 - aco_k)  w_{j, k} y_j (1 - y_j) u_i`

Similarly,

:math:`\Delta \beta_j = -\eta \underset{k=0}{\overset{t-1}{\sum}} (aco_k - exv_k)  aco_k (1 - aco_k)  w_{j, k} y_j (1 - y_j)`.

      
Neural Networks as Logic Gates
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

After understanding the operations of neural networks, it is time to use neural networks
doing something useful. This example considers building a two-input logic gate.
Each gate receives either false (also represented by 0) or true (also represented by 1).
Commonly used gates are shown in the table below:


+---------+---------+-----+----+-----+
| input 0 | input 1 | AND | OR | XOR |
+=========+=========+=====+====+=====+
|    0    |    0    |  0  | 0  |   0 |
+---------+---------+-----+----+-----+
|    0    |    1    |  0  | 1  |   1 |
+---------+---------+-----+----+-----+
|    1    |    0    |  0  | 1  |   1 |
+---------+---------+-----+----+-----+
|    1    |    1    |  1  | 1  |   0 |
+---------+---------+-----+----+-----+


Prior studies have shown that it would not be possible make an XOR gate without
any hidden layer. We won't talk about that here. Instead, right now we focus on how to use graident descent to create logic gates.