# Neural networks - concepts

### Why use neural nets?¶

Consider a classification problem where the decision boundary is non-linear as shown below:

We can represent non-linearity in a linear model by adding higher order features. However, when the original dataset already comes with a large number of features (say `100`

), then feature engineered features increases by $\frac{O(n^{2})}{2}$ if we want to include **quadratic features**. Thus, for input data set with `100`

features, the feature engineered features is in the order of `5000s`

. Fitting a model on such a data set is expensive, further, the model will overfit. Furthermore, if we want to represent **cubic features**, then order increases to $O(n^{3})$.

#### Why not traditional ML?¶

**Image classification** is also a non-linear problem. This is because the algorithm sees images as matrices. In the graphic below, we build a training set that classifies cars from non-cars.

Each pixel in the image is now a feature. Thus a `50x50`

grayscale image has `2500`

features! Since the decision boundary is usually non-linear, the number of feature required for a quadratic fit is `3 million`

features. Trying to fit a logistic regression to this dataset is not feasible.

#### Why are neural nets powerful?¶

Neural nets mimic the biological neural nets found in animal brains. In brains, specific regions are responsible for specific functions. However, when scientists have conducted experiments where they would cut the signals from the ear to the sound processing region and rewrite the signals from eyes to it, the sound processing region now learns to process vision and functions just as good as the original vision processing engine. Similarly, they were able to repeat this for touch as well. Animal brain is effective as each region is not a bunch of complex algorithms, instead, most regions are general purpose systems built to infer data / signals.

An example of this approach are usecases for differently abled people shown below:

### Neural net representation¶

The physical neuron in a brain looks like below. It has a set of dendrites which act as inputs, a processing engine and the axon which acts as output.

ANNs model these 3 parts of the neuron as shown below. A set of inputs, multiplied by their weights are fed to an activation function, which is a logit or sigmoid function.

A group of neurons working together forms a neural net. The first layer is called the **input layer** and the last called the **output layer**. Sometimes, the **bias** is represented as an explicit node.

**Weights in a neural net**: The graphic below shows how weights are applied in a neural net. The hypothesis function for each neuron takes the familiar $g(\theta^{T}X)$ form. `g`

is the sigmoid function and $\theta_{i,k}^{j}$ represents the weight for `jth`

layer, hidden node `i`

, input node `k`

. There is always a **bias node** which is represented with index `0`

.

Thus, when you have `2`

nodes in layer 1 (input) and `3`

nodes in layer 2, the dimension of the weight matrix for layer 2 is `3 x (2+1)`

, we add `+1`

to include the bias node in the first layer. Since weights is a matrix, we represent it with capital theta $\Theta$.

##### Vectorized implementation of forward propagation¶

The input parameters in the previous slide can be represented as a vector $x$ $$ x = \begin{bmatrix} x_{0}\ x_{1}\ x_{2}\ x_{3} \end{bmatrix} $$ The activation function can be represented as $a^{(j)} = g(z^{(j)})$ where

$$ z^{(2)} = \begin{bmatrix} z_{1}^{2}\ z_{2}^{2}\ z_{3}^{2} \end{bmatrix} $$

Thus, $z^{(2)} = \Theta^{(1)}x$ and $a^{(2)} = g(z^{(2)})$. By extension, for the next layer, $z^{(3)} = \Theta^{(2)}a^{(2)}$ and $h_{\Theta}(x) = a^{(3)} = g(z^{(3)})$

##### Neural nets learn their own features¶

If you look at the second half of the simple neural net presented earlier, it is simply a logistic regression. The inputs are however, not inputs from real world, but activations of the previous layer. Thus, neural net can create its own input features. Because of this, it is capable of representing non-linear and higher order functions, even when the real world input does not have them.

### Logical operations with neurons¶

Neurons in neural nets build complex representations using simple condition checks. Below is an example of how logical `AND`

, `OR`

operators are represented:

Then, by simply changing the weights, the same neuron can be switched to an `OR`

operator:

Why are these useful? Many layers of such neurons can build to represent more complex decision boundaries such as `XOR`

or `XNOR`

or even non-linear boundaries. Below is an example of how `2`

layers of NN are used to build `XNOR`

gate using `OR`

, `AND`

, `NOR`

gates. `XNOR`

gives `1`

if both `x1`

, `x2`

are `0`

or `1`

.

### Multiclass classification with NN¶

Multiclass classification in NN is essentially a **on-vs-all** classification. The output layer has as many nodes as the number of classes. Further, the value of the output layer looks like **one-hot** encoding

#### OCR on MNIST digits database using NN¶

The MNIST database has `14`

million images of handdrawn digits. We work with a subset of `5000`

images. Each image is `20x20`

pixels. When laid out as a column vector (which is how Neural Nets and log reg algorithms will read it), we get a `1x400`

row vector. A sample of 100 images is below:

When classifying these digits, we work with `1`

image at a time. This is unlike linear or logistic regression where we would represent the whole training set as matrix `X`

. Here, we treat each pixel as a feature. Thus our input layer has `400+1`

nodes (1 added to represent bias). The hidden layer from pre-trained network has `25`

nodes. The output layer should have `10`

nodes to represent the `10`

classes we predict.

Thus, input layer is x = $a^{(1)}_{401x1}$. The weight matrix

$$ a^{(1)} = x_{401x1} $$

$$ z^{(2)} = \Theta^{(1)}_{25x401} . a^{(1)} $$

$$ a^{(2)}_{25x1} = sigmoid(z^{(2)}) $$

We will add a bias to $a^{(2)}$ when computing the next layer, making it $a^{(2)}_{26x1}$

$$ z^{(3)} = \Theta^{(2)}_{10x26} . a^{(2)} $$

$$ a^{(3)}_{10x1} = sigmoid(z^{(3)}) $$

$$ h_{\Theta}(x) = max(sigmoid(a^{(3)})) $$

The implementation code can be see here.