Despite learning about neural networks in a variety of contexts, it was never clear to me why they are good models. A bit of digging and I’ve gained some important insights, which I wanted to preserve. What follows is a short explanation with some valuable links.

Imagine you have some continuous function f(x) with a funky shape. You could approximate the continuous function by a bunch of rectangles of appropriate width and height. It turns out that a small number of neurons with non-linear activations can be used to generate a rectangular function. If we stack these neurons with non-linear activations, we can create a function that is a sum of rectangles. A sum of rectangles is what we said can approximate any continuous function!

A nice and concrete example is that a stack of 4 neurons with ReLU activations can be used to create a rectangle function f(x), where f(x) has a value of 1 (or something else) over the domain [a,b] and is 0 elsewhere. This is shown here.

ReLU is one popular choice of (non-linear) activation, and the fact that it is non-linear is what enables us to combine activations to get something like a rectangular function (non-linear). If we limited ourselves to having neurons with linear activations then their combination would still be a linear function, which isn’t something like a rectangular function that could be used to approximate arbitrary continuous functions.

So my take aways are:

- Continuous functions can be approximated by a bunch of rectangles. “Simply” place a rectangle of the right height in the right part of the domain, and repeat.
- A neural network with a single hidden layer containing a small number of neurons with non-linear activations can be used to create non-linear functions like a rectangle. To create a function with many rectangles we can widen the single hidden layer (add more neurons).
- Given 1 and 2, we can see that a neural network with a single hidden layer is powerful enough to approximate any continuous function.
- If we want a better approximation we can use more rectangles (more neurons).
- The non-linear activation of the neuron is what enables us to create powerful building blocks. If instead we only had linear activations we would be limited to creating linear primitives.

#### References:

https://www.quora.com/Is-a-single-layered-ReLu-network-still-a-universal-approximator