Why are Neural Networks Powerful?

Despite learning about neural networks in a variety of contexts, it was never clear to me why they are good models. A bit of digging and I’ve gained some important insights, which I wanted to preserve. What follows is a short explanation with some valuable links.

Imagine you have some continuous function f(x) with a funky shape. You could approximate the continuous function by a bunch of rectangles of appropriate width and height. It turns out that a small number of neurons with non-linear activations can be used to generate a rectangular function. If we stack these neurons with non-linear activations, we can create a function that is a sum of rectangles. A sum of rectangles is what we said can approximate any continuous function!

A nice and concrete example is that a stack of 4 neurons with ReLU activations can be used to create a rectangle function f(x), where f(x) has a value of 1 (or something else) over the domain [a,b] and is 0 elsewhere. This is shown here.

ReLU is one popular choice of (non-linear) activation, and the fact that it is non-linear is what enables us to combine activations to get something like a rectangular function (non-linear). If we limited ourselves to having neurons with linear activations then their combination would still be a linear function, which isn’t something like a rectangular function that could be used to approximate arbitrary continuous functions.

So my take aways are:

  1. Continuous functions can be approximated by a bunch of rectangles. “Simply” place a rectangle of the right height in the right part of the domain, and repeat.
  2. A neural network with a single hidden layer containing a small number of neurons with non-linear activations can be used to create non-linear functions like a rectangle. To create a function with many rectangles we can widen the single hidden layer (add more neurons).
  3. Given 1 and 2, we can see that a neural network with a single hidden layer  is powerful enough to approximate any continuous function.
  4. If we want a better approximation we can use more rectangles (more neurons).
  5. The non-linear activation of the neuron is what enables us to create powerful building blocks. If instead we only had linear activations we would be limited to creating linear primitives.

 

References:

https://www.quora.com/Is-a-single-layered-ReLu-network-still-a-universal-approximator

http://neuralnetworksanddeeplearning.com/chap4.html

vim copy paste in shell

Say you are SSH’ed into a server and you want to copy text from one place to another, you can do the following with VIM.

  1. VIM a.txt
  2. Select what you want to copy (e.g shift + v)
  3. “+y (yank)
  4. :q (quit)
  5. VIM b.txt
  6. “+p (paste)

hdf5 shuffle caffe

What is HDF5? HDF5 is a file format that is useful with Caffe because it allows you to have labels that are continuous valued and multidimensional. For example you may be interested in a regression problem where for each person you want to predict their height in inches, and their weight in pounds. This would mean that for each training sample you need to label it with 2 continuous values. HDF5 let’s you store this kind of label.

How are HDF5 files allocated? From what I have seen people typically store their training set across multiple HDF5 files. In theory you could store your whole training set into a single HDF5 file, or you could store a single sample in a single HDF5 file. My guess would be that storing all of your data in a single HDF5 file would result in a large HDF5 file on the order of gigabytes or even terrabytes and so it may be too costly to read in. Storing a single sample in an HDF5 file would create a large number of HDF5 files and then the overhead of reading from each increases. For my own work, where I have 200,000 training samples, I divided that up into 200 HDF5 files (1000 samples per file).

What is this shuffling business? Well when you have a particularly small dataset the neural network will see the training data multiple times (multiple epochs). The concern is that seeing training data in the same order each time may bias the network in an undesirable way. To address this one should randomize the order in which the network sees the training samples.

From some Google searches I got the impression that shuffling cannot easily be done with HDF5 files in Caffe. However, from looking at this commit, it appears you can shuffle HDF5 files. To shuffle both the order in which HDF5 files are consumed, and the order in which the samples within an HDF5 are consumed, add shuffle: true to your hdf5_data_param. Note you only need to add this to the training phase. Currently, testing to see if this works…

ffmpeg example

For my own convenience I wanted to blog an example of how to use ffmpeg. If you’ve landed here, I hope it’s useful to you too.

ffmpeg -framerate 25 -i name_input%d.png out.mp4

This will create an mp4 video (out.mp4) from the input frames name_input1.png, name_input2.png … name_input500.png. The input framerate is specified as 25 frames per second which will also be the output frames per second by default if I recall.

 

linear regression and its flavors

I have been trying to make sense of linear regression and its various forms. Here’s what I learned:

Linear Regression:

The most basic version of linear regression is where you have a single dependent variable, Y1, and a single independent variable (predictor) X1.

Given n samples, the linear regression problem will have the form Y = X B + E, where Y is (n x 1), X is (n x 2), B (2 x 1) and E is (n x 1). B is the set of weights/coefficients and E is the error.

The least squares solution should yield: B = (X’X)^-1 X’Y. Here B has dimension (2 x 1) which corresponds to a line.

Multiple Linear Regression: 

This is the case where you still have a single dependent variable Y1, but instead of a single independent variable (predictor) X1, you will have multiple predictors e.g X1, X2 up to  Xp.

Given n samples, the multiple linear regression problem will have the form Y = X B + E, where Y is (n x 1), X is (n x p+1), B is (p+1 x 1) and E is (n x 1).

Again, the least squares solution is: B = (X’X)^-1 X’Y. This time B has dimension (p+1 x 1), where p is the dimensionality of the independent variable.

Multivariate Linear Regression:

In multivariate linear regression there are multiple dependent variables which we are trying to predict from one or more independent variables. I suppose that if you have the case of multiple independent variables you could use the term Multivariate Multiple Linear Regression.

Given n samples, multivariate linear regression will have the form Y = X B + E, where Y is (n x m), X is (n x p+1), B is (p+1 x m) and E is (n x m). Here m is the dimensionality of the dependent variable!

It turns out that once again the solution is the same: B = (X’X)^-1 X’Y. Here however B is a matrix and it ends up being identical to doing linear regression for each individual dependent variable!

 

Reference: http://www.d.umn.edu/math/graduates/documents/CassieQuickFinalPaper.pdf

Which folders are taking up all your hard drive space?

My old Macbook Pro has a 250 HD which now days is always close to running out of space! With limited space, we must constantly make space for new tools by deleting old and often unnecessary files.

Today I was examining my Downloads folder to figure out what to delete. Using the GUI you can sort the files by size, but this only shows you file sizes, not the sizes of all the folders inside Downloads.

So we make use of UNIX on OS X, to give us more information.

The command du -sh ./*/ | gsort -hr , saves the day. Let’s break it down.

du can display disk usage.  The -h flag says to display disk usage in a human readable way e.g 3G for 3GB. The -s flag says to output something for each file/directory at the current depth (equivalent to -d 0). The ./*/ from what I understand says all things that are directories (ending in /).

Finally, I would like to pipe the sizes of all these directories to sort, so that I can sort the directories by size. However, the sizes are shown in human readable form (M,G) so the sort needs to also consider this. Unfortunately the default sort on OS X does not support this, so instead you can obtain the GNU version of sort, by using brew. Specifically if you do brew install coreutils, it will get you something called gsort which is the GNU version of sort, and has -h.

So now the output will be folders sorted by human readable size (-h) , and (-r) for reversing the sort so that largest directories are shown first.

Sample output I get is:

36M ./Introduction_to_Linear_Algebra/
35M ./BlackBoard Files/
34M ./cups-1.6.1/
31M ./cups-1.5.3/
31M ./Django-1.2.3/
25M ./cups-1.6.4/
24M ./voc-release5/

Lastly, you can pipe that to a text file, to look at it more.

du -sh ./*/ | gsort -hr > blah.txt

Unix is pretty awesome once you get over some hoops and actually remember what command to use.

Taming Subplots in Matlab with Subaxis

Every now and then I am trying to generate a Figure in Matlab with many subplots…. but as you may know this can produce figures with lots of white space!

So at some point I found code called subaxis that helps alleviate this.

The call I now make instead of subplot is: subaxis(num_subplots,num_subplots,l,’Spacing’,0.01,’Padding’,0.010,’Margin’,0.025); You can fiddle with the settings and read the documentation to customize your plots.

You can get subaxis here: http://www.mathworks.com/matlabcentral/fileexchange/3696-subaxis-subplot

Example for Geometric Tools Library (Wild Magic 5)

It took me quite a while to get Geometric Tools Library (www.geometrictools.com) compiled, and even longer to figure out how to use it. Keep in mind I am working on Linux (Ubuntu 14.04).

A working example I created is: test.cpp which tests whether two boxes intersect.

//test.cpp

#include <iostream>
#include “Wm5IntrBox3Box3.h”

int main (int argc, char** argv)
{
    Wm5::Vector3<float> centera(0.5,0.5,0.5);
    Wm5::Vector3<float> centerb(1.5,0.5,0.5);

    Wm5::Vector3<float> axesa[3];
    Wm5::Vector3<float> axesb[3];

    axesa[0] = Wm5::Vector3<float>(1.0f,0.0f,0.0f);
    axesa[1] = Wm5::Vector3<float>(0.0f,1.0f,0.0f);
    axesa[2] = Wm5::Vector3<float>(0.0f,0.0f,1.0f);

    axesb[0] = Wm5::Vector3<float>(1.0f,0.0f,0.0f);
    axesb[1] = Wm5::Vector3<float>(0.0f,1.0f,0.0f);
    axesb[2] = Wm5::Vector3<float>(0.0f,0.0f,1.0f);

    float extenta[3] = {0.25f,0.75f,0.75f};
    float extentb[3] = {0.8f,0.75f,0.75f};

    const Wm5::Box3<float> a(centera,axesa,extenta);
    const Wm5::Box3<float> b(centerb,axesb,extentb);

    Wm5::IntrBox3Box3<float> test(a,b);

    printf(“Boxes Intersect: %s\n”, test.Test() ? “true” : “false”);

    return 0;

}

To get this to compile I used the following:

g++ -ggdb -Wall  -I./SDK/Include/ -L./SDK/Library/Release/ test.cpp -o test -lWm5GlxApplication -lWm5GlxGraphics -lWm5Imagics -lWm5Physics -lWm5Mathematics -lWm5Core -lX11 -lXext -lGL -lGLU -lpthread -lm

Best of luck to anyone who finds this useful.

A Handy way to deal with Varargin In Matlab

After writing a function you realize that in certain cases you would like that function to take in more arguments. These arguments are optional! In Matlab one way to do this is to append ‘varargin’ to your argument list. So for example if my function definition was originally draw(x,y) , it would become draw(x,y,varargin). You could think of the base functionality of the draw function as just plotting a point and the varargin a way to make the function do more elaborate things.

One super convenient way to encode these optional arguments is by using pairs of strings and values, where the string describes what the argument represents and the value is the value of the variable. So for example I could write a call that looks like draw(x,y,’Color’,[1 0 0],’Thickness’,15,’Type’,3). Everything after the first two arguments (x,y) would be placed in varargin as a cell array. So from the functions point of view varargin{1} = ‘Color’, varargin{2} = [1 0 0], and so forth.

Now the fun part. We can convert the varargin cell array into variables denoted by the strings with values denoted by the values. To do this we can use:

for a=1:2:length(varargin)
eval(sprintf(‘%s = varargin{a+1};’,lower(varargin{a})));
end

This will produce the variable color with value [1 0 0], thickness with value 15, and type with value 3. Note: you don’t need to use lower in the eval statement.

Finally, you can do things like if(~exist(variable_name) ) variable_name = default_value;

 

2D Medial Axis Transform in Matlab

I was looking to experiment with the medial axis in Matlab. To get a medial axis (or what seems to be a good approximation) I used the bwmorph command with ‘thin’ and ‘Inf’. I then realized that this leaves you with the medial axis but no information on the radius information that is needed to reconstruct the original contour from the medial axis. After some searching and thinking I realized that I could simply compute a distance transform! The distance transform, applied on the negative binary image, will tell you how far each point inside the connected component is from the closest boundary point. Lastly, doing the intersection between this distance transform and the medial axis gives me the approximate radii that I wanted. This is the medial axis transform! I can’t say anything about accuracy of this method yet… but its a start.

Previous Older Entries