Autoencoders, Tied Weights and Dropout
==============================================


Training deep neural networks (i.e., networks with several hidden
layers) is challenging, because normal training easily gets stuck in
undesired local optima. This prevents the lower layers from learning
useful features. This problem can be partially circumvented by
pre-training the layers in an unsupervised fashion and thereby
initialising them in a region of the error function which is easier to
train (or fine-tune) using steepest descent techniques.

One of these unsupervised learning techniques are autoencoders. An autoencoder
is a feed forward neural network with one hidden layer which is trained to map
its input to itself via the representation formed by the hidden units. The optimisation
problem for input data :math:`\vec{x}_1,\dots,\vec{x}_N` is stated as:

.. math ::
	\min_{\theta} \frac 1 N \sum_{i=1}^N (\vec x_i - f_{\theta}(\vec x_i)^2 \enspace .

Of course, without any constraints this is a simple task as the model
will just try to learn the identity. It becomes a bit more challenging
when we restrict the size of the intermediate representation (i.e.,
the number of hidden units). An image with several hundred input
points can not be squeezed in a representation of a few hidden
neurons. Thus, it is assumed that this intermediate representation
learns something meaningful about the problem.  Of course, using this
simple technique only works if the number of hidden neurons is smaller than
the number of dimensions of the image. We need more advanced
regularisation techniques to work with overcomplete representations
(i.e., if the size of the intermediate representation is larger than
the input dimension). But especially for images it is obvious that a
good intermediate representation must be somehow more complex: the
number of objects that can be seen on an image is larger than the
number of its pixels.

Shark offers a wide range of possible training algorithms for autoencoders. This is the
basic tutorial which will show you:

* How to train autoencoders
* How to use the :doxy:`Autoencoder` and :doxy:`TiedAutoencoder` classes
* How to use the dropout technique

In tutorials building up on this one, we will show how to use these building blocks in
conjunction with training :doxy:`FFNet` in an unsupervised fashion.

As a dataset for this tutorial, we use a subset of the MNIST dataset which needs to
be unzipped first. It can be found in ``examples/Supervised/data/mnist_subset.zip``.

The following includes are needed for this tutorial::


	#include <shark/Data/Pgm.h> //for exporting the learned filters
	#include <shark/Data/SparseData.h>//for reading in the images as sparseData/Libsvm format
	#include <shark/Models/Autoencoder.h>//normal autoencoder model
	#include <shark/Models/TiedAutoencoder.h>//autoencoder with tied weights
	#include <shark/ObjectiveFunctions/ErrorFunction.h>
	#include <shark/Algorithms/GradientDescent/Rprop.h>// the RProp optimization algorithm
	#include <shark/ObjectiveFunctions/Loss/SquaredLoss.h> // squared loss used for regression
	#include <shark/ObjectiveFunctions/Regularizer.h> //L2 regulariziation
	

Training autoencoders
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Training an autoencoder is straight forward in shark. We just create an instance
of the :doxy:`Autoencoder` class, initialize its structure and perform a simple
regression task where we set the labels to be the same as the input. 

We will start by creating a simple function which creates and trains a given
type of autoencoder model. This will enable us during the evaluation to take a look at different
structures and model types::


	template<class AutoencoderModel>
	AutoencoderModel trainAutoencoderModel(
		UnlabeledData<RealVector> const& data,//the data to train with
		std::size_t numHidden,//number of features in the autoencoder
		std::size_t iterations, //number of iterations to optimize
		double regularisation//strength of the regularisation
	){
	

The parameters are the dataset we use as the inputs for the
autoencoder, the number of hidden units (i.e. the size of the
intermediate representation), the number of iterations to train and
the strength of the two-norm regularisation we want to use. The
template parameter is the type of autoencoder we want to use, this
will be for example ``Autoencoder<LogisticNeuron,LogisticNeuron>``.
The two template parameters define the type of the activation function
used in the hidden and output layer, respectively.

Next we create the model::


		//create the model
		std::size_t inputs = dataDimension(data);
		AutoencoderModel model;
		model.setStructure(inputs, numHidden);
		initRandomUniform(model,-0.1*std::sqrt(1.0/inputs),0.1*std::sqrt(1.0/inputs));
	

All autoencoders need only two parameters for 
``setStructure``: the number of input dimensions and the number of hidden units.
The number of outputs is implicitly given by the input dimensionality. 

Next, we set up the objective function. This should by now be looking
quite familiar.  We set up an :doxy:`ErrorFunction` with the model and
the squared loss. We create the :doxy:`LabeledData` object from the
input data by setting the labels to be the same as the inputs. Finally
we add  two-norm regularisation by creating an instance of the
:doxy:`TwoNormRegularizer` class::


		//create the objective function
		LabeledData<RealVector,RealVector> trainSet(data,data);//labels identical to inputs
		SquaredLoss<RealVector> loss;
		ErrorFunction error(trainSet, &model, &loss);
		TwoNormRegularizer regularizer(error.numberOfVariables());
		error.setRegularizer(regularisation,&regularizer);
	

Lastly, we optimize the objective using :doxy:`IRpropPlusFull`, a variant
of the Rprop algorithm which uses full weight backtracking which is more
stable on the complicated error functions formed by neural networks::


		IRpropPlusFull optimizer;
		optimizer.init(error);
		std::cout<<"Optimizing model: "+model.name()<<std::endl;
		for(std::size_t i = 0; i != iterations; ++i){
			optimizer.step(error);
			std::cout<<i<<" "<<optimizer.solution().value<<std::endl;
		}
	


Experimenting with different architectures
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We want to use the code above to train different architectures. We first start with the
standard autoencoder. It has the formula:

.. math ::
	f(x) = \sigma_2(\vec b_2+W_2\sigma_1(\vec b_1+W_1\vec x))
	
This is the normal equation for a feed forward neural network with a single hidden layer.
The input and output activation functions :math:`\sigma_1` and :math:`\sigma_2` can
be chosen among the same types as used for :doxy:`FFNet`. The problem of this
architecture is that, since the weight matrices :math:`W_1` and :math:`W_2` are
independent, the autoencoder can easily learn the identity given a big enough hidden
layer. A way around this is to use a :doxy:`TiedAutoencoder` which has the formula:

.. math ::
	f(x) = \sigma_2(\vec b_2+W^T\sigma_1(\vec b_1+W\vec x))
	
Here we set :math:`W_2=W_1^T` eliminating a lot of degrees of freedom. 

Additionally we can use a relatively new technique called
dropout [SrivastavaEtAl2014]_. This works completely on the level of the activation
functions by setting the neuron randomly to 0 with a probability of
0.5.  Dropout makes acts only on the hidden units. It makes it harder
for the single representations to specialise, and instead redundant
features need to be learned.

We will now use the 4 combinations of using tied weights and dropout and compare the features
that were generated on the MNIST dataset (we omit loading and preprocessing of the dataset for
brevity)::


		typedef Autoencoder<LogisticNeuron, LogisticNeuron> Autoencoder1;
		typedef TiedAutoencoder<LogisticNeuron, LogisticNeuron> Autoencoder2;
		typedef Autoencoder<DropoutNeuron<LogisticNeuron>, LogisticNeuron> Autoencoder3;
		typedef TiedAutoencoder<DropoutNeuron<LogisticNeuron>, LogisticNeuron> Autoencoder4;
		
		Autoencoder1 net1 = trainAutoencoderModel<Autoencoder1>(train.inputs(),numHidden,iterations,regularisation);
		Autoencoder2 net2 = trainAutoencoderModel<Autoencoder2>(train.inputs(),numHidden,iterations,regularisation);
		Autoencoder3 net3 = trainAutoencoderModel<Autoencoder3>(train.inputs(),numHidden,iterations,regularisation);
		Autoencoder4 net4 = trainAutoencoderModel<Autoencoder4>(train.inputs(),numHidden,iterations,regularisation);
	
		exportFiltersToPGMGrid("features1",net1.encoderMatrix(),28,28);
		exportFiltersToPGMGrid("features2",net2.encoderMatrix(),28,28);
		exportFiltersToPGMGrid("features3",net3.encoderMatrix(),28,28);
		exportFiltersToPGMGrid("features4",net4.encoderMatrix(),28,28);
		

Visualizing the autoencoder
^^^^^^^^^^^^^^^^^^^^^^^^^^^

After training the different architectures, we printed the feature maps (i.e., the input weights of the hidden neurons ordered according to the pixels they are connected to). Let's have a look.

Normal autoencoder:

.. figure:: ../images/featuresAutoencoder.png
  :alt: Plot of features learned by the normal autoencoders
  
Autoencoder with tied weights:

.. figure:: ../images/featuresTiedAutoencoder.png
  :alt: Plot of features learned by the tied autoencoders
  
Autoencoder with dropout:

.. figure:: ../images/featuresDropoutAutoencoder.png
  :alt: Plot of features learned by the normal autoencoders

Autoencoder with dropout and tied weights.

.. figure:: ../images/featuresDropoutTiedAutoencoder.png
  :alt: Plot of features learned by the tied autoencoders

Full example program
^^^^^^^^^^^^^^^^^^^^^^^

The full example program is  :doxy:`AutoEncoderTutorial.cpp`.

.. attention::
  The settings of the parameters of the program will reproduce the filters. However the program
  takes some time to run! This might be too long for weaker machines.

References
----------

.. [SrivastavaEtAl2014] N. Srivastava, G. E. Hinton, A. Krizhevsky,
   I. Sutskever, and R. Salakhutdinov. Dropout: A Simple Way to
   Prevent Neural Networks from Overfitting. Journal of Machine
   Learning Research 15: 929-1958, 2014
