Pretraining of Deep Neural Networks
==============================================

.. attention::
  This is an advanced topic

Training deep neural networks is a challenge because normal training
easily gets stuck in undesired local optima which prevent the lower
layers from learning useful features. This problem can be partially
circumvented by pretraining the layers in an unsupervised fashion and
thus initialising them in a region of the error function which is
easier to train (or fine-tune) using steepest descent techniques.

In this tutorial we will implement the architecture presented in 
"Deep Sparse Rectifier Neural Networks" [Glorot11]_. The authors propose a 
multi-layered feed forward network with rectified linear hidden neurons, which is
first pre-trained layerwise using denoising autoencoders [Vincent08]_. Afterwards, the full 
network is trained supervised with a L1-regularisation to enforce additional sparsity.

Training denoising autoencoders is outlined in detail in :doc:`./denoising_autoencoders` and
supervised training of a feed forward neural network is explained in :doc:`./ffnet`. This tutorial provides
the glue to bring both together.

Due to the complexity of the task, a number of includes are needed::


	//noisy AutoencoderModel model and deep network
	#include <shark/Models/FFNet.h>// neural network for supervised training
	#include <shark/Models/Autoencoder.h>// the autoencoder to train unsupervised
	#include <shark/Models/ImpulseNoiseModel.h>// model adding noise to the inputs
	#include <shark/Models/ConcatenatedModel.h>// to concatenate Autoencoder with noise adding model
	
	//training the  model
	#include <shark/ObjectiveFunctions/ErrorFunction.h>//the error function performing the regularisation of the hidden neurons
	#include <shark/ObjectiveFunctions/Loss/SquaredLoss.h> // squared loss used for unsupervised pre-training
	#include <shark/ObjectiveFunctions/Loss/CrossEntropy.h> // loss used for supervised training
	#include <shark/ObjectiveFunctions/Loss/ZeroOneLoss.h> // loss used for evaluation of performance
	#include <shark/ObjectiveFunctions/Regularizer.h> //L1 and L2 regularisation
	#include <shark/Algorithms/GradientDescent/SteepestDescent.h> //optimizer: simple gradient descent.
	#include <shark/Algorithms/GradientDescent/Rprop.h> //optimizer for autoencoders
	

Deep Network Pre-training
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We will use the code of the denoising autoencoder tutorial to pre-train a deep neural network and we 
will create another helper function which initialises a deep neural network 
using the denoising autoencoder. In the next step a supervised fine-tuning step is applied 
which is simple gradient descent on the supervised learning goal using the pre-trained 
network as starting point for the optimisation. The types of networks we use are::


	typedef Autoencoder<RectifierNeuron,LinearNeuron> AutoencoderModel;//type of autoencoder
	typedef FFNet<RectifierNeuron,LinearNeuron> Network;//final supervised trained structure
	

First, we create a function to initialise the network. We start by training the 
autoencoders for the two hidden layers. We proceed by taking the original dataset and
train an autoencoder using this. Next, we take the encoder layer - that is
the connection of inputs to the hidden units - and compute the feature vectors for every
point in the dataset using ``evalLayer``, a method specific to autoencoders and feed forward networks. 
Finally, we create the autoencoder for the next layer by training it on the feature dataset::


	Network unsupervisedPreTraining(
		UnlabeledData<RealVector> const& data,
		std::size_t numHidden1,std::size_t numHidden2, std::size_t numOutputs,
		double regularisation, double noiseStrength, std::size_t iterations
	){
		//train the first hidden layer
		std::cout<<"training first layer"<<std::endl;
		AutoencoderModel layer =  trainAutoencoderModel<AutoencoderModel>(
			data,numHidden1,
			regularisation, noiseStrength,
			iterations
		);
		//compute the mapping onto the features of the first hidden layer
		UnlabeledData<RealVector> intermediateData = layer.evalLayer(0,data);
		
		//train the next layer
		std::cout<<"training second layer"<<std::endl;
		AutoencoderModel layer2 =  trainAutoencoderModel<AutoencoderModel>(
			intermediateData,numHidden2,
			regularisation, noiseStrength,
			iterations
		);
	

We can now create the pre-trained network from the auto encoders by creating 
a network with two hidden layers, initialize all weights randomly, and then setting
the first and hidden layers to the encoding layers of the auto encoders::


		//create the final network
		Network network;
		network.setStructure(dataDimension(data),numHidden1,numHidden2, numOutputs);
		initRandomNormal(network,0.1);
		network.setLayer(0,layer.encoderMatrix(),layer.hiddenBias());
		network.setLayer(1,layer2.encoderMatrix(),layer2.hiddenBias());
		
		return network;
	


Supervised Training
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The supervised training part is overall the same as in previous tutorials and we only
show the code here. We use the :doxy:`CrossEntropy` loss for classification and the
:doxy:`OneNormRegularizer` for sparsity of the activation function. We again optimize
using :doxy:`IRpropPlusFull`::


		//model parameters
		std::size_t numHidden1 = 8;
		std::size_t numHidden2 = 8;
		//unsupervised hyper parameters
		double unsupRegularisation = 0.001;
		double noiseStrength = 0.3;
		std::size_t unsupIterations = 100;
		//supervised hyper parameters
		double regularisation = 0.0001;
		std::size_t iterations = 200;
		
		//load data and split into training and test
		LabeledData<RealVector,unsigned int> data = createProblem();
		data.shuffle();
		LabeledData<RealVector,unsigned int> test = splitAtElement(data,static_cast<std::size_t>(0.5*data.numberOfElements()));
		
		//unsupervised pre training
		Network network = unsupervisedPreTraining(
			data.inputs(),numHidden1, numHidden2,numberOfClasses(data),
			unsupRegularisation, noiseStrength, unsupIterations
		);
		
		//create the supervised problem. Cross Entropy loss with one norm regularisation
		CrossEntropy loss;
		ErrorFunction error(data, &network, &loss);
		OneNormRegularizer regularizer(error.numberOfVariables());
		error.setRegularizer(regularisation,&regularizer);
		
		//optimize the model
		std::cout<<"training supervised model"<<std::endl;
		IRpropPlusFull optimizer;
		optimizer.init(error);
		for(std::size_t i = 0; i != iterations; ++i){
			optimizer.step(error);
			std::cout<<i<<" "<<optimizer.solution().value<<std::endl;
		}
		network.setParameterVector(optimizer.solution().point);
	

.. note::
  In the original paper, the networks are optimized using stochastic gradient descent instead of RProp.

Full example program
^^^^^^^^^^^^^^^^^^^^^^^

The full example program is  :doxy:`DeepNetworkTraining.cpp`.
As an alternative route, :doxy:`DeepNetworkTrainingRBM.cpp` shows how to do unsupervised pretraining
using the RBM module.

References
^^^^^^^^^^

.. [Glorot11] X. Glorot, A. Bordes, and Y. Bengio.  Deep sparse
   rectifier networks. Proceedings of the 14th International
   Conference on Artificial Intelligence and Statistics. JMLR W&CP
   (15), 2011.

.. [Vincent08] P. Vincent, H. Larochelle Y. Bengio, and
   P. A. Manzagol. Extracting and Composing Robust Features with
   Denoising Autoencoders, Proceedings of the Twenty-fifth
   International Conference on Machine Learning (ICML‘08), pages
   1096-1103, ACM, 2008.