============================
Linear Discriminant Analysis
============================

Background
----------

Linear Discriminant Analysis (LDA) is a basic classification method
from parametric statistics. It is based on a maximum a posteriori
estimate of the class membership under the assumption that the class
conditional densities are multi-variate Gaussians having different
means but a common covariance matrix. Despite its simplicity, LDA
gives surprisingly good results in practice, of course crucially
depending on the representation of the input patterns and the
underlying distribution. Details can be found, for instance, in
[HastieTibshiraniFriedman2008]_.


Linear Discriminant Analysis in Shark
----------------------------------------

We assume that a data file in csv format is provided as a command line argument.
As a first step, we inspect and split the data as described in the
:doc:`nearestNeighbor` tutorial: ::


	int main(int argc, char **argv) {
		if(argc < 2) {
			cerr << "usage: " << argv[0] << " (filename)" << endl;
			exit(EXIT_FAILURE);
		}
		// read data
		ClassificationDataset data;
		try {
			importCSV(data, argv[1], LAST_COLUMN, ' ');
		} 
		catch (...) {
			cerr << "unable to read data from file " <<  argv[1] << endl;
			exit(EXIT_FAILURE);
		}
	
		cout << "overall number of data points: " << data.numberOfElements() << " "
		     << "number of classes: " << numberOfClasses(data) << " "
		     << "input dimension: " << inputDimension(data) << endl;
	
		// split data into training and test set
		ClassificationDataset dataTest = splitAtElement(data, .5 * data.numberOfElements() );
		cout << "training data points: " << data.numberOfElements() << endl;
		cout << "test data points: " << dataTest.numberOfElements() << endl;
	




Model and learning algorithm
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We include the :doxy:`LDA` class ::


	#include <shark/Algorithms/Trainers/LDA.h>
	

and define the LDA trainer :: 


		LDA ldaTrainer;
	

and the linear classifier to be trained by the LDA trainer: ::


		LinearClassifier<> lda;
	

The model is trained by calling: ::


		ldaTrainer.train(lda, data);
	

Evaluating the model
^^^^^^^^^^^^^^^^^^^^

After training the model, we can evaluate it.  As a performance
measure, we consider the standard 0-1 loss.  The corresponding risk is
the probability of error, the empirical risk is the average
classification error.  When measured on set of sample patterns, it
simply computes the fraction of wrong predictions.
We define the loss for ``unsigned integer`` labels and
create a new data container for the predictions of our model: ::


		Data<unsigned int> prediction;
		ZeroOneLoss<unsigned int> loss;
	
	
Let's apply the classifier to the training and the test data: ::


		prediction = lda(data.inputs());
		cout << "LDA on training set accuracy: " << 1. - loss(data.labels(), prediction) << endl;
		prediction = lda(dataTest.inputs());
		cout << "LDA on test set accuracy:     " << 1. - loss(dataTest.labels(), prediction) << endl;
	

Of course, the accuracy is given by one minus the error.
In this example, the (parametric) LDA performs worse than the
(non-parametric) nearest neighbor classifier, but still much better
than random guessing.


Full example program
--------------------

The full example program is 
:doxy:`LDATutorial.cpp`.

Sample classification problem: Protein fold prediction
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Let us consider the same bioinformatics problem as in the
:doc:`nearestNeighbor` tutorial, namely the prediction of the
secondary structure of proteins. The goal is to assign a protein to
one out of 27 SCOP fold types [DingDubchak2001a]_.  We again consider
the descriptions of amino-acid sequences provided by
[DamoulasGirolami2008a]_.  The data :download:`C.csv <../../../../../examples/Supervised/data/C.csv>`
provide a description of the amino-acid compositions of 695 proteins
together with the corresponding fold type. Each row corresponds to a
protein.  After downloading the data :download:`C.csv <../../../../../examples/Supervised/data/C.csv>` we
can envoke the example program: ::

  ./LDATutorial C.csv


References
----------

.. [DamoulasGirolami2008a] T. Damoulas and M. Girolami.
   Probabilistic multi-class multi-kernel learning: on protein fold
   recognition and remote homology detection. Bioinformatics,
   24(10):1264-1270, 2008.

.. [DingDubchak2001a] C.H.Q. Ding and I. Dubchak.  Multi-class
   protein fold recognition using support vector machines and neural
   networks. Bioinformatics, 17(4):349-358, 2001.

.. [HastieTibshiraniFriedman2008] T. Hastie, R. Tibshirani and
   J. Friedman.  `The Elements of Statistical Learning
   <http://www-stat.stanford.edu/~tibs/ElemStatLearn>`_, section 4.3. Springer-Verlag,
   2008.
