==========================================
Classification And Regression Trees (CART)
==========================================

Background
----------

CART is a decision tree algorithm for both classification and
regression. It was first described by [Breiman1984]_. It is a recursive
algorithm, which partitions the training dataset by doing binary splits.
It is a conceptual simple decision tree algorithm, and performs 
OK in practice.

CART in Shark
----------------------------------------

For this tutorial the following includes are needed::


	#include <shark/Data/Csv.h>                            // importing CSV files
	#include <shark/Algorithms/Trainers/CARTTrainer.h>     // the CART trainer
	#include <shark/ObjectiveFunctions/Loss/ZeroOneLoss.h> // 0/1 loss for evaluation
	

Sample classification problem: Protein fold prediction
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Let us consider the same bioinformatics problem as in the
:doc:`nearestNeighbor` tutorial, namely the prediction of the
secondary structure of proteins. The goal is to assign a protein to
one out of 27 SCOP fold types [DingDubchak2001a]_.  We again consider
the descriptions of amino-acid sequences provided by
[DamoulasGirolami2008a]_.  The data :download:`C.csv <../../../../../examples/Supervised/data/C.csv>`
provide a description of the amino-acid compositions of 695 proteins
together with the corresponding fold type. Each row corresponds to a
protein.  After downloading the data :download:`C.csv <../../../../../examples/Supervised/data/C.csv>` we
can read, inspect and split the data as described in the
:doc:`nearestNeighbor` tutorial: ::


		// read data
		ClassificationDataset dataTrain;
		importCSV(dataTrain, "data/C.csv", LAST_COLUMN, ' ');
	
	
		//Split the dataset into a training and a test dataset
		ClassificationDataset dataTest =splitAtElement(dataTrain,311);
		
		cout << "Training set - number of data points: " << dataTrain.numberOfElements()
			 << " number of classes: " << numberOfClasses(dataTrain)
			 << " input dimension: " << inputDimension(dataTrain) << endl;
	
		cout << "Test set - number of data points: " << dataTest.numberOfElements()
			 << " number of classes: " << numberOfClasses(dataTest)
			 << " input dimension: " << inputDimension(dataTest) << endl;
	
		

Model and learning algorithm
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Training the CART-Tree is very simple as it is just a call to the trainer which creates the tree::


		CARTTrainer trainer;
		CARTClassifier<RealVector> model;
		trainer.train(model, dataTrain);
		

Evaluating the model
^^^^^^^^^^^^^^^^^^^^

After training the model we can evaluate it.  As a performance
measure, we consider the standard 0-1 loss.  The corresponding risk is
the probability of error, the empirical risk is the average
classification error.  When measured on set of sample patterns, it
simply computes the fraction of wrong predictions::


		ZeroOneLoss<unsigned int, RealVector> loss;
		Data<RealVector> prediction = model(dataTrain.inputs());
		cout << "CART on training set accuracy: " << 1. - loss.eval(dataTrain.labels(), prediction) << endl;
	
		prediction = model(dataTest.inputs());
		cout << "CART on test set accuracy:     " << 1. - loss.eval(dataTest.labels(), prediction) << endl;
		

Of course, the accuracy is given by one minus the error.
In this example, CART performs slightly better than the LDA and much
better than random guessing.

Full example program
--------------------

The full example program is 
:doxy:`CARTTutorial.cpp`.



References
----------

.. [Breiman1984] L. Breiman, J. H. Friedman, R. A. Olshen, C. J.  Stone.
            Classification and Regression Trees.
            Wadsworth and Brooks,  
            1984
