====================================
Support Vector Machines: First Steps
====================================


Kernel-based learning algorithms such as support vector machine (SVM,
[CortesVapnik1995]_) classifiers mark the state-of-the art in pattern
recognition . They employ (Mercer) kernel functions to implicitly
define a metric feature space for processing the input data, that is,
the kernel defines the similarity between observations.  Kernel
methods are well understood theoretically and give excellent results
in practice. This tutorial explains how to train a standard
C-SVM.

Theoretical background
----------------------

The general supervised learning problem can be stated as follows.
Given sample data :math:`S=\{(x_i,y_i)\,|\,1 \leq i \leq \ell\}` drawn from an
unknown distribution :math:`p` over :math:`X \times Y`, the goal of binary
classification is to infer a hypothesis :math:`h:X \to Y` that minimizes the
expected risk

.. math::
  R_p(h)= \int\limits_{X \times Y} L_{0-1}(y,h(x)) \, \text{d}
  p(x,y) ,


where :math:`L_{0-1}(y,z)` is 0
if :math:`y=z` and 1 otherwise.

Support vector machines (SVMs, [CortesVapnik1995]_) transfer the input
data to a feature space and perform linear classification in that space.
For a positive semi-definite kernel function :math:`k:X \times X \to\mathbb{R}`, consider the feature space
:math:`\mathcal H_k = {\text{span} \{k(x, \cdot) \,|\, x \in X\}}` and
function class :math:`\mathcal H_k^b = \{f = g + b\,|\, g \in \mathcal H_k, b\in \mathbb{R}\}`. The decision boundary induced by the sign of a
function :math:`f \in \mathcal H_k^b` is an affine hyperplane in :math:`\mathcal H_k`.
1-Norm Soft Margin C-SVMs find a solution to

.. math::
       \underset{f \in\mathcal H_k^b}{\text{minimize}}  \frac{1}{2} \|f\|_2^2 + C \sum_{i=1}^\ell L_{\text{hinge}}(y_i, f(x_i))

with loss function
:math:`L_{\text{hinge}}(y,f(x))=\max\{0, 1-(2y-1)f(x)\}` for
:math:`Y=\{0,1\}`.
The parameter :math:`C >= 0`
controls the trade-off between reducing the empirical loss
:math:`L_{\text{hinge}}` and the complexity of the hypothesis :math:`\|.\|_2`
measured by its norm (neglecting the bias parameter :math:`b`).

Here we have adopted the Shark library convention of choosing
:math:`Y=\{0,1\}` instead of :math:`Y=\{-1,1\}`. The latter is the
common choice in the SVM literature. This explains the :math:`2y-1`
instead of a simple :math:`y` in the hinge loss definition.




Support Vector Machines in Shark
--------------------------------

For this tutorial the following include files are needed::


	#include <shark/Algorithms/Trainers/CSvmTrainer.h> // the C-SVM trainer
	#include <shark/Models/Kernels/GaussianRbfKernel.h> //the used kernel for the SVM
	#include <shark/ObjectiveFunctions/Loss/ZeroOneLoss.h> //used for evaluation of the classifier
	#include <shark/Data/DataDistribution.h> //includes small toy distributions
	

Toy problem
^^^^^^^^^^^

In this tutorial, we consider an artificial binary benchmark classification
problem shipped with the Shark library::


		unsigned int ell = 500;     // number of training data point
		unsigned int tests = 10000; // number of test data points
		
		Chessboard problem; // artificial benchmark data
		ClassificationDataset training = problem.generateDataset(ell);
		ClassificationDataset test = problem.generateDataset(tests);
		



Model and learning algorithm
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To build an SVM, we need a :doxy:`KernelClassifier`  and an
:doxy:`CSvmTrainer`.

To define our model, we have to choose a kernel function.  Here we
consider the standard Gaussian/RBF kernel

.. math::

  k(x,z) = \exp(\gamma\|x-z\|^2)

by writing::


		double gamma = 0.5;         // kernel bandwidth parameter
		
		GaussianRbfKernel<> kernel(gamma); // Gaussian kernel
		

All kernels such as the :doxy:`GaussianRbfKernel` are derived from the
base class :doxy:`AbstractKernelFunction`.

Our model is thus a kernel classifier, which is
a linear classifier in feature space::


		KernelClassifier<RealVector> kc; // (affine) linear function in kernel-induced feature space
		

A `KernelClassifier` can be understood as an :doxy:`ArgMaxConverter` which converts the output
of a :doxy:`KernelExpansion` to a class label by choosing the index of the maximum output.
The `KernelExpansion` specifies a model from
:math:`\mathcal H_k = {\text{span} \{k(x, \cdot) \,|\, x \in X\}}` or
:math:`\mathcal H_k^b = \{f = g + b\,|\, g \in \mathcal H_k, b\in \mathbb{R}\}` 
depending on whether a bias is used or not.

Training the machine is done by::


		double C = 1000.0;          // regularization parameter
		bool bias = true;           // use bias/offset parameter
		
		CSvmTrainer<RealVector> trainer(&kernel, C, bias);
		

Here ``C`` denotes the regularization parameter (the SVM uses the 1-norm
penalty for target margin violations by default) and `bias` the inclusion of a bias term in the solver..
The Shark SVM training is inspired by [ChangLin2011]_
but uses unique features [GlasmachersIgel2006]_.

.. admonition:: Configuring the trainer further:

    The above lines construct a default SVM trainer with default
    settings for the underlying quadratic programming optimization
    task. In certain cases, the user may want more fine-grained
    control over the behaviour of the optimizer. For example,
    the memory cache size of the kernel matrix cache and the
    stopping criterion for the solver might be varied. Consider
    the following lines of code::

        CSvmTrainer<RealVector, double> trainer(&kernel, C);
        trainer.sparsify() = false;
        trainer.stoppingCondition().minAccuracy = 1e-6;
        trainer.setCacheSize( 0x1000000 );
        trainer.train(ke, training);
        std::cout << "Needed " << trainer.solutionProperties().seconds << " seconds to reach a dual of " << trainer.solutionProperties().value << std::endl;

    The first line uses one more template parameter in this alternative
    trainer declaration, requesting it to use ``double`` for the matrix
    cache internally (instead of the default ``float``). Note that this
    is only needed in very rare, mathematically sensitive cases.
    The second line sets the trainer to *not* discard non-support
    vectors from the solution kernel expansion after training
    (they are discarded by default). The third line sets the desired
    accuracy to a lower value (i.e., more strict value, implying longer
    optimization times) than the default of 1e-3. The fourth
    line reduces the cache size (counted in numbers of stored
    variables of the matrix cache type) from 512MB to 128MB (had we
    not passed the second template argument in the first line of this
    snippet, it would be a reduction from 256MB to 64MB). The fifth
    line is again identical to the above example. The last line
    illustrates the use of the :doxy:`solutionProperties()` method
    to access information about the optimization run after training.
    For more information on available options, see the documentation
    of :doxy:`AbstractSvmTrainer`, :doxy:`QpStoppingCondition`,
    and :doxy:`QpSolutionProperties` (as well as potentially of
    the particular SVM solver you are using, i.e., binary, multi-class,
    one-class, etc.).


Evaluating the model
^^^^^^^^^^^^^^^^^^^^

After training the model, we can evaluate it.  As a performance
measure, we consider the standard 0-1 loss :math:`L_{0-1}(y,z)`
which we apply to training and test data::


		ZeroOneLoss<unsigned int> loss; // 0-1 loss
		Data<unsigned int> output = kc(training.inputs()); // evaluate on training set
		double train_error = loss.eval(training.labels(), output);
		cout << "training error:\t" <<  train_error << endl;
		output = kc(test.inputs()); // evaluate on test set
		double test_error = loss.eval(test.labels(), output);
		cout << "test error:\t" << test_error << endl;
		


Full example program
--------------------

The full example program considered in this tutorial is :doxy:`CSvmTutorial.cpp`.
Another relevant example in the ``examples`` subdirectory is the SVM
model selection (see the next tutorial on :doc:`svmModelSelection`);

References
----------

.. [ChangLin2011] C.C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011.

.. [CortesVapnik1995] C. Cortes and V. Vapnik. Support-Vector
   Networks. Machine Learning, 20, 1995.

.. [GlasmachersIgel2006] T. Glasmachers and C. Igel. Maximum-Gain Working Set Selection for SVMs. Journal of Machine Learning Research 7, 1437-1466, 2006.
