Support Vector Machines – Classification

Support vectors for an artificial dataset (Foster & Provost's Data Science for Business).
Support vectors for an artificial dataset (Foster & Provost's Data Science for Business).

Consider again the mortgage default dataset from the previous post. The decision boundary is not exact: one defaulter falls in the non-default region. A straight line that cleanly separates the classes is called a linear discriminant (or separating hyperplane in the case of multivariate data).

A separating hyperplane does not necessarily exist. The example provided above (the combination of the data and the suggested boundary) is in some sense “nearly minimal” – only one point is mis-labeled by the decision boundary. That is mostly an artifact of the data in itself, in that example. It is not hard to conceive of examples for which the separation is “maximally” violated, so to speak (as in the spiral example below).

A dataset (with 2 classes) for which a separating hyperplane does not exist (Least Squares Support Vector Machine, Wikipedia). Where would you draw the line?
A dataset (with 2 classes) for which a separating hyperplane does not exist (Least Squares Support Vector Machine, Wikipedia). Where would you draw the line?

At any rate, when a separating hyperplane exists, it is not necessarily unique (see below for an example derived from a modified mortgage default dataset).

Non-uniqueness of separating hyperplanes (Fawcett & Provost's Data Science for Business).
Non-uniqueness of separating hyperplanes (Fawcett & Provost’s Data Science for Business).

Among all the separating hyperplanes, is it possible to find one which would be optimal? What might this look like?

The maximum-margin hyperplane is the “best” separating hyperplane in the sense that it is the one at the center of the largest strip which separates the data points cleanly.

Maximal-margin separating hyperplane for an artificial dataset (Fawcett & Provost's Data Science for Business).
Maximal-margin separating hyperplane for an artificial dataset (Fawcett & Provost’s Data Science for Business).

The objective is fairly simple to state: maximize the margin to find the optimal hyperplane.

The support vectors are those observations which are closest to the margin on either side. Usually the number of support vector is small, or at the very least, substantially smaller than the number of observations. In this example, the margin is hard (meaning that we do not allow for observations within the strip); we will see how to allow for a soft margin below.

Support vectors for an artificial dataset (Fawcett & Provost's Data Science for Business).
Support vectors for an artificial dataset (Fawcett & Provost’s Data Science for Business).

Let \mathbf{x}_i (vector) represent the explanatory features of each observation, with y_i=\pm 1 (scalar) the known classification of each of n observations. The equation of a hyperplane is given by f(\mathbf{x})=\beta_0+\beta^{\!\top}\mathbf{x}=0, where \beta is the weight vector and \beta_0 is the bias.

There is an infinite number of way of expressing that equation – we scale it so that |\beta_0+\mathbf{\beta}^{\!\top}\mathbf{x}|=1 for any eventual support vector \mathbf{x} in the available training set. This choice is known as the canonical hyperplane.

From geometry, we know that the distance from any point \mathbf{z} to the canonical hyperplane is

    \[\frac{|\beta_0+\beta^{\!\top}\mathbf{z}|}{\|\beta\|}.\]

For a support vector \mathbf{x}, this distance simply becomes \frac{1}{\|\beta\|}.The margin M is twice that distance: M=\frac{2}{\|\beta\|}.
Finding the parameters (\beta_0,\mathbf{\beta}) which maximize M is equivalent to solving the following quadratic optimization problem:

    \[\arg\min_{\beta_0,\beta} \left\{\frac{1}{2}\|\beta\|^2:\mbox{ subject to } y_i(\beta_0+\beta^{\!\top}\mathbf{x}_i)\geq 1\mbox{ for each observation }\mathbf{x}_i\right\}.\]

This constrained quadratic optimization problem (QP) can be solved with the use of Lagrange multipliers. While this primal formulation does (in theory) yield the required parameter vector \beta, there can be computational hazards when the dimension of the feature space is high. Furthermore, it is not obvious from the form of the QP how to identify the support vectors before real data is plugged in.

According to the Representer Theorem, we can find a set of weights \alpha_i, i=1,\ldots,n for which

    \[\beta = \sum_{i=1}^n\alpha_iy_i\mathbf{x}_i\quad\mbox{and}\quad \sum_{i=1}^n\alpha_iy_i=0.\]

Technically speaking we do not need to invoke the Representer Theorem in this case (the linear kernel case) as the weights \alpha_i are simply the Lagrange multipliers of the QP and the relations can be derived directly (this does not apply in the general case, however).

Substituting the relations in the primal formulation yields the dual formulation, which can be shown to be equivalent to the primal formulation while by-passing the computational issues:

    \[\arg\min_{\alpha} \left\{\frac{1}{2}\sum_{i,j=1}^n\alpha_i\alpha_j\mathbf{x}_i^{\!\top}\mathbf{x}_j-\sum_{i=1}^n\alpha_i:\mbox{ subject to } \sum_{i=1}^n\alpha_iy_i=0\right\}.\]

In the dual formulation, the K support vectors \mathbf{x}_{i_k} are exactly those with non-zero weights \alpha_{i_k}. The corresponding decision function is defined by

    \[T(\mathbf{x};\alpha)=\sum_{k=1}^K\alpha_{i_k}y_{i_k}\mathbf{x}_{i_k}^{\!\top}\mathbf{x}+\beta_0\]

, which is scaled so that T(\mathbf{x}_{i_k};\alpha)=y_{i_k}(=\pm 1) for each support vector \mathbf{x}_{i_k}, for any observation \mathbf{x}. Class assignment for \mathbf{x} proceeds as follows:

  • if T(\mathbf{x};\alpha) > 0, then we assign \mathbf{x} to the class +1;
  • if T(\mathbf{x};\alpha) < 0, then we assign \mathbf{x} to the class -1.

This, of course, represents an idealized situation: linearly separable data with no tolerance for misclassification errors. A number of variants have been presented to handle more realistic scenarios.


Support Vector Machines – Posts
  1. Overview [previous]
  2. Classification [current]
  3. Kernel Transformations [next]
  4. Non-Separable Data
  5. Further Considerations
  6. Examples
  7. References
About Dr. Idlewyld 8 Articles
As a youth, Dr. Idlewyld used to read everything he could lay his hands on and he was in a band. For years, he believed that the NHL would have come calling if he hadn't broken his leg as a kid in a hilarious skiing mishap. Nowadays, whatever's left of his hair is slowly turning grey, and that can only mean one thing: he's had the chance to work on plenty of quantitative projects, providing expertise in operations research methods, data science and predictive analytics, stochastic and statistical modeling, and simulations. So he's got that going for him, which is nice. He's not keen on buzzwords, but overall he's glad to see interest in analytical endeavours grow. In the final analysis, he thinks that insights and discoveries are within everyone's reach, and that he would have made a great goalie.