Consider again the mortgage default dataset from the previous post. The decision boundary is not exact: one defaulter falls in the non-default region. A straight line that cleanly separates the classes is called a linear discriminant (or separating hyperplane in the case of multivariate data).
A separating hyperplane does not necessarily exist. The example provided above (the combination of the data and the suggested boundary) is in some sense “nearly minimal” – only one point is mis-labeled by the decision boundary. That is mostly an artifact of the data in itself, in that example. It is not hard to conceive of examples for which the separation is “maximally” violated, so to speak (as in the spiral example below).
At any rate, when a separating hyperplane exists, it is not necessarily unique (see below for an example derived from a modified mortgage default dataset).
Among all the separating hyperplanes, is it possible to find one which would be optimal? What might this look like?
The maximum-margin hyperplane is the “best” separating hyperplane in the sense that it is the one at the center of the largest strip which separates the data points cleanly.
The objective is fairly simple to state: maximize the margin to find the optimal hyperplane.
The support vectors are those observations which are closest to the margin on either side. Usually the number of support vector is small, or at the very least, substantially smaller than the number of observations. In this example, the margin is hard (meaning that we do not allow for observations within the strip); we will see how to allow for a soft margin below.
Let (vector) represent the explanatory features of each observation, with (scalar) the known classification of each of observations. The equation of a hyperplane is given by , where is the weight vector and is the bias.
There is an infinite number of way of expressing that equation – we scale it so that for any eventual support vector in the available training set. This choice is known as the canonical hyperplane.
From geometry, we know that the distance from any point to the canonical hyperplane is
For a support vector , this distance simply becomes .The margin is twice that distance: .
Finding the parameters which maximize is equivalent to solving the following quadratic optimization problem:
This constrained quadratic optimization problem (QP) can be solved with the use of Lagrange multipliers. While this primal formulation does (in theory) yield the required parameter vector , there can be computational hazards when the dimension of the feature space is high. Furthermore, it is not obvious from the form of the QP how to identify the support vectors before real data is plugged in.
According to the Representer Theorem, we can find a set of weights , for which
Technically speaking we do not need to invoke the Representer Theorem in this case (the linear kernel case) as the weights are simply the Lagrange multipliers of the QP and the relations can be derived directly (this does not apply in the general case, however).
Substituting the relations in the primal formulation yields the dual formulation, which can be shown to be equivalent to the primal formulation while by-passing the computational issues:
In the dual formulation, the support vectors are exactly those with non-zero weights . The corresponding decision function is defined by
, which is scaled so that for each support vector , for any observation . Class assignment for proceeds as follows:
- if , then we assign to the class ;
- if , then we assign to the class .
This, of course, represents an idealized situation: linearly separable data with no tolerance for misclassification errors. A number of variants have been presented to handle more realistic scenarios.