Support Vector Machines — Classification

Support vectors for an artificial dataset (Foster & Provost's Data Science for Business).
Support vectors for an artificial dataset (Foster & Provost's Data Science for Business).

Con­sid­er again the mort­gage default dataset from the pre­vi­ous post. The deci­sion bound­ary is not exact: one default­er falls in the non-default region. A straight line that clean­ly sep­a­rates the class­es is called a lin­ear dis­crim­i­nant (or sep­a­rat­ing hyper­plane in the case of mul­ti­vari­ate data). 

A sep­a­rat­ing hyper­plane does not nec­es­sar­i­ly exist. The exam­ple pro­vid­ed above (the com­bi­na­tion of the data and the sug­gest­ed bound­ary) is in some sense “near­ly min­i­mal” – only one point is mis-labeled by the deci­sion bound­ary. That is most­ly an arti­fact of the data in itself, in that exam­ple. It is not hard to con­ceive of exam­ples for which the sep­a­ra­tion is “max­i­mal­ly” vio­lat­ed, so to speak (as in the spi­ral exam­ple below). 

A dataset (with 2 classes) for which a separating hyperplane does not exist (Least Squares Support Vector Machine, Wikipedia). Where would you draw the line?
A dataset (with 2 class­es) for which a sep­a­rat­ing hyper­plane does not exist (Least Squares Sup­port Vec­tor Machine, Wikipedia). Where would you draw the line?

At any rate, when a sep­a­rat­ing hyper­plane exists, it is not nec­es­sar­i­ly unique (see below for an exam­ple derived from a mod­i­fied mort­gage default dataset). 

Non-uniqueness of separating hyperplanes (Fawcett & Provost's Data Science for Business).
Non-unique­ness of sep­a­rat­ing hyper­planes (Faw­cett & Provost’s Data Sci­ence for Busi­ness).

Among all the sep­a­rat­ing hyper­planes, is it pos­si­ble to find one which would be opti­mal? What might this look like? 

The max­i­mum-mar­gin hyper­plane is the “best” sep­a­rat­ing hyper­plane in the sense that it is the one at the cen­ter of the largest strip which sep­a­rates the data points clean­ly.

Maximal-margin separating hyperplane for an artificial dataset (Fawcett & Provost's Data Science for Business).
Max­i­mal-mar­gin sep­a­rat­ing hyper­plane for an arti­fi­cial dataset (Faw­cett & Provost’s Data Sci­ence for Busi­ness).

The objec­tive is fair­ly sim­ple to state: max­i­mize the mar­gin to find the opti­mal hyper­plane.

The sup­port vec­tors are those obser­va­tions which are clos­est to the mar­gin on either side. Usu­al­ly the num­ber of sup­port vec­tor is small, or at the very least, sub­stan­tial­ly small­er than the num­ber of obser­va­tions. In this exam­ple, the mar­gin is hard (mean­ing that we do not allow for obser­va­tions with­in the strip); we will see how to allow for a soft mar­gin below. 

Support vectors for an artificial dataset (Fawcett & Provost's Data Science for Business).
Sup­port vec­tors for an arti­fi­cial dataset (Faw­cett & Provost’s Data Sci­ence for Busi­ness).

Let \mathbf{x}_i (vec­tor) rep­re­sent the explana­to­ry fea­tures of each obser­va­tion, with y_i=\pm 1 (scalar) the known clas­si­fi­ca­tion of each of n obser­va­tions. The equa­tion of a hyper­plane is giv­en by f(\mathbf{x})=\beta_0+\beta^{\!\top}\mathbf{x}=0, where \beta is the weight vec­tor and \beta_0 is the bias.

There is an infi­nite num­ber of way of express­ing that equa­tion – we scale it so that |\beta_0+\mathbf{\beta}^{\!\top}\mathbf{x}|=1 for any even­tu­al sup­port vec­tor \mathbf{x} in the avail­able train­ing set. This choice is known as the canon­i­cal hyper­plane.

From geom­e­try, we know that the dis­tance from any point \mathbf{z} to the canon­i­cal hyper­plane is 

    \[\frac{|\beta_0+\beta^{\!\top}\mathbf{z}|}{\|\beta\|}.\]

For a sup­port vec­tor \mathbf{x}, this dis­tance sim­ply becomes \frac{1}{\|\beta\|}.The mar­gin M is twice that dis­tance: M=\frac{2}{\|\beta\|}.
Find­ing the para­me­ters (\beta_0,\mathbf{\beta}) which max­i­mize M is equiv­a­lent to solv­ing the fol­low­ing qua­drat­ic opti­miza­tion prob­lem:

    \[\arg\min_{\beta_0,\beta} \left\{\frac{1}{2}\|\beta\|^2:\mbox{ subject to } y_i(\beta_0+\beta^{\!\top}\mathbf{x}_i)\geq 1\mbox{ for each observation }\mathbf{x}_i\right\}.\]

This con­strained qua­drat­ic opti­miza­tion prob­lem (QP) can be solved with the use of Lagrange mul­ti­pli­ers. While this pri­mal for­mu­la­tion does (in the­o­ry) yield the required para­me­ter vec­tor \beta, there can be com­pu­ta­tion­al haz­ards when the dimen­sion of the fea­ture space is high. Fur­ther­more, it is not obvi­ous from the form of the QP how to iden­ti­fy the sup­port vec­tors before real data is plugged in. 

Accord­ing to the Rep­re­sen­ter The­o­rem, we can find a set of weights \alpha_i, i=1,\ldots,n for which 

    \[\beta = \sum_{i=1}^n\alpha_iy_i\mathbf{x}_i\quad\mbox{and}\quad \sum_{i=1}^n\alpha_iy_i=0.\]

Tech­ni­cal­ly speak­ing we do not need to invoke the Rep­re­sen­ter The­o­rem in this case (the lin­ear ker­nel case) as the weights \alpha_i are sim­ply the Lagrange mul­ti­pli­ers of the QP and the rela­tions can be derived direct­ly (this does not apply in the gen­er­al case, how­ev­er).

Sub­sti­tut­ing the rela­tions in the pri­mal for­mu­la­tion yields the dual for­mu­la­tion, which can be shown to be equiv­a­lent to the pri­mal for­mu­la­tion while by-pass­ing the com­pu­ta­tion­al issues:

    \[\arg\min_{\alpha} \left\{\frac{1}{2}\sum_{i,j=1}^n\alpha_i\alpha_j\mathbf{x}_i^{\!\top}\mathbf{x}_j-\sum_{i=1}^n\alpha_i:\mbox{ subject to } \sum_{i=1}^n\alpha_iy_i=0\right\}.\]

In the dual for­mu­la­tion, the K sup­port vec­tors \mathbf{x}_{i_k} are exact­ly those with non-zero weights \alpha_{i_k}. The cor­re­spond­ing deci­sion func­tion is defined by 

    \[T(\mathbf{x};\alpha)=\sum_{k=1}^K\alpha_{i_k}y_{i_k}\mathbf{x}_{i_k}^{\!\top}\mathbf{x}+\beta_0\]

, which is scaled so that T(\mathbf{x}_{i_k};\alpha)=y_{i_k}(=\pm 1) for each sup­port vec­tor \mathbf{x}_{i_k}, for any obser­va­tion \mathbf{x}. Class assign­ment for \mathbf{x} pro­ceeds as fol­lows:

  • if T(\mathbf{x};\alpha) > 0, then we assign \mathbf{x} to the class +1;
  • if T(\mathbf{x};\alpha) < 0, then we assign \mathbf{x} to the class -1.

This, of course, rep­re­sents an ide­al­ized sit­u­a­tion: lin­ear­ly sep­a­ra­ble data with no tol­er­ance for mis­clas­si­fi­ca­tion errors. A num­ber of vari­ants have been pre­sent­ed to han­dle more real­is­tic sce­nar­ios.


Support Vector Machines — Posts
  1. Overview [pre­vi­ous]
  2. Clas­si­fi­ca­tion [cur­rent]
  3. Ker­nel Trans­for­ma­tions [next]
  4. Non-Sep­a­ra­ble Data
  5. Fur­ther Con­sid­er­a­tions
  6. Exam­ples
  7. Ref­er­ences
About Dr. Idlewyld 7 Articles
As a youth, Dr. Idlewyld used to read everything he could lay his hands on and he was in a band. For years, he believed that the NHL would have come calling if he hadn't broken his leg as a kid in a hilarious skiing mishap. Nowadays, whatever's left of his hair is slowly turning grey, and that can only mean one thing: he's had the chance to work on plenty of quantitative projects, providing expertise in operations research methods, data science and predictive analytics, stochastic and statistical modeling, and simulations. So he's got that going for him, which is nice. He's not keen on buzzwords, but overall he's glad to see interest in analytical endeavours grow. In the final analysis, he thinks that insights and discoveries are within everyone's reach, and that he would have made a great goalie.