Support Vector Machines – Overview

Non-linear SVM for artificial dataset (personal file).
Non-linear SVM for artificial dataset (personal file).

Support Vector Machines (SVMs) are a family of supervised learning methods which determines the “best” curve/surface separating/classifying the data into disjoint classes.

In an ideal setting, the classes are clearly separable (using an appropriate classifier). Of course, most data in the wild cannot be separated clearly: “best” in this case refers to minimizing the cost of making an error in “messy” cases. SVMs are especially well-suited to these problems, in part due to their flexibility and ability to handle non-linear decision surfaces; this contributes to their current popularity.

In this article, we will briefly discuss the main ideas behind various SVM methods, as well as some of their applications and limitations.(As with all machine learning methods, SVMs do not exist in a vacuum. Enough data has to first be collected, cleaned-up, and processed. Ethical issues must be considered. In all likelihood, dimension reduction and scaling will be required to get the most out of the algorithms. The models must be trained, tested, and validated in order to make predictions that avoid over-fitting. These will be the focus of other articles.)


Overview

Consider an artificial dataset which contains information about three features: the age of a customer, their savings balance, and whether or not they defaulted on a mortgage loan (this example is derived from Fawcett & Provost’s Data Science for Business); the mortgage default variable is categorical (dot for default, plus sign for no default), the explanatory variables are numerical.

Artificial dataset with 2 categories (Fawcett & Provost's Data Science for Business).
Artificial dataset with 2 categories (Fawcett & Provost’s Data Science for Business).

A cursory look at the data suggests that (in this dataset, at least), younger borrowers with smaller savings tend to default on their mortgages, whereas that is not the case for older borrowers with larger savings, but there is some overlap.

Let us forget about the unrealistic nature of this small dataset for the moment – these variables could be replaced by height, weight and reported gender, for instance – and concentrate on the classifying task at hand. Can we come up with a rule (or a set of rules) that would help us make a prediction for a set of new observations, assuming that the data at hand is somehow representative of the general situation?

One possible decision tree is shown below.

Classification of artificial dataset, using a decision tree (Fawcett & Provost's Data Science for Business).
Classification of artificial dataset, using a decision tree (Fawcett & Provost’s Data Science for Business).

The decision rule derived from this tree is simple:

  • borrowers with a savings balance below 50,000$ who were 50 years old or younger defaulted on their mortgage in 100% of the cases;
  • borrowers with a savings balance above 50,000$ who were 45 years old or older defaulted on their mortgage in 0% of the cases;
  • borrowers with a savings balance below 50,000$ who were 50 years old or older defaulted on their mortgage in 33% of the cases, and
  • borrowers with a savings balance above 50,000$ who were 45 years old or younger defaulted on their mortgage in 57% of the cases.

Two of the leaves are pure (meaning that all instances belong to the same category within their respective quadrant), but the possibility of making a mistake (of misclassifying the data) is somewhat high in other two quadrants.

Can we do better? Without access to more features (variables), this is about as effective a classifier as a decision tree can get (other decision trees exist, but they are marginally more effective, at best). For this particular dataset, separating curves which are parallel to the axes are not ideal. To be sure, we could create an intricate decision tree with a large number of separating lines, but that number would be greater than 2^{m}, where m is the number of features, but that is undesirable for a well-fitted tree.

It is easy to draw a decision curve which improves on the effectiveness of the decision tree:

Classification of artificial dataset, using a decision boundary (Fawcett & Provost's Data Science for Business).
Classification of artificial dataset, using a linear decision boundary (Fawcett & Provost’s Data Science for Business).

The decision rule derived from this linear boundary is simpler than the decision tree rule:

  • borrowers for whom the pair (balance,age) fall below the decision boundary defaulted on their mortgage in 100% of the cases, while
  • borrowers for whom the pair (balance,age) fall above the decision boundary defaulted on their mortgage in 7% of the cases.

A single borrower is mis-classified by the simpler decision rule; for this dataset ,the decision boundary method is a better classifier than the decision tree (using any reasonable metric as a measuring stick). We could easily improve the accuracy to 100% using non-linear curves, but aiming for perfect accuracy at the training level can easily lead to serious over-fitting issues.


Building on the decision boundary approach, SVMs provide a protocol to train classifiers, using non-linear hyper-surfaces as required. In contrast with decision trees and other “classical” classifiers, SVMs are trained on a small subset of the available observations, which can prove useful when dealing with large datasets.

Among other problems, SVM methods have been applied to:

  • text categorization
  • image classification
  • hand-writing recognition
  • smoothing and regression
  • outlier detection

Support Vector Machines – Posts
  1. Overview [current]
  2. Classification [next]
  3. Kernel Transformations
  4. Non-Separable Data
  5. Further Considerations
  6. Examples
  7. References
About Dr. Idlewyld 8 Articles
As a youth, Dr. Idlewyld used to read everything he could lay his hands on and he was in a band. For years, he believed that the NHL would have come calling if he hadn't broken his leg as a kid in a hilarious skiing mishap. Nowadays, whatever's left of his hair is slowly turning grey, and that can only mean one thing: he's had the chance to work on plenty of quantitative projects, providing expertise in operations research methods, data science and predictive analytics, stochastic and statistical modeling, and simulations. So he's got that going for him, which is nice. He's not keen on buzzwords, but overall he's glad to see interest in analytical endeavours grow. In the final analysis, he thinks that insights and discoveries are within everyone's reach, and that he would have made a great goalie.