Support Vector Machines (SVMs) are a family of supervised learning methods which determines the “best” curve/surface separating/classifying the data into disjoint classes.
In an ideal setting, the classes are clearly separable (using an appropriate classifier). Of course, most data in the wild cannot be separated clearly: “best” in this case refers to minimizing the cost of making an error in “messy” cases. SVMs are especially well-suited to these problems, in part due to their flexibility and ability to handle non-linear decision surfaces; this contributes to their current popularity.
In this article, we will briefly discuss the main ideas behind various SVM methods, as well as some of their applications and limitations.(As with all machine learning methods, SVMs do not exist in a vacuum. Enough data has to first be collected, cleaned-up, and processed. Ethical issues must be considered. In all likelihood, dimension reduction and scaling will be required to get the most out of the algorithms. The models must be trained, tested, and validated in order to make predictions that avoid over-fitting. These will be the focus of other articles.)
Consider an artificial dataset which contains information about three features: the age of a customer, their savings balance, and whether or not they defaulted on a mortgage loan (this example is derived from Fawcett & Provost’s Data Science for Business); the mortgage default variable is categorical (dot for default, plus sign for no default), the explanatory variables are numerical.
A cursory look at the data suggests that (in this dataset, at least), younger borrowers with smaller savings tend to default on their mortgages, whereas that is not the case for older borrowers with larger savings, but there is some overlap.
Let us forget about the unrealistic nature of this small dataset for the moment – these variables could be replaced by height, weight and reported gender, for instance – and concentrate on the classifying task at hand. Can we come up with a rule (or a set of rules) that would help us make a prediction for a set of new observations, assuming that the data at hand is somehow representative of the general situation?
One possible decision tree is shown below.
The decision rule derived from this tree is simple:
- borrowers with a savings balance below 50,000$ who were 50 years old or younger defaulted on their mortgage in 100% of the cases;
- borrowers with a savings balance above 50,000$ who were 45 years old or older defaulted on their mortgage in 0% of the cases;
- borrowers with a savings balance below 50,000$ who were 50 years old or older defaulted on their mortgage in 33% of the cases, and
- borrowers with a savings balance above 50,000$ who were 45 years old or younger defaulted on their mortgage in 57% of the cases.
Two of the leaves are pure (meaning that all instances belong to the same category within their respective quadrant), but the possibility of making a mistake (of misclassifying the data) is somewhat high in other two quadrants.
Can we do better? Without access to more features (variables), this is about as effective a classifier as a decision tree can get (other decision trees exist, but they are marginally more effective, at best). For this particular dataset, separating curves which are parallel to the axes are not ideal. To be sure, we could create an intricate decision tree with a large number of separating lines, but that number would be greater than , where is the number of features, but that is undesirable for a well-fitted tree.
It is easy to draw a decision curve which improves on the effectiveness of the decision tree:
The decision rule derived from this linear boundary is simpler than the decision tree rule:
- borrowers for whom the pair (balance,age) fall below the decision boundary defaulted on their mortgage in 100% of the cases, while
- borrowers for whom the pair (balance,age) fall above the decision boundary defaulted on their mortgage in 7% of the cases.
A single borrower is mis-classified by the simpler decision rule; for this dataset ,the decision boundary method is a better classifier than the decision tree (using any reasonable metric as a measuring stick). We could easily improve the accuracy to 100% using non-linear curves, but aiming for perfect accuracy at the training level can easily lead to serious over-fitting issues.
Building on the decision boundary approach, SVMs provide a protocol to train classifiers, using non-linear hyper-surfaces as required. In contrast with decision trees and other “classical” classifiers, SVMs are trained on a small subset of the available observations, which can prove useful when dealing with large datasets.
Among other problems, SVM methods have been applied to:
- text categorization
- image classification
- hand-writing recognition
- smoothing and regression
- outlier detection