Support Vector Machines – Kernel Transformations

A non-linear separator, with associated higher-dimensional transformation (J. Watson, Support Vector Machine Tutorial).
A non-linear separator, with associated higher-dimensional transformation (J. Watson, Support Vector Machine Tutorial).

The best separating strip may not be linear, especially when the underlying data structure is too complex.

A non-linear separator, with associated higher-dimensional transformation (J. Watson, Support Vector Machine Tutorial).
A non-linear separator, with associated higher-dimensional transformation (J. Watson, Support Vector Machine Tutorial).

The solution is “simple”: we devise a transformation \Phi from the initial feature space to a higher-dimensional space, and we train a linear SVM to the data \Phi(\mathbf{x}_i) (see Watson’s example above).
We thus solve

    \[\arg\min_{\alpha} \left\{\frac{1}{2}\sum_{i,j=1}^n\alpha_i\alpha_j\Phi(\mathbf{x}_i)^{\!\top}\Phi(\mathbf{x}_j)-\sum_{i=1}^n\alpha_i:\mbox{ subject to } \sum_{i=1}^n\alpha_iy_i=0\right\}\]

to get a set of support vectors \{x_{i_k}\} with non-zero weights \{\alpha_{i_k}\} and a decision function

    \[T(\mathbf{x};\alpha)=\sum_{k=1}^M\alpha_{i_k}y_{i_k}\Phi(\mathbf{x}_{i_k})^{\!\top}\Phi(\mathbf{x})+\beta_0,\]

which is scaled as in the previous post. Class assignment for \mathbf{x} proceeds exactly as as in the case of the simple linear kernel given by the identity transformation \Phi(\mathbf{x})=\mathbf{x}.

Not every transformation \Phi is allowable, unfortunately: in order for the associated QP to remain solvable, the inner product \Phi(\mathbf{x})^{\!\top}\Phi(\mathbf{w}) must be positive semi-definite. Standard families of transformations do satisfy these conditions, however.

How does one chose an appropriate \Phi? Prior knowledge of the data structure, including hitting on the right level of complexity, come in handy. Would you have been able to recognize that

    \[\Phi(x_1,x_2)=(z_1,z_2,z_3):=(x_1^2,\sqrt{2}x_1x_2,x_2^2)\]

was the “right” transformation for the data presented in the image above? What would be an appropriate transformation for the dataset below?

Kernel transformation on artificial dataset (image from Wikipedia).
Kernel transformation on artificial dataset (image from Wikipedia).

Obviously, selecting the perfect \Phi is not always easy, or even feasible. Thankfully, there are kernel functions K(\mathbf{x},\mathbf{w}) that generalize the notion of the inner product \Phi(\mathbf{x})^{\!\top}\Phi(\mathbf{w}) and that subsumes the need to specify a function \Phi in certain cases, at the cost of providing values for a few parameters.

The most common kernels include:

  • simple linear: K(\mathbf{x},\mathbf{w})=\mathbf{x}^{\!\top}\mathbf{w};
  • polynomial: K(\mathbf{x},\mathbf{w})=(\mathbf{x}^{\!\top}\mathbf{w}+c)^d, where c\in \mathbb{R} (typically 1) and 1\leq d\in \mathbb{N};
  • gaussian (or radial-basis function): K(\mathbf{x},\mathbf{w})=\mathrm{exp}\left(-\gamma \|\mathbf{x}-\mathbf{w}\|^2\right), where \gamma is positive, and
  • sigmoidal (or perceptron): K(\mathbf{x},\mathbf{w})=\mathrm{tanh}\left(\kappa \mathbf{x}^{\!\top}\mathbf{w}-\delta\right), for allowable combinations of \kappa and \delta.

In this case, we solve

    \[\arg\min_{\alpha} \left\{\frac{1}{2}\sum_{i,j=1}^n\alpha_i\alpha_jK(\mathbf{x}_i,\mathbf{x}_j)-\sum_{i=1}^n\alpha_i:\mbox{ subject to } \sum_{i=1}^n\alpha_iy_i=0\right\}\]

to get a set of support vectors \{x_{i_k}\} with non-zero weights \{\alpha_{i_k}\} and a decision function

    \[T(\mathbf{x};\alpha)=\sum_{k=1}^M\alpha_{i_k}y_{i_k}K(\mathbf{x}_{i_k},\mathbf{x})+\beta_0,\]

which is scaled as in the previous post.


Support Vector Machines – Posts
  1. Overview
  2. Classification [previous]
  3. Kernel Transformations [current]
  4. Non-Separable Data [next]
  5. Further Considerations
  6. Examples
  7. References
About Dr. Idlewyld 8 Articles
As a youth, Dr. Idlewyld used to read everything he could lay his hands on and he was in a band. For years, he believed that the NHL would have come calling if he hadn't broken his leg as a kid in a hilarious skiing mishap. Nowadays, whatever's left of his hair is slowly turning grey, and that can only mean one thing: he's had the chance to work on plenty of quantitative projects, providing expertise in operations research methods, data science and predictive analytics, stochastic and statistical modeling, and simulations. So he's got that going for him, which is nice. He's not keen on buzzwords, but overall he's glad to see interest in analytical endeavours grow. In the final analysis, he thinks that insights and discoveries are within everyone's reach, and that he would have made a great goalie.