Parallelizing SMO for solving SVMs

Support-Vector-Machine (SVM) is a large margin classifier that considered as robust and effective solution for classification and estimation tasks.

Abstract
Support-Vector-Machine (SVM) is a large margin classifier that considered as robust and effective solution for classification and estimation tasks. Training SVM consists of solving Quadratic Programming (QP) problem. This problem cannot be solved analytically if the training dataset exceeds few hundred vectors. Standard Numeric QP algorithms are also insufficient for big datasets and special algorithms were developed to handle this problem. Sequential-Minimal-Optimization (SMO), introduced be J. Platt in 1988, is the most popular algorithm to handle this problem. SMO can solve a problem of few thousand vectors efficiently.
We introduce a parallel version of the SMO. A single SVM problem is divided into separate sub-problems. Each sub-problem would be solved on a different computer. The solutions would be merged to achieve the overall solution. This way we are able to speed-up the SMO and enable solution of much bigger datasets.
nbsp;
Support-Vector-Machine (SVM) Classifier
The purpose of the classification problem is some hyperplane which separates the positive from the negative examples (a “separating hyperplane”). The points which lie on the hyperplane satisfy , where is normal to the hyperplane. For the linearly separable case, the support vector algorithm simply looks for the separating hyperplane with largest margin.
nbsp;

nbsp;
Now consider the points for which this equality holds . This points are called Support Vectors. The purpose of the classifier is to maximize the distance to the nearest points (Support Vectors). Introducing the lagrange multipliers the problem becomes:
nbsp;

nbsp;

nbsp;
The problem could be transformed to the dual form:

which is a standard form quadratic programming (QP) problem.
nbsp;
We can generalize the problem for non-linear separation problems using kernel function instead of the dot-product above. It also can be shown how small change in the constraints (introducing the error-price parameter, C) would generalize the problem for the non-separable case. The resulting problem is:
nbsp;

In order to classify the vector x, we’ll use the formula:

nbsp;
Sequential Minimal Optimization (SMO)
Sequential Minimal Optimization (SMO) solves the above QP by decomposing it into smallest possible sub-problems (2 lagrange multipliers) and solving the sub-problems sequentially. This is done until the minimum point is reached. There are necessary and sufficient condition for the optimal point called the Karush-Kuhn-Tucker (KKT) Conditions. This conditions are checked after each step to determine whether the SMO should continue.
nbsp;
Parallel SMO
The speedup idea suggested here is to use several computers to solve SVM problem. The computers would be transferring data via one of the common protocols such as TCP/IP. The basic idea is to divide the problem into two sub-problems such that each sub-problem would contain at least one multiplier that violates KKT conditions, solving each sub-problem on a remote computers (called “slave”), merging the results in one computer (“master”). Then, the examples would be mixed and sent to the slaves again. Further parallelization can be accomplished by dividing each sub-problem again.
Two solution methods were tested. The non-optimal-parallel solution merges the two sub-problem and uses this as the overall solution. The optimal-parallel classifier uses the previous point as initialization point for new SVM problem.
nbsp;
Results
The non-optimal-parallel classifier tested on artificial dataset gives the following classifier:

The yellow area is a result of an optimal classifier while the blue area is a result of the non-optimal-parallel classifier. In a real world datasets, the non-optimal gives much better results. The 3 algorithms timing result comparison:
nbsp;

nbsp;
Conclusions

The non-optimal-parallel algorithm gives very good results
The optimal-parallel algorithm is not good: sometimes it gives good results (like in the above examples), but on different problem it would result with a solution time that is slower than the simple SMO

Acknowledgments
We would like to thank Shie Mannor for his great assistance during all stages of this work. Also, we thank Johanan Erez and rest of the VISL Laboratory staff and the Ollendorff Minerva Center Fund which supported the project.
nbsp;
Links

Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines, John C. Platt, where published http://www.research.microsoft.com/users/jplatt/smo.html
A Tutorial on Support Vector Machines for Pattern Recognition, J.C. Burges,. KnowledgeDiscovery and Data Mining, 2(2), 1998. http://www.kernel-machines.org/papers/Burges98.ps.gz