A Support Vector Machine (SVM) is a supervised learning method that generalizes a large set of trained samples into a smaller number of support vectors to predict the class of unknown samples.

A SVM classifier is mathematically more complex than a distance-based classifier. However a SVM classifier has better generalization capabilities than a distance-based classifier, and is faster when the sample set is large because the SVM classifier operates only on the support vectors.

When to use

Use a SVM classifier in the following types of applications:

  • The application has one class of good samples but an unknown number of classes for bad samples. An example of this type of application is defect detection. For this type of application, use a one-class SVM classifier to train samples of the known good class. Samples, such as defects, that cannot be classified as the known class are classified as unknown.
  • The application requires a large number of training samples. During training, the SVM classifier identifies support vectors for the training samples. During classification the SVM classifier operates only on the support vectors, which reduces the time required for classification.

In-Depth Discussion

The SVM algorithm builds a model to classify samples. The model represents the samples in a multi-dimensional space where the samples are separated by the maximum possible distance. For example, the following figure illustrates an application that involves linearly-separable two classes represented in a two-dimensional space.

  1. Samples of Class A
  2. Samples of Class B
  3. Support Vectors
  4. Hyperplane
  5. Margin

The SVM algorithm uses a quadratic function to identify the support vectors for each class. A support vector is a sample in one class that is closest to another class. The SVM algorithm then identifies a hyperplane that separates the support vectors of each class. The distance between the support vector and the hyperplane is called the margin. The SVM algorithm selects a support surface that produces the largest possible margin for each support vector.

Training

When you train the SVM classifier, the SVM algorithm uses an iterative process to optimize the support vector function. You can control the optimization by using the tolerance parameter in the software. Training is terminated when the gradient of the optimized function is less than or equal to tolerance. A tolerance value that is too high may cause the SVM algorithm to terminate training before the support vector function is adequately optimized. A tolerance value that is too low will cause the SVM algorithm to try to achieve a very high level of optimization, which may be too time-consuming and computationally expensive.

Classification

When you use the SVM classifier, the SVM algorithm determines the class of an unknown sample by comparing it with the support vectors of the trained samples. The SVM algorithm uses the following formula to classify an unknown sample x:

s g n y i a i K i x i , x + b
where:
  • yi is the class association (–1 or +1),
  • ɑi is the weight coefficient,
  • K is the kernel function xi is the number of support vectors,
  • b is the distance of the hyperplane from origin.

Classification speed depends on the number of support vectors and the selected kernel function. The weight coefficient ɑi, which is an output of the optimized support vector function, determines the number of support vectors. If the weight coefficient of a sample is not equal to 0, the sample is a support vector.

Multi-Class SVM

SVM classification typically involves two classes. For applications that involve more than two classes, the SVM algorithm uses a one-versus-one approach. In a one-versus-one approach, the algorithm creates a binary classification model for every possible combination of classes, so that n number of classes produces n × (n – 1)/2 classification models. During classification, the algorithm uses a voting mechanism to identify the best class. If the voting mechanism identifies multiple classes, the algorithm selects the class that is closest to the sample.

Models

The following sections describe the models that the SVM algorithm uses to classify samples. Select a model based on the classes involved in your application. For applications that involve a single class, such as texture defect detection, select the one-class model. For applications that involve multiple classes, select the C-SVC or nu-SVC models. For applications that involve multiple classes, always start with the nu-SVC model.

C-SVC

The C-SVC model allows the SVM algorithm to clearly separate samples that are separated by a very narrow margin. Training involves minimizing the error function:

m i n w , b , ξ 1 2 W T W + C y = 1 I ξ i

Subject to Yi(WTK(Xi) + b) ≥ 1 – ξi; ξi ≥ 0, i = 1 . . . l

where:
  • W is the normal vector of the hyperplane to origin,
  • C is the cost parameter,
  • ξ is the slack variable.

If the SVM algorithm cannot define a clear margin, then it uses the cost parameter to allow some training errors and produce a soft margin. If the cost value is too high it prohibits training errors, producing a narrow margin and rigid classification.

Nu-SVC

In the Nu-SVC model, the nu parameter controls training errors and the number of support vectors. Training involves minimizing the error function:

m i n w , b , ξ 1 2 W T W - v p + i l i = 1 I ξ i

Subject to Yi(WTK(Xi) + b) ≥ ρ – ξi; ξi ≥ 0, i = 1 . . . l;ρ ≥ 0

where:
  • W is the normal vector of the hyperplane to origin,
  • v is the nu parameter,
  • ξ is the slack variable.

The nu value specifies both the maximum ratio of training errors and the minimum number of support vectors relative to the number of samples. Nu must be greater than 0 and cannot exceed 1. A higher nu value increases tolerance for variation in the texture, but may also increase tolerance for texture defects. If nu is too high, training produces too many training errors to be useful.

One-Class SVM

In the one-class model, the SVM algorithm considers the spatial distribution information for each sample to determine whether the sample belongs to the known class. Training involves minimizing the error function:

m i n w , b , ξ 1 2 W T W - p + 1 v l i = 1 I ξ i

Subject to WTK(Xi) ≥ ρ – ξi; ξi ≥ 0, i = 1 . . . l;ρ ≥ 0

where:
  • W is the normal vector of the hyperplane to origin,
  • v is the nu parameter,
  • ξ is the slack variable.

Kernels

A SVM classifier is a linear classifier. Typically, a SVM classifier uses a linear kernel, which is the product of the sample feature vector multiplied by the sample support vector. A SVM classifier can also use the following nonlinear kernels.

Polynomial
G a m m a × K e r n e l x i , x + C o e f f i c i e n t D e g r e e
Radial BasisFunction (RBF)
e - G a m m a x i + x 2
Gaussian
e - x i - x 2 2 × S i g m a 2

Use a nonlinear kernel to transform samples with nonlinear feature information to a dimension where the feature information is linearly separable, as illustrated in the following figures.

A

B

C

Figure A illustrates how a polynomial kernel separates nonlinear feature information. Figure B illustrates how a RBF kernel separates nonlinear feature information . Figure C illustrates the clearly devisable nonlinear feature information obtained after using a nonlinear kernel to transform the sample to a dimension where the feature information is linearly separable.

Choosing the Right Parameters

The following list provides information for selecting the right SVM parameters for your application.

  • Model—If your application involves only one class, use the one-class model. If your application involves more than one class, always start with the nu-SVC model.
  • Tolerance—Specifies the maximum gradient of the quadratic function used to compute support vectors. Training is terminated when the gradient of the optimized function is less than or equal to the tolerance value. The default value is 0.001. You typically do not need to change this value.
  • nu—Specifies both the maximum ratio of training errors and the minimum number of support vectors relative to the number of samples. Values must be greater than 0 and cannot exceed 1. The default value is 0.1 A higher nu value increases tolerance for variation in the texture, but may also increase tolerance for texture defects. If the texture classifier does not perform as expected because the trained texture samples do not represent every possible variation of the texture, try increasing the value of nu.
  • Cost—Specifies the penalty for training errors. If the cost value is too high it prohibits training errors, producing a narrow margin and rigid classification. Decrease the cost value to allow more training errors and produce a softer margin between classes.
  • Kernel—Specifies the kernel that the classifier uses. RBF is the default value. In general, you do not need to modify this setting. If the number of sample features is high, try the linear kernel.
  • Degree—Specifies the degree of the polynomial kernel. In general, select a value less than 10.
  • Gamma—Specifies the gamma value for the polynomial and RBF kernels. A high value requires more support vectors to classify the sample. Use a high value for samples with regularly distributed feature information, and a low value for samples with irregularly distributed feature information. You may need to change this value to support the values selected for Cost or nu. For example, if you specify a high nu value, which raises the minimum number of support vectors, you may also need to increase the value of Gamma.
  • If you use a custom classifier, specify a feature vector value for the custom classifier that is greater than 0 but less than 1. Scaling the feature vector reduces overflow issues and improves the classification rate.