Dimensionality Reduction for Classification Problems

The main goal of the project is to examine a new algorithm for dimensionality reduction.

Abstract
The main goal of the project is to examine a new algorithm for dimensionality reduction. Many problems in pattern recognition begin with the preprocessing of multidimensional signals, such as images, faces etc… Often, the goal of preprocessing is some form of dimensionality reduction: to compress the signals in size and to discover compact representation of their variability. It is very important in applications involving automatic classification of the data. In such applications the preprocessing goal is to ease on a classifier, because an automatic classification of multidimensional data maybe very difficult task to implement.

Background
A popular form of dimensionality reduction is the method of principal component analysis (PCA). It is eigenvector methods designed to model linear variability in high dimensional data. In PCA, we compute the linear projections of greatest variance from the top eigenvectors of the data covariance matrix. Another popular form of dimensionality reduction is the method of Fisher’s Linear Discriminant (fisherc). Fisher’s linear discriminant is a classification method that projects high-dimensional data onto a line and performs classification in this one-dimensional space. It finds the linear discriminant function between the given classes by minimizing the errors in the least square sense. Locally linear embedding (LLE) is a new eigenvector method for the problem of nonlinear dimensionality reduction, which is the distinguished algorithm of the project.

LLE Algorithm description
The main principle of LLE algorithm is to preserve local neighborhood relation of data in both the embedding space and the intrinsic one. Each sample in the observation space is a linearly weighted average of its neighbors. The algorithm is described as follows:

Given N real-valued vectors 1, each of dimensionality D the algorithm represents each data point (vector) by linear coefficients that reconstruct each data point from its neighbors. It is done by identifying K nearest neighbors per data point as measured by Euclidian distance. Reconstruction error is measured by the cost function:

2

The weights 3 summarize the contribution of the j-th data point to the i-th reconstruction. To compute the weights we should minimize the cost function subject to two constraints: each data point is reconstructed from its neighbors only, the rows of the weights matrix sum to one. The optimal weights found by solving a least squares problem.

In the final step of the algorithm, each high dimensional data point 1 is mapped to a low dimensional vector 4 . This is done by choosing d-dimensional coordinates 4 to minimize the embedding cost function:
5
The cost function is based on locally linear reconstruction errors (like the previous one). Here we fix the weights while optimizing the coordinates 4. The above cost function can be minimized by solving a sparse NxN eigenvector problem. The optimal solution is the smallest eigenvectors of matrix 6. We should choose bottom d non-zero eigenvectors, so we need to compute the bottom (d+1) eigenvectors of the matrix and discard the smallest eigenvector.

Note: the algorithm has only one free parameter – the number of neighbors per data point, K. Once neighbors are chosen, the optimal weights and coordinates in the low dimensional space are computed by standard methods in linear algebra.

Principle scheme of the algorithm:
7

PCA Algorithm description:

PCA is a useful statistical technique that has found application in fields such as face recognition and image compression, and is a common technique for finding patterns in data of high dimension.
The Algorithm:
1 – Subtract the mean
For PCA to work properly, you have to subtract the mean from each of the data dimensions. The mean subtracted is the average across each dimension. This produces a data set whose mean is zero.

2 – Calculate the covariance matrix

3 – Calculate the eigenvectors and eigenvalues of the covariance matrix

4 – Choosing components and forming a feature vector
The eigenvector with the highest eigenvalue is the principle component of the dataset. In general, once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives you the components in order of significance. Now you can decide to ignore the components of lesser significance. You do lose some information, but if the eigenvalues are small, you don’t lose much.
Now we need to form a feature vector, i.e. a matrix of vectors. This is constructed by taking the eigenvectors that you want to keep from the list of eigenvectors, and forming a matrix with these eigenvectors in the columns

8

5 – Deriving the new data set

Once we have chosen the components (eigenvectors) that we wish to keep in our data and formed a feature vector, we simply take the transpose of the vector and multiply it on the left of the original data set, transposed.

9
Algorithm demonstration:
10

Fisherc Algorithm description:

Fisher’s linear discriminant is a classification method that projects high-dimensional data onto a line and performs classification in this one-dimensional space. The projection maximizes the distance between the means of the two classes while minimizing the variance within each class (in others words, moves elements of the same class closer together, while moving elements of difference classes further apart). This defines the Fisher criterion, which is maximized over all linear projections, w:

11

where m represents a mean, 12 represents a variance, and the subscripts denote the two classes. In signal theory, this criterion is also known as the signal-to-interference ratio.

For example:

13

Solution
As we said before, often dimensionality reduction is a middle stage in automatic classification problems. Therefore was decided to test the discussed algorithms performance by comparison of the classification errors, measured after applying the preprocessing and then classification. For getting more righteous estimation three classification algorithms were used: linear(LDC), quadratic(QDC) and k-nearest neighbors(KNNC).
Because fisherc combines dimensionality reduction to 1D and classification, it was applied just in the case that final data dimension was one in order to compare it to LLE and PCA performance while reduce the dimensionality to 1D (no additional classifiers was applied in testing fisherc).
The data to work on it was taken from different fields of study and life.
First part of the tests was performed on certain hyper spectral image that has 242 different spectrums. The color mask for the image was painted by an expert that marked in different colors the flora, water and roads appearing in the image. So in this case, additional to the classification error value we also got some visual results of final classification.
Second type of the data was some real datasets: Hearts dataset – contains health properties of people and information about absence or presence of heart disease; Letters dataset – contains some letters morphology attributes and the letter matching it; DNA data set – contains binary attributes and its classification to one of three classes.
As we remember, LLE algorithm has only one free parameter – the number of neighbors per data point, K. So before testing LLE performances we should decide which number of neighbors to use for current type of data. To do this we measured classification error as function of neighbors number for each type of data. And latter the performances’ testing was done with value, which gives best results for this type of data.
The used K values:
14

Tools
All the test were performed on PC using Matalab. For performing the classification we used PRTools Classification toolbox. The toolbox also supplies some useful functionality like generating datasets, splitting existing datasets into learning dataset and test one, displaying the data features map and classified data scattered plots and more.

Conclusions
Let’s start with counting some disadvantages of LLE. First, shortly we realized that LLE has some numerical problems, so in order to force it work we always applied regularization, i.e. little distortion of original data. There was no need for this action while applying PCA or fisherc. The second disadvantage is in final stage of eigenvectors evaluation. In that stage we should find eigenvectors of matrix with dimensions N x N ,where N is the number of elements in our dataset. Let’s pay attention to the fact that in the same stage of eigenvectors evaluation in PCA (or fisherc) we should find eigenvectors of matrix with dimensions D x D, where D is dimension of the original data and in most applications 15. So it became strong advantage for PCA and fisherc. And finally, LLE is non-linear transformation of the data from original space to embedding one. Therefore it can’t be applied so easy as PCA, i.e. matrix multiplication of the original data and the feature vector, to get embedding space representation for all the dataset at once. In case of LLE preprocessing, the embedding space representation for each point in test set was found individually (and of course it affects the runtime).
As an attempt to improve the ability of LLE algorithm to work on big datasets and to achieve the best results one can implement it in C/C++ environment, because in MATLAB it doesn’t get enough resources. And to improve the runtime performances one can improve algorithm implementation to get embedding space representation for all the dataset at once, while taking care that one test set point doesn’t influence on another one.
But most significant measure is classification error. We can see that fisherc algorithm always results in smallest classification error for one dimension. It has smaller classification error than PCA or LLE combined with one of the used classifiers. Usually PCA works better with qdc classifier and LLE usually works better with knnc classifier. But the classification error after PCA is almost always less than the one after LLE for each one of the used classifiers and any value of final dimension.
We also saw that for dataset consisting of binary data, the results of PCA and LLE are negatively affected. But in this case PCA has significant better results than LLE.
So in summary we’ll prefer to use the traditional methods – PCA or fisherc algorithm depending on application and not the LLE.

Acknowledgment
We would like to thank our project supervisor Ran Kaftory for his help and guidance throughout this work.