Blind Source Separation

In different fields today we encounter the Blind Source Separation problem, when what is known to us is a collection of linear combinations (mixtures) of unknown sources

Abstract
In different fields today we encounter the Blind Source Separation problem, when what is known to us is a collection of linear combinations (mixtures) of unknown sources and worse the coefficients of the linear combinations are unknown.

The problem we have is therefore to estimate the matrix of the combinations coefficients (mixing matrix) and to reconstruct the sources according to it.

As was recently found, we can improve the quality of separation of sources from mixtures dramatically by exploiting the quality of sparsity of sources (sparsity means that most of the values of the sources are zeros) when the sources are properly represented according to some collection of signals (the sources are transformed to such representation by using a transform such as the Wavelet transform).

There are two main approaches to the Blind Source Separation problem solution that we will present in this article:

Clustering
ICA (Independent Component Analysis)

The implementation of the programs in this project is in MATLAB

Introduction
The Blind Source Separation (BSS) problem of a vector of N mixtures X(ξ) created by M unknown sources S(ξ) by an N*M sized mixing matrix A that is also unknown can be formulated by the equation:
X(ξ) = AS(ξ) + N(ξ)
when N(ξ) is an additive noise that may be present in the problem. From here and until stated otherwise we will assume that N(ξ) = 0 (no noise). The coordinate ξ can be a time coordinate (for audio signals) or a space coordinate (for images).

ICA:
One possible approach is to assume the independence of the sources what leads us to the Independent Component Analysis (ICA). For more details see chapter III.

Another assumption is the sources` sparsity assumption when there are properly represented according to a group of functions {φk(ξ)}, meaning:
sm = ∑kcmkφk(ξ)
The functions φk(ξ) do not have to be linearly independent and can form an over complete set of functions like a Wavelets family. Assuming sparsity gives even better performance than standard ICA and even allows more sources than mixtures.

Another improvement of the performance will be achieved in many cases that we have different sets of coefficients cmk by choosing one or several groups only and separating the sources based only on the coefficients of those sets and not all the coefficients.

For introduction of two algorithms based on ICA with sparsity assumption see chapter III.

Clustering:
Another approach to the sources separation, again under the assumption of sparsity is the clustering approach, when we attempt to separate the coefficient data, generated after using a transform on the mixtures into several clusters when cluster is a group of condensed points. The data that we try to separate will be collection of points of the form (d1k, d2k, …, dNk) when N is the number of mixtures and k runs on the indices of the family of functions φk(ξ).

Lets see what the sparsity assumption contributes to us for example in the simple case of two sources and two mixtures when the sources are already sparse in the original domain.

x1 = a11s1(t) + a12s2(t)

x2 = a21s1(t) + a22s2(t)

Since the sources are sparse in most cases when s1(t) is different than 0, s2(t) is equal to 0 and therefore we get :

x1 = a11s1(t)

x2 = a21s1(t)

Plotting the values of x1(t) v.s. the values of x2(t) will give straight line passing through the origin of the axes with a slope of a11/a21. Similarly when s2(t) is different from 0 and s1(t) is 0 (or close enough to 0) we get:

x1 = a12s2(t)

x2 = a22s2(t)

that in the same way is a description of a straight line passing through the origin of the axes with the slope of a12/a22. After finding the two slopes, we can reconstruct the mixing matrix A as it appears in the equation:

X = AS

when

X = (x1(t) x2(t))T

and

S = (s1(t) s2(t))T

assuming the mixing matrix is normalized, for example each column norm is 1.

As it turns out the sources are rarely sparse in their original domain, but their decomposition coefficients according to a set of functions that is wisely chosen are usually sparse. That is why we plot not the mixtures themselves, but the coeff. produced by a proper transform activated on them. Also it is clear that the lines we get not always will be clear ones and therefore an algorithm is needed to estimate their slopes. Such an algorithm will be introduced in chapter II.

Clustering based algorithm:
Before using this algorithm as well as the next ones we first activated the Discrete Wavelet Packet (DWP) transform on the mixtures received as input. As a result we got a tree of subsets of the transform coefficients.

The algorithm we present here passes on all the subsets of coeff. including the set of all the coeff. together and for each set the following is performed:

1. All the coeff. of the subset for all the mixtures in the form of points in the mixtures space are taken and casted on the upper half of a sphere with a radius of 1 in this N dimensional space (N is the number of mixtures).
2. The Fuzzy C-Means (FCM) clustering algorithm is executed on the resulting data with the number of clustering changing from 1 to a predefined maximum (that maximum is determined according to the maximal allowed number of sources). The output of the FCM algorithm is the centers of the clusters the data was separated to. According to these centers we calculate the criterion that will be used to estimate the number of sources seen in the subset (the criterion is defined based on the Hartigan statistics we soon will present).
3. Based on the clusters` centers from the previous stage after finding the optimal number of clusters we calculate the global distance criterion given by:

{global distance} ={∑all points distance from center of point`s cluster} / {num of points}

This criterion in addition to the clusters` centers (for the number of clusters estimated only) are stored aside.

After those three stages the best set is chosen according to minimum global distance (smaller global distance means better separation between clusters and therefore a possibility for better separation between sources). Our assumption for the chosen set is that its coefficients are influenced by all sources (means all sources are present in it).

For that set we take the clusters` centers and build from them the estimated mixing matrix when each center is one of its columns. Using that matrix we separate the sources from mixtures. In order to describe the way we do that, let`s look at 2 cases:

1. The number of mixtures is equal to the number of sources (N=M) and let`s assume we correctly estimated the number of sources, therefore the estimated mixing matrix A is square. As stated before the mixtures X for the noiseless case are given by:

X = AS

when S is the vector of sources, and so

S = A-1X = WX

W is called the unmixing matrix and the reconstruction of the sources is done the same way the construction of mixtures was done, but with the W matrix instead of A.

2. The number of mixtures is different from the number of sources (N≠M). Here assuming we estimated the number of sources correctly A is not squared (it is of size N*M) and therefore it has no inverse. The separation algorithm will be as following:
We take the chosen set and pass on all its clusters when for each cluster we plant only its coefficients in the set we the index of the chosen set in a tree of the Discrete Wavelet Packet (the coeff. Taken are of one of the mixtures only). The coeff. of the rest of the sets and the other clusters are taken as zero. Finally we reconstruct the data according to the DWP tree created in order to get one of the sources (usually we will restore the edges of the source only because we use partial data for its reconstruction).

Description of the algorithm for estimating the number of clusters using the Hartigan statistics:
For a collection of n points we perform clustering to get k clusters

C1, C2, … , Ck when in each cluster nr = |Cr| points (1 ≤ r ≤ k).

Now we define

Dr = ∑i,j Є Cr(dij)

when dij is the distance between points i and j (we use the Euclidian distance ∑k(Xik-Xjk)2 while k runs on all the dimensions of the points).

Based on the Dr parameter we define:

W(k) = ∑r=1..k(1/2nr)Dr

that is actually the sum of the average distances of the points from one another within each cluster on all the clusters.

The Hartigan statistics is then:

H(k) = [(W(k)/W(k+1))-1](n-k-1)

for a number of clusters of k.

Now that we know the Hartigan statistics for the number of clusters of k, we calculate it in the algorithm for k = 1…MAX and choose the estimated number of clusters as the minimal k so that:

H(k) ≤ TR

TR is some threshold value that Hartigan proposed to take as 10. The algorithm that corresponds more to our problem chooses TR as a function of the number of points n.

ICA based algorithms:
In this chapter we present three algorithms when the first one integrates clustering and ICA algorithms and the other two are based only on ICA. For the simplicity of the algorithms we work only with equal number of mixtures and sources. That way the problem of estimating the number of sources is bypassed. However the ICA approach allows to deal with the case of the number of mixtures smaller than the number of sources also.

Algorithm 1:
This algorithm does in its first stage the same process exactly as the clustering based algorithm in chapter II. After this stage is finished (and the set with the minimal global distortion is chosen) instead of separating the sources according to the centers of the clusters that the FCM algorithm returns, we estimate the unmixing matrix directly according to the Maximum Likelihood criterion by finding the minimum of the function:

LW(Y) = k*ln(|detW|) – ∑m=1..M∑k=1..Kv((WY)mk)

This formula is received under the assumption of the independence of sources (the ICA assumption) and in addition to that modeling the probability density function for the coeff. cmk generated by representing the sources according to a certain family of functions by:

ρ(cmk) ∞ exp{-v(cmk)}

when v(x) is the absolute value function |x| after smoothing. This model of probability density is very common when needed to model coefficients sparsity. The matrix Y appearing in the expression LW(Y)

is the data matrix when in each row there are the coeff. of one of the mixtures after activating the DWP transform on it. In addition to that, k is the number of coeff. for each mixture called the number of features, and M=N is the number sources and also the number of mixtures.

The maximization of LW(Y) for finding the W that gives the maximum is done effectively by the Natural Gradient algorithm (as it exists in Matlab). Using that W we reconstruct the sources exactly as we do it in the case of square A in the clustering based algorithm in chapter II.

Algorithm 2:
This algorithm starts with building the mixtures DWP coeff. tree exactly as in the previous 2 algorithms. After that a pass on all the subsets (nodes) in the tree is performed (including the set of all the coefficients) and for each subset the following stages are performed:

1. Estimating the unmixing matrix W according to the ICA algorithm (as described in the first algorithm in this chapter) and based on the current subset coeff. only.

2. After that we reconstruct the sources using the W matrix of the previous stage and on those sources we activate the DWP transform to calculate their entropy in the Wavelet Transform domain. The entropy of a series of coefficients “coeff” is:

entropy = ∑i=1..length(coeff)[|coeff[i]|]0.5

That`s for the series of coefficients after we normalized it to

have a square norm of 1:

[∑i=1..length(coeff) |coeff[i]|2]0.5 = 1

The entropy of the sources is calculated as the sum of entropies

of each source separately.

After those two stages we choose the “best” set as the set with the minimal reconstructed sources entropy (the entropy above is a measure of the set`s sparsity and therefore we expect smaller entropy for a sparser set).

The sources reconstruction we perform according to the set we chose the same way as in the previous algorithm.

Algorithm 3:
This algorithm is very much like algorithm 2, but with the difference that after the DWP coefficients tree is built we reconstruct the sources according to all the coefficients and not pass on all the sets and reconstruct them for each set. The sources received are transformed to the Wavelet Transform domain by building a tree of the coeff. for the sources the same way we did it for the mixtures in the beginning of the algorithm. Then we pass on all the sets in the tree and for each set calculate the entropy of its coeff. alone on all the sources (the sum of the entropies of each source by itself).

Finally the set chosen is the one with minimal entropy (again we assume it`s the sparsest set) and based on this set we perform the final reconstruction of the sources (same way as before).

The entropy used here is given by the formula in the description of algorithm 2.

MATLAB Implementation, results and results table can be seen in the project book

In this figure you can see :

first row => the original unknown images;

second row => the mixtures;

last row => the separated final images.

Summary and conclusions

As can be seen from the errors table for the case of mixtures without noise the NSE and CTE measures are very much close and therefore are practically the same measure of the quality of separation. For the noised case there are larger differences between the 2 error measure, but still they are close and we consider the CTE (cross talk error) as the error that shows us the separation quality when the NSE is very much influenced by the noise energy
For the case of two non noised images none of the algorithms checked had problem to achieve visually good results in the sources separation when according to the errors table the best algorithm was the 1st ICA based algorithm with CTE or NSE error of 2.5e-4% or 0.19%. The other algorithms like the clustering based one and the 3rd ICA algorithm also achieve pretty good results (errors of 0.38% or 0.19% for clustering algorithm and 0.14% or 0.20% for the 3rd ICA algorithm) while the 2nd ICA based algorithm has much worse performance (error of 11.6% or 1.18%)
For the cases of three non noised images checked the best algorithm changed from clustering algorithm for the first case of lena&woman&chess (error of 0.14-0.15%) to the 3rd ICA algorithm for the 2nd case of chess&woman&clown (error of 0.63-0.64%). In general the 1st and 3rd ICA algorithms have relatively good performance here while the 2nd ICA algorithm reaches pretty high errors (12% and 21-22%) and also the clustering algorithm for the second case reaches the error of about 10%
For the case of two noised images the algorithm that achieves the lowest errors is undoubtedly the 3rd ICA algorithm with CTE of 0.25% and 0.001%. The other algorithms have much worse performance with relatively very high errors for the clustering algorithm and the 1st ICA algorithms for the case of lena&woman and better performance from them in the 2nd case of clown&chess (still highest error for the clustering algorithm)
When three images were involved with noise the best algorithm still remains the 3rd ICA algorithm with higher errors than in the 2 images case of 0.64% and 3.1% CTE. The clustering algorithm fails to compete with the others and reaches errors of 1400 – 1500%. Also the 1st and 2nd ICA algorithms return much worse errors than the 3rd ICA algorithm

Bibliography

P. Kisilev, M. Zibulevsky, Y. Y. Zeevi, B. A. Pearlmutter. “Blind source separation via multinode sparse representation.” – Technion, Haifa, Israel.
C.Fraley, A.E. Raftery. “How Many Clusters? Which clustering method? Answers via Model – Based cluster analysis.” – Statistic University, Washington, USA.
P. Kisilev, M. Zibulevsky, Y.Y. Zeevi, B.A. Pearlmutter (2000). “Multiresolution framework for blind source separation”
[gZIPPed PS]
R.Tibshirani; G.Walther and T.Hastie. (2000) “”Estimating the number of clusters.””
Hartigan method: http://www-stat.stanford.edu/~tibs/ftp/gap.ps

Acknowledgments
We would like to thank our supervisor Pavel Kissilev for his support and guidance throughout this project.
Also we would like to thank The Ollendorff Minerva Center Fund which supported this project.