Sound Source Localization with CS Based Compressed Neural Network

Microphone arrays are today employed to specify the sound source locations in numerous real time applications such as speech processing in large rooms or acoustic echo cancellation. Signal sources may exist in the near field or far field with respect to the microphones. Current Neural Networks (NNs) based source localization approaches assume far field narrowband sources. One of the important limitations of these NN-based approaches is making balance between computational complexity and the development of NNs; an architecture that is too large or too small will affect the performance in terms of generalization and computational cost. In the previous analysis, saliency subject has been employed to determine the most suitable structure, however, it is time-consuming and the performance is not robust. In this paper, a family of new algorithms for compression of NNs is presented based on Compressive Sampling (CS) theory. The proposed framework makes it possible to find a sparse structure for NNs, and then the designed neural network is compressed by using CS. The key difference between our algorithm and the state-of-the-art techniques is that the mapping is continuously done using the most effective features; therefore, the proposed method has a fast convergence. The empirical work demonstrates that the proposed algorithm is an effective alternative to traditional methods in terms of accuracy and computational complexity.


Introduction
Location of a sound source is an important piece of information in speech signal processing applications. In the sound source localization techniques, location of the source has to be estimated automatically by calculating the direction of the received signal [1]. Most algorithms for these calculations are computationally intensive and difficult for real time implementation [2]. Neural network based techniques have been proposed to overcome the computational complexity problem by exploiting their massive parallelism [3,4]. These techniques usually assume narrowband far field source signal, which is not always applicable [2].
In this paper, we design a system that estimates the direction-of-arrival (DOA) (direction of received signal) for far field and near field wide band sources. The proposed system uses feature extraction followed by a neural network. Feature extraction is the process of selection of the useful data for estimation of DOA. The estimation is performed by the use CS. The neural network, which performs the pattern recognition step, computes the DOA to locate the sound source. The important key insight is the use of the instant neous cross-power spectrum at each pair of sensors. Instan taneous cross-power spectrum means the cross-power spectrum calculated without any averaging over realizations. This step calculates the discrete Fourier transform (DFT) of the signals at all sensors. In the compressive sampling step K coefficients of this DFT transforms are selected, and then multiplies the DFT coefficients at these selected frequencies using the complex conjugate of the coefficients in the neighboring sensors. In comparison to the other crosspower spectrum estimation techniques (which multiply each pair of DFT coefficients and average the results), we have reduced the computational complexity. After this step we have compressed the neural network that is designed with these feature vectors. We propose a family of new algorithms based on CS to achieve this. The main advantage of this framework is that these algorithms are capable of iteratively building up the sparse topology, while maintaining the training accuracy of the original larger architecture. Experimental and simulation results showed that by use of NNs and CS we can design a compressed neural network for locating the sound source with acceptable accuracy.
The remainder of the paper is organized as follows. The next section presents a review of techniques for sound source localization. Section III explains feature selection and discusses the training and testing procedures of our sound source localization technique. Section IV describes traditional pruning algorithms and compressive sampling theory and section V contains the details of the new network pruning approach by describing the link between pruning NNs and CS and the introduction two definitions for different sparse matrices. Experimental results are illustrated in Section VI while VII concludes the paper.

Sound Source Localization
Sound source localization is performed by the use of DOA. The assumption of far field sources remains true while the distance between source and reference microphone is larger than 2 min 2D λ [2] fig. 1. In this equation min λ is the minimum wavelength of the source signal, and D is the microphone array length. With this condition, incoming waves are approximately planar. So, the time delay of the received signal between the reference microphone and the n th − microphone would be [15]: In (1) l is the distance between two microphones, Φ is the DOA, and υ is the velocity of sound in air. Therefore, 0 t is the amount of time that the signal traverses the distance between any two neighboring microphones, Figure. 1 and 2 illustrates this fact.  If the distances between source and microphones are not far enough, then time delay of the received signal between the reference microphone and the n th − microphone would be [15] Figure. 2: Where, l is the distance between source and the first (reference) microphone [15].

Feature Selection
The aim of this section is to compute the feature vectors from the array data and use the MLP (Multi Layer Perceptron) approximation property to map the feature vectors to the corresponding DOA, as shown in Figure. 3. [6]. Feature vector must: 1. be able to be mapped to the desired output (DOA). 2. be independent in phase, frequency, bandwidth, and amplitude of the source.
3. be able to be calculated computationally efficient. Assume that ) (t S n is the signal received at the th n − microphone and 1 n = is the reference microphone We can write the signal at the n th − microphone in terms of the signal at the first microphone signal as follow: Then the cross-power spectrum between sensor n and sensor 1 n + like below: The normalized version is: ) and thus to the DOA. Therefore our aim is to use an MLP neural network to approximate this mapping.
We summarized our algorithm for computing a real-valued feature vector of length ( 3. Construct a feature vector that contains the real and imaginary part of cross-power spectrum coefficient and their corresponding FFT indices.
We utilized two-layer Perceptron neural network and trained it according to fast back propagation training algorithm [7]. For training network we use a simulated dataset of received signals. We modeled received signal as a sum of cosines with random frequencies and phases. We write received sampled signal at sensor n as below: where N is the number of cosines (we assumed

Traditional Pruning Algorithms and CS Theory
Generally speaking, network pruning is often casted as three sub-procedures: (i) define and quantify the saliency for each element in the network; (ii) eliminate the least significant elements; (iii) re-adjust the remaining topology. By this knowledge, the following questions may appear in mind: 1) What is the best criterion to describe the saliency, or significance of elements?
2) How to eliminate those unimportant elements with minimal increase in error?
A new theory known as Compressed Sensing (CS) has recently emerged that can also be categorized as a type of dimensionality reduction. Like manifold learning, CS is strongly model-based (relying on sparsity in particular).
This theory states that for a given degree of residual error ε , CS guarantees the success of recovering the given signal under some conditions from a small number of samples [14].
According to the number of measurement vectors, the CS problem can be sorted into Single-Measurement Vector

Problem Formulation and Methodology
Before we formulate the problem of network pruning as a compressive sampling problem we introduce some definitions [11,10]: 1. If for all columns of a matrix, norm l − 0 was smaller than S, then this matrix called a sparse S − 1 matrix. 2. If the number of rows that contain nonzero elements in a matrix was smaller than S then this matrix is called a sparse S − 2 matrix. We assume that the training input patterns are stored in a matrix I, and the desired output patterns are stored in a matrix O, then the mathematical model for training of the neural network can be extracted in the form of the following expansion: .
This problem is equivalent to finding 2 w which most of its rows are zeros. So with definition of sparse S − 2 matrix we can rewrite the problem as below: In matrix form equation (9) and (10) can be written as: In which * h O is input matrix of the hidden layer for the compressed neural network. Comparing these equations with (7) we can conclude that these minimization problems can be written as CS problems. In these CS equations

Results and Discussion
As mentioned before, assuming that the received speech signals are modeled with 10 dominant frequencies, we have trained a two layer Perceptron neural network with 128 neurons in hidden layer and trained it with feature vectors that are obtained with CS from the cross-power spectrum of the received microphone signals. After computing network weights we tried to compress network with our algorithms.
In order to compare our results with the previous algorithms we have use SNNS (SNNS is a simulator for NNs which is available at [19]). All of the traditional algorithms, such as Optimal Brain Damage (OBD) [16], Optimal Brain Surgeon (OBS) [17], and Magnitude-based pruning (MAG) [18], Skeletonization (SKEL) [6], non-contributing units (NC) [7] and Extended Fourier Amplitude Sensitivity Test (EFAST) [13], are available in SNNS (CSS1 is name of algorithm that uses SMV for sparse representation and CSS2 is another technique that uses MMV for sparse representation). Table 1 and 2 demonstrate the results of the simulations. Observing these results, in table I we compare algorithms on classification problem and in Table 2 we compare algorithms on approximation problem. For classification problem we compare sum of hidden neurons weights in different algorithms with similar stopping rule in training neural networks. Another thing that we compared in this table was classification error and time of training epochs. In Table 2 we compare number of hidden neurons and error in approximation and time of training epochs, where we have stopping rule in training neural networks. With these outputs we can infer that CS algorithms are faster than other algorithms and have smaller error in compare with other algorithms. In comparison to other algorithms CSS1 is faster than CSS2 and would achieve smaller computational complexity. This means that, According to the number of Measurement vectors, the algorithm that uses single-measurement vector (SMV) is faster than another algorithm that uses multiple-measurement vector (MMV) but its achieve error is not smaller.

Conclusions
In this paper, compressive sampling is utilized to design-ing NNs. Particularly, using the pursuit and greedy methods in CS, a compressing methods for NNs has been presented.
The key difference between our algorithm and previous techniques is that we focus on the remaining elements of neural networks; our method has a quick convergence. The simulation results, demonstrates that our algorithm is an effective alternative to traditional methods in terms of accuracy and computational complexity. Results revealed this fact that the proposed algorithm could decrease the computational complexity while the performance is increased.