Speech enhancement based on hidden Markov model using sparse code shrinkage

This paper presents a new hidden Markov model-based (HMM-based) speech enhancement framework based on the independent component analysis (ICA). We propose analytical procedures for training clean speech and noise models using the Baum re-estimation algorithm, and present a maximum a posteriori (MAP) estimator based on the Laplace-Gaussian (for clean speech and noise, respectively) combination in the HMM framework, namely sparse code shrinkage-HMM (SCS-HMM). The proposed method on the TIMIT database in the presence of three noise types at three SNR levels in terms of PESQ and SNR are evaluated and compared with Auto-Regressive HMM (AR-HMM) and speech enhancement based on HMM with discrete cosine transform (DCT) coefficients using the Laplace and Gaussian distributions (LaGa-HMMDCT). The results obtained confirm the superiority of the SCS-HMM method in the presence of non-stationary noises compared to LaGa-HMMDCT. The results of the SCS-HMM method represent a better performance of this method compared to AR-HMM in the presence of white noise based on the PESQ measure.


Introduction
Speech enhancement aims to improve speech quality using various algorithms.Enhancing speech degraded by noise, or noise reduction, is the most important field of speech enhancement, and is used for many applications such as mobile phones, VoIP, teleconferencing systems, speech recognition, and hearing aids.Among the different proposed solutions, the statistical approach in speech enhancement is often preferred due to the stochastic nature of speech signals [1].Generally, the statistical methods are divided into the model-based [2,3] and non-model-based [4,5] techniques.In the model-based procedures, the clean speech and noise models are first generated in a training phase, and then the clean speech is estimated based on this prior information in a test phase.The non-model-based procedures only consist of the test phase, and the required information is estimated using the noisy speech.Under the nonstationary noisy conditions, the model-based techniques have advantages over the non-modelbased techniques through prior information [3].Hidden Markov Model (HMM) is one of the powerful model-based methods applied to speech enhancement and has resulted in high efficiency, especially under non-stationary noisy conditions [2].One of the most important factors that influences the model precision of an HMM is the probability density function (pdf) of clean speech, noise, and noisy speech.In the HHM-based speech enhancement, the Gaussian pdf is used to model clean speech and noise, while the recent studies [4,6,7] have shown that clean speech and noise pdf are non-Gaussian distributions.The multivariate Laplace distribution has been recommended for modeling HMM as a non-Gaussian distribution [8].In this modeling, using multivariate Laplace distribution causes a nonclosed form formula.To solve this problem, it was assumed that the DCT coefficients were statistically independent, whereas DCT only reduces the correlation between the coefficients, and they are not completely uncorrelated to each other.If we assume that the DCT coefficients are uncorrelated to each other and that the distribution of coefficients is Laplace, we cannot assume that the coefficients are statistically independent.If we use independent component analysis (ICA) instead of DCT, whose coefficients are greatly statistically independent from each other, we can do a more accurate statistical modeling.We modeled the clean speech signal using HMM in ICA domain with Laplace distribution, while doing the noise modeling only by assuming Gaussian pdf for noises.In this work, we propose a novel MAP HMM-based speech enhancement algorithm that uses the ICA transformation.Our theoretical analysis shows that under the assumption of Laplace clean speech and Gaussian noise, the proposed algorithm leads to a wellknown enhancement technique, sparse code shrinkage.This paper is organized as follows.In Section 2, the HMM training methods are reviewed.In Section 3, the MAP estimator is derived based on HMM in the ICA space.In Section 4, a summary of the proposed algorithm is given.In Section 5, we present the experimental evaluation and results, and in Section 6, the conclusions are given.

Signal model
Assume a time-domain noisy speech vector y n at time n that is composed of a clean speech vector s n and an additive noise vector d n given as (1).Taking the independent component analysis (ICA) of y n , we get (2) in this equation.We assumed that noise is independent from clean speech, and that the vectors have the length L and a zero mean.The AR features of P th order a n = [1,a(1),…,a(p)] for s n =[s(0),s(1),…,s(L-1)] could be derived by the linear predictive coefficient approach [9], and the AR coefficients of other signals are obtained analogously.
An HMM with M states and N mixtures is defined as 1 ( , , , ) , where  is the initial state distribution, a denotes the state transition probability distribution, c is the probability distribution for each mixture in each state, and M N L   is the matrix of pdf parameters in each mixture.The parameters of  are estimated by the Baum re-estimation formulas [10].In order to estimate clean speech from noisy signal, it is necessary to construct the HMM models for clean speech ( S  ) and noise ( D  ) separately, and then combine them to create the noisy HMM ( Y  ).

Speech model
Based on the central limit theorem, we can assume that ICA S has a multivariate Gaussian pdf with independent coefficients according to (3) and (5).In these equations, index k shows the k th ICA coefficient of an L-dimensional vector.
As shown in [6], the Laplace distribution function, compared to the Gaussian distribution function, is closer to the speech signal distribution in different domains, and thus we can consider the distribution of vector ICA S as a multivariate Laplace pdf.We know that the ICA coefficients are independent.Therefore, the multivariate Laplace pdf of ICA S is derived by ( 4) and (5), where b k is the scale parameter of the k th coefficient.In these equations, it is assumed that ICA S has a zero mean.
We used (4) and ( 5) for each mixture in each state, and estimated the model parameters of S  in closed form using the Baum's auxiliary function.In fact, changing Gaussian pdf to Laplace pdf in each HMM mixture modifies the equations for parameter estimation of S  (the Laplace scale parameter estimation in each mixture in each state).Estimation of the Laplace scale parameter can be derived by differentiating the auxiliary function of ( 6) with respect to scale parameter resulting in (7).
In the above-mentioned equations,

Noise model
In this work, we assumed that D ICA had a multivariate Gaussian distribution.In other words, if we use (3) and ( 5) for each mixture in each state, then we can estimate the model parameters of D  using the Baum's auxiliary function.Therefore, estimation of  , can be interpreted as estimation of the diagonal covariance matrix ( D  ) for each mixture in each state.This means that the main diagonal of the covariance matrix contains the variances for each independent dimension.

Map estimation
In this section, we present the MAP estimation based on the Hidden Markov model that in the ICA space.We assumed, that the speech distribution was non-Gaussian, and that the noise distribution was Gaussian.Studies have shown that the proposed framework under the assumption that the signal is non-Gaussian and the noise is Gaussian leads to a sparse code shrinkage [11], which we called the SCS-HMM technique.Let t s be an L-dimensional vector of the clean speech.Similarly, let t d be an L-dimensional vector of the noise.Assume that the noise is additive and statistically independent of the speech.p q u s y p y q u s p q u s p y s p q u s p y s p q u s On substituting ( 10) into ( 9), we obtained the following formula: For a Gaussian distributed noise, the term  ( , , ) In order to estimate the clean signal in the ICA space, we used the ICA unmixing matrix s w , obtained from the training phase.Thus, the estimate of signal s can be obtained by letting ds ww  .For clarity of presentation, we denoted s w by w .In this case, the MAP estimation rule from (11) can be expressed in the form of: where, .
( ,:) denotes the lth row of matrix w .In (12), the conditional probability ( , | .) p q u w s  is calculated by the forwardbackward algorithm [12], and the second term of the above equation is calculated as follows: We can perform the estimation in the independent space first, and then transform the estimate obtained into the original space.Denote .x can be calculated by the equations (14)   and ( 15): ( ) arg max(ln ( ( ) ( )) ln ( ( ) | , )) where, | ( ( ) ln ( ( ) | , ) . The minimization is equivalent to solving the following equation: Although ( 16) may not have a closed form solution, the estimation function can be approximated as follows [13]:

Summary of the proposed SCS-HMM algorithm
This section provides a summary of the steps involved in the proposed SCS-HMM enhancement algorithm, as described in Sections 2 and 3. 3. To perform the enhancement process of the observed noisy signal y, we applied the estimation rule (11) to estimate the clean speech ŝ .

Experimental evaluation
The objective evaluation of AR-HMM, LaGa-HMM, and the proposed algorithm was performed in terms of the SNR and PESQ measures.The experimental evaluation was performed using the speech signals selected from the TIMIT database, separately for each gender.The training set contained 100 sentences for each gender, and the testing set contained the sentences of the speaker female and male.There were no common sentences between the training and test sets.The noisy speech signals were created by adding different noises such as the white, babble, and machinegun noises at 0 db, 5 db, and 10 db.All of the signals were sampled at 8 kHz.The signals were split into frames of 64 samples using the rectangular window.The fast-ICA algorithm [14] was employed to the ICA basis functions based on the training set.In the training phases of the various models, there was no inter-frame overlap.The clean speech models were generated using 10 states and 30 mixtures.In AR-HMM and LaGa-HMM, the noise models were constructed based on 4 states and 4 mixtures.Due to the use of MAP estimation in the SCS-HMM method, the noise model was generated using 1 state and 1 mixture.In AR-HMM, we used an AR-order of 10 for a clean speech and noise.The fast-ICA algorithm was employed to estimate the ICA basis functions based on the training data.The performance of the proposed algorithm SCS-HMM was compared with AR-HMM in [12], and LaGa-HMM DCT in [8].Tables 1 and 2, respectively, demonstrate the SNR and PESQ values achieved by AR-HMM, LaGa-HMM DCT , and SCS-HMM.The results obtained confirm the superiority of the SCS-HMM method in the presence of non-stationary noises compared to LaGa-HMM DCT based on the SNR measure.The results of the SCS-HMM method represent a better performance of this method, compared to AR-HMM, in the presence of white noise based on the PESQ measure.This result is expected, because the MAP method is constructed based on stationary noises such as the white noise.For this reason, the performance of the SCS-HMM method is better than AR-HMM in the presence of white noise based on the PESQ measure.However, improvements in SNR results were not observed.An example of clean, noisy, and enhanced speech spectrograms are depicted in figure 1.Using the spectrogram representation, it can be seen that the SCS-HMM result is close to the original clean signal.

Conclusion
In this paper, we presented a new HMM-based speech enhancement framework based on the independent component analysis (ICA).Furthermore, a MAP estimator was derived for the ICA coefficients of a clean speech.It was also shown that the proposed framework under the assumption of the signal being Laplace distribution and noise being Gaussian distribution led to sparse code shrinkage, called the SCS-HMM technique.The evaluation results, in terms of SNR and PESQ, indicated the superiority of the SCS-HMM method in the presence of nonstationary noises, compared to LaGa-HMM DCT .The results of the SCS-HMM method represented a better performance of this measure.The performance of SCS-HMM in the presence of other noise types based on PESQ and in the presence of all noises based on SNR showed slightly inferior performance of this method.

2
(j) n  is the probability of being in state j at time n, ( , ) n ij  is the transition probability of state i at time n to state j at time n+1, and ( | ) parameter of the k th dimension in state j ( Thus the components of vector t In the above equation, | uq b is the scale parameter of the fourth mixture in the qth state.The estimation rule in (17) is known as the sparse code shrinkage estimation[11].Given the two words p ( , | .) the clean signal component by(11).

1 .
First, using two sets of data s and d , which should have the same statistical properties as the noise d and signal s, calculate the ICA transformation matrices d w and s w .This can be performed using any the existing ICA algorithms.2. Train HMM using the independent components in Sections 2.1 and 2.2, respectively.

Figure 1 .
Figure 1.Spectrograms of female speech corrupted by white noise at SNR=5 dB.