Incorporating Prior Information in
Nonnegative Matrix Factorization for
Audio Source Separation
by
Emad Mounir Grais Girgis
A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy
in the
Faculty of Engineering and Natural Sciences Electronics Engineering
APPROVED BY
Assist. Prof. Dr. Hakan Erdo˘gan ... (Thesis Supervisor)
Prof. Dr. Mustafa ¨Unel ...
Assoc. Prof. Dr. M¨ujdat C¸ etin ...
Assoc. Prof. Dr. Ali Taylan Cemgil ...
Assoc. Prof. Dr. Ilker Hamzao˘glu ...
I would like to express my deep and sincere gratitude to my thesis supervisor Assist. Prof. Hakan Erdo˘gan for his invaluable guidance, tolerance, positiveness, support, and encouragement throughout my thesis.
I am grateful to my committee members Prof. Mustafa ¨Unel, Assoc. Prof. Ilker Hamzao˘glu, Assoc. Prof. Ali Taylan Cemgil, and Assoc. Prof. M¨ujdat C¸ etin for taking the time to read and comment on my thesis.
I would like to thank Erasmus Mundus for providing the necessary financial support for the first three years of my PhD study. I also would like to thank Sabancı university and T¨urk-Telekom for supporting my research for the remaining period of my PhD study.
My deepest gratitude goes to my family for their unflagging love and support throughout my life.
I would like to thank all my colleges in VPA laboratory for their great help. Finally, I would like to thank the international office here at Sabancı university for their help during my stay in Turkey.
Emad Mounir Grais Girgis
EE, PhD Thesis, 2013
Thesis Supervisor: Hakan Erdo˘gan
Keywords: Single channel source separation, nonnegative matrix factorization, hidden Markov model, Gaussian mixture model, minimum mean squared error estimation, model adaptation, orthogonality constraints, discriminative training,
dictionary learning, Wiener filter, spectral masks.
Abstract
In this work, we propose solutions to the problem of audio source separation from a single recording. The audio source signals can be speech, music or any other audio signals. We assume training data for the individual source signals that are present in the mixed signal are available. The training data are used to build a representative model for each source. In most cases, these models are sets of basis vectors in magnitude or power spectral domain. The proposed algorithms basically depend on decomposing the spectrogram of the mixed signal with the trained basis models for all observed sources in the mixed signal. Nonnegative matrix factorization (NMF) is used to train the basis models for the source signals. NMF is then used to decompose the mixed signal spectrogram as a weighted linear combination of the trained basis vectors for each observed source in the mixed signal. After decomposing the mixed signal, spectral masks are built and used to reconstruct the source signals.
In this thesis, we improve the performance of NMF for source separation by incorpo-rating more constraints and prior information related to the source signals to the NMF decomposition results. The NMF decomposition weights are encouraged to satisfy some prior information that is related to the nature of the source signals. The priors are modeled using Gaussian mixture models or hidden Markov models. These priors basi-cally represent valid weight combination sequences that the basis vectors can receive for a certain type of source signal. The prior models are incorporated with the NMF cost function using either log-likelihood or minimum mean squared error estimation (MMSE).
introduce post enhancement using MMSE estimation to obtain better separation for the source signals.
In this thesis, we also improve the NMF training for the basis models. In cases when enough training data are not available, we introduce two different adaptation methods for the trained basis to better fit the sources in the mixed signal. We also improve the training procedures for the sources by learning more discriminative dictionaries for the source signals. In addition, to consider a larger context in the models, we con-catenate neighboring spectra together and train basis sets from them instead of a single frame which makes it possible to directly model the relation between consequent spectral frames.
Experimental results show that the proposed approaches improve the performance of using NMF in source separation applications.
EMAD MOUNIR GRAIS GIRGIS
EE, Doktora Tezi, 2013
Tez Danı¸smanı: Hakan Erdo˘gan
Anahtar Kelimeler: Tek Kanal Kaynak ayrımı, Negatif Olmayan Matris Ayrı¸stırma (NOMA), saklı Markov modeli, Gauss karı¸sım modeli, minimum ortalama karesel hata
kestirimi (MOKH), model uyarlama, dikgenlik kısıtları, ayırt edici e˘gitim, s¨ozl¨uk ¨
o˘grenme, Wiener filtresi, spektral maskeler.
¨ Ozet
Bu ¸calı¸smada tek bir kayıttan ses kaynaklarının ayrımı problemine ¸c¨oz¨um ¨onerilerinde bulunuyoruz. Ses kaynakları konu¸sma, m¨uzik veya ba¸ska ses sinyalleri olabilir. Karı¸smı¸s sinyal i¸cerisindeki ¨ozg¨un sinyal kaynaklarının e˘gitim verilerinin elimizde mevcut oldu˘gunu varsayıyoruz. E˘gitim verileri her kaynak i¸cin ¨ornek model kurmak amacıyla kullanılır. Genellikle bu modeller spektral uzayda b¨uy¨ukl¨uk veya g¨u¸c de˘gerlerini a¸cıklayan ta-ban vekt¨or k¨umeleridir. Temelde, ¨onerilen algoritma karı¸smı¸s sinyalin spektrogramının karı¸smı¸s sinyal i¸cinde bulunan b¨ut¨un kaynak sinyallerin taban e˘gitim modelleriyle ayrı¸stır-ılmasına dayanır. Kaynak sinyallerin taban modellerini e˘gitmek i¸cin Negatif Olmayan Matris Ayrı¸stırma (NOMA) metodu kullanılır. Daha sonra NOMA, karı¸smı¸s sinyal spek-trogramını, bu sinyal i¸cinde bulunan b¨ut¨un kaynak sinyallerin e˘gitilmi¸s taban vekt¨orlerinin a˘gırlıklı do˘grusal katı¸sımı olarak ayrı¸stırmakta kullanılır. Karı¸smı¸s sinyali ayrı¸stırdıktan sonra kaynak sinyali tekrar in¸sa etmek i¸cin spektral maskeler olu¸sturulur.
Bu tezde, NOMA ayrı¸stırma sonu¸clarına, kaynak sinyalleriyle ba˘glantılı daha ¸cok kısıt ve ¨onsel bilgi dahil ederek, kaynak ayrı¸stırmada NOMA’nın performansını arttırıyoruz. NOMA ayrı¸stırmasındaki a˘gırlıklar kaynak sinyallerin do˘gasına ba˘glı bazı ¨onsel kısıtları sa˘glamak i¸cin te¸svik edilmi¸stir. Kullandı˘gımız ¨onsel bilgi modelleri Gauss karı¸sımı ya da saklı Markov modelleridir. Temelde bu ¨onsel modeller her kayna˘gın tabanlarının sahip olacakları ge¸cerli a˘gırlık dizilerini ifade ederler. Bu ¨onsel modeller NOMA maliyet fonksiyonuna log-olabilirlik ya da minimum ortalama karesel hata (MOKH) kestirimi kullanılarak dahil edilmi¸stir.
MOKH kestirimi kullanarak ardıl iyile¸stirme metodu da tanıtılmı¸stır.
Bu tezde aynı zamanda taban modelleri i¸cin NOMA e˘gitimini de iyile¸stiriyoruz. Yeterli e˘gitim verisi mevcut olmayan durumlarda karı¸smı¸s sinyaldeki kaynaklara daha uygun tabanlar bulmak amacıyla iki farklı uyarlama metodu sunuyoruz. Di˘ger bir katkı olarak, kaynak sinyaller i¸cin daha ayırt edici modeller ¨o˘grenerek kaynak e˘gitim yordamlarını da geli¸stiriyoruz. Ba¸ska bir b¨ol¨umde, modellerimizin ¸cevresel etkileri daha iyi ¨o˘grenmesi i¸cin, kom¸su spektral verileri birle¸stirdikten sonra onlardan taban vekt¨orleri e˘gitiyor ve b¨oylece kom¸su ¸cer¸ceveler arasındaki bilgileri do˘grudan modellemi¸s oluyoruz.
Deneysel sonu¸clar ¨onerilen metotların kaynak ayrı¸stırma uygulamalarında NOMA’nın performansını arttırdı˘gını g¨ostermi¸stir.
Acknowledgements iv
Abstract v
¨
Ozet vii
List of Figures xii
List of Tables xiii
Abbreviations xv
1 Introduction 1
1.1 Approaches to single-channel audio source separation . . . 2
1.2 The contributions of this thesis . . . 6
1.3 Organization of this thesis . . . 8
2 Background 9 2.1 Formulation for single-channel source separation . . . 9
2.2 Non-negative matrix factorization . . . 11
2.3 NMF for single channel source separation . . . 16
2.3.1 Reconstruction of source signals and spectral masks . . . 19
2.4 Performance evaluation . . . 20
3 Regularized NMF using GMM priors 22 3.1 Motivations and overview . . . 22
3.2 The proposed regularized nonnegative matrix factorization approach . . . 25
3.3 Training the source models . . . 28
3.3.1 Sequential training . . . 28
3.3.2 Joint training . . . 29
3.3.3 Determining the hyper-parameters . . . 30
3.4 Signal separation . . . 31
3.4.1 Reconstruction of source signals and spectral masks . . . 33
3.4.2 Signal separation using IS-NMF . . . 33
3.5 Experiments and discussion . . . 34
3.5.1 Speech-music separation . . . 34
3.5.2 Speech-speech separation . . . 38
3.5.3 Comparison with the use of a conjugate prior . . . 40
3.6 Conclusion . . . 42
4 Regularized NMF using HMM priors 44 4.1 Motivations and overview . . . 44
4.2 The proposed regularized NMF using HMM . . . 46
4.3 Training the source models . . . 50
4.3.1 Initial training . . . 50
4.3.2 Joint training . . . 51
4.4 Signal separation . . . 52
4.5 Experiments and discussion . . . 54
4.5.1 Comparison with other priors . . . 55
4.6 Conclusion . . . 57
5 Regularized NMF using MMSE estimates under GMM priors with online learning for the uncertainties 58 5.1 Motivations and overview . . . 58
5.2 Regularized nonnegative matrix factorization using MMSE estimation . . 59
5.3 The proposed regularized NMF for source separation . . . 65
5.3.1 Signal separation . . . 65
5.3.2 Source signals reconstruction . . . 68
5.4 Experiments and discussion . . . 69
5.4.1 Comparison with other priors . . . 70
5.5 Conclusion . . . 72
6 Spectro-temporal post-smoothing 74 6.1 Motivations and overview . . . 74
6.2 Source signals reconstruction and smoothed masks . . . 75
6.3 Experiments and discussion . . . 77
6.3.1 Comparison with regularized NMF with continuity prior . . . 79
6.3.2 Comparison with regularized NMF with MMSE priors . . . 81
6.3.3 Combining MMSE estimation based regularized NMF with post-smoothing . . . 81
6.4 Conclusion . . . 82
7 Spectro-temporal post-enhancement using MMSE estimation 84 7.1 Motivations and overview . . . 84
7.2 MMSE estimation for post enhancement . . . 85
7.2.1 Training the source GMMs . . . 86
7.2.2 Learning the distortion . . . 87
7.2.3 Calculating MMSE estimates . . . 88
7.3 Experiments and discussion . . . 89
7.4 Conclusion . . . 91
8 Discriminative nonnegative dictionary learning using cross-coherence
8.1 Motivations and overview . . . 92
8.2 Dictionary learning . . . 93
8.3 Discriminative learning through cross-coherence penalties . . . 95
8.4 Signal separation . . . 98
8.5 Experiments and discussion . . . 98
8.6 Conclusion . . . 100
9 Adaptation of speaker-specific bases in non-negative matrix factoriza-tion 102 9.1 Motivations and overview . . . 102
9.2 Probabilistic perspective of NMF . . . 104
9.3 Basis vectors matrix prior p(B) . . . 105
9.4 Training the bases . . . 105
9.5 Speech model adaptation . . . 106
9.5.1 Bayesian adaptation of the speech bases . . . 106
9.5.2 Linear transformation adaptation of the speech bases. . . 107
9.5.3 Combined adaptation . . . 108
9.6 Signal separation and reconstruction . . . 108
9.7 Experiments and results . . . 109
9.8 Conclusion . . . 111
10 Nonnegative matrix factorization with sliding windows and spectral masks 112 10.1 Motivations and overview . . . 112
10.2 Training the bases . . . 113
10.3 Signal separation and masking . . . 113
10.3.1 Source signals reconstruction and masks. . . 114
10.4 Experiments and discussion . . . 115
10.4.1 Comparison with post-smoothing in Chapter 6 and CNMF . . . . 116
10.5 Conclusion . . . 118
11 Conclusions and future work 119 11.1 Future work . . . 121
12 Appendix A 122
13 Appendix B 130
1.1 Single channel source separation (SCSS) . . . 2
1.2 Nonnegative matrix factorization (NMF). . . 5
2.1 The nonnegative linear combinations for the given two basis vectors. . . 12
2.2 The NMF decomposition matrices for the sinusoidal signals. . . 15
2.3 The DTMF signals.. . . 16
2.4 The NMF decomposition matrices for the DTMF signals. . . 17
2.5 The NMF decomposition matrices for a clean speech signal. . . 18
3.1 The cluster structure for the nonnegative linear combinations of the basis vectors. 24 3.2 The effect of changing the number of GMM mixture K for speech-music separation using KL-NMF at SMR = −5 dB, λspeech= λmusic= 0.005, λtrain= 0.0001. . . 36
3.3 The SIR for the case of using no priors during training and separation stages, the case of using prior only during testing, and the case of using prior during training and separation stages. . . 39
4.1 The cluster and temporal structures for the nonnegative linear combinations of the basis vectors. . . 45
4.2 The graphical model representation of a HMM . . . 48
5.1 The flow chart of using regularized NMF with MMSE estimates under GMM priors for SCSS. The term NMF+MMSE means regularized NMF using MMSE estimates under GMM priors. . . 66
5.2 The effect of using different prior models on the gains matrix on the SNR values. . . 71
5.3 The effect of using different prior models on the gains matrix on the SIR values . . . 72
5.4 The effect of using different prior models on the gains matrix on the SDR values . . . 73
7.1 Columns construction and sliding windows with length L frames. . . 86
8.1 The simplified cross-coherence penalty. . . 99
8.2 SDR and SIR in dB for the estimated speech signal. . . 100
10.1 Columns construction and sliding windows with length L frames. . . 113
12.1 The graphical model of the observation model. . . 122
12.2 Graphical representation of the observation model for a set of N data points. . . 127
3.1 SNR in dB for the speech signal for speech-music separation using regu-larized KL-NMF with λtrain = 0 and different values of the regularization
parameters in testing λspeech and λmusic. . . 38
3.2 SNR in dB for the speech signal for speech-music separation using reg-ularized KL-NMF with different values of the regularization parameters λspeech, λmusic and λtrain= 0.0001 for last two columns. . . 38
3.3 SNR in dB for the speech signal for speech-music separation using reg-ularized IS-NMF with different values of the regularization parameters λspeech and λmusic. . . 38
3.4 SNR in dB for the male speech signal for speech-speech separation using regularized KL-NMF with different values of the regularization parame-ters λmale and λf emale. . . 39
3.5 SNR in dB for the male speech signal for speech-speech separation using regularized IS-NMF with different values of the regularization parameters λmale and λf emale. . . 40
3.6 SNR in dB for the speech signal for speech-music separation using conju-gate prior KL-NMF with different values of the prior parameters. . . 42
4.1 SNR in dB for the estimated speech signal for using different HMM . . . 55
4.2 SNR in dB for the estimated speech signal for using GMM prior models . . . 56
4.3 SNR in dB for the estimated speech signal for using different prior models . . . 57
5.1 SNR and SIR in dB for the estimated speech signal with regularization parameters λspeech= λmusic = 1 and different number of Gaussian mixture components K. . . 69
6.1 SNR in dB for the estimated speech signal using spectral mask without and with smoothing filter, with different filter types and different filter size a × b. . . 77
6.2 SNR in dB for the estimated speech signal using spectral mask after smoothing the matrix G in the mask, with different filter types and different filter size a × b.. . . 77
6.3 SNR in dB for the estimated speech signal with smoothing G without using mask with different filters with a = 1, b = 3. . . 79
6.4 SNR in dB for the estimated speech signal with smoothing the estimated magnitude spectrogram of speech signal with different filters with a = 1, b = 3.. . . 79
6.5 SNR in dB for the estimated speech signal using only NMF and with using regularized NMF in [1]. . . 80
6.6 SNR and SIR in dB for the estimated speech signal using spectral mask after smoothing the matrix G in the mask, with different filter types and different filter size a = 1 and different values for b. . . 81
6.7 SNR and SIR in dB for the estimated speech signal using MMSE estimates based regularized NMF and smoothed masks for different filter types and
different filter size a = 1, K = 16, λ = 1 and different values for b. . . 82
6.8 SNR and SIR in dB for the oracle experiment. . . 82
7.1 SDR and SIR in dB for the estimated speech signal. . . 90
8.1 SDR and SIR in dB for the estimated speech signal. . . 100
9.1 Signal to Noise Ratio (SNR) in dB for the separated speech signal for every experiment with 10 seconds samples. . . 110
9.2 Signal to Noise Ratio (SNR) in dB for the separated speech signal for every experiment with 15 seconds samples. . . 111
10.1 SNR in dB for the speech signal using NMF with sliding window and spectral mask with p = 3 for different numbers of bases. . . 115
10.2 SNR in dB for the speech signal in case of using NMF with sliding window and different masks, with Ns = Nm= 642. . . 116
10.3 SNR in dB for the speech signal in case of using NMF with different masks, without sliding window, with Ns= Nm = 128. . . 116
10.4 The percentage improvement for SNR and SIR in dB for the estimated speech signal for using post-smoothing and NMF with sliding windows. . 117
10.5 SNR and SIR in dB for the estimated speech signal in the case of using NMF with sliding window and CNMF with different L values and p = 2. . 118
SCSS Single Channel Source Separation NMF Nonnegative Matrix Factorization
GMM Gaussian Mixtuer Models
HMM Hidden Markov Models
FHMM Factorial Hidden Markov Models
MMSE Minimum Mean Squared Error
PDF Probability Density Function
MFCC Mel-Frequency Cepstral Coefficients
EM Expectation Maximization
SVM Support Vector Machines
SNR Signal to Noise Ratio SDR Signal to Distortion Ratio SIR Signal to Interference Ratio PSD Power Spectral Density
Introduction
Source separation refers to the problem of separating one or more desired signals from mixtures of multiple signals. This problem can be encountered in many different applica-tions such as medical [2,3,4], military [5,6], and multimedia [7,8]. To perform effective separation, this problem is usually approached by using multiple sensors each of which measures a different mixture of the source signals to obtain sufficient information about the incoming source signals. In most cases, the source signals are assumed to be statis-tically independent and no extra prior information about the source signals is assumed available. The problem is treated as blind source separation (BSS) [7, 9], which can be performed by techniques such as independent component analysis (ICA) [9, 10,11]. This approach performs well when the number of measuring sensors (channels) are at least as many as the number of signal sources in the mixed signal.
A more complicated problem is that of separating multiple source signals from a single measuring of the mixed signal. This problem is usually defined as the single channel source separation (SCSS) problem. The goal in single-channel source separation (SCSS) is to recover the original source signals from a single recording of their linear mixture as shown in Figure 1.1. Since the problem is underspecified, prior knowledge or training data for the source signals are assumed to be available.
In this thesis we consider the single channel source separation problem for audio signals. The audio signals can be speech, music, or noise. The single-channel audio source separation problem is encountered in many applications such as: separating instruments in music recordings [1,12,13,14], separating speech signals from multiple simultaneous
speakers recording [15, 16, 17, 18, 19, 20], separating speech signals from background music signals [21, 22, 23, 24, 25], speech denoising [26, 27], and improving automatic speech recognition systems by removing the background signals [28,29,30,31,32].
Figure 1.1: Single channel source separation (SCSS)
1.1
Approaches to single-channel audio source separation
There are many proposed approaches to estimate the audio source signals from the observed mixed signal. Most of these approaches rely on training data about the source signals that are in the mixture. In many approaches, the training and the mixed signals are usually processed in magnitude or power spectral domain [33,34,35,36,37,38]. In other approaches, the signals are processed in the log-spectral domain [39,40,41,42].
In [16,17,18], the training data for each source are modeled with a Gaussian mixture model (GMM) in the log-spectral domain. Given the trained GMMs, minimum mean squared error estimation (MMSE) is used to estimate the source signals from the ob-served mixed signal. This approach is usually used for separating speech signals from a mixture of multiple speaker signals. To better model the source signals, in [43,44,45,46], the training data for each source are modeled using a hidden Markov model (HMM) and the mixed signal is represented by a factorial hidden Markov model (FHMM) [47]. The best state sequence of each HMM for each source that can explain the sequence of the observed mixed signal is found. Given the state sequences, MMSE estimation is used to find the estimate of each source. The idea of the factorial hidden Markov model was first developed in [47]. It has been shown that FHMMs are better suited to model loosely coupled random processes [48]. In FHMM, every hidden state is factorized into multiple
state variables. The main limitation of using MMSE estimation to separate the source signals in the log-spectral domain is that, the data used in training and separation stages are assumed to have the same energy level. In practical cases, the sources are mixed with a different energy level. There are approaches to fix this limitation [49,50,51]. The basic idea is to express the probability density function (PDF) of the mixture in terms of the individual speakers PDFs and their corresponding gains. Then, those patterns and gains which maximize the mixture PDF are selected and used to recover the speech signals. In [44], a non-linear optimization technique was used to estimate the ratio be-tween the energy of the two sources. In [52], the expectation-maximization algorithm (EM) was used to estimate the gains and the other FHMM parameters. In general, these approaches are computationally complicated with slow performance.
Another approach for SCSS is to decompose the mixed signal spectral frames as a weighted linear combination of the training data spectral frames. In [53,54,55,56,57,
58,59], the mixed signal is decomposed as a linear combination of a number of exemplars from a large exemplar dictionary of training data for each source signal. In our early work [23], the magnitude spectrogram frames of the training data for each source were used as a model or dictionary; then matching pursuit was used to decompose the mixed signal magnitude spectrogram with the training data magnitude spectrogram frames; the decomposition results were used to build spectral masks; the spectral masks were used to estimate the contribution of each source in the mixed signal. These approaches usually give good results but they require large dictionaries for the source signals.
Instead of using the whole training data as a dictionary, in [14,22, 60, 61, 62] a set of representative vectors is used as a dictionary for each source training data. The mixed signal spectrum is represented as a linear combination of these dictionary entries. In [60], a non-negative sparse representation is employed, and the sources are reconstructed using the Wiener filter. In [14], sparse coding with a temporal continuity objective was used for separating musical instruments. In [63], the training data was modeled in power spectral density domain by a Gaussian mixture model (GMM) with zero means and diagonal covariance matrix for each source. Every model was then adapted to better represent the source signals in the mixed signal. Finally, the adaptive Wiener filter was used with the adapted models to estimate the source signals in [63]. In our early work [22], the training data was modeled by clustering the spectrogram of the training data using K-means algorithms. Coordinate descent was used in [22] to decompose the
spectrogram of the mixed signal with the K cluster centroids for each source training data. Sparsity and continuity priors were enforced during the decomposition and the sources’ STFT were reconstructed in [22] using the Wiener filter.
The most used approach for solving the SCSS problem is nonnegative matrix factoriza-tion (NMF) [64] to train a set of nonnegative basis vectors (dictionary) for the training data of each source. In the separation stage, NMF is used to decompose the mixed signal as a weighted linear combination of the trained basis vectors. The estimate of each source is found by summing its corresponding trained basis terms from the NMF decomposition during the separation stage [13, 26, 27, 65]. The NMF is used in this framework in magnitude spectral or power spectral domain where the nonnegativity constraint is necessary. The number of the trained basis vectors is usually less than the dimension of the spectral frames of the training data. Due to the efficient update rule solutions of NMF [64] and since every source is represented by a few number of basis vectors, this approach is considered to be fast and very simple which makes it the most used approach in SCSS. Another advantage of using NMF in SCSS is that there is no limitation on the energy level for the training and mixed signals. As we will show later, NMF can be extended to consider more properties for the processed signals.
There are many other methods of using NMF in SCSS. In [12], different NMF decompo-sitions were done for both training and testing data. The trained basis vectors were used to learn support vector machines (SVM) classifiers. The trained SVM classifiers were used to classify the basis vectors of the mixed signal and assign them to different source signals. An unsupervised NMF with clustering was used in [1] to separate the mixed signal. In [13], NMF was used to decompose the mixed data by fixed trained basis vec-tors for each source in one method, and in another method the NMF was used without trained basis vectors to decompose the mixed data, but it requires human interaction for clustering the resulting basis vectors into different sources.
As can be seen in Figure 1.2, NMF is a matrix factorization that decomposes any nonnegative matrix V into a multiplication of a nonnegative basis matrix/dictionary B and a nonnegative gains/weights/activations matrix G [64, 66]. The decomposition matrices are found by minimizing a predefined cost function. Like any optimization problem, the main goal is to minimize the cost function without considering the nature of the processed signals. To consider the prior information on the NMF solution, extra prior
Figure 1.2: Nonnegative matrix factorization (NMF)
information can be formed as an additive regularization term to the NMF cost function. To improve the performance of NMF, there have been many works that aim to encourage the NMF decomposition matrices to satisfy certain characteristics of the source signals to be estimated. In [1], continuity and sparsity priors were placed on the decomposition weights. In [67] and [68], harmonicity and smoothness were enforced in Bayesian NMF and applied to music transcription. In [27], a regularized NMF was used to impose statistical structure for each audio frame using a Gaussian model, which was also used in [26] in addition to modeling frame-to-frame temporal structure. In [69] and [3], spatial decorrelation and other priors were incorporated with NMF for different applications. In [70], regularized NMF with Itakura-Saito (IS-NMF) divergence was introduced with Markov chain prior models for smoothness within a Bayesian framework. The conjugate prior distributions on the NMF weights and basis matrices with the Poisson observation model within Bayesian framework was introduced in [71]. In [71], Gamma distribution was used as a prior for the basis matrix and the Gamma Markov chain [72] was used as a prior for the weights/gains matrix. In [73], a mixture of Gamma prior model was used as a prior for the basis matrix. In [65], sparse NMF was used with trained basis vectors to separate the mixture of two speech signals. In [30], inverse-Gamma distribution was used as a speech prior for speech-music separation. In [74, 75], discriminativity constraint was applied to the NMF solution. In [76], group sparsity was enforced on the NMF decomposition solutions.
Since NMF discards the temporal information, an extension of NMF for time series was introduced in [77, 78] which is capable of identifying components with temporal structure. This extension of NMF is called convolutive nonnegative matrix factorization (CNMF). The basis matrix in CNMF contains temporal-spectral bases, which means the basis matrix contains bases that extend in both dimensions of the input. In [79], a sparse
CNMF was presented. In [80], CNMF was used with exemplar-based robust speech recognition under noise. In [81], group sparsity is enforced on the CNMF decomposition. CNMF with basis adaptation was introduced in [38].
1.2
The contributions of this thesis
In this thesis, we improve single channel source separation using NMF by incorporating prior information about the source signals. The prior information is incorporated during the NMF decomposition or as a post processing step. We improve the NMF performance by incorporating more priors to the NMF matrices by using regularized NMF. The prior information is modeled using statistical models that can capture the characteristics of the source signals. Unlike [70,71] the parameters of the used prior models here are trained from training data for the source signals as opposed to choosing the parameters of the prior models during the separation/testing stage. We model the prior information about the NMF gains using rich models like Gaussian mixture models (GMM) and hidden Markov models (HMM). We also propose a novel approach for applying the prior on the NMF solutions which aims to evaluate how much the NMF solution needs to rely on the prior information. In addition, we incorporate prior information based on intuitive facts like discriminativity of the bases and the smoothness of the source spectra or gains matrices or spectral masks. Furthermore, we introduce model adaptation where the source signals are modeled by a nonnegative dictionary. Finally, we consider to train a set of basis vectors that capture the relations between the consequent spectral frames. We have disseminated some of our contributions in the following publications [82,83,84,85,86,87,88,89].
The contributions of this thesis can be itemized as follows:
• We use a Gaussian mixture model (GMM) to regularize the gains in NMF decom-position to improve the separation performance. We develop new update rules for the proposed regularized NMF. In addition, we introduce joint training to learn both the dictionary and the prior GMM together.
• To model the dynamic information in source signals, we introduce an HMM as a regularizer for the gains matrix columns which characterizes the sequential depen-dence of the temporal activations in a principled manner. We derive the update rules for the new regularized NMF.
• As a novel idea to improve regularization and to avoid dependence on the regu-larization parameters, we introduce a new reguregu-larization method based on MMSE estimates under a GMM prior with a distortion model where the distortion co-variance is estimated online. Based on the estimated coco-variance of the distortion, the proposed MMSE estimation based regularized NMF decides how much the solution relies on the GMM prior.
• We incorporate the smoothness prior information into the estimated source signals using post-processing. The NMF solutions during the separation stage are used to build spectral masks which are then smoothed by low pass filters. The smoothed spectral masks are used to estimate the source signals.
• Instead of incorporating prior information about the NMF matrices, we incorpo-rate prior information about the spectrogram frames of the source signals. The GMMs are used to model the priors about the log-spectra of the source signals. The MMSE estimation is used to enhance the separated signal spectrogram un-der the trained GMM priors. To consiun-der the temporal information between the consequent frames of the spectrogram, MMSE estimation is applied to enhance multiple stacked spectral frames together.
• We introduce a novel method to learn discriminative nonnegative dictionaries for the source signals. The dictionary for each source is learned to well represent its own source and penalized from representing the other source signals. We penal-ize each dictionary from representing the other source signals by minimizing the projection of each source dictionary into the other source dictionaries.
• We introduce new model adaptation techniques where the training data are mod-eled using nonnegative dictionaries. The adaptation is used here to overcome the lack of sufficient training data. A general dictionary is learned for speech signals and then the proposed adaptation methods are used to adapt the general model to better fit the speech signals that exist in the mixed signal. The Bayesian and
linear transformation adaptations are introduced and used to adapt the source dictionaries.
• The last contribution is to train basis matrices that can capture the relation be-tween consequent spectral frames. The main idea is to use sliding windows with NMF to decompose multiple stacked spectral frames together. The NMF decom-position results are used to build spectral masks to estimate the source signals.
1.3
Organization of this thesis
In Chapter 2, we introduce the mathematical formulation for single channel source sep-aration (SCSS) and nonnegative matrix factorization (NMF). We also show the conven-tional use of NMF in SCSS in Chapter2. In Chapters3to5, we describe our approaches for incorporating statistical priors to the NMF solution of the gains matrix. In Chapter
3, prior information about the NMF gains matrix is modeled by Gaussian mixture mod-els (GMM) and this information is incorporated by adding a GMM log-likelihood term to the NMF divergence cost function. In Chapter 4, prior information about the NMF gains matrix is modeled by a hidden Markov model (HMM). The NMF solution of the gains matrix is guided by the prior HMM. In Chapter5, we incorporate statistical priors which are modeled using GMM to the NMF solution of the gains matrix after evaluating the actual need to the prior information by using a novel regularization method based on MMSE estimation. Chapters 6and 7 focus on post-processing after NMF-based source separation. In Chapter 6, the smoothness prior is considered in the estimation of the sources using simple post processing. Another more complex post processing approach using MMSE estimation is introduced in Chapter 7 to enhance the separated signals. In Chapters 8-10, we improve the dictionaries used in source separation. In Chapter8, discriminative training for the NMF basis matrices is introduced. In Chapter 9, adap-tation of the basis matrix to a specific speaker is introduced. Finally, in Chapter 10, NMF with sliding windows approach is introduced to model the relation between the sequence of spectral frames.
Background
2.1
Formulation for single-channel source separation
Single-channel audio source separation (SCSS) aims to find estimates of the original audio source signals sz(t), ∀z ∈ {1, .., Z} given only a mixed signal y(t). The mixing
process is taken as a sum of the sources as follows:
y(t) =
Z
X
z=1
sz(t), (2.1)
where t denotes time and Z is the number of sources in the mixed signal.
This problem is usually solved in the short time Fourier transform (STFT) domain [1,22,28]. Let Y (n, f ) be the STFT of y(t), where n represents the frame index and f is the frequency index. Due to the linearity of the STFT, we have:
Y (n, f ) =
Z
X
z=1
Sz(n, f ), (2.2)
where Sz(n, f ) is the unknown STFT of source z in the mixed signal. To compute the
STFT of a given audio signal, the signal is divided into overlapping segments (frames). For each frame, the discrete Fourier transform (DFT) is calculated and a column in the STFT matrix is obtained. The STFT of a given signal is a matrix of complex numbers where each column represents a DFT of a segment or frame of the audio signal and the
rows represent frequency indices. We can rewrite (2.2) as follows: |Y (n, f )| ejφY(n,f )= Z X z=1 |Sz(n, f )| ejφSz(n,f ), (2.3)
where φY(n, f ) and φSz(n, f ) are the phase angles of the mixed and zth source signal
respectively.
To find estimates for the source signals, different approximations for Equation (2.3) are usually used to avoid dealing with complex numbers which have unknown phase angles and magnitudes; the phase is known to be less important in audio applications. In [1,
24,27,28,65,90], it is assumed that the sources have the same phase angle as the mixed signal, that is φSz(n, f ) = φY(n, f ), ∀z = 1, .., Z. Thus, the magnitude spectrogram of
the measured signal is approximated as the sum of source signal magnitude spectrograms’ entries as follows: |Y (n, f )| = Z X z=1 |Sz(n, f )| . (2.4)
In this approximation, the mixed signal and the source magnitude spectrograms can be written in matrix form as follows:
Y =
Z
X
z=1
Sz. (2.5)
In this first approximation, Y is the magnitude spectrogram of the mixed signal and Sz
represents the unknown magnitude spectrogram of the source signal z.
The second approximation for Equation (2.3) is assuming the sources to be independent [67,68,70,83,84]. In those works, the power spectral density (PSD) of the measured signal is approximated as the sum of source signal PSDs as follows:
σ2y(n, f ) = Z X z=1 σSz2 (n, f ), (2.6) where σ2
y(n, f ) = E(|Y (n, f )|2). In this approximation, the PSD frames can be written
in matrix form (spectrogram) as follows:
Y =
Z
X
z=1
In this approximation, Y is the spectrogram (power spectrogram) of the mixed signal and Sz represents the unknown spectrogram of the source signal z. The PSD for the
measured signal y(t) is calculated by taking the squared magnitude of its STFT.
In this thesis, we mainly use NMF for source separation. We give here an introduction about different types of NMF cost functions. The conventional approach of using NMF for source separation is introduced in the following section.
2.2
Non-negative matrix factorization
Non-negative matrix factorization [64] is an algorithm that is used to decompose any matrix V with nonnegative entries into a nonnegative basis/dictionary matrix B and a nonnegative weights/gains matrix G as follows:
V ≈ BG. (2.8)
So every column vector in the matrix V is approximated by a nonnegative weighted linear combination of the basis vectors in the columns of B, where B has fewer columns than V . The weights for basis vectors appear in the corresponding column of the matrix G as follows: vn= D X j=1 gjnbj, (2.9)
where vnis the column n in matrix V , bj is the column j in matrix B, gjnis its weight
in the gains matrix G, and D is the number of bases in B. The matrix B contains nonnegative basis vectors that are optimized to allow the data in V to be approximated as a nonnegative linear combination of its constituent vectors. Figure2.1shows a simple two dimensional example where the number of basis vectors is two. Any nonnegative linear combinations of the two basis vectors appears in the nonnegative cone between the two basis vectors as shown in Figure2.1.
The two matrices B and G in Equation (2.8) can be computed by solving the following minimization problem:
min B,G
C (V , BG) , (2.10)
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4 b 2 b 1
Figure 2.1: The nonnegative linear combinations for the given two basis vectors.
Different cost functions C lead to different kinds of NMF. In [64], two different cost functions were analyzed. The first cost function is the Euclidean distance between V and BG, given by min B,G kV − BGk22, (2.11) where kV − BGk22=X i,j Vi,j− (BG)i,j 2 .
The NMF solution for Equation (2.11) can be computed by alternating updates of B and G as follows: B ← B ⊗ V G T BGGT, (2.12) G ← G ⊗ B TV BTBG, (2.13)
where the operations ⊗ and all divisions are element-wise multiplication and division respectively. The matrices B and G are initialized by positive random numbers and then updated iteratively using the update rules in (2.12,2.13).
The second cost function for NMF in [64] is the generalized Kullback-Leibler (KL-NMF) divergence cost function [64]
min B,G
where DKL(V || BG) = X i,j Vi,jlog Vi,j
(BG)i,j − Vi,j+ (BG)i,j !
.
The NMF solution for Equation (2.14) can be computed by alternating updates of B and G as follows: B ← B ⊗ V BGG T 1GT , (2.15) G ← G ⊗B T V BG BT1 , (2.16)
where 1 is a matrix of ones with the same size of V .
The third cost function is the Itakura-Saito (IS-NMF) divergence cost function [70]:
min B,G DIS(V || BG) , (2.17) where DIS(V || BG) = X i,j Vi,j (BG)i,j − log Vi,j (BG)i,j − 1 ! .
The IS-NMF solutions for Equation (2.17) can be computed by alternating multiplicative updates of B and G as shown in [70,91]:
B ← B ⊗ V (BG)2G T 1 BGGT , (2.18) G ← G ⊗ BT V (BG)2 BTBG1 , (2.19)
where (.)2 is also an element-wise operation.
The basis and gains matrices are usually initialized by random positive numbers. Within each iteration, the columns of the basis matrix are normalized using the Euclidean norm and the gains matrix is calculated accordingly. Since the NMF cost functions are non-convex with multiple local minima, the solution for the basis and gains matrices is not unique and any local minima is a candidate solution for the cost function. To find a better solution than the others, prior information can be incorporated to the cost function. A better solution means a solution that is more suited to the nature of the
processed data. The prior information can be incorporated as an additive regularization term to the NMF cost function.
The shown three NMF cost functions in this thesis are special cases of the β-divergence introduced in [92] as argued in [70, 93, 94]. The second and third divergence cost functions were found to work better for audio source separation, and they are good measurements for the perceptual differences between different audio signals [27,70,91].
In this thesis we will consider only the second and third divergence cost functions. In source separation applications, the KL-NMF is used with matrices of magnitude spectrograms with the approximation shown in Equations (2.4,2.5) as in [1,24,26,27,
88, 90]. IS-NMF is used with matrices of power spectral densities (spectrograms) with the approximation shown in Equations (2.6,2.7) as in [67,68,70,91].
To understand the idea of using NMF in signal processing, let us consider the spectro-gram shown in Figure 2.2. The figure shows the spectrogram of a signal composed of sinusoids of three frequencies at different time intervals. NMF is applied to decompose the spectrogram as a multiplication of a basis matrix with three basis columns and a weights matrix. The NMF decomposition result for the basis matrix B will appear as shown on the left hand side of Figure 2.2. The decomposition for the gains matrix G will appear as shown on the top of the same figure. We can see from the shown bases in the basis matrix B in Figure2.2that, they have energy only at the three frequencies that are present in the signal spectrogram. The gains matrix G shows the excitation intervals for each basis vector in the basis matrix B.
Another illustration example of using NMF in signal processing, is to decompose the Dual-Tone Multi-Frequency (DTMF) signals using NMF. DTMF signaling is used in communication for dialing the telephone numbers. Figures 2.3 and 2.4 show the fre-quencies, the generated signals for each dialed number in time domain, the spectrogram of the DTMF signals, and the NMF decomposition matrices for the DTMF spectrogram. We can see that, DTMF signals contain seven frequency components divided into two groups. The low frequencies group is (697 Hz, 770 Hz, 852 Hz, 941 Hz) and the high frequencies group is (1209 Hz, 1336 Hz, 1477Hz). For each dialed number there is one generated frequency from each group at the same time. The basis matrix B of the NMF decomposition captures the seven frequencies that are in the DTMF signals. The gain-s/activations matrix G of the NMF decomposition shows that, at each time (pressed
200 100 50 1 2 3 1 2 3 200 100 50 Freque ncy (DF T inde x) ) 100 200 300 400 500 600 700 800 900 200 300 400 500 600 700 800 900 100 Time (DFT frames) Spectrogram Columns of B Rows of G
Figure 2.2: The NMF decomposition matrices for the sinusoidal signals.
button in the telephone switchboard) there are two frequencies (two bases in matrix B) active at the same time. The NMF decomposition results for DTMF signals capture the frequencies and the activations for each dialed number in the telephone switchboard. For example, when the button for symbol 1 is pressed, the frequencies 697Hz and 1209Hz are generated. The bases number five and seven in Figure2.4(b) represent the frequencies 697Hz and 1209Hz respectively. We can see from the activation matrix in Figure2.4(c)
that, when the symbol 1 is pressed, the fifth and seventh rows are active. Figures2.2to
2.4 show very simple signals that contain few sinusoidal components. In Figures 2.2to
2.4the number of frequency components for each signal is assumed to be known, which leads to the correct choice for the suitable number of basis vectors for each signal. For natural audio data, the suitable number of bases can not be predetermined but it is always believed that the data lie on a lower-dimensional manifold. Figure2.5 shows an example of the spectrogram of an audio signal and its NMF decomposition results using 128 basis vectors.
(a) The used frequencies in the telephone switchboards. 0 10 20 −2 0 2 Symbol "1": [697,1209] Amplitude 0 10 20 −2 0 2 Symbol "2": [697,1336] Amplitude 0 10 20 −2 0 2 Symbol "3": [697,1477] Amplitude 0 10 20 −2 0 2 Symbol "4": [770,1209] Amplitude 0 10 20 −2 0 2 Symbol "5": [770,1336] Amplitude 0 10 20 −2 0 2 Symbol "6": [770,1477] Amplitude 0 10 20 −2 0 2 Symbol "7": [852,1209] Amplitude 0 10 20 −2 0 2 Symbol "8": [852,1336] Amplitude 0 10 20 −2 0 2 Symbol "9": [852,1477] Amplitude 0 10 20 −2 0 2 Symbol "*": [941,1209] Amplitude Time (ms) 0 10 20 −2 0 2 Symbol "0": [941,1336] Amplitude Time (ms) 0 10 20 −2 0 2 Symbol "#": [941,1477] Amplitude Time (ms)
Time response of each tone of the telephone pad
(b) Time response for each number of the telephone switchboards.
Figure 2.3: The DTMF signals.
2.3
NMF for single channel source separation
There are many approaches of applying NMF in single channel source separation (SCSS). The method that is mostly used and gives reasonable results [13,27,65] is divided into
Time (DFT frames) Frequency 1 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 0 500 1000 1500 2000 2500 3000 3500
(a) The spectrogram of the DTMF.
Columns of B Frequency 1 2 3 4 5 6 7 0 500 1000 1500 2000 2500 3000 3500
(b) The basis matrix for the DTMF.
Time (DFT frames) Rows of G 1 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 1 2 3 4 5 6 7
(c) The activation matrix for the DTMF.
Figure 2.4: The NMF decomposition matrices for the DTMF signals.
two main stages, the training stage and the separation stage. In the training stage, it is assumed that, there is enough training data for each observed source in the mixed signal. NMF uses these training data to train a set of basis vectors for each source. The spectrogram Strainz of the training data of each source z is computed first using STFT. Then NMF is used to decompose this spectrogram into a basis matrix Bz and a weights
matrix Gtrainz as follows:
Strainz ≈ BzGtrainz . (2.20)
The basis matrix (trained bases) Bz is used as a representative model for the training
data for each source z.
In the testing or separation stage, the trained basis matrices for all sources are concate-nated in one bases matrix. NMF decomposes the mixed signal spectrogram Y into a
Time (DFT frames) Frequency Speech Spectrogram 50 100 150 200 250 300 350 400 0 1000 2000 3000 4000 5000 6000 7000
(a) The spectrogram of a speech signal.
Basis matrix for speech signal
Basis number Frequency 20 40 60 80 100 120 0 1000 2000 3000 4000 5000 6000 7000
(b) The basis matrix.
Gains matix for speech
Time Rows of G 50 100 150 200 250 300 350 400 20 40 60 80 100 120
(c) The gains matrix.
Figure 2.5: The NMF decomposition matrices for a clean speech signal.
weighted linear combination of the trained bases matrices as follows:
Y ≈ [B1, .., Bz, .., BZ] G or Y ≈ [B1, .., Bz, .., BZ] G1 . Gz . GZ . (2.21)
In the testing stage, the bases matrix is kept fixed and only the weights matrix is updated. The estimated signal spectrogram for each source is found by multiplying each source
basis in the bases matrix with its corresponding weights in the weights matrix as follows:
e
S1 = B1G1, ..., Sez= BzGz, ..., SeZ = BZGZ. (2.22)
2.3.1 Reconstruction of source signals and spectral masks
In our earlier work of using NMF for source separation [24], instead of using the initial estimates eS1, ...., eSZ in Equation (2.22) as the final estimates for the source signals, we
used them to build spectral masks. Special cases of this idea appear in [28,73,90,95]. The sum of the estimated spectra ˜Sz, ∀z ∈ {1, .., Z} in (2.22) may not sum up to
the mixed magnitude-spectrogram Y . We usually obtain nonzero decomposition error. Thus, NMF gives us an approximation:
Y ≈ Z X z=1 ˜ Sz.
Assuming noise is negligible in the mixed signal, the component signals’ sum should be directly equal to the mixed magnitude spectrogram. To make the error zero, we used the initial estimated magnitude spectrograms ˜Sz, ∀z ∈ {1, .., Z} to build spectral masks as
follows: Hz = ˜ Sz p PZ j=1S˜j p, ∀z ∈ {1, .., Z} (2.23)
where p > 0 is a parameter, the operation (.)p, and the division are element-wise op-erations. Notice that, the elements of Hz ∈ [0, 1] and using different p values leads to
different kinds of masks. When p = 2 the mask H is a Wiener filter assuming ˜S2z are estimates of the PSD of the zth source signal and sources are independent. The value of p controls the saturation level of the ratio in (2.23). When p > 1, the larger source com-ponent will dominate more in the mixture. At p = ∞, we achieve a binary mask (hard mask) which will choose the larger source component as the only component. These masks will scale every frequency component in the observed mixed spectrogram Y with a ratio that explains how much each source contributes in the mixed signal such that:
ˆ
Sz = Hz⊗ Y , (2.24)
where ˆSz is the final estimate of the source z spectrogram, and ⊗ is an element-wise
make sure that the estimated signals will add up to the mixed signal. After finding the contribution of each source in the mixed signal, the estimate for each source signal ˆsz(t)
can be computed by inverse STFT to the estimated source spectrogram ˆSz with the
phase angle of the mixed signal. In [24], we applied this idea to separate a speech signal from a background music signal using the KL-NMF cost function. We tried different values for the number of trained basis vectors for each source basis matrix and different values for the spectral mask parameter p. We achieved reasonable performance when the number of basis was 128 basis vectors for each source and p = 3. We also evaluated this idea in our previous work [96] to improve the audio-visual speech recognition performance by removing the background music signal from the speech signal. In [96], we achieved better recognition performance compared to the case when the speech recognition was done without removing the background signals.
2.4
Performance evaluation
In this thesis, we use different metrics to evaluate our proposed ideas. The metrics are Source to Distortion Ratio (SDR) and Source to Interference Ratio (SIR) from [97]. We also use the regular Signal to Noise Ratio (SNR) metric. These metrics are defined as follows:
SDR = 10 log10 kstarget(t)k
2
keinterf(t) + eartif(t)k2
, (2.25)
SIR = 10 log10 kstarget(t)k
2
keinterf(t)k2, (2.26)
SN R = 10 log10 ks (t)k
2
ks (t) − ˆs (t)k2. (2.27)
The separated signal is a combination of different components as follows:
ˆ
s (t) = starget(t) + einterf(t) + eartif(t) , (2.28)
where starget(t) is the target signal which is defined as the projection of the predicted
signal onto the original desired signal, einterf(t) is the interference error due to the other
If sw(t) is the desired source signal, ˆsw(t) is its estimated signal, so
starget(t) =
< ˆsw(t) , sw(t) >
kswk2 sw(t) ,
if the sources are mutually orthogonal,
einterf(t) = Z X ´ w=1 < ˆsw(t) , sw´(t) > ksw´k2 sw´ ! −< ˆsw(t) , sw(t) > kswk2 sw(t) ,
where < ., . > is the dot product. If the sources are not orthogonal, one can use Gram Schmidt orthogonalization to find the orthogonal projection onto the subspace spanned by all the source signals [97],
eartif(t) = ˆsw(t) − starget(t) − einterf(t) ,
The higher the SDR, SIR, and SNR, the better performance we achieve.
In the literature there have been some studies where the improvements appear to be small. In [98], the SDR improvements were around 0.1 dB. In [1,73], the improvements in SNR were between 0.2-0.5 dB. The minimum improvement SIR was 1.5 dB in [60].
Regularized NMF using GMM
priors
3.1
Motivations and overview
In this chapter, we propose a new regularized NMF algorithm that incorporates the statistical characteristics of the source signals to steer the optimal solution of the NMF cost function during the separation process. We propose a new multi-objective cost function which includes the conventional divergence term for the NMF together with a prior likelihood term. The first term measures the divergence between the observed data and the multiplication of basis and gains matrices as shown in Equations (2.14,
2.17). The novel second term encourages the log-normalized gain vectors of the NMF solution to increase their likelihood under a Gaussian mixture model (GMM) prior which is used to encourage the gains to follow certain patterns. The normalization of the gains makes the prior models energy independent, which is an advantage as compared to earlier proposals [26,27] where a single Gaussian was used as a prior model. In addition, GMM is a much richer prior than the previously considered alternatives such as conjugate priors [71,99] which may not represent the distribution of the gains in the best possible way. We introduce novel update rules that solve the optimization problem efficiently for the new regularized NMF problem. This optimization is challenging due to using energy normalization and GMM for prior modeling, which makes the problem highly nonlinear and non-convex.
As shown in Section 2.3, the conventional use of NMF in supervised source separation is to decompose the magnitude or power spectra of the training data of each source into a trained basis matrix and a trained gains/weights matrix as in Equation (2.20). In previous works [24,65], the columns of the trained basis matrix are usually used as the only representative model for the training source signals and the trained gains matrices were usually ignored.
As a simple example to understand the model we introduce here, we can look at the toy example in Figure 2.4. In Figure 2.4(c), the columns of the gains matrix only appear in certain patterns in the DTMF signal. We can also see from Figure 2.4 that some combinations for the basis vectors in the basis matrix are not allowed. For example, any combination between the basis vectors number two, three, four, and five is not allowed because these basis vectors represent the lower band frequencies that can not be combined in DTMF data as shown in Figure2.3(a). Also any combination between the basis vectors number one, six, and seven can not be combined because they represent the higher band frequency components that can not be combined as shown in Figure2.3(a). Based on the basis matrix in Figure 2.4(b), there are many different combinations for the basis vectors in the basis matrix but just 12 of them are only valid combinations as we can see in Figures 2.3(a)and 2.4(c).
The columns of the trained gains matrix represent the valid weight combination patterns that the columns in the basis matrix can jointly receive for a specific type of source signal. A prior distribution can represent the statistical distribution of the gains vector in each column of the gains matrix and model the correlation between their entries. Since the trained basis matrix for each source is common in the training and separation stage, the prior model for the gains matrix for each source can guide the NMF solution to prefer valid gain patterns during the separation stage. We use a multivariate Gaussian mixture model (GMM) as a prior model for the gains vector for each frame of each source.
Figure3.1shows an example similar to Figure2.1but where certain linear combinations between the two basis vectors are allowed. The figure shows the cases where the clus-tering structure of the nonnegative linear combinations of the given two basis vectors can be seen. For example, for speech signals there are a variety of phonetic differences, which causes a sort of clustering structure for the data. Since the trained basis vectors are the same during the training and the separation stage, we believe these clustering
structures are inherited in the gains matrix. This clustering structure raises the need for using GMMs. The GMM is a rich model for capturing the statistics and the correlations
Figure 3.1: The cluster structure for the nonnegative linear combinations of the basis vectors.
of the valid gain combinations for a certain type of source signal. GMMs are used exten-sively in speech recognition and speaker verification to model the multi-modal nature in speech feature vectors due to phonetic differences, gender, speaking styles, accents [100] and we conjecture that the gains vector can be considered as a feature extracted from the audio signal in a frame so that it can be modeled well with a GMM. The columns of the trained gains matrix for each source are normalized by the `2 norm, and their logarithm is taken and used in the GMM prior. In the proposed method, the trained basis matrix and its corresponding gains GMM prior are jointly used as a representative model for the training data for each source.
The training can be performed either in two steps sequentially, or all the parameters can be learned using joint training. In sequential training, we first learn the basis and gains matrices using conventional NMF for each source from the corresponding training data and then fit a GMM to the log-normalized gains vectors obtained in the previous step. In joint training, we learn both the NMF matrices and the GMM parameters using coordinate descent (or alternating minimization) on the proposed regularized cost function directly. Jointly training the NMF and the prior models simultaneously is a novel idea introduced in this work. In joint training, the trained basis matrix is
also changed since the gains matrix is enforced to satisfy the NMF equation guided by the GMM prior, so that the trained models are more consistent with the GMM prior assumption. For this reason, we use sequential training for initialization of the model parameters, but eventually use joint training of the model parameters in this work.
In the separation stage after observing the mixed signal, the proposed regularized NMF is used to decompose the magnitude or power spectra of the observed mixed signal as a weighted linear combination of the columns of trained bases matrices for all source signals that appear in the mixed signal. The decomposition weights are encouraged to increase their log-likelihood with their corresponding trained prior GMMs using the regularized cost function.
In this chapter, we apply the proposed regularized NMF using the generalized Kullback-Leibler (KL-NMF) divergence cost function [64] and the Itakura-Saito (IS-NMF) diver-gence cost function [70] which are shown in Equations (2.14) and (2.17) respectively. As shown in Section (2.2), the KL-NMF is used with matrices of magnitude spectro-grams with the approximation shown in Equations (2.4,2.5), while IS-NMF is used with matrices of power spectral densities (spectrograms) with the approximation shown in Equations (2.6,2.7). We will show the proposed regularized NMF using KL-NMF first, then we will state the differences regarding the usage of IS-NMF.
3.2
The proposed regularized nonnegative matrix
factor-ization approach
The goal of regularized NMF is to incorporate prior information on the solutions of the matrices B and G. We enforce a statistical prior on the solution of the gains matrix G only. We need the solution of G in Equation (2.8) to minimize the KL-divergence cost function in Equation (2.14), and the log-normalized columns of the gains matrix G, namely logkggk
2, to maximize their log-likelihood under a trained GMM prior model.
Hence, the solution of G can be found by minimizing the following regularized KL-divergence cost function:
where L(G|θ) is the log-likelihood of the log-normalized columns of the gains matrix G under the trained prior gain GMM with parameters θ, and λ is a regularization parameter. The regularization parameter controls the trade-off between the NMF cost function and the prior log-likelihood. The multivariate Gaussian mixture model (GMM) with parameters θ = {wk, µk, Σk}Kk=1 for a random variable x is defined as:
p(x|θ) = K X k=1 wk (2π)d/2|Σ k|1/2 exp −1 2(x − µk) T Σ−1k (x − µk) , (3.2)
where K is the number of Gaussian mixture components, wkis the mixture weight, d is
the vector dimension, µk is the mean vector and Σk is the diagonal covariance matrix
of the kth Gaussian model. In this section, we assume GMM parameters θ are given. We will mention the training of θ in the next section. The normalization is done using the `2 norm by modeling logkggk
2.
The reason for using the logarithm is because GMM is usually a better fit to the loga-rithm of the values between 0 and 1 due to wider support as observed in tandem speech recognition research [101]. The reason for normalization is to make the prior models insensitive to the change of the energy level of the signals, which makes the same prior models applicable for a wide range of energy levels and avoids the need to train a different prior model for different energy levels.
The log-likelihood for the gains matrix G with N columns can be written as follows:
L(G|θ) = N X n=1 log K X k=1 ρk,n(θ) , (3.3) where ρk,n(θ) = wk (2π)(d/2)|Σ k|1/2 exp ( −1 2 log gn kgnk2 − µk T Σ−1k log gn kgnk2 − µk ) , (3.4) and gn is the column numbered n in the gains matrix G. The multiplicative update rule for the basis matrix B for the cost function in Equation (3.1) is the same as in Equation (2.15). To find the multiplicative update rule for G in Equation (3.1), we follow the same procedures as in [1] and [67]. We express the gradient with respect to G of the
cost function ∇GC as the difference of two positive terms ∇+GC and ∇−GC as:
∇GC = ∇+GC − ∇−GC. (3.5)
The cost function is shown to be nonincreasing under the following update rule [1,67]:
G ← G ⊗∇
− GC
∇+GC, (3.6)
where the operations ⊗ and division are element-wise as in Equation (2.16). We can write the gradients as:
∇GC = ∇GDKL− λ∇GL(G|θ), (3.7)
where ∇GL(G|θ) is a matrix with the same size of G. The gradient for the KL-cost
function and the prior log-likelihood can also be formed as differences between positive terms as follows:
∇GDKL = ∇+GDKL− ∇−GDKL, (3.8)
∇GL(G|θ) = ∇+GL(G|θ) − ∇−GL(G|θ). (3.9)
We can rewrite Equations (3.5,3.7) as:
∇GC = ∇+GDKL+ λ∇−GL(G|θ) − ∇−GDKL+ λ∇+GL(G|θ) . (3.10)
The final update rule in Equation (3.6) can be written as follows:
G ← G ⊗∇ − GDKL+ λ∇ + GL(G|θ) ∇+GDKL+ λ∇−GL(G|θ) , (3.11) where ∇GDKL = BT 1 − V (BG) , (3.12) ∇−GDKL= BT V (BG), (3.13) and ∇+GDKL = BT1. (3.14)
The row j and column n component of the gradient of the prior log-likelihood in Equation (3.3) can be found as follows:
(∇GL(G|θ))jn= ∇+GL(G|θ) jn− ∇ − GL(G|θ) jn, (3.15) where ∇−GL(G|θ)jn= PK k=1 −ρk,n Σkjj −1 µ kj gjn + gjn kgnk 2 2 log gjn kgnk2 PK k=1ρk,n , (3.16) ∇+GL(G|θ)jn= PK k=1 −ρk,n Σkjj −1 µ kjgjn kgnk 2 2 +g1 jn log gjn kgnk2 PK k=1ρk,n . (3.17)
Since the GMMs are trained by log-normalized columns, we know that the values of the mean vectors µ are always negative. The values of the vectors g are always positive, so the values from Equations (3.16) and (3.17) will be always positive. We can use Equations (3.13,3.14,3.16,3.17) to find the total gradients in Equation (3.10) and then to derive the update rules for G in Equation (3.11). The initialization of the matrix G is done by running one regular NMF iteration without any prior.
3.3
Training the source models
In the training stage, we aim to train a set of basis vectors for each source and a prior statistical GMM for the gain patterns that each set of basis vectors can receive for each source signal.
3.3.1 Sequential training
Given a set of training data for each source signal, the magnitude spectrogram Strainz for each source z is calculated. The NMF is used to decompose Strainz into basis matrix Bz and gains matrix Gtrainz . The gains matrix Gtrainz is then used to train the prior
basis and gains matrices as follows:
Strainz ≈ BzGtrainz , (3.18)
Bz, Gtrainz = arg min
B,G
DKL Strainz || BG .
After finding the basis and the gains matrices, the corresponding GMM parameters θz
are then learned as follows:
θz= arg max
θ L (Gz|θ) . (3.19)
We use multiplicative update rules in Equations (2.15) and (2.16) to find solutions for Bz and Gz in Equation (3.18). All the matrices B and Gtrain are initialized by positive
random noise. In each iteration, we normalize the columns of Bz using the `2 norm and
find Gtrainz accordingly. After finding matrices B and Gtrain for all sources, all the basis matrices B are used in mixed signal decomposition as it is shown in Section 3.4. We use the gains matrices Gtrain to build statistical prior models. For each matrix Gtrainz , we normalize its columns and the logarithm is then calculated. These log-normalized columns are used to train a gain prior GMM for each source in Equation (3.19) using the well-known expectation maximization (EM) algorithm [102].
3.3.2 Joint training
In Section3.3.1, the trained NMF basis and gains matrices for each source are computed using Equations (2.15, 2.16), and then the prior gain GMMs are trained using the logarithm of the normalized columns of the trained gains matrix. To match between the way the trained models are used during training with the way they are used during separation, we jointly train the basis vectors and the prior models simultaneously to minimize the regularized cost function:
Bz, Gtrainz , θz = arg min
B,G,θ
DKL Strainz || BG − λtrainL (G|θ) . (3.20)
We use the trained NMF and GMM models from Section 3.3.1 as initializations for the source models, and then we update the model parameters by running alternating update (coordinate descent) iterations on Bz, Gtrainz and θz parameters. At each NMF