Randomized Matrix Decompositions and ExemplarSelection in Large Dictionaries for PolyphonicPiano Transcription

(1)

Full Terms & Conditions of access and use can be found at

http://www.tandfonline.com/action/journalInformation?journalCode=nnmr20

Download by: [Bogazici University] Date: 12 February 2016, At: 07:20

ISSN: 0929-8215 (Print) 1744-5027 (Online) Journal homepage: http://www.tandfonline.com/loi/nnmr20

Randomized Matrix Decompositions and Exemplar Selection in Large Dictionaries for Polyphonic

Piano Transcription

İsmail Arı, Umut Şimşekli, Ali Taylan Cemgil & Lale Akarun

To cite this article: İsmail Arı, Umut Şimşekli, Ali Taylan Cemgil & Lale Akarun (2014) Randomized Matrix Decompositions and Exemplar Selection in Large Dictionaries for Polyphonic Piano Transcription, Journal of New Music Research, 43:3, 255-265, DOI:

10.1080/09298215.2014.891628

To link to this article: http://dx.doi.org/10.1080/09298215.2014.891628

Published online: 10 Sep 2014.

Submit your article to this journal

Article views: 119

View related articles

View Crossmark data

(2)

(Received 25 June 2013; accepted 29 January 2014)

Abstract

Non-negative matrix factorization has been shown to be powerful for modelling audio signals. Many useful applications based on NMF, including musical source separation and polyphonic transcription, have been presented in the field of music information retrieval. The multiplicative update rules for making inference on the NMF model are quite simple and practical; however, they do not scale up well with the increasing size of the dictionary matrices. In this study, we develop efficient approaches based on randomized matrix decompositions and exemplar selection that can easily handle very large dictionary matrices that can be encountered in real applications. We apply our methods on the transcription of polyphonic piano music. The results show that by only re- taining 1% of a large dictionary matrix, we still get high performance in terms of objective measures.

Keywords: machine learning, music analysis, software, representation

1. Introduction

Non-negative data often appear in various domains, such as stock prices in finance, pixel intensities in images, or magnitude spectra of audio signals. Modelling non-negative data by using factorization models has been shown to be successful in various domains, including signal processing, finance, bioinformatics, and natural language processing (Cichocki, Zdunek, Phan, & Amari,2009). Among many others, non- negative matrix factorization (NMF) (Lee & Seung, 2001) has been one of the most popular factorization models where it decomposes an observed data matrix X as follows:

X≈ DW, (1)

Correspondence: ˙Ismail Arı, Bogazici University, Computer Engineering, Istanbul, Turkey. E-mail: [email protected]

where X, D, and W are non-negative matrices of size F× T , F×N, and N ×T , respectively. Here D is called the dictionary matrix and W encapsulates the corresponding weights. Here, the aim is to estimate the matrices D and W and one of the popular approaches for doing this task is minimizing a cost function between X and DW:

(D, W) = argmin

D,W D(X||DW), (2)

whereD(·) is a suitable cost function which is usually selected as Euclidean, Kullback–Leibler (KL), or Itakura–Saito (IS) divergences.

In practice, a typical scenario of using NMF is training a low-rank dictionary matrix D on a large training corpus, and then fixing it during the testing procedure (Dessein, Cont, &

Lemaitre,2010). Even though this approach has yielded useful applications, it can be problematic since not all the linear combinations of the columns of D would be semantically meaningful and realistic; therefore further regularization on the weights matrix W is often needed (Févotte, Le Roux, &

Hershey,2013; ¸Sim¸sekli, Le Roux, & Hershey,2013).

Rather than training and fixing the dictionary matrix D, another possible option is using the training data as the dictionary itself (Smaragdis, 2011). In this manner, the linear combinations of the columns of D would be semantically more meaningful, as the dictionary consists of real samples in this case. However, using large and comprehensive training data as the dictionary would potentially lead to an impractical algorithm. A possible solution would be expressing the dictionary matrix using fewer dimensions in order to have lower computational and space requirements with a tolerable loss from the accuracy.

In this study, we investigate fast and efficient methods for expressing very large dictionary matrices using much fewer dimensions. The proposed methods are based on randomized matrix decompositions and exemplar selection that can easily

Downloaded by [Bogazici University] at 07:20 12 February 2016

(3)

handle very large training corpora encountered in real applications. We apply our methods on polyphonic piano transcription where the magnitude spectra is modelled by an NMF model.

Our experiments show that the proposed methods can reduce the size of the dictionary significantly without compromising the success rate of the full solution in which the whole training corpus is used.

1.1 NMF for polyphonic transcription

In this study, we focus on transcription of polyphonic piano music where the problem is defined as the process of deter- mining the notes and the time intervals in which they are active given polyphonic piano recordings¹. Conventionally, it is done by hand and requires both an educated ear and considerable amount of effort. Automation of this process is one of the fundamental problems studied in the field of audio processing (Klapuri & Davy,2006). The computational methods developed to solve this problem find application in various areas such as phonetics, speech processing and music information retrieval (Klapuri & Davy,2006).

It has been demonstrated that, when the NMF model is applied on music signals, the estimated matrices D and W tend to be semantically meaningful as they can be seen as spectral templates and a musical score (Smaragdis & Brown, 2003). In a polyphonic transcription setting, the columns of D would correspond to different notes and the rows of W would determine the loudness of individual notes across time.

One can obtain a transcription estimate simply by thresh- olding W. There have also been several extensions to the NMF model where the transcription performance is further improved by using additional constraints such as temporal continuity and structural hierarchy (¸Sim¸sekli et al., 2013), harmonicity (Bertin, Badeau, & Vincent, 2010), and multi- instrument hierarchy (Grindlay & Ellis,2011).

Experience suggests that human listeners become more successful with training in recognizing musical constructs.

Partially inspired by this idea,Smaragdis(2011) demonstrated that it is possible to perform polyphonic pitch tracking successfully by using an NMF model that tries to approximate observed polyphonic spectra as a superposition of previously recorded monophonic sounds. Instead of training the dictionary matrix on a training corpus, the training corpus itself is used as the dictionary in this approach. However, this approach tends to be impractical as the size of the training corpus increases.

In this study, we start from the model ofSmaragdis(2011) and we deal with the case where the size of the dictionary ma- trix D is very large. Note that, this model is the simplest NMF model which does not have any additional constraints. How- ever, it can be enriched by incorporating the aforementioned

1Note that automatic transcription can also involve tasks such as instrument identification, beat and tempo tracking, and key detection (Klapuri & Davy,2006). In this study, we only focus on the onset and the offset times of the notes.

musical constraints, which is beyond the scope of this study.

Here, the main purpose is to develop computationally efficient methods to handle large amounts of training data within the NMF framework.

Before discussing the methods, let us start by clarifying the problem definition and introduce the notation we will use. Let D_i, with elements Di( f, τi), denote the magnitude spectrogram²of monophonic music recordings belonging to I differ- ent notes. Here, i = 1, . . . , I is the note index, f = 1, . . . , F is the frequency index, andτi = 1, . . . , Ni is the time index where F is the number of frequency bins and Ni is the num- ber of columns in Di. We remove the samples below some loudness and we get rid of the effect of sound intensity by normalizing each vector such that its elements sum up to 1 since we are only interested in the pitch of the data (Smaragdis, 2011). We obtain the training data by concatenating all training vectors, D= [ D1D₂ . . . DI ], where there are a total number of N =_I

i=1N_itraining samples. Test data are composed of polyphonic recordings. We have considered piano samples in this paper, but the proposed methods can be extensible for multiple instruments. Let X, with values X( f, t), be the spectrogram of the test data where t = 1, . . . , T and T is the number of time frames. A test piece from Bach (BWV 850) is illustrated in Figure1. The waveform is plotted on the top, the corresponding ground truth which shows the active keys through time is given in the middle, and the corresponding spectrogram is given at the bottom.

1.2 Inference

In polyphonic transcription, the purpose is to reveal the active notes which are intuitively expected to have higher weights in the weight matrix W. Thus, we aim to find W which minimizes the cost functionD(·) defined in Equation2, where we choose the KL divergence as the cost function. Note that the model is identical to the NMF model whose update rule is well known (Lee & Seung,1999;Smaragdis & Brown,2003). We start with random initialization of W, and continue with the following step until convergence:

W← W

D^X_(DW) D1

. (3)

The symbol stands for the Hadamard product implying element-wise multiplication and D is the transpose of D.

The division operation is also done element-wise.1 is a ma- trix of all ones of the same size with X. Active notes are observed to have significantly higher weights than the inac- tive ones and they are selected by thresholding. In order to obtain a single weight for a particular note, the corresponding weights are summed up since there might be multiple weights corresponding to the same note. Additionally, each weight

2In this study we use the short-time Fourier transform to compute the spectra. It is also possible to utilize log-frequency representations such as the constant-Q transform (Fuentes, Badeau, & Richard, 2011).

(4)

Fig. 1. Waveform (top), ground truth (middle) and spectrogram (bottom) of a test piece.

value is smoothed by applying a median filter to remove sharp jumps.

The computational bottleneck of the algorithm is seen to be in the matrix-matrix multiplications involving D and D in Equation3. Hence, the running time is dominated by the matrix multiplication operation which is in order of O(F N T ).

Space complexity is O(F N) since we store all elements in D.

Note that Equation3can work in parallel on the columns of W since each column of W can be computed independently. So, the time complexity can be reduced to O(F N) in a parallel environment.

In this study, we focus on reducing the size of the dictionary matrix D. One of the basic approaches to reduce the dimen- sionality of D relies on Singular Value Decomposition (SVD) where D= AB (Golub & Van Loan,1996). A and B are the orthonormal matrices holding the left and right singular vectors, respectively, and the diagonal matrix includes the singular values in descending order. The k-rank reduced SVD approximates D by using only the first k singular vectors and singular values and it gives the best low-rank approximation of the matrix in terms of Frobenius (or l2) norm of the reconstruction error. Unfortunately, SVD cannot be computed by conventional techniques when the data matrix is very big. But we overcome this by utilizing the recently developed randomized techniques to solve a reduced SVD that approximates the data matrix well in practice within a tolerable error bound (Halko, Martinsson and Tropp,2011b;Mahoney,2011).

SVD is optimal in terms of reconstruction error, but the singular vectors may not represent physical reality (Mahoney

& Drineas,2009). In an alternative approach, we express the dictionary matrix as a collection of its individual columns and rows, which already represent physical identities. This decomposition, known as CUR decomposition in the liter- ature, attracted considerable attention in the last few years (Halko et al.,2011b; Lee & Choi, 2008; Mahoney,2011).

We express D ≈ CUR where C and R represent a set of columns and a set of rows of the matrix, respectively, and U is computed as a link matrix between them. We follow

Mahoney and Drineas(2009) CUR factorization to decide on the set of columns and rows to be used.

Finally, we shift our attention from minimizing the reconstruction error to a better interpretation of CUR. We enrich CUR with two widely used exemplar selection based clus- tering algorithms: k-medoids (Park & Jun,2009) and affinity propagation (Frey & Dueck,2007).

The main contribution of this study is the usage of randomized matrix decompositions and exemplar selection methods to employ very large dictionaries in polyphonic music transcription. Note that this paper is a significant extension ofArı,

¸Sim¸sekli, Cemgil and Akarun(2012) where the SVD and CUR based methods are first proposed. In this paper, we present exemplar-based methods in the same algorithmic framework and compare all methods on the same evaluation task.

The structure of the paper is as follows: we start with the initial model where all the training data is used. Then we improve this model via SVD and CUR factorizations. Af- terwards, we explain the column selection based and skeleton based approaches, and the exemplar based methods via k-medoids and affinity propagation. We explain the database used in our experiments with the related experimental setup and corresponding results. We finish with conclusions and discussion.

2. Methods

2.1 Improving efficiency via SVD

The proposed approach is expected to perform better with increasing and comprehensive training data. But this leads to a dictionary matrix with millions of columns and rows which is impractical to work on. It may even not fit in RAM in real applications. Our aim is to make the method efficient in terms of both time and space complexity. The key point is to decompose D and use its decomposition instead of working on the whole of it. A rank k approximation of a matrix can be obtained best by the reduced SVD (Golub & Van Loan,1996):

(5)

argmin

˜D, rank( ˜D)≤kD − ˜DF = AkkB_k (4) where k, Ak,k and Bk stand for the rank, the orthonormal matrix involving the left singular vectors, the diagonal matrix involving the k largest singular values in descending order, and the orthonormal matrix involving the right singular vectors, respectively.

For this problem, it is convenient to store kB_k as ˜B_k to avoid redundant computations. Using this decomposition, Equation3is reorganized as follows:

W← W

⎛

⎜⎜

⎝

˜Bk

A_k ^X A_k( ˜BkW)

˜B_k(A_k1)

⎞

⎟⎟

⎠. (5)

It is well known that SVD itself is an expensive operation.

The full solution of SVD is of order O(min{F N², F²N}) (Golub & Van Loan, 1996). The reduced solution is also expensive when conventional techniques are used. To tackle this problem, we use the randomized SVD technique ofHalko, Martinsson and Tropp (2011b) which serves as a good example of randomized matrix factorization algorithms (Halko et al.,2011b;Mahoney,2011). We first draw an N×k Gaussian random matrix and form the F ×k sample matrix Y = D.

The trick lies on finding an orthogonalization of the range of D such that D ≈ QQD. Thus, we compute an orthonor- mal matrix Q satisfying Y= QR. SVD is computed on the smaller matrix QD⇒ ˆAB. Finally, A is computed as Q ˆA.

Reduced SVD of a matrix of size F × N at rank-k takes O((F + N) k) time. Note that reduced SVD is only done once at the training phase to obtain Ak and ˜B_k and they are fixed at the test phase. The time complexity is O((F + N) kT ) when the operations are done as given in Equation5. Space complexity is O((F + N) k) which is the total number of elements in matrices Akand ˜B_k.

2.2 Improving efficiency via CUR

An alternative approach to SVD is using CUR decomposition to approximate the full matrix (Mahoney & Drineas,2009). In a CUR decomposition, we approximate the dictionary matrix as D ≈ CUR, where C represents actual columns, R rep- resents actual rows of the data matrix D, and U is a linking matrix, easily estimated via least squares.

Let kcoland krowbe the number of the selected columns and rows, respectively. Then, the dimensions for C, U and R are F× kcol, kcol× krowand krow× N, respectively.

To compute the selection probabilities of rows and columns, we use the CUR method proposed byMahoney and Drineas (2009). We first take the SVD and compute the probability of selecting the ith row as follows:

ρi = 1 k

k j=1

a_{i j}² , i = 1, . . . , F, (6)

where ai jis the(i, j)th entry of Ak, the matrix containing the left singular vectors. Similarly, the probability of selecting the jth column is computed as follows:

πj =1 k

k i=1

b²_ji, j= 1, . . . , N, (7) where bji is the( j, i)th entry of Bk, the matrix containing the right singular vectors. Here, k is taken to be the number of dimensions necessary to cover a large amount of the total variation in the data.

Afterwards, kcol columns and krow rows are randomly selected using a multinomial distribution with the computed probabilities. The relative error is ensured to be very small with a high probability as proved inMahoney and Drineas (2009) when sufficient number of dimensions are selected.

Since krow is relatively small compared to kcol in our case, let us define ˜C = CU and reorganize the update rule in Equation3as follows:

W← W

⎛

⎜⎜

⎝ R

˜C ^X

˜C( ˜RW)

R( ˜C1)

⎞

⎟⎟

⎠ (8)

The time complexity of the algorithm will be of order O((F + N) min{krow, kcol}T ). The minimum implies that one can merge U and R if krowis larger than kcol. The memory requirement will be of order O((F +N) min{krow, kcol}) since we store both matrices.

2.3 Improving efficiency via a skeleton in CUR

The dictionary matrix is a huge matrix possibly containing redundant information both in its columns and rows (Lee &

Choi,2008). So, instead of using low rank approximations of the full matrix, we extract only the important information.

Firstly, we observe that only a small subset of the spectral templates are satisfactory and thus, we use only a selected set of columns of the spectrogram. That is, we only use C:

W← W

C^X_(CW) C1

. (9)

We go further and observe that there is no necessity to hold all of the rows which correspond to the frequency bands since the notes can be detected without hearing the higher harmonics. With this information, we skeletonize the dictionary matrix and keep only the points at the intersection of the selected rows and columns. Let ˇD be the skeleton of D, and let ˇX involve the selected rows of the observed data. Then, the formula becomes:

W← W

⎛

⎜⎜

⎝ ˇD ^ˇX

ˇDW

ˇD1

⎞

⎟⎟

⎠ , (10)

where the sizes of W and1 are selected appropriately.

(6)

Fig. 2. All approaches illustrated: SVD factorizes D into matrices A_k,k, and B_k whose structure is not directly related to D. CUR factorizes D into matrices C, U, and R, where C involves a set of columns of D and R involves a set of columns of D. U is not directly related to D. The C-based approach uses only C and the skeleton-based approach uses only the skeleton acquired by the intersection of the selected columns and rows of D.

We pictorially summarize all the discussed approaches in Figure2. The left image shows how the data matrix D is taken into consideration in the corresponding approach. In the SVD based approach, D does not have a special consideration, so its cells are illustrated as random points as well as factorized matrices. But the latter approaches consider D factored as a set of its columns and a set of its rows. The cells given with two colours are in common of rows and columns. The computational gain of using a skeleton of the data matrix is obvious in the illustration. This picture is created with a very small data matrix D for illustrative purposes, but it would be huge in real applications and the gain would be much more.

The time complexities will be of order O(FkcolT) and O(kcolk_rowT) for C-based and skeleton approaches, respectively. Since we discard most of the data and keep only the important part, the computational gain is very high in practice.

The number of rows and columns that represent the variety in the data well enough may be hundreds of times less than the original dimensions of the full data. For a realistic case, if we have F = 1025, N ≈ 10⁵, and we choose krow = 400 and k_col = 4000, the algorithm becomes nearly 75 times faster and we use only less than 2% of the data. This method can

Table 1. Time and space complexities in Big-O notation.

Time Space

Full linear

model F N T F N

SVD-based (F + N) k T (F + N) k

CUR-based (F + N) min{krow, kcol}T (F + N) min{krow, kcol}

C-based F k_colT F k_col

Skeleton krowk_colT krowk_col

work much faster and handle large amounts of data which can not be handled by conventional methods. Time and space complexities for the discussed methods are summarized in Table1.

2.4 Improving efficiency via exemplar-based clustering Instead of focusing on low-rank approximations of the original data matrix, let us now shift our attention to select exemplars which are good candidates to represent the remaining data points.

2.4.1 Exemplar selection via k-medoids

The classical approach to this problem is the k-medoids method where a medoid of a cluster is defined to be the one whose average dissimilarity to all the objects in the cluster is minimal (Park & Jun,2009). We aim to find the medoid that each sample belongs to. That is, we aim to find Z minimiz- ing X − CZ, where Z ∈ {0, 1}^k^×N,

iZ_{i n} = 1 for all n ∈ {1, . . . , N}. We pre-compute the dissimilarities: Yi j =

D(:, i) − D(:, j)1. In a typical k-medoids algorithm, we start with random initialization of medoids. Let J be the set involving the indices of the medoids. At each iteration, we first assign clusters to the data points:

c_n⇐ argmin

i|i∈{1,...k}D_{n J}_i, n ∈ {1, . . . , N} (11) and recompute a medoid per cluster until convergence or a permitted maximum number of iterations:

Ji ⇐ argmin

n|cn=i

N j=1,cj=i

Dn j, i ∈ {1, . . . , k}. (12)

(7)

k-medoids is run several times each with different initial centres and the best solution is taken in order to avoid local optima.

2.4.2 Exemplar selection via affinity propagation

An alternative exemplar-based clustering technique is the affinity propagation (AP) of Frey and Dueck (2007). Suppose we are interested in the exemplars of the ith note. Let S(x, y) be the similarity between column x and column y, pj ≡ S( j, j) be the preference of the jth column to be selected as an exemplar point, and J be the set of exemplar indices.

The aim of affinity propagation is to maximize

Ni

n=1

S(n, cn) +

j| j∈J

pj, (13)

where cn is the index of the exemplar which the nth point belongs to. AP finds an approximation to the objective by exchanging real-valued messages between data points until a high-quality set of exemplars and corresponding clusters gradually emerges (Frey & Dueck,2007). There are two types of messages: responsibilities and availabilities. The responsi- bility sent from node x to node y reflects the accumulated evidence how appropriate it would be for node x to have node y as exemplar. The availability sent from node x to node y indicates the accumulated evidence that node y should choose node x as exemplar. We initialize the responsibilities R(x, y) sent from data point x to candidate exemplar point y to 0 for all x, y ∈ {1 . . . , N}. We also initialize the availabilities A(x, y) sent from candidate exemplar point y to data point x to 0 for all x, y ∈ {1 . . . , N}. Then until convergence, we compute the responsibilities:

R(x, y) ⇐ S(x, y) − max

y=y{A(x, y) + S(x, y)} (14) and the availabilities:

A(x, y) ⇐

⎧⎪

⎪⎪

⎪⎨

⎪⎪

⎩ min

0, R(y, y) +

x/∈{x,y} max

0, R(x, y)

if x= y,

x=y max{0, R(x, y)}

if x= y.

(15) The clusters are computed as follows after convergence:

cn⇐ argmax^N

x=1{A(n, x) + R(n, x)}. (16) The number of clusters is not given a priori in AP. It is determined by the preference values. If the preference for selecting a column is more than its similarity to the nearest possible exemplar, then it would be selected as a new exemplar. So, higher preference values lead to more clusters and it is possible to tune the number of clusters via modifying the preference values.

Applying exemplar-based algorithms on the merged data is not practical since it requires pairwise distances of all the columns, which is of order O(N²). Instead, we apply it for

Fig. 3. Histogram of polyphony order: the polyphony in the test data is expressed mainly by the first six orders.

Fig. 4. Contribution of the number of dimensions to total variation:

51 dimensions are enough for expressing 98% of total variance.

each note data Diseparately for i = 1, . . . , I which is of order O(_I

i=1N_i²) in total and find the exemplars representing each note. This complexity is for the training phase. For the test phase, the complexities will be the same as C-based and skeleton-based approaches discussed before since we follow the same steps after we spot the exemplar rows and columns.

3. Experiments and results

In order to evaluate our methods, we have conducted several experiments. We first compute the transcription accuracy of each proposed method on a test data and compare the results.

We also consider the elapsed time per method. We evaluate the transcription accuracy on a validation data for the CUR- based and skeleton-based approaches and figure out how many columns and rows are enough for a high transcription accuracy. Then we give the results per polyphony order. Finally, we analyse the effect of filtering on the results.

3.1 Data description and evaluation criteria

In our experiments, we have used the MAPS (MIDI Aligned Piano Sounds) data set³ (Emiya, Badeau, & David, 2010).

3For training, we have used the isolated note sounds corresponding to the following piano models: AkPnBcht, AkPnBsdf, and ENSTDkCl. Since we want to compare the proposed methods with the full solution, which takes a considerable amount of time, we have only used random consecutive sections of the following content for verification and testing. Verification files: MAPS_MUS-mz_331_3_ENSTDkCl’, MAPS_MUS- bach_847_AkPnBcht, MAPS_MUS-bach_846_SptkBGAm, MAPS_MUS-bach_847_AkPnStgb. Test files: MAPS_MUS- bach_846_AkPnBcht, MAPS_MUS-bach_850_AkPnBsdf,

MAPS_MUS-mz_331_2_ENSTDkCl, MAPS_MUS-

bach_847_SptkBGCl, and MAPS_MUS-alb_se2_ENSTDkCl.

(8)

Fig. 5. F-measures versus computation time obtained on the test set per approach: it is seen that the result is better preserved by using more dimensions in the SVD-based approach where using 200 dimensions is enough to get the original results. CUR seems to be an alternative to SVD. The C-based approach and skeletonizing the matrix (i.e. reducing both the #rows and the #columns) give promising results. Notice that there is a considerable computational gain acquired by C-based and skeleton-based approaches. The full solution takes around 248× real-time, whereas the skeleton-based approach takes only 5× real-time. The number of iterations are set to 200 and the configuration of the used system is 64-bit Windows 7, i7 3.07 GHz CPU, 8 GB RAM, and MATLAB 2011b.

Fig. 6. F-measures versus the number of columns (samples) and rows (frequency bins) in the CUR-based approach (left) and the skeleton approach (right): it is seen that only a few hundred frequency bins and a few thousand samples are necessary to keep the success ratio of the algorithm.

The training set is obtained using 440 monophonic piano sounds with 44.1 kHz sampling rate where we represent the audio by its magnitude spectrogram which is computed via DFT. The spectrogram is formed using overlapping Hanning windows of dimension 2048 with a window shift 512. While constructing Di( f, τi), we simply concatenated the spectra corresponding to the ith note where i= 1, . . . , 88. About one third of the training data is removed due to low loudness and the obtained final dictionary matrix is 1025× 115,600 and is around 860 MB (7.5 min).

A test set is formed by choosing random sections from five different polyphonic pieces. The histogram of the polyphony orders (the number of notes playing simultaneously at a time instance) in the test data are given in Figure3, where it can be observed that the test data are predominantly polyphonic.

In order to evaluate the performance of the methods we use precision (fraction of retrieved instances that are relevant), recall (fraction of relevant instances that are retrieved) and

f-measure= 2× precision × recall precision+ recall

metrics. The evaluation is performed framewise. We try dif- ferent threshold values on W matrices and report the best f-measure.

3.2 Testing randomized factorization-based methods We conducted the first set of experiments to analyse the performance of the methods based on randomized matrix factorizations. Our goal is to improve the efficiency of the

(9)

Fig. 7. Precision versus recall obtained on the test set per approach: the results are consistent with the f-measure values given in Figure5. CUR gives a very close result to the full solution.

Fig. 8. ROC curve for the skeleton approach. The curve shows that the method is robust to the threshold level as the area under the curve is close to 1.

approach without compromising the transcription accuracy.

As the base criteria, we start with the full solution given in Equation3and obtain an f-measure of 78.07%. It should be noted that the method is a basic machine learning method and does not employ any advanced field-related signal processing techniques.

As all our methods are based on SVD, as the next step, we decompose D via randomized SVD. We observe that 98%

of the total variance can be covered by using only the first 51 singular vectors (Figure4). That is, there seems to be a high correlation in the data and thus, SVD is suitable for the problem.

As the first factorized method, the SVD-based approach given in Equation5is applied with k = 50, 100, and 200 to see the effect of the used dimensions on the result. It is shown in Figure5that we get better results with more dimensions.

We get an f-measure value of 78.14% when we have 200 dimensions where the complexities are nearly 5 times better than the full solution.

Next, we compute the CUR decomposition of the dictio- nary. k = 51 dimensions are used in the computations of row selection probabilitiesρi and column selection probabilities πj. In this way, we aim to avoid selecting columns and rows leading to residual noise. We analyse the effect of the selected number of columns and rows on the results and give it in Figure6for CUR-based and skeleton-based approaches. As can be seen, the f-measure becomes stabler after a few hundreds of rows and a few thousands of columns. So, there is no need to keep all of the data since it holds lots of repetitive information. Note that the results are obtained on a separate validation test excluding the test samples. The fluctuations in the plots stem from randomization which is used for selection.

Using this information on the validation set, we select 400 rows (frequency bands) and 4000 columns (samples), and we apply the CUR-based approach given in Equation8. As shown in Figure5, the results seem to be similar to the SVD-based approach.

Finally, we select only the important parts of D with the computed indices of the CUR decomposition above. The results for the C-based and skeleton-based approaches are also given in Figure5. We only use less than 2% of the data after skeletonization, the speed is nearly 75 times faster, and we get an f-measure of 77.17% which is very close to the result of the full solution. In addition to the f-measure values, we give the precision recall values of the proposed methods in Figure7.

In order to see the effect of the threshold level, we investigate the receiver operating characteristic (ROC) curves for the proposed methods. Figure8visualizes the ROC curve for the skeleton-based approach. For the sake of clarity, we only present the ROC curve for the skeleton-based approach since

(10)

Fig. 9. The magnitude spectrum of the training data corresponding to note A4 with 10 exemplars selected by AP labelled as black vertical stripes.

Three exemplars are plotted on the right. Note that the figures show the log of the spectrum for illustrative purposes, we use the magnitude spectrogram D in the methods.

Fig. 10. Results obtained on the test set for exemplar based approaches: it is seen that using exemplars that represent the remaining data give close results to using all of the samples whereas they gain a tremendous advantage in time and space complexities and they enable us to work in nearly real-time.

Fig. 11. Precision, recall and f-measures for each polyphony order: the figure shows the results for the skeleton approach based on affinity propagation. F-measure seems to be more than 60% for each polyphony order.

the ROC curves for the other methods are almost identical. It can be observed that the methods are robust to the threshold level as the area under the ROC curves are close to 1.

3.3 Testing exemplar-based methods

In the second set of experiments, we modify the C-based and skeleton-based techniques by making use of k-medoids

and affinity propagation. We investigate here the difference between column selection of Mahoney and Drineas(2009), k-medoids (Park & Jun,2009), and AP (Frey & Dueck,2007).

Note that the results in the previous subsection are gathered by selecting 4000 columns from all the training data. Here, we select the columns on separate training data. That is, we decompose each note’s training data via SVD and we select 45 columns per note using Equation 7 to get around

(11)

Fig. 12. Post-processing the results with median filtering significantly increases the f-measure in all approaches.

4000 columns in total for the C-based approach and select an additional 400 rows using Equation 6 for the skeleton- based approach. We then choose 45 exemplars per note via k-medoids and merge these selected exemplars to form the training data involving around 4000 exemplars in total. We use l1-distance of the instances. Then, we additionally select 400 rows via k-medoids and use the skeleton of the selected columns and rows. Alternatively, we use affinity propagation to select exemplars representing each note. The similarity values between columns x and y are selected to be Si(x, y) =

−|Di(:, x) − Di(:, y)|1for the ith note and the preference values are selected to be 2.3 times the median of these similarity values. This scale factor is chosen such that it leads to nearly 45 exemplars to be selected per note. Recall that this value is not given a priori in affinity propagation, it is determined by the preference values intrinsically. We also apply affinity propagation on the rows and get 278 exemplars by using a scale factor of 1.2.

In order to demonstrate, we give the spectrogram corresponding to the note A4 (440 Hz) and the selected exemplars via affinity propagation in Figure9. The selected exemplars seem to represent different parts of the spectrogram of A4.

The C-based and skeleton-based approaches give f-measures of 78.22% and 77.51%, respectively which are higher than the ones found in the previous subsection, as shown in Figure10. We can conclude that considering training data separately per note improves the transcription accuracy.

C-k medoids gives the highest f-measure value among our tests which is 78.69%. In addition, choosing only 400 of the rows via k-medoids and skeletonizing the training data leads to an f-measure of 77.41%. The f-measure of the C-based approach using affinity propagation is nearly 78.54%. The f-measure for the Skeleton-AP method is 78.39%. That is, we use less than 1% of the training data and still maintain the success rate. In fact, there is a slight increase in the f-measure which supports that removing repetitive and non-relevant parts may lead to better results.

In addition to the overall results, we give the results for each polyphony order in Figure11for the skeleton-based approach using affinity propagation. We got similar results for the other approaches, so we do not provide separate plots for each. As can be seen, recall rate is perfect in the monophonic case but the precision is low. Using a higher threshold on the weights may lead to selection of fewer notes as active and this can improve precision with a trade-off in recall rate if one is more

interested in precision. The values used here are obtained by optimizing the threshold on a verification set which excludes the test data.

As post-processing, we apply a median filter on each row of W before thresholding, where we set the size of the kernel to 15. Filtering leads to an increase in f-measure with a mean around 2.5% and standard deviation around 0.1% for the discussed approaches as given in Figure 12. We also observe that normalizing the spectrum at each frame increases the performance at the note offsets due to the artificial increase of the volume.

4. Conclusions and discussion

In this study, we presented computationally efficient methods for handling very large dictionaries in the NMF model. We have addressed the problem of polyphonic piano transcription and discussed that the conventional inference methods are inefficient to handle big dictionaries. We show that even a standard matrix factorization model is prohibitive in real applications where a huge amount of training data is used. We made use of randomized matrix factorizations and exemplar- based techniques to form efficient update rules. We employed randomized SVD which enables us to work with large data in practice (Halko et al.,2011b). Note that the accuracy of the randomized SVD can be improved via using Krylov sub- space of D (Halko et al., 2011a). We also used CUR factorization which can fully represent the matrix by selecting columns/rows that span the whole column/row space of D as well as approximating it with fewer dimensions.Halko et al.

(2011b) andMahoney(2011) present an extensive summary of the current deterministic and randomized CUR decomposition techniques with relative and additive error bound analysis for the reconstruction. When compared to SVD, CUR leads to a more interpretable decomposition since its basis vectors are individual rows and columns (Mahoney,2011) with physical correspondence. Besides, reconstruction with reduced SVD might lead to negative entries even if the data is non-negative as in the case of spectrogram data. Furthermore, for the case of sparse data as in a spectrogram of polyphonic music samples, CUR maintains the sparsity whereas SVD often gives a dense decomposition. Interested readers can refer toGoreinov, Tyrtyshnikov and Zamarashkin(1997), andStewart(1999) for more discussions on CUR.

(12)

We also conducted model-independent experiments where the f-measure degraded around 6% for each method. For the sake of clarity, we do not present the detailed results corresponding to those experiments. These results may suggest that our methods are more suitable for model-dependent applications.

With abundance of data in different applications, randomized matrix decompositions as well as exemplar selection techniques are likely to get more attention in the future and we have illustrated a relevant application of this in the context of music transcription.

References

Arı, I., ¸Sim¸sekli, U., Cemgil, A.T., & Akarun, L. (2012).

Large scale polyphonic music transcription using randomized matrix decompositions. In Proceedings of 20th European Signal Processing Conference (EUSIPCO) (pp. 2020–2024).

Piscataway, NJ: IEEE.

Bertin, N., Badeau, R., & Vincent, E. (2010). Enforcing harmonicity and smoothness in Bayesian non-negative matrix factorization applied to polyphonic music transcription. IEEE Transactions on Audio, Speech & Language Processing, 18(3), 538–549.

Cichocki, A., Zdunek, R., Phan, A.H., & Amari, S. (2009).

Nonnegative matrix and tensor factorizations: Applications to exploratory multi-way data analysis and blind source separation. New York: Wiley Publishing.

Dessein, A., Cont, A., & Lemaitre, G. (2010). Real-time polyphonic music transcription with non-negative matrix factorization and beta-divergence. In ISMIR (pp. 489–494).

Canada: International Society for Music Information Retrieval.

Emiya, V., Badeau, R., & David, B. (2010). Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1643–1654.

Févotte, C., Le Roux, J., & Hershey, J.R. (2013). Non-negative dynamical sytem with application to speech and audio. In ICASSP (pp. 3158–3163). Piscataway, NJ: IEEE.

Frey, B.J., & Dueck, D. (2007). Clustering by passing messages between data points. Science (New York, N.Y.), 315(5814), 972–976.

An algorithm for the principal component analysis of large data sets. SIAM Journal on Scientific Computing, 33(5), 2580–2594.

Halko, N., Martinsson, P.G., & Tropp, J.A. (2011b). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2), 217–288.

Klapuri, A., & Davy, M. (2006). Signal processing methods for music transcription. Berlin: Springer.

Lee, D.D., & Seung, H.S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791.

Lee, D.D., & Seung, H.S. (2001). Algorithms for non- negative matrix factorization. In NIPS (Vol. 13, pp. 556–562).

Cambridge, MA: MIT Press.

Lee, H., & Choi, S. (2008). CUR + NMF for learning spectral features from large data matrix. In IEEE International Joint Conference on Neural Networks (pp. 1592–1597). Piscataway, NJ: IEEE.

Mahoney, M.W. (2011). Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2), 123–

234.

Mahoney, M.W., & Drineas, P. (2009). CUR matrix decomposi- tions for improved data analysis. Proceedings of the National Academy of Science, 106(3), 697–702.

Park, H.S., & Jun, C.H. (2009). A simple and fast algorithm for K- medoids clustering. Expert Systems with Applications, 36(2), 3336–3341.

¸Sim¸sekli, U., Le Roux, J., & Hershey, J.R. (2013). Hierarchical and coupled non-negative dynamical systems with application to audio modeling. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (pp. 1–4).

Piscataway, NJ: IEEE.

Smaragdis, P. (2011). Polyphonic pitch tracking by example. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (pp. 125–128). Piscataway, NJ: IEEE.

Smaragdis, P., & Brown, J.C. (2003). Non-negative matrix factorization for polyphonic music transcription. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 177–180). Piscataway, NJ: IEEE.

Stewart, G.W. (1999). Four algorithms for the efficient computation of truncated pivoted QR approximations to a sparse matrix. Numerische Mathematik, 83(2), 313–323.