Cross Lingual Document Retrieval - EVALUATING BILINGUAL EMBEDDINGS IN BILINGUAL DICTIONARY ALIG

Where rank_q is the ranking of the correct definition for the query q.

English Princeton WordNet Definition

Translated Definition ancient Greek or Roman

galley or warship having three tiers of oars on each side

with the ancient Greeks and Romans, a three-wheeled war ship was happy

the pure mathematics of points and lines and curves and surfaces

the scientific branch of mathematics, which deals with the spatial properties of the bodies and their interrelations everything stated or

assumed in a given discussion

circumstances, facts, things taken into account in a discussion or discussion the time when the Moon is

fully illuminated

Phase of the month in which it is fully illuminated someone whose occupation

is catching fish

A person who deals with fishing and sometimes with the conservation of fish fishing make fit for, or change to

suit a new purpose

I do something to go and match a certain purpose, make the necessary adjustments and adjustments, to respond to certain conditions and circumstances.

admit (to a wrongdoing) I assure you, did someone present a thing to see, to know him or to judge him? I find something to be seen; show Table 4.2: English Princeton WordNet definitions and the target

word-net definitions we want to match

We have studied bag of words representation for document retrieval tasks, using tf-idf weights. Kusner et al. gives a brilliant example where relying on word overlap falls short;

Obama speaks to the media in Illinois The President greets the press in Chicago After the stop words are removed, the resulting tokens {Obama, speaks, media, Illinois}

and {President, greets, press, Chicago} have no overlap so their vector representations would be orthogonal, showing maximum dissimilarity. Yet they convey the same mean-ing semantically. By calculatmean-ing the distances between individual words of documents, it can be proven that the case above is basically the same sentence. Using this case as their motivation, Kusner et al. adapted the optimal transport theorem [89], an opti-mization technique like the one discussed in Chapter 3. This theorem deals with the energy cost of transporting one arbitrary mass to another, often explained with trans-porting piles of dirt analogy. Their idea starts by casting documents as probability distributions defined over word embeddings of the words they are written with. First, words of a document is thought as the elements of a document’s probability distribu-tion. We now know that relying on sameness or word overlap is not feasible so the words are instead represented by their d dimensional word embeddings. The distance between two words can be represented by the Eucledian distance between them over k dimensions, or as the distance metric c(·);

c(w_a, w_b) =||wa− wb||2

Kusner et al. [22] calls this the travel cost between word w^a and w^b. Now that the distance or the cost between words can be shown, two documents d_iand d_j, respectively written using words such that d_i = {w1ⁱ, wⁱ₂. . . w_nⁱ} and dj = {w^j1, w₂^j. . . w_n^j}, a flow matrix can be formulated using the distances between every w_xⁱ and w^j_y pair. This flow matrix is called T and an element T_i,j is how much of the distance needs to be travelled when starting from word i to reach word j in order to move d_i to d_j. The T is optimized to find the minimum distance possible between d_i and d_j using linear programming. Mainly, the following constraints are solved in order to find the T with the least overall sum [22];

min

∑n i,j

Ti,jc(i, j) (4.5)

∑n j=1

T_i,j = d_i∀i ∈ 1, . . . , n (4.6)

∑n i=1

T_i,j = d^′_j∀j ∈ 1, . . . , n (4.7)

This optimization is solved by choosing which word pairs to use to travel from one document to another and when cast over a whole corpus. Similar documents will have lower minimal costs on their T matrices depending if their words are easily moveable across each other.

4.2.2. Cross Lingual Word Mover’s Distance

Balikas et al. [21] proposed using Sinkhorn distance [90] in a cross-lingual document retrieval setting. Their approach can be thought as a step forward from Word Mover’s Distance. First, instead of normalized bag of words representation for the documents, they used term frequency and tf-idf to weigh the document representations. Second, they used entropic regularization which allowed them to use Sinkhorn-Knopp algo-rithm [91] to solve the linear assignment between word embeddings of the source and the target documents. For the regularization term λ and entropy of the transport matrix T ; E(T );

min∑

i,j=1

T_i,jc(i, j)− 1

λE(T ) (4.8)

∑n j=1

Ti,j = di∀i ∈ 1, . . . , n (4.9)

∑n i=1

T_i,j = d^′_j∀j ∈ 1, . . . , n (4.10)

We have modified their open source project² to accompany our task as cross-lingual textual similarity measure.

4.2.3. Evaluation

In order to evaluate cross lingual pseudo document retrieval approaches, mean recipro-cal rank that was presented in Equation 4.4 is reported here as well. Mean reciprorecipro-cal rank gives an overall insight about the performance of the retrieval system by giving penalty to cases that retrieved the correct definition in a lower rank. However, in the context of dictionary alignment, percentage score of precision at one is also reported since in a real life application, retrieving the correct definition on the top spot is more important.

2https://github.com/balikasg/WassersteinRetrieval

5. Supervised Alignment

The approaches we have presented so far work fully unsupervised. Given two collec-tions of definicollec-tions, we have studied the methods that retrieved the definition(s) that represented the same meaning. In this chapter, given the moderately sized data in our hands we have accepted as “golden”, we will investigate the feasibility of training an encoder [92], where the objective is to learn whether the pair of definitions entail the same sense across languages.

5.1. Neural Network Model

Recurrent Neural Network (RNN) architectures improve upon the prototypical neural network model by introducing a memory for the connections in the network [93]. By updating the hidden unit over time using the output of the previous time step, the model can remember features of the input signal for later inputs. One particular archi-tecture of RNNs propsed by Hochreiter & Schmidhuber [94] is long short-term memory (LSTM). LSTM models have been successful on language modelling tasks [92], hand-writing recognition [95, 96] and machine translation with a focus on rare words [97].

Highlight of these results are that LSTM has an advantage on tasks that require con-textual information to persist over long periods of time [98]. Furthermore, LSTMs do not require fixed input vectors, which is a necessity for us since our definition pairs do not have to be the same size.

5.1.1. Vanishing Gradient Problem

LSTM is born out of the need to address the vanishing gradient problem [94, 99]. On the original publication by Hochreiter & Schmidhuber [94], a crucial shortcoming of RNNs have been identified as their slow rate of training which may not converge in the end at all. Independently, Bengio et al. [99] suggested that the problem stems from the choice between the conservation of the previous inputs versus resisting against the noise they accumulate. Figure 5.1 illustrates the problem using shades as the influence of input over neural network units. As the input signal traverses the units of an RNN, it either diminishes or blows up [98].

漀甀琀瀀甀琀猀

椀渀瀀甀琀猀栀椀搀搀攀渀氀愀礀攀爀

琀椀洀攀㄀㈀㌀㐀㔀㘀㜀

Figure 5.1: Graphical representation of vanishing gradient problem where the shades of the nodes represent the influence of the input

sig-nal [98]

5.1.2. Long Short-Term Memory

LSTM is the solution highlighted by Graves [98] as a recurrent neural network model that can work over temporally distant input signals while preserving their influence or diminishing their noise. The centrepiece idea is to use a constant error carousel, special cells that enforce a constant error flow. This complex unit is named memory cell [94].

Using an input gate, the cell is updated if the current input is relevant and using an output gate, the unit will not update other cells if the current output is not relevant.

A simplified overview of the suggested model is presented in Figure 5.2.

x . . y

input gate output gate

cell

t t

Figure 5.2: Simplified long short-term memory cell architecture

Multiple arrows denote input from current time frame and recurrent connections. ⊙ symbol denotes multiplication. By weighing the input and output gates between 0 and 1, the impact of the current input and output can be adjusted. Overall, input gate controls how much cell will learn and output gate controls how much the cell will propagate. Figure 5.3 adapted from Graves [98] illustrates how the LSTMs operate against vanishing gradient.

Two gates on top of the cell structure model got extended with a third forget gate by Gers et al. [100]. The aim was to handle input sequences that are not segmented in a predictable manner. The error signals were getting carried too far back in time which deteriorated the performance on tasks with continuous input streams. The proposed

漀甀琀瀀甀琀猀

椀渀瀀甀琀猀栀椀搀搀攀渀

氀愀礀攀爀

琀椀洀攀㄀㈀㌀㐀㔀㘀㜀

Figure 5.3: Preserving the input signal through blocking (-) or allow-ing (O) the input signal, adapted from Figure 4.4 of Graves [98]

forget gate is implemented to reset the cell. When a cell state got irrelevant due to a change in problem domain, forget gate gradually resets the cell state instead of erroneous activations from the input gate attempting to ineﬀiciently do so.

Another extension came in the form of peephole connections by Gers et al. [101]. With this addition, the gates can use the cell state to increase their sensitivity to timining in the input data. By allowing internal gates to inspect the cell state, they have shown improvements on non-linear tasks.

The finalized model with input, forget and output gates as well as the internal peephole connections was debuted in Graves & Schmidhuber [102]. In their expansive study comparing 8 LSTM variants over 15 years of CPU time, Greff et al. [103] named this model the “vanilla LSTM”. This particular form of LSTM is commonly used [92].

The real valued vectors are denoted alongside their update time step (·)_t such that t ∈ 1, . . . , T . Updates are performed on the input gate it, memory cell c_t, forget gate f_t and output gate o_t. We show the updates over weight matrices W_i, W_f, W_c and W_o via the use of recurrent weights R_i, R_f, R_c and R_o and bias vectors b_i, b_f, b_cand b_o. Peephole connections are shown using p_·for input, forget and output gates as pi, pf and

p_o respectively. Finally, the input is a sequence of data in the form of (x₁, x₂, . . . , x_T) and the model outputs real valued vector y_t at the time step t.

i^t = σ(

W_ix^t+ R_iy^t⁻¹+ p_i⊙ c^t⁻¹+ b_i)

(5.1)

f^t = σ(

Wfx^t+ Rfy^t⁻¹+ pf ⊙ c^t⁻¹+ bf

) (5.2)

c^t= W_cx^t+ R_cy^t⁻¹+ b_c (5.3)

c_t= tanh(

i^t⊙ c^t+ f_t⊙ c^t⁻¹)

(5.4)

o^t= σ(

Wox^t+ Roy^t⁻¹+ po⊙ c^t+ bo

) (5.5)

y^t= tanh(o_t)⊙ c^t (5.6)

Where σ(·) is the logistic sigmoid function σ(x) = _1+e¹−x and pointwise vector multi-plication is denoted using ⊙. The recurrence holds via the usage of signals from the previous time steps (·)^t⁻¹.

Belgede EVALUATING BILINGUAL EMBEDDINGS IN BILINGUAL DICTIONARY ALIGNMENT ÇİFT DİLLİ KELİME TEMSİLLERİ İLE SÖZLÜK EŞLENMESİ (sayfa 53-61)