Crosslingual document comparison

The Tower of Babel

The Tower of Babel by Bruegel

Introduction

How can you learn a map from a German language to an English language word vectorisation model, to enable cross-lingual document comparison?

The monolingual version of this task has a familiar solution: Given say a German language word vectoriser and a recipe for building document vectors from word vectors, one can measure the thematic similarity of two German documents by comparing their vectorisations, for example using cosine similarity. Let’s refine the above question in this more concrete context.

Question: Suppose now we have a second such model in English, together with a list of bilingual phrase pairs. How can we compare documents across the two languages?

Recently, I implemented an answer to this question described in the 2013 paper “Exploiting Similarities among Languages for Machine Translation” of Mikolov, Le and Sutskever. The main idea is simple: Learn a linear map A that approximately sends each German phrase vector to the vectorisation of the English counterpart recorded in the list of bilingual phrase pairs. Applying A to any German document will then position it near similarly-themed English documents.

Data & software acknowledgments

Bilingual word and phrase pairs were taken both from the dict.cc German-English dump, and from the (German-English) Europarl corpus, which is a bilingual transcription of the proceedings of the European Parliament, compiled by Koehn (see his paper “Europarl: A Parallel Corpus for Statistical Machine Translation”). The data from both these sources was used for research purposes only, and the code used to clean dict.cc prior to vectorisation is included below. The English and German word vectorisers were trained on English and German Wikipedia dumps, using the Facebook’s open source FastText software.

The learning task

To formalise the learning task, suppose we are given the English and the German word vectorisers vec_{\textnormal{en}} and vec_{\textnormal{de}}, a list of bilingual phrase pairs [(g_{1}, e_{1}), ..., (g_{n}, e_{n})] together with a recipe for building document vectors as linear combinations of their constituent word vectors (for example, define the document vector to be the sum of the word vectors). The same recipe must be used for both languages.

Mikolov, Le and Sutskever’s idea is to compute a linear regression from the German language vector space to English language vector space, mapping the German phrase vectors onto their English counterparts as nearly as possible. Mathematically, the task is to learn a matrix A minimising the sum \sum_{i=1}^{n} \lvert \lvert A x_i - y_i \lvert \lvert^{2}, where x_i := vec_{\textnormal{de}}(g_i) and y_i := vec_{\textnormal{en}}(e_i). Hence, a well-learned A will have the property that e.g. A.vec_{\textnormal{de}}(\textnormal{``Haus''}) and vec_{\textnormal{en}}(\textnormal{``House''}) are nearby with respect to the Euclidean metric. Applying the linear map A to the vectorisation x of a German document will thus reposition x close to vectorisations of similarly-themed English documents, since document vectors are linear combinations of the constituent word vectors.

Training workflow

To learn the matrix A, the workflow was as follows:

  1. Train word vectorisation models on German and on English Wikipedia.
  2. Download and clean the text file version of the online dictionary dict.cc.
  3. Vectorise a subset of the bilingual phrase pairs from the second step using the models from the first step, first removing all pairs in which one of the phrases contained at least one word not in the vocabulary of the corresponding model. These vector pairs will form our training set. We will sample a test set from the remaining vectors.
  4. Learn A by applying a linear regression mapping the German phrase vectors to the English phrase vectors.

We explain each step in more detail.

1. Creating the vectorisers

We trained two word vectorisers, one on German Wikipedia and one on English Wikipedia. Each Wikipedia corpus contained around 463k long articles, and each comprised around 3GB of data. Both models were 100-dimensional, trained using FastText in CBOW mode, with negative sampling. The other hyperparameters were:

t = 0.00001
lr = 0.05
ws = 10
dim = 100
minCount = 15
epoch = 3
neg = 10
loss = ns
maxn = 0
minn = 0

2. Obtaining the bilingual phrase pairs

We constructed a list of bilingual (German, English) phrase pairs by cleaning a downloadable text file version of the online dictionary dict.cc. The resulting list of pairs looked like:

[('gesamtkosten', 'overall cost'),
('ableitung', 'derivation'),
('ingenieurleistung', 'engineering achievement'),
('schinkensandwich', 'ham sandwich'),
('postmaterialistisch', 'post materialistic'),
('kleiner seychellen taggecko', 'seychelles day gecko'), ...]

I used the following python code to clean the raw file.


square_ex = re.compile(r'\[.*?\]')
angle_ex = re.compile(r'<.*?>')
curly_ex = re.compile(r'{.*?}')
round_ex = re.compile(r'\(.*?\)')
slash_ex = re.compile(r'/')
exs = [square_ex, angle_ex, curly_ex, round_ex, slash_ex]

def getlines(filepath):
	with open(filepath, 'r') as f:
		lines = [line.split('\t')[:2] for line in f.read().splitlines()[9:]]
		return [line for line in lines if len(line) == 2]

def cleantext(text):
	for ex in exs:
		pars = ex.findall(text)
		for par in pars:
			text = text.replace(par, '')
	words = [w for w in text.split(' ') if w != '']
	ctext = ' '.join(words)
	return ctext

def processlines(lines):
	plines = [map(cleantext, line) for line in lines]
	return [(de, en) for de, en in plines if de != '' and en != '']

filepath = '/home/stephen/crosslingual/dictionary_de_en.txt'
lines = getlines(filepath)
phrasepairs = processlines(lines)

3. Processing and vectorising the phrase pairs

We filtered the list phrase pairs of (German, English) word pairs described above, throwing out all pairs in which one phrase contained at least one word not in the vocabulary of the corresponding model. After applying the German vectoriser to the German phrases and the English vectoriser to the English phrases, the result was a pair (gdf, edf) of python dataframes, each of shape (644320, 100). The i^{\textnormal{th}} row of gdf (resp., edf) contained the vectorisation of the i^{\textnormal{th}} phrase in the German (resp, English) vocabulary. From this pair of dataframes, we selected 600,000 vector pairs lines on which to train the linear regression. The remainder were held out, to be used for evaluating A.

4. Computing A with linear regression

Using numpy, we computed a linear regression mapping the 600k training lines of gdf to edf. Here is the code:

import numpy as np

def getmapmatrix(gdf, edf):
    X, Y = np.array(gdf), np.array(edf)
    A, _, _, _ = np.linalg.lstsq(X, Y)
    return A

Note: This function returns a matrix A such that XA \simeq Y, whereas the objective function defined above implies AX \simeq Y. But XA \simeq Y implies A^{\textnormal{T}}X^{\textnormal{T}} \simeq Y^{\textnormal{T}}, so the function solves the transpose of the required problem, which is equivalent.

Evaluation

How can we evaluate the quality of the matrix A? By definition, if A has been well-learned, the image Av of any German document vector v will live in a neighbourhood of similarly-themed English documents (hence the same is true of words, since words are short documents). We check if this is so both by eye — inspecting neighbours of the anglicisations of some German word and document vectors — and by computing a metric on the test set of bilingual vectorisations.

Inspection by eye

Below are the ten nearest neighbours to three German words:”Himmel” (Engl. “sky” or “heaven”), “Katze” (Engl. “cat”) and “Bergbau” (Engl. “mining”):

1. Word = “Himmel”:

heaven 0.762261
sky 0.761401
heavens 0.715642
starry 0.706182
shining 0.698117
skies 0.697317
darkness 0.695424
heavenly 0.682477
moonlit 0.675498
sun 0.673441

2. Word = “Katze”:

cat 0.766059
fievel 0.706771
freckles 0.691149
giggle 0.684391
scruffy 0.682200
rabbit 0.679410
frizzy 0.675246
cuddly 0.674954
farts 0.673027
licked 0.669227

3. Word = “Bergbau”:

mining 0.768001
industrial 0.674482
coal 0.671264
coalfield 0.667098
mines 0.642443
quarrying 0.638779
coalfields 0.627454
agricultural 0.624660
economy 0.623161
industry 0.618026

We now inspect the neighbours to each of two query documents. Each of the ten entries in both outputs lists the first 200 characters of the article — lowercased and stripped of punctuation, and ignoring word boundaries — and is preceded by the cosine similarity. All German language entries are coloured orange, and followed by English translations from Google translate. In both cases the closest match is the article itself, as one might expect.

The first query document is the English Wikipedia article about the Dutch national holiday ‘Koningsdag’, celebrating the King’s birthday:

0: (1.000000) koningsdag koningsdag or king day is national holiday in the kingdom of the netherlands celebrated on april the if the falls on sunday the date marks the birth of king willem alexander from to the day

1: (0.962013)  koningsdag koningsdag bzw ist nationalfeiertag in den niederlanden inklusive der besonderen gemeinden in der karibik in curaçao in sint maarten und in aruba an diesem tag feiern die niederländer den

Koningsdag koningsdag or national holiday in the netherlands including the special parishes in the carribean curaçao in sint maarten and in aruba on this day celebrate the niederländer

2: (0.948780)  goldenes thronjubiläum von elisabeth ii das goldene thronjubiläum von elisabeth ii engl golden jubilee war im jahr das jubiläum der thronbesteigung der britischen königin elisabeth ii mit den interna

Golden thronjubiläum of elisabeth ii the golden thronjubiläum of elisabeth ii engl golden jubilee was in the year the jubilee of the throne of the british queen elisabeth ii with the

3: (0.947533)  silbernes thronjubiläum von elisabeth ii das silberne thronjubiläum von elisabeth ii engl silver jubilee war im jahr das jubiläum der thronbesteigung der britischen königin elisabeth ii aus diesem an

Silver thronjubiläum of elisabeth ii the silver thronjubiläum of elisabeth ii engl silver jubilee was the year of the jubilee of the throne of the british queen elisabeth ii from this

4: (0.942364) silver jubilee of elizabeth ii the silver jubilee of elizabeth ii marked the anniversary of queen elizabeth ii accession to the thrones of the united kingdom canada australia new zealand and other com

5: (0.941474)  beatrix niederlande beatrix januar in baarn als beatrix wilhelmina armgard prinzessin von oranien nassau prinzessin zur lippe biesterfeld war vom april bis zum april als sie das amt ihrem sohn willem

Beatrix netherlands beatrix january in baarn as beatrix wilhelmina armgard princess of oranien nassau princess to lippe biesterfeld was from april until april when she was the office her son willem

6: (0.940599)  diamantenes thronjubiläum von elisabeth ii das diamantene thronjubiläum von elisabeth ii engl diamond jubilee war im jahr das jubiläum der thronbesteigung der britischen königin elisabeth ii es fande

Diamond thronjubiläum of elisabeth ii the diamond thronjubiläum of elisabeth ii engl diamond jubilee was the year of the jubilee of the throne of the british queen elisabeth ii it found

7: (0.939123)  juliana niederlande juliana louise emma marie wilhelmina deutsch auch juliane april in den haag märz in soestdijk prinzessin von oranien nassau herzogin zu mecklenburg war vom september bis zum april

Juliana netherlands juliana louise emma marie wilhelmina german also juliane april in the hague march in soestdijk princess of oranien nassau duchess zu mecklenburg was from september to april

8: (0.938922) father day father day is celebration honoring fathers and celebrating fatherhood paternal bonds and the influence of fathers in society many countries celebrate it on the third sunday of june but it i

9: (0.938111) juliana of the netherlands juliana juliana louise emma marie wilhelmina april march was the queen regnant of the kingdom of the netherlands between and she was the only child of queen wilhelmina and p

The second query document is an English Wikipedia article on “roadkill cuisine”:

0: (1.000000) roadkill cuisine roadkill cuisine is preparing and eating roadkill animals hit by vehicles and found along roads it is practice engaged in by small subculture in the united states southern canada the

1: (0.937038) dog meat dog meat refers to the flesh and other edible parts derived from dogs human consumption of dog meat has been recorded in many parts of the world including ancient china ancient mexico and anc

2: (0.935527)  heimtier heimtiere sind tiere die vom menschen aus verschiedenen motiven meist in der wohnung oder in sonstigem engen kontakt mit ihm gehalten werden motive für die haltung von heimtieren können sein

Pet animals are animals that are kept by people from different motives usually in the home or in other close contact with him motives for the attitude of animals can be

3: (0.934084)  hundefleisch hundefleisch wird in einigen ländern als nahrung genutzt zum beispiel in korea vietnam und einigen südlichen und nördlichen regionen chinas wie guangdong und beijing allerdings hat sich

Dog meat is used in some countries as a food for example in korea vietnam and some southern and northern regions china like guangdong and beijing however has itself

4: (0.932840)  kängurufleisch als kängurufleisch in australien roo wird das fleisch aller australischen känguruarten bezeichnet es ist nicht nur in australien erhältlich sondern wird vor allem ins ausland exportier

Kangaroo as kangaroo in australia roo is the meat of all australian kangaroo species, it is not only available in australia but is mainly exported to abroad

5: (0.932549) tree squirrel tree squirrels include over hundred species that are found on all continents except antarctica and are the members of the squirrel family sciuridae most commonly referred to as squirrels

6: (0.929929)  hoden lebensmittel hoden verschiedener tierarten zählen zu den innereien und sind generell essbar in mitteleuropa ist ihre verwendung in der küche heute unüblich war früher jedoch relativ verbreitet

Foodstuffs of various animals are among the innards and are generally edible in Central Europe, their use in the kitchen is unusual today but was relatively common

7: (0.928278)  katzenfleisch katzenfleisch dient in jeweils mehr oder weniger geringem umfang unter anderem in südchina nordvietnam korea peru und großbritannien zu nahrungszwecken in notzeiten wurde auch katzenfle

Cat meat is mainly used in south china north korea korea peru and britain for food purposes in emergency times also katzenfle

8: (0.927188)  fugu fugu jap ist eine japanische spezialität die aus dem muskelfleisch von kugelfischen besteht in einer besonderen zubereitungstechnik werden die durch das darin enthaltene tetrodotoxin hochgiftige

Fugu fugu jap is a Japanese specialty consisting of the muskfish of kugelfischen in a special preparation technology, the tetrodotoxin is highly toxic

9: (0.926449)  winterfütterung unter winterfütterung versteht man die fütterung von tieren im winter bei landwirtschaftlichen nutztieren unterscheidet sich die winterfütterung heute oft nicht mehr von der fütterung

Winterfeedterung unter winterfütterung is the feeding of animals in the winter with agricultural use, the winter feeding is often no longer different from the feeding

Evaluating A with a metric

We define a metric that quantifies the extent to which the image of a German document vector under the matrix A will live in a neighbourhood of similarly-themed English documents. Clearly, the difficulty of this task depends on the size of the list.

Given a list L := [(x_{1}, y_{1}), ..., (x_{n}, y_{n})] of document vectorisations, where the x_{i} are German and the y_{i} English, we define the match rate of A with respect to L to be the frequency with which an English document vector y_{i} is the nearest neighbour to its anglicised German counterpart A x_{i}, from amongst all the [y_{1}, \ldots, y_{n}], as i ranges over all integers from 1 to n.

Informally, this metric answers the question “Given a German phrase and a list of English phrases containing its translation, how often will A pick out the correct one?”.

To evaluate A, we computed the match rate on three test sets of size n = 10,000, each sampled independently from the approximately 44,000 bilingual phrase pairs held out from the training sample. The results were:

match rate 1:   0.22810
match rate 2:   0.23730
match rate 3:   0.22780

Thus A was able to identify the English equivalent of a German phrase around 23% of the time, from a list of 10,000 candidates. Out of curiosity, we repeated the above experiment with the same word vectorisation models, but using the German-English Europarl corpus — the bilingual proceedings of the European parliament — instead of dict.cc. We report the match rates on the four combinations of training set S and test sets T, where S and T can be either of Eurparl or dict.cc. For comparability, the matrix A was learned in all cases from 600k bilingual phrases, and all evaluations are calculated on 10,000 test document pairs, randomly sampled from those held out from training.

1. For (training_corpus, test_corpus) = (EuroParl, EuroParl):

match rate 1:   0.57490
match rate 2:   0.57540
match rate 3:   0.57520

2. For (training_corpus, test_corpus) = (dict.cc, EuroParl):

match rate 1:   0.43240
match rate 2:   0.44030
match rate 3:   0.43880

3. For (training_corpus, test_corpus) = (EuroParl, dict.cc):

match rate 1:   0.20430
match rate 2:   0.21020
match rate 3:   0.20480

The last combination, for (training_corpus, test_corpus) = (dict.cc, dict.cc), was reported above.

Conclusion

The linear regression has clearly learned something, since the above nearest neighbour words and documents are visibly relevant to the query terms. The match rates are hard to interpret in isolation, but provide a baseline for future experiments. Moreover, they show that using EuroParl as a test corpus is a significantly easier task than using dict.cc, regardless of training corpus. We also note that testing on one of the two corpora and training on the other, as opposed to training on documents sampled from the same corpus as the testing corpus, reduces the match rate.

Addendum

My colleague Matthias Leimeister told me about the paper “Offline bilingual word vectors, orthogonal transformations and the inverted softmax” of Smith, Turban, Hamblin and Hammerla, in which the authors compute, in place of our A, a necessarily orthogonal map B, which yields a higher match rate. They show that B, which is defined to be the argmax of the objective function \sum_{i=1}^{n} x_i^{\textnormal{T}} B y_i — where x_i and y_i are German and English phrase vectorisations, as above — is in fact given by B = UV^{\textnormal{T}}, where U \Sigma V^{\textnormal{T}} is the singular value decomposition of the matrix product X^{\textnormal{T}}Y, whose factors collect the x_is and y_i s as column vectors. I have not yet tried this out.


The Author
I'm a data scientist at Lateral.
Comments