A fastText-based hybrid recommender

Introduction

Using Facebook Research’s new fastText library in supervised mode, I trained a hybrid recommender system, to recommend articles to users, given as training data both the text in the articles and the user/article interaction matrix. The labels attached to a document were both its id, and the ids of all users who viewed it. I’ve not finished testing it, but early signs are that it learns quite well. This post is a short progress report, perhaps of interest to others building hybrid recommenders.

About fastText

What is it?

Recently, Facebook Research released the C++ source code for fastText, a software library for learning word embeddings using a shallow neural network. It can learn either unsupervised, training on unstructured text, or in a supervised manner, from a series of labelled documents. The code is accompanied by two research preprints:

  1. Enriching Word Vectors with Subword Information (unsupervised)
  2. Bag of Tricks for Efficient Text Classification (supervised),

explaining fastText’s workings and performance on benchmark datasets in each case. The authors state its performance is comparable with much more complex “deep learning” algorithms, but the training time is significantly lower.

How does it work?

In unsupervised mode, fastText trains by sliding a window over the input text and either learning the centre word from the remaining context (“continuous bag of words” or “CBOW”), or all the context words from the centre word (“Skipgram”), and learning can be viewed as a series of updates to a neural network with two layers of weights and three tiers of neurons, in which the two outer tiers each have one neuron for each word in the vocabulary, and the middle tier has as many neurons as there are dimensions in the embedding space. In this way, it is similar to word2vec (in fact, one of the fastText coauthors, Tomas Mikolov, is the creator of word2vec). Unlike word2vec, however, fastText can also learn vectors for sub-parts of words  — so-called character n-grams — ensuring e.g. that the words “supervised”, “supervise” and “supervisor” all have similar vectorisations, even if they tend to appear in different contexts. This feature enhances learning on heavily inflected languages, according to the authors.

In supervised mode, the first two neuron tiers of the network are the same, but the neurons in the output tier are in bijection with the labels, rather than the words in the vocabulary. The learning task on a given window is to predict the labels attached to the current document, given the words in the window.

Training a hybrid recommender

Defining the model

Supervised fastText trains on a text file in which each line is a series of space-separated labels, followed by the text of single document. The labels are prepended by a crib, by default “__label__”, so they can be differentiated from words in the text.

In my case, each line of the training file contained one document, with no repetitions. The labels attached to a document were its document id, and the ids of all users who had interacted with it. Here’s a fictitious example:

__label__doc_id_21231 __label__user_id_aadsf89asdf __label__user_id_87sdfsdf __label__user_id_45jfkasf novak djokovic won yet another tennis match last friday ...

After training, the score associated to a user/document pair is the cosine similarity of their associated vectors. (Note that the vectors learned by fastText are not of unit length, so computing cosine similarity means normalising.) The document yielding the highest score is defined to be the recommendation.

Data preparation and hyperparameters

After throwing out users and documents with too few interactions, my training set consisted of around 22k documents and 85k users, with 1 million interactions, and my test set consisted of 30,000 user/document interaction pairs. The training set was 56MB in size. Here are the hyperparameters I specified for the best-performing model (all others were default):

-dim 100
-loss ns
-thread 32
-minCount 5
-neg 10
-ws 20
-epoch 100000
-lr 0.075

The training took around 5 hours on a 32-core, 120 GB AWS instance. Varying the window size, dimension and minCount made little difference to model quality. Increasing “neg” — the number of negative samples — improved the quality of the model uniformly in the range I checked: values from 5 to 50 in intervals of 5. (The downside is that more negative samples means training takes longer.) Varying learning rate affected the model unpredictably: larger values seemed better when the number of epochs was small, but I can’t see a consistent pattern. Using a very large learning rate with very many epochs lead to segfault errors, perhaps because of arithmetic overflow. Increasing the number of epochs steadily improved model quality. I stopped at 100,000 epochs because the improvements were becoming marginal, and my patience ran out.

A small “gotcha”

In the version of the code I downloaded in late August, the documentation stated the default value for the “loss” hyperparameter was “ns”, meaning negative sampling. However, this was not true at the time: reading the code revealed the true default value was “softmax”, which perhaps leads to a better model, but is much slower to train. This meant that my first attempts to train a model were unsuccessful, because fixing a large enough value of “epoch” to ensure the model learned something would’ve meant waiting days or weeks for the model to train. After a few days’ frustration, I read the code and noticed the error in the documentation. Happily, fastText coauthor Edouard Grave was extremely helpful, responding immediately to my bug report, and fixing the problem within a day. He was also kind enough to suggest I use hierarchical softmax instead of negative sampling for the “loss” parameter, since it is, apparently, faster still to train. I haven’t yet experimented with hierarchical softmax systematically. A big thank you to Edouard for his help!

Rationale: why should this work?

The first layer of weights in the network encodes the word vectors; the second layer the label vectors. During training, at any given text window, each word vector is moved closer to a randomly chosen label vector for the current document. Each document id label is seen only once per training epoch, but if we train over many epochs, then:

  1. We expect the label vector associated to a document id to be pushed closer to the average of the word vectors this document contains, and
  2. We expect the vector associated to a user id to be pushed closer to the average of the vectors of all words mentioned in each document seen by that user.

If both the above hold, then label vectors of document ids and label vectors of user ids will be tend to be moved closer to one another, provided the given document has been seen by the given user. For the same reason, it will also be the case that any two users with similar reading tastes, and any two documents attracting users with similar reading tastes, will have similar vectorisations.

Results

We measured the performance of the model as follows:

  1. For each user_id/doc_id pair(u, d) in the test set, compute the cosine similarity of vec(u) with vec(d’), where d’ varies across all documents in the test set; this is by definition the score s(u, d’) of (u, d’).
  2. Write these scores s(u, d’) in descending order in a list L, and record the position in which s(u, d) occurs.
  3. Now compute the “decimal percentile” of s(u, d) in L: if it is in first place, assign it a score of 0.0, if it is in last place, give it 1.0; in the middle is 0.5. Call this number the “percentile rank” pr(u, d) of the pair (u, d).
  4. We measure the performance of the model as the mean of pr(u, d) across all 30,000 pairs (u, d).

The best model trained in this way achieved a score of around 0.093. A model that understood nothing would achieve a score of 0.5 — meaning that the score s(u, d) would be half-way down the list of test scores on average — so this is significantly better than random. However, this result is not as good as our current state-of-the-art model (CSOA), which achieves a score of less than 0.01. However, the CSOA is also trained on around 100 times more data. The next step is to train a fastText model on this larger data set. The experiment continues!

The Author
I'm a data scientist at Lateral.
Comments

15 thoughts on “A fastText-based hybrid recommender

  1. Thanks for sharing this great Article.Could you describe the entire flow of your model.Like lets say once you have the training data in the required format, how did you used the fastText library to train the model and evaluated on the test data with little bit code….?

    1. Hi Prakash, glad you enjoyed the article 🙂 Once you’ve cloned and compiled fasttext as per their documentation, training is straightforward. Assuming the compiled fasttext binary is in the same directory as your training file train.csv, (assumed to have the format described in the post), just type, e.g.:

      ./fasttext supervised -input train.csv -output model -dim 100 -loss ns -thread 32 -minCount 5 -neg 10 -ws 20 -epoch 100000 -lr 0.05

      The original source code prints out only the word vectors, not the label vectors. I modified the FastText::saveVectors() method in the fasttext.cc file, so that it also prints the label vectors to a separate file. Thus, when training is finished, I have two vector files: “model_words.vec” and “model_labels.vec”. The latter contains both document and user labels, with prefixes “__label__doc__id__” and “__label__user__id__”, respectively. To evaluate the code, I (i) load “model_labels.vec” into an ipython notebook, (ii) separate out the user and doc vecs, (iii) apply the MPR metric explained in the post. Note that to compute this metric as I did, you’ll first need to l2-normalise all the user and document vectors. The intuition behind the MPR is explained in this wikipedia article:

      https://en.wikipedia.org/wi

      This should complement the procedural description of MPR I wrote in the post. Good luck!

      1. Thanks Stephen for such prompt reply and in a detailed manner. Let me apply this on my training data if i’ll face any problem i’ll bother you again..:P..:)

  2. Hi, nice write up!
    what is you csoa model? How does the fastText it do in compare to other models such as factorization machines?
    Can it be used for clustering and/or tagging putposes?
    Thanks!

  3. Thanks for sharing the great article. Could you describe how to get the label vector, not the word/sentence vector, i want to get the labels’ vector. Can you share your experience with me? Thanks so much!

  4. Great article. My 2 cents. You can now use the “-label” parameter for the prefix, to avoid the default (and frustraing) default label “__label__”.

    1. Thank you for the helpful tip, Loreto! I haven’t kept up with all the latest developments in the FastText codebase, so I appreciate the update 🙂

  5. Hi Stephen,

    The article was really useful. I need a clarification. Is there any option in FastText to append new training sets to the existing model?

    1. Hi Vijay,

      Glad the article was helpful for you. Let me make sure I understand your question: Are you asking if there is a way to train a model on one training corpus, then to subsequently update the weights in this trained model by training on a second corpus? If so, then my understanding is that this is not possible out-of-the-box, at present.

      Depending on your requirements, however, it may be easier to train just once on your first corpus, then infer all the out-of-vocabulary words from your second corpus by feeding a “queries.txt” file into fasttext, as explained by the FastText authors under the heading “Obtaining word vectors for out-of-vocabulary words” on their github page:

      https://github.com/facebook

      Hope this helps.

  6. Hi Stephen,

    Could you tell me the concept used by fast text in supervised (trained). I have a doubt that whether fast text uses cbow or skipgram or any other model in supervised system.

    1. Hi Anbu,

      Thanks for your good question.

      As you suspect, FastText uses neither skipgram (SG) nor CBOW learning in supervised mode, but something a little different. To answer your question in more detail, I’ll outline all three learning tasks: CBOW, SG and supervised. First I’ll say what they have in common, then I’ll say how they differ. To simplify the explanation, I will ignore the fact that FastText trains on vectors for (subword) character ngrams, like e.g. “pre” in the word “prediction”.

      I. Commonalities

      All three methods work by first:

      (i) Reading in the current line of text in the training corpus (including labels, in supervised mode) into a C++ vector of word indices named “line”.

      The two unsupervised methods (SG and CBOW) then:

      (ii) Iterate over the word indices of “line” (the outer loop), forming at each step a window about the given word radius chosen uniformly between 1 and the window size given by the user at the command line.

      (iii) Iterate over the word indices in each window (the inner loop, denoted by “for (int32_t c = -boundary; c < = boundary; c++) {…}”).

      The supervised method does not use windows, so steps (ii) and (iii) do not apply in the supervised case. However, all three methods (CBOW, SG and supervised):

      (iv) Perform the weight updates that induce the model to “learn” either inside the inner loop, or just after it.

      In all three cases, the updates in (iv) are made by calling the Model::update(input, target, lr) method. This method updates an “input”, which is a C++ vector of word indices, against a “target”, which is a single word index (the float “lr” denotes the learning rate, which doesn’t concern us here). More precisely, it updates the weight vectors associated to each index in “input” against the weight vectors associated to “target”, in such a way that the _average_ of the weight vectors of the indices in “input” moves closer to the weight vector for “target”.

      II. Differences

      The three learning tasks differ only in how they build the variables “input” and “target”.

      (a) In the case of CBOW, Model::update is called just after the inner loop, in other words only once per window. Here, “input” is defined to be a C++ vector of word indices called “bow” (short for “bag of words”), which collects all word indices in a window, in order, except the central word index — call this collection of word indices the “context”. The central word index is defined to be the “target”. As a result (I refer here back to the explanation in the above paragraph):

      CBOW learning moves the average of the word vectors in the context and the target word vector closer together.

      (b) In the SG case, Model::update is called inside the inner loop, so once per iteration. Here, “input” is defined to be C++ vector of (subword) character ngrams for the current word in the window, and “target” is defined to be the central word in the window. However, since I’m ignoring character ngrams for simplicity, think of “input” as a C++ vector containing exactly one word index. Hence:

      SG learning moves each word vector in the context and the target word vector closer together, iteratively.

      Note: This means that SG learning is more expensive than CBOW by a factor of the average window size.

      (c) In the supervised case, there is no windowing, as mentioned above. Here, “input” is defined to be the C++ vector containing all word indices in the whole “line”, and “target” is defined to be a randomly chosen label index from the labels in the current “line”. Thus:

      SG learning moves the average of the word vectors in the current line and a randomly chosen label vector from the current line, closer together.

      Hope this helps!

      1. Hi Stephen,

        Thank you for your reply. And also kindly clarify one more question

        Can we reduce the size of a model ? Is there any possibilities available in fast text ?

        I have trained a “input.txt” file of 18 GB using supervised mode since the model produced by fast text is of 1.5 GB which is very complicated to handle.

        Here is the constraints of mine while creating the model.

        “./fasttext supervised -input input.txt -output modelname -dim 50 -wordNgrams 2 -ws 10 -epoch 10 -lr 1.0 -thread 10”

  7. Hi, Thank you for the superb article 🙂

    I am just wondering how to tackle bigrams such as “new york”, “machine learning” in fasttext? Any thoughts on that?

Comments are closed.