Text Segmentation using Word Embeddings

This post describes a simple principle to split documents into coherent segments, using word embeddings. Then we present two implementations of it. Firstly, we describe a greedy algorithm, which has linear complexity  and runtime in the order of typical preprocessing steps (like sentence splitting, count vectorising). Secondly, we present an algorithm that computes the optimal solution to the objective given by the principle, but is of quadratic complexity in the document lengths.

Semantic trees for training word embeddings with hierarchical softmax

Hierarchical softmax is a more efficient way to train word embeddings compared to a regular softmax output layer. It has been shown that for language modeling the choice of tree affects the outcome significantly. In this blog post we describe an experiment to construct semantic trees and show how they can improve the quality of the learned embeddings in common word analogy and similarity tasks.

A fastText-based hybrid recommender

Facebook Research’s new fastText library can learn the meaning of metadata from the text it labels. By labelling documents with the users who read them, we used fastText to hack together a “hybrid recommender” system, able to recommend documents to users using both collaborative information (“people who read this also liked that”) and whether the text in the documents is thematically similar to things they read previously. Early signs are it performs quite well, so we’ll continue to experiment with it.

Clustering debates from UK politicians

What kind of language do British parliamentarians use? We scraped, parsed and vectorised a sample of recent debates from the House of Commons. We then applied a k-means clustering algorithm to these vectors, and created a word cloud for each cluster.

Teaching machines new languages

Previously we’ve written about how machines can learn meaning. One of the exciting opportunities of this approach is that it also means they can learn new languages very quickly. We have recently started working on supporting new languages, and thought we would share some initial impressions here.

The arXiv as Dataset

The arXiv is a repository of over 1 million preprints in physics, mathematics and computer science. It is truly open access, and the preprints are an excellent dataset for testing out all sorts of language modelling / machine learning prototypes.

How do machines learn meaning?

Computers consist of on/off switches and process meaningless symbols. So how is it that we can hope that computers might understand the meaning of words, products, actions and documents? If most of us consider machine learning to be magic, it is because we don’t yet have an answer to this question. Here, I’ll provide an answer in the context of machines learning the meaning of words. But as we’ll see, the approach is the same everywhere.

The Unknown Perils of Mining Wikipedia

If a machine is to learn about humans from Wikipedia, it must experience the corpus as a human sees it and ignore the overwhelming mass of robot-generated pages that no human ever reads. We provide a cleaned corpus (also a Wikipedia recommendation API derived from it).