This post describes a simple principle to split documents into coherent segments, using word embeddings. Then we present two implementations of it. Firstly, we describe a greedy algorithm, which has linear complexity and runtime in the order of typical preprocessing steps (like sentence splitting, count vectorising). Secondly, we present an algorithm that computes the optimal solution to the objective given by the principle, but is of quadratic complexity in the document lengths.
Hierarchical softmax is a more efficient way to train word embeddings compared to a regular softmax output layer. It has been shown that for language modeling the choice of tree affects the outcome significantly. In this blog post we describe an experiment to construct semantic trees and show how they can improve the quality of the learned embeddings in common word analogy and similarity tasks.
What kind of language do British parliamentarians use? We scraped, parsed and vectorised a sample of recent debates from the House of Commons. We then applied a k-means clustering algorithm to these vectors, and created a word cloud for each cluster.