Text Segmentation using Word Embeddings

This post describes a simple principle to split documents into coherent segments, using word embeddings. Then we present two implementations of it. Firstly, we describe a greedy algorithm, which has linear complexity  and runtime in the order of typical preprocessing steps (like sentence splitting, count vectorising). Secondly, we present an algorithm that computes the optimal solution to the objective given by the principle, but is of quadratic complexity in the document lengths.