Breaking documents into “chunks”, like sections and subsections, is easy for humans, but surprisingly hard for computers. In this post we explain why this is, why it’s a valuable problem to solve, and we introduce our new solution.
Facebook Research’s new fastText library can learn the meaning of metadata from the text it labels. By labelling documents with the users who read them, we used fastText to hack together a “hybrid recommender” system, able to recommend documents to users using both collaborative information (“people who read this also liked that”) and whether the text in the documents is thematically similar to things they read previously. Early signs are it performs quite well, so we’ll continue to experiment with it.
What kind of language do British parliamentarians use? We scraped, parsed and vectorised a sample of recent debates from the House of Commons. We then applied a k-means clustering algorithm to these vectors, and created a word cloud for each cluster.
For the last few months, we’ve been doing occasional work on an approximate nearest neighbours (ANN) vector search tool, written in Python. It’s still not finished and there are many rough edges, but it comes with a working DynamoDB adaptor and hence operates out-of-memory, one our main requirements. On the down side, it isn’t as fast […]