Breaking documents into “chunks”, like sections and subsections, is easy for humans, but surprisingly hard for computers. In this post we explain why this is, why it’s a valuable problem to solve, and we introduce our new solution.
During the past couple of years, we have been building tools to help teams working in expert fields find the information and expertise they need. These fields include law, engineering, consulting and medicine, and our aim was to craft an approach that was relevant for any profession where expert knowledge is needed to interpret information and make a decision. We wanted to make it possible for an expert to scale their expertise.
This post describes a simple principle to split documents into coherent segments, using word embeddings. Then we present two implementations of it. Firstly, we describe a greedy algorithm, which has linear complexity and runtime in the order of typical preprocessing steps (like sentence splitting, count vectorising). Secondly, we present an algorithm that computes the optimal solution to the objective given by the principle, but is of quadratic complexity in the document lengths.
Hierarchical softmax is a more efficient way to train word embeddings compared to a regular softmax output layer. It has been shown that for language modeling the choice of tree affects the outcome significantly. In this blog post we describe an experiment to construct semantic trees and show how they can improve the quality of the learned embeddings in common word analogy and similarity tasks.
In this blog post we demonstrate how to generate a dataset for recommending Reddit posts based on semantic similarity. The Reddit API and the PRAW Python library are used to extract data from the AskScience subreddit. The posts are then analysed using LIP and built into a Chrome extension for searching similar content.
We have many different ways of delivering the Lateral API to clients who would like to install it in their own environment. One of those is as an Azure VHD for deployment to Azure VMs. In this post I will cover how to create a VHD that is fully compatible with Azure from an Ubuntu Cloud Image base.
Facebook Research’s new fastText library can learn the meaning of metadata from the text it labels. By labelling documents with the users who read them, we used fastText to hack together a “hybrid recommender” system, able to recommend documents to users using both collaborative information (“people who read this also liked that”) and whether the text in the documents is thematically similar to things they read previously. Early signs are it performs quite well, so we’ll continue to experiment with it.
A technique we use to visualise how Lateral recommendations would look and work on a website is to create a Chrome extension that inserts the recommendations at load time. In this blog post, I will create a Chrome extension that modifies this blog to set a custom background and to modify the HTML.