Semantic trees for training word embeddings with hierarchical softmax

Hierarchical softmax is a more efficient way to train word embeddings compared to a regular softmax output layer. It has been shown that for language modeling the choice of tree affects the outcome significantly. In this blog post we describe an experiment to construct semantic trees and show how they can improve the quality of the learned embeddings in common word analogy and similarity tasks.

Reddit science discussions as a dataset

In this blog post we demonstrate how to generate a dataset for recommending Reddit posts based on semantic similarity. The Reddit API and the PRAW Python library are used to extract data from the AskScience subreddit. The posts are then analysed using LIP and built into a Chrome extension for searching similar content.