In this blog post we demonstrate how to generate a dataset for recommending Reddit posts based on semantic similarity. The Reddit API and the PRAW Python library are used to extract data from the AskScience subreddit. The posts are then analysed using LIP and built into a Chrome extension for searching similar content.
What kind of language do British parliamentarians use? We scraped, parsed and vectorised a sample of recent debates from the House of Commons. We then applied a k-means clustering algorithm to these vectors, and created a word cloud for each cluster.
The arXiv is a repository of over 1 million preprints in physics, mathematics and computer science. It is truly open access, and the preprints are an excellent dataset for testing out all sorts of language modelling / machine learning prototypes.