Facebook Research’s new fastText library can learn the meaning of metadata from the text it labels. By labelling documents with the users who read them, we used fastText to hack together a “hybrid recommender” system, able to recommend documents to users using both collaborative information (“people who read this also liked that”) and whether the text in the documents is thematically similar to things they read previously. Early signs are it performs quite well, so we’ll continue to experiment with it.
What kind of language do British parliamentarians use? We scraped, parsed and vectorised a sample of recent debates from the House of Commons. We then applied a k-means clustering algorithm to these vectors, and created a word cloud for each cluster.
Previously we’ve written about how machines can learn meaning. One of the exciting opportunities of this approach is that it also means they can learn new languages very quickly. All you need is enough text data. Wikipedia offers a great starting point and partnering with content providers enables us to quickly gather additional data. We […]
The arXiv is a repository of over 1 million preprints in physics, mathematics and computer science. It is truly open access, and the preprints are an excellent dataset for testing out all sorts of language modelling / machine learning prototypes.
Computers consist of on/off switches and process meaningless symbols. So how is it that we can hope that computers might understand the meaning of words, products, actions and documents? If most of us consider machine learning to be magic, it is because we don’t yet have an answer to this question. Here, I’ll provide an answer in the context of machines learning the meaning of words. But as we’ll see, the approach is the same everywhere.
If a machine is to learn about humans from Wikipedia, it must experience the corpus as a human sees it and ignore the overwhelming mass of robot-generated pages that no human ever reads. We provide a cleaned corpus (also a Wikipedia recommendation API derived from it).
By ignoring citation graphs and keywords, you can discover papers and researchers you never knew existed. Check it out here (on arXiv papers = ML, CS, math & physics).
Stack Overflow is a programming Q&A website with over 4M users and 9M posts. It is one of many such sites on a variety of topics run by StackExchange. Stack Overflow is highly successful at gameifying the answering of questions through a reputation system based on up-votes and bounties. Users can use the reputation points and badges they win to support job applications, and employers can use the reputation to find the best employees. So answering many questions and earning tons of points is something that users take very seriously. But how can I find questions that I can answer?