The arXiv is a repository of over 1 million preprints in physics, mathematics and computer science. It is truly open access, and the preprints are an excellent dataset for testing out all sorts of language modelling / machine learning prototypes.
Computers consist of on/off switches and process meaningless symbols. So how is it that we can hope that computers might understand the meaning of words, products, actions and documents? If most of us consider machine learning to be magic, it is because we don’t yet have an answer to this question. Here, I’ll provide an answer in the context of machines learning the meaning of words. But as we’ll see, the approach is the same everywhere.
If a machine is to learn about humans from Wikipedia, it must experience the corpus as a human sees it and ignore the overwhelming mass of robot-generated pages that no human ever reads. We provide a cleaned corpus (also a Wikipedia recommendation API derived from it).
Stack Overflow is a programming Q&A website with over 4M users and 9M posts. It is one of many such sites on a variety of topics run by StackExchange. Stack Overflow is highly successful at gameifying the answering of questions through a reputation system based on up-votes and bounties. Users can use the reputation points and badges they win to support job applications, and employers can use the reputation to find the best employees. So answering many questions and earning tons of points is something that users take very seriously. But how can I find questions that I can answer?