May 22, 2015

Leveraging machine learning to discover research

By ignoring citation graphs and keywords, you can discover papers and researchers you never knew existed. Check it out here (on arXiv papers = ML, CS, math & physics).

What's the arXiv?

For the technically-inclined, the arXiv is a treasure trove of recent research in machine learning, computer science, mathematics and physics. It makes available the full-text of over 1M preprints. I first met the arXiv as a mathematician, and it was (and still is) the first place to look for new maths research. Papers appear on the arXiv long before they are officially published, and are also often revised there subsequent to their publication.

Content discovery on the arXiv

The arXiv is very effective as a repository, but it desperately lacks any content discovery functionality (this not a critique: as a repository, the arXiv shows how a community can simply bypass research paywalls). In any case, when I started applying machine learning to content recommendation, the arXiv was at the top of my hit list. So to show off what we could do here at Lateral, we built a document discovery solution for the arXiv.

Here is how it works: once you find one paper you are interested in, all those papers that are thematically similar are listed directly below it. You can search by title and author, or by pressing the "paragraph button" next to the search box, you can search by pasting in a whole chunk of text, e.g. an abstract of an article that interests you. Star the articles you like, and keep cruising for new papers.

Thoughts / dog-fooding

I shamelessly lobbied my coworkers that we should build this tool together, and I did so because I wanted it for myself. Using it, I have found many papers (and researchers!) that I had never heard of that work on similar stuff to myself. In all cases, the new paper or researcher had remained undiscovered because of one of two reasons:

the researchers belonged to a parallel research clique, that for one reason or another didn't cite and weren't cited by the researchers I was familiar with; and
the researchers used different language, e.g. they were physicists talking about representation theory as opposed to representation theorists talking about physics.

Our document model is purposefully blind to both social cliques and keywords -- it breaks through both these barriers. It sees only ideas!

Do-it-yourself

We place our content recommender at your disposal and you can do what you want with it. It understands any English text that you throw at it. You could build a similar tool yourself, for any corpus you want. Knock yourselves out.

To do

The interface is (we hope) functional but stripped back. There are some problems rendering some of the LaTeX. It would be great to have the ability to restrict the search by date, or to link to citations. All these things can be done! Please get in touch with your other suggestions and ideas.

‍

“

More in

Machine Learning

Using machine learning to segment documents

Breaking documents into “chunks”, like sections and subsections, is easy for humans, but surprisingly hard for computers. In this post we explain why this is, why it’s a valuable problem to solve, and we introduce our new solution.

Machine Learning

Text segmentation using word embeddings

This post describes a simple principle to split documents into coherent segments, using word embeddings.

Machine Learning

Semantic trees for training word embeddings with hierarchical softmax

In this blog post we describe an experiment to construct semantic trees and show how they can improve the quality of the learned embeddings in common word analogy and similarity tasks.

By clicking “Agree”, you agree to the storing of cookies on your device to enhance site navigation, analyse site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

More Options Deny Agree

Leveraging machine learning to discover research

Become a Lateral Pioneer

What's the arXiv?

Content discovery on the arXiv

Thoughts / dog-fooding

Do-it-yourself

To do

More in

Using machine learning to segment documents

Text segmentation using word embeddings

Semantic trees for training word embeddings with hierarchical softmax

Get into flow.

Leveraging machine learning to discover research

Become a Lateral Pioneer

What's the arXiv?

Content discovery on the arXiv

Thoughts / dog-fooding

Do-it-yourself

To do

Spread the word

More in

Using machine learning to segment documents

Text segmentation using word embeddings

Semantic trees for training word embeddings with hierarchical softmax

Get into flow.