Leveraging Machine Learning to Discover Research

By ignoring citation graphs and keywords, you can discover papers and researchers you never knew existed.  Check it out here (on arXiv papers = ML, CS, math & physics).

What’s the arXiv?

For the technically-inclined, the arXiv is a treasure trove of recent research in machine learning, computer science, mathematics and physics.  It makes available the full-text of over 1M preprints.  I first met the arXiv as a mathematician, and it was (and still is) the first place to look for new maths research.  Papers appear on the arXiv long before they are officially published, and are also often revised there subsequent to their publication.

Content discovery on the arXiv

The arXiv is very effective as a repository, but it desperately lacks any content discovery functionality (this not a critique: as a repository, the arXiv shows how a community can simply bypass research paywalls).  In any case, when I started applying machine learning to content recommendation, the arXiv was at the top of my hit list.  So to show off what we could do here at Lateral, we built a document discovery solution for the arXiv.

Here is how it works: once you find one paper you are interested in, all those papers that are thematically similar are listed directly below it.  You can search by title and author, or by pressing the “paragraph button” next to the search box, you can search by pasting in a whole chunk of text, e.g. an abstract of an article that interests you.  Star the articles you like, and keep cruising for new papers.

Screen Shot 2015-05-22 at 17.39.11

Thoughts / dog-fooding

I shamelessly lobbied my coworkers that we should build this tool together, and I did so because I wanted it for myself.  Using it, I have found many papers (and researchers!) that I had never heard of that work on similar stuff to myself.  In all cases, the new paper or researcher had remained undiscovered because of one of two reasons:

  • the researchers belonged to a parallel research clique, that for one reason or another didn’t cite and weren’t cited by the researchers I was familiar with; and
  • the researchers used different language, e.g. they were physicists talking about representation theory as opposed to representation theorists talking about physics.

Our document model is purposefully blind to both social cliques and keywords — it breaks through both these barriers.  It sees only ideas!

Do-it-yourself

We place our content recommender at your disposal and you can do what you want with it.  It understands any English text that you throw at it.  You could build a similar tool yourself, for any corpus you want.  Knock yourselves out.

To do

The interface is (we hope) functional but stripped back.  There are some problems rendering some of the LaTeX.  It would be great to have the ability to restrict the search by date, or to link to citations.  All these things can be done!  Please get in touch with your other suggestions and ideas.

The Author
I am in charge of machine learning and data science at Lateral.
Comments