The arXiv as Dataset

The arXiv is a repository of over 1 million preprints in physics, mathematics and computer science. It is truly open access, and the preprints are an excellent dataset for testing out all sorts of language modelling / machine learning prototypes.

What’s available?

Full-text and article metadata are published in two different ways.

  • The preprint metadata (title, abstract, authors, categories) are published via the OAI protocol for metadata harvesting (OAI-PMH) and via the arXiv API
  • The full-text of all preprints is made available in a huge data dump on S3, either as PDFs or as TeX source files.

I’ve found the preprint metadata much easier to work with, being smaller in size, cleaner and more frequently updated. Using the OAI-PMH interface, we fetch new abstracts everyday.

The arXiv as a dataset

There are many tasks for which the arXiv is an ideal dataset. You could use the tags (MSC categories) to train a tagger, for instance, or test out your ideas for summarisation or keyword extraction. We feed the abstracts into our content recommender to provide a way to conceptually browse the arXiv. When reading an abstract, articles with conceptually-related abstracts are surfaced automatically (see earlier post). The arXiv dataset often turns up in the language modelling literature as well, including in a recent paper authored by arXiv founder, Paul Ginsparg, and Alexander Alemi.


I had never heard of OAI-PMH before I wanted to work with arXiv data. It must have been popular at some stage, because there is a very long list of institutions that publish via OAI PMH. However, to my knowledge most are too small to be interesting, being e.g. the ePrint server of such and such a university. The big three seem to be the arXiv, CERN and PubMed Central.

All OAI PMH publishers must serve “Dublin Core”, an XML-based format, which looks like this:

<oai_dc:dc xmlns:oai_dc="" xmlns:dc="" xmlns:xsi="" xmlns="" xsi:schemaLocation="">
 <dc:title>The Szemeredi-Trotter Theorem in the Complex Plane</dc:title>
 <dc:creator>Toth, Csaba D.</dc:creator>
 <dc:subject>Mathematics - Combinatorics</dc:subject>
 <dc:subject>05B25, 11T99</dc:subject>
 <dc:description>  It is shown that $n$ points and $e$ lines in the complex Euclidean plane
${\mathbb C}^2$ determine $O(n^{2/3}e^{2/3}+n+e)$ point-line incidences. This
bound is the best possible, and it generalizes the celebrated theorem by
Szemer\'edi and Trotter about point-line incidences in the real Euclidean plane
${\mathbb R}^2$.
 <dc:description>Comment: 24 pages, 5 figures, to appear in Combinatorica</dc:description>
 <dc:identifier>Combinatorica 35 (1) (2015), 95-126</dc:identifier>


We use the Python package oai-harvest (by John Harrison at the University of Liverpool) for harvesting the OAI-PMH metadata. It comes with some neat command line tools, that allow the use of a date filter when harvesting, which is useful for update cycles. (One word of warning: it is best to start small, since there are 1M records on the arXiv and oai-harvest writes out a file for each one). We then process the XML into a format we find more amenable using BeautifulSoup.

That’s it, I hope you found it useful. Please do check out the arXiv demo!

Extra links

pmigdal on DataTau brought to my attention this StackExchange post with more about OAI-PMH for the arXiv.

The Author
I am in charge of machine learning and data science at Lateral.