The arXiv as Dataset

What's available?

Full-text and article metadata are published in two different ways.

The preprint metadata (title, abstract, authors, categories) are published via the OAI protocol for metadata harvesting (OAI-PMH) and via the arXiv API

The full-text of all preprints is made available in a huge data dump on S3, either as PDFs or as TeX source files.

I've found the preprint metadata much easier to work with, being smaller in size, cleaner and more frequently updated. Using the OAI-PMH interface, we fetch new abstracts everyday.

The arXiv as a dataset

There are many tasks for which the arXiv is an ideal dataset. You could use the tags (MSC categories) to train a tagger, for instance, or test out your ideas for summarisation or keyword extraction. We feed the abstracts into our content recommender to provide a way to conceptually browse the arXiv. When reading an abstract, articles with conceptually-related abstracts are surfaced automatically (see earlier post). The arXiv dataset often turns up in the language modelling literature as well, including in a recent paper authored by arXiv founder, Paul Ginsparg, and Alexander Alemi.

OAI-PMH -- WTF?

I had never heard of OAI-PMH before I wanted to work with arXiv data. It must have been popular at some stage, because there is a very long list of institutions that publish via OAI PMH. However, to my knowledge most are too small to be interesting, being e.g. the ePrint server of such and such a university. The big three seem to be the arXiv, CERN and PubMed Central.

All OAI PMH publishers must serve "Dublin Core", an XML-based format, which looks like this:

<pre>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>The Szemeredi-Trotter Theorem in the Complex Plane</dc:title>
<dc:creator>Toth, Csaba D.</dc:creator>
<dc:subject>Mathematics - Combinatorics</dc:subject>
<dc:subject>05B25, 11T99</dc:subject>
<dc:description> It is shown that $n$ points and $e$ lines in the complex Euclidean plane
${\mathbb C}^2$ determine $O(n^{2/3}e^{2/3}+n+e)$ point-line incidences. This
bound is the best possible, and it generalizes the celebrated theorem by
Szemer\'edi and Trotter about point-line incidences in the real Euclidean plane
${\mathbb R}^2$.
</dc:description>
<dc:description>Comment: 24 pages, 5 figures, to appear in Combinatorica</dc:description>
<dc:date>2003-05-19</dc:date>
<dc:date>2014-05-16</dc:date>
<dc:type>text</dc:type>
<dc:identifier>http://arxiv.org/abs/math/0305283</dc:identifier>
<dc:identifier>Combinatorica 35 (1) (2015), 95-126</dc:identifier>
<dc:identifier>doi:10.1007/s00493-014-2686-2</dc:identifier>
</oai_dc:dc>
</pre>

Harvesting

We use the Python package oai-harvest (by John Harrison at the University of Liverpool) for harvesting the OAI-PMH metadata. It comes with some neat command line tools, that allow the use of a date filter when harvesting, which is useful for update cycles. (One word of warning: it is best to start small, since there are 1M records on the arXiv and oai-harvest writes out a file for each one). We then process the XML into a format we find more amenable using BeautifulSoup.

That's it, I hope you found it useful. Please do check out the arXiv demo!

Machine Learning

Using machine learning to segment documents

Breaking documents into “chunks”, like sections and subsections, is easy for humans, but surprisingly hard for computers. In this post we explain why this is, why it’s a valuable problem to solve, and we introduce our new solution.

Become a Lateral Pioneer

What's available?

The arXiv as a dataset

OAI-PMH -- WTF?

Harvesting

Extra links

More in

Using machine learning to segment documents

Text segmentation using word embeddings

Semantic trees for training word embeddings with hierarchical softmax

Get into flow.

The arXiv as Dataset

Become a Lateral Pioneer

What's available?

The arXiv as a dataset

OAI-PMH -- WTF?

Harvesting

Extra links

Spread the word

More in

Using machine learning to segment documents

Text segmentation using word embeddings

Semantic trees for training word embeddings with hierarchical softmax

Get into flow.