The Unknown Perils of Mining Wikipedia

If a machine is to learn about humans from Wikipedia, it must experience the corpus as a human sees it and ignore the overwhelming mass of robot-generated pages that no human ever reads. We provide a cleaned corpus (also a Wikipedia recommendation API derived from it).

Robots learning from robots

Because of the breadth and availability of its content, Wikipedia has been widely used as a reference dataset for research in machine learning and for tech demos. However, Wikipedia has some serious problems that are not apparent from our familiarity with it as a resource for human beings.

Wikipedia has good coverage of popular topics and very irregular coverage of unpopular topics. Humans are unaware of this, since it is precisely the popular pages that are consumed: the most popular 12% of Wikipedia accounts for 90% of all traffic. The irregularity of coverage is poisonous to many models. A topic model trained on all of Wikipedia, for example, will associate “river” with “Romania” and “village” with “Turkey”. Why? Because there are 10k pages on Villages in Turkey, and not enough pages on villages in other places.

To make things worse, unpopular pages are very often robot generated. For example, rambot authored 98% of all the articles on US towns, and half of the Swedish Wikipedia is written by lsjbot! Robot generated pages are built by inserting data into sentence templates. The sheer mass of these pages means that a huge proportion of the language examples a model learns from are just the same template used over and over. Robots learning from robots.

Popularity filtering

The most useful trick is to exclude those Wikipedia pages that are not viewed frequently. This automatically excludes the mass of robot generated pages, and retains those pages that are frequently viewed and therefore edited by humans.  Wikipedia publishes page view statistics.  Below, for example, are the most frequently viewed pages on the day we derived the dataset (some time ago).

                                   title  dailyviews
                               Main Page     8501031
Climatic Research Unit email controversy      175964
                        Sachin Tendulkar      151859
                                  Jaguar       96096
                            Andy Kaufman       85422
                               Ram-Leela       61326
                 Great Oxygenation Event       55590
                           United States       54713
         General Educational Development       53607
                      Financial services       53305
                                Facebook       46794

On the day that we downloaded the dump, there were 4.3M pages, 1.3M of which had not been viewed once on that day, and 3.8M (i.e. 88%) were looked at less than 20 times.  The human experience of Wikipedia is restricted to a very small proportional of the pages.  Moreover, we found that the performance of our test models improved considerably when we trained on only popular pages.

Roll-your-own

Wikipedia publishes regular dumps of its content in XML format (here). The dataset we provide below is from October 2013, but unless you care about the latest rap star, that shouldn’t bother you.  Just in case you’d like to create an updated version of the dataset, here is how it was done:

  1. Download the latest XML data dump
  2. Use the Wikipedia extractor, version 2.6, by Giuseppe Attardi and Antonio Fuschetto of the TANL project at the University of Pisa to produce nice small XML files for each page of the form:
    <pre><doc id=”” url=”” title=””>

    </doc></pre>
    We used the single threaded version 2.6 (the multi-threaded version caused us problems).
  3. Wrangle these into the format you want using an XML parser.
  4. Download the some page view statistics, and remove all pages with less than (e.g.) 20 daily page views.
  5. Exclude content pages based on title, e.g. “Image: XXX”, “User: XXX”
  6. Drop disambiguation pages.
  7. Drop stubs.

The dataset

We worked through this procedure using an XML datadump from October 2013, retaining only those pages with at least 20 pages views.  You can download it here as a UTF encoded, two column CSV, the first column being the URL of the page, the second column being the text of the page.  Linefeeds in the text are escaped as ‘\n’.  There are 463k pages.  File size is 1.2GB compressed (gzip).

The Author
I am in charge of machine learning and data science at Lateral.
Comments

3 thoughts on “The Unknown Perils of Mining Wikipedia

  1. Great write-up, and it seems like a good approach to getting a nice, clean corpus. However, when I look at your downloadable .csv.gz, the articles themselves don’t include linebreaks encoded as ‘n’ (AFAICT). This makes it look like the pagetitle is repeated twice, .e.g. “Ip Ching Ip Ching is a Grandmaster of the Chinese Martial Art Wing Chun”, which can’t be good for sequence statistics…

  2. if I want to use the wikipedia csv format dataset for research, do you have any publication which can be cited as the source of dataset?

  3. Could you share your page view statistics cleaning code as well? (I’ve run into various versions of the pageview stats, unsure which is which).

Comments are closed.