Clustering debates from UK politicians

What kind of language do British parliamentarians use? We scraped, parsed and vectorised a sample of recent debates from the House of Commons. We then applied a k-means clustering algorithm to these vectors, and created a word cloud for each cluster. Click on the image below for an enlarged version.

wc30allbrightedit

The image contains 24 word clouds, representing the 24 categories into which a sample of roughly 130,000 statements from UK House of Commons parliamentarians, all made between 2006 and the present day, were partitioned by the clustering algorithm. Each cloud contains ten words; the larger the word, the more representative it is of the cluster. The colouring is also meaningful: red words have meanings more closely aligned with remarks by Labour politiciansblue words, with those of Conservatives; and yellow words, with the sentiments of Liberal Democrats (contributions from all other parties were disregarded in this analysis, for simplicity). The brightness of a colour reflects the degree to which its meaning aligns with statements of politicians from the party it signifies, and grey words are spoken roughly equally often by all parties.

The Hansard corpus

In such a debate, a politician poses a question, which is then answered by one or more members of an opposing, or sometimes the same party. Occasionally, The Speaker intervenes to police etiquette, curtail digressions, or indeed exchange pleasantries with members of parliament (the exchange below is from 6th November, 2014):

Bob Blackman: Yesterday I had the honour of captaining the House of Commons bridge team in our annual match against the other place. I am happy to report that we successfully retained the Jack Perry trophy with an outstanding victory. Sadly, I was the only sitting Member participating and I had to enlist a number of ex-MPs—former distinguished Members of this House—to join me. Even more sadly, UK Sport refuses to recognise bridge, chess and other mind sports as sports. May we have an early debate, in Government time, on ensuring that there is recognition of those mind sports, which are important for sporting purposes in schools and for older people, so that we can encourage participation in them?

Mr Speaker: I am sure the House is pleased to learn of the hon. Gentleman’s prowess and distinction at the bridge table. It is a prowess and distinction of which I was hitherto unaware, but I am now better informed.

The text for this analysis comes from the publication Hansard, which provides transcriptions of speeches and debates in both houses of parliament in the United Kingdom. These records are all free for public perusal on web, although I agree with the economist and data scientist Andrew Whitby, who in another Hansard-related blog post from 2013 lamented the lack of a workable bulk-download option for the data. (Generously, he includes in his blog post a link to a text file containing a large list of URLs to XML files containing historical Hansard records, which he was obliged to collate himself.)

Technical details

The steps, in brief

Our goal was to construct word clouds from a representative collection of extracts from the corpus, to give a quick sense of the main themes discussed. I will outline the steps here in brief, before explaining them below in more detail:

  1. Programmatically generate a list of  candidate URLs, and download the associated HTML pages.
  2. Wrangle the text and metadata contained in the HTML pages into a CSV file.
  3. Vectorise the text chunks from the CSV file, and “cluster” the vectors into distinct categories.
  4. Create the word clouds (displayed in the image above) by finding representative words for each category. The more representative the word is of the category, the larger the font. The colours of the words symbolise political parties: blue words are aligned with statements made by Conservatives, red with those of Labour Party members, and yellow with Liberal Democrats.

We’ve made available the Python code we used to complete for Steps 1, 2 and 4 on a public Github repo. We’ve not made the code for Step 3 available, since it uses the underlying document vectors, which we don’t expose via our API.

Generating the URLs

The first step to creating the word clouds was to identify and download a sample of the parliamentary debates. Transcripts of “Debates and Oral Answers” for the House of Commons can be searched by date starting from this webpage. Here is an example URL for the transcript from 15th June 2016:

 

The characters in red encode the date and the page number. We wrote a Python script to iterate over all such dates and attempt to download the associated webpage, starting at the beginning of 1988, the earliest date from which the records are searchable at the above link. The script:

  1. Generates candidate dates and page numbers, and generates a URL from these in the above format.
  2. Writes each of these URLs, one per line, to a text file called “urls.txt” in the current directory.
  3. Attempts to download the HTML from this URL using the Python “requests” module. If it encounters a “page not found ” (404) error, it simply moves on to the next candidate. Note that it still writes this URL to the “urls.txt” file.

Running the above script gave us around 3,500 HTML files, starting in around 2006. I wanted to download all transcripts since 1988, but this failed. It seems that the URL format above is only valid for pages starting in 2006.

Text wrangling

The next step was to create a parser to extract and format text, in suitable chunks, together with metadata. We rolled our own parser using Python’s BeautifulSoup module, specifying that each chunk should be a “Remark”, by which we mean an uninterrupted block of text spoken by one politician, terminating when another politician begins to speak. The parser also extracts the name of each speaker and their title, if present (e.g. Mr., Mrs., Sir, …), as well as either their constituency and party affiliation, or their position in cabinet. The parliamentarian’s political party is omitted if they have a cabinet position, perhaps because the editors feel it ought to be clear to a human reader which party is governing at any recent point in history. This makes our job harder, and means that we were unable to programmatically extract the political parties for the most senior politicians. The output of applying the parser to each of the 3,500 html files in the corpus was a three-column CSV file with around 130,000 lines, one for each Remark. The first column contained a unique identifier, the second the speaker metadata in json format, and the third the plain text. In particular, each Remark is labelled not only by a speaker, but also by a political party.

A warning to the would-be user: The parser does not perfectly extract the political party in all cases; sometimes it also includes extraneous bits of surrounding text. This means it is very likely our analysis failed to find the party for some MPs, thus distorting the calculations used to create the red/blue/yellow party affiliation colour coding shown in the word cloud image above. Use at your own risk!

Vectorisation and Clustering

The next step was to vectorise the Remark contained in each line of the above CSV file using our vectoriser, obtaining a CSV file of 130,000 vectors, one for each Remark. Those curious can read more about the mathematical and linguistic ideas underpinning the vectorisation process in this earlier Lateral blog post by Benjamin Wilson.  We then clustered these vectors into groups using an implementation of the k-means clustering algorithm from the open source Python scikit-learn library. We experimented with different values for the number of clusters and found that 30 — 50 gave results that looked reasonable, although it’s clear from the image at the beginning of the post that some clusters are duplicated. All initialisations of sklearn.cluster.KMeans objects used almost all the standard parameters listed on the documentation page, with the exception that we set n_init=20, instead of 10. This k-means cluster object computes, in particular, a centroid vector for each cluster.

Word cloud creation

Having clustered the above vectors, hence obtained the centroid vectors for each cluster, we wrote a Python script to create the word clouds seen in the image above. The word cloud creation itself uses a very slightly modified version of Andreas Müller’s lovely open source Python module word_cloud to create the word clouds. The input to the Python script was:

  1. A list W of all word vectors, one for each item in the vocabulary of our vectorisation model.
  2. A list D of all the document vectors, one for each Remark in our Hansard corpus.
  3. The number num_words = 10 of words to appear in the word cloud.
  4. Note also that we forced all words in the word cloud images to be written horizontally (rather than word clouds comprising a mixture of horizontal and vertical words), by setting the parameter prefer_horizontal = 1.0.

For each centroid vector c, we compute the words in the word clouds as follows. Note that in all cases the “nearness” of two vectors v and w is measured by the cosine similarity of their normalisations v_norm and w_norm.

  1. A list W_of the 10 word vectors from W nearest to c was computed. This list is recorded, together with the associated nearnesses.
  2. For each word vector w in W_c, a list D_w of the political parties associated to each of the 100 document vectors from D nearest to w was computed.
  3. From D_w, create a dictionary of the form P_w  = {‘Lab’: x, ‘Con’: y, ‘LD’: z}, where x, y and z represent the number of occurrences of “Labour”, “Conservative” and “Liberal Democrat” in the list D_w. The dictionary P_w is a crude measure of the partisanship of the word w, encoding the absolute frequency with which members of each of the above three parties made a Remark with a meaning strongly reflected by that of the word w.
  4. Now we adjust this measure of partisanship P_w for the fact that the political parties are of unequal sizes by dividing each entry by the total number of Remarks in the corpus uttered by the corresponding political party. This compensates for the fact, for example, that there are many fewer Liberal Democrats than there are members of either of the other two parties. According to our parser, the corpus contained 7,986 Remarks by Liberal Democrats, 47,971 by Labour politicians, and 39,475 by Conservatives. The political affiliations of the remaining Remarks were either not recognised, or corresponded to some other party.
  5. Now normalise P_w by each value by the total of the three values, so that the new values sum to unity. This yields a dictionary P_w which looks, for example, like: {‘Lab’: 0.3, ‘Con’: 0.5, ‘LD’: 0.2}.
  6. Now apply the following rule to P_w to compute the colour of the word w: choose the maximum value m of P_w (0.5, in the example). If this value is larger than the arbitrarily chosen cutoff value of 0.45, then colour the associated word: (i) blue, if the associated party is Conservative; (ii) red, if the party is Labour, an (iii) yellow, if the party is the Liberal Democrats. The word in the example would thus be coloured blue. The magnitude of this value determines how bright the colour is rendered. Value close to 0.0 are dark, those close to 1.0 are bright.
  7. If m is not larger than the cutoff (indicating that no party made Remarks with meanings strongly reflected by the words significantly more often than any of the other two parties), colour the word grey.
  8. The relative sizes of the words, their positioning, and the final word cloud image are now computed by Andreas Müller’s word_cloud code, using the list W_c and associated nearnesses computed in Step 1.

We’ve published the raw output the above word cloud computation, together with the texts of the five most important documents to each cluster, is given in this text file on the Github repo. We hope it’s of interest to mavens of data science and of politics!

The Author
I'm a data scientist at Lateral.
Comments