Using machine learning to segment documents

Introduction

Many firms store in-house knowledge in digital form, to exploit in future business. To be valuable, both people and software must be able to access and reuse this information. For example: An employee might read a report on a previous project to better understand a current one, or a machine might provide statistical insight about previous engagements, displaying the results on an analytics platform. In either case, a good response should be coherent, but also contain the right amount of information.

In this post, we discuss the problem of “segmentation”. This means algorithmically breaking documents into coherent pieces like chapters, sections and subsections. These pieces can then be used as responses to user queries. We describe why the segmentation problem is useful to solve, and why it is easy for humans but hard for computers. We also outline our most recent solution, which works on documents in many languages. The underlying algorithm was developed by our ML team, and productionised by our design and development teams. 

Please get in touch if you would like to try this out.

Screenshots

The screenshots below show the segmentation tool in action, used to identify clauses in both English and Polish legal documents.

English
Polish

In both cases, the algorithm recognises the correct headings and sections. Inside the document view, it highlights the headings in orange, and the subheadings in blue. Using this information, the algorithm also creates a “table of contents” (ToC) pane on the left-hand side. This allows the user to navigate freely over the document, even if no native ToC exists.

The value

Retrieving information from a digital knowledge base is a question-and-response process. The user has one or more queries, and wants good answers. The best possible answer to a question must not be merely coherent, but also “proportionate”. This means it contains exactly enough information, but no more. For example, a good answer to the question: 

“What are the main economic and technological challenges in the renewable energy sector?” 

could stretch anywhere from a few sentences to a few books, depending on the context. In contrast, the answer to: 

“What is the combined 2019 GDP of all EU states?” 

is a single number. This “proportionality principle” implies that a good knowledge base should return the information in coherent, appropriately-sized blocks. Further, if the digital knowledge base is to be used by both humans and computers, these blocks should be intelligible to both agents.

The problem

But this presents a problem: Computers like to read data in databases, which are often linear and flat, but humans like to read human language in human documents. Almost always, these documents — books, reports, blogs and so on — have a hierarchical structure, comprising chapters, sections and subsections. So:

Problem: How do you store and retrieve information so that both humans and computers can easily reuse it?

Let us dissect this. The agent creating the data may be either human or computer, ditto the agent consuming it. This gives four creator-consumer combinations: Human-Computer (H2C), C2H, C2C and H2H, hence four separate answers to the above question. Each answer corresponds to a software category:

  1. C2C: If the data is both consumed and created by a computer, we are in the traditional computing world, and this is a software engineering problem for backend developers. 
  2. C2H: If the data is created by computers but consumed by humans, we are in the world of analytics. One answer is an analytics platform that summarises complex information in a human-friendly fashion.
  3. H2C: If the data is created by humans but consumed by computers, we are in the world of AI, or designing machines to understand human information. If the data is text, this is called Natural Language Processing (NLP).
  4. H2H: If the data is created and consumed by humans, but lives on a digital platform, we are in the world of recommenders. Here, the question becomes: How do you get a computer to suggest relevant, digestible human content to a human?

The solution is chunks

In both of the last two cases, the data must be split into the right “chunks” to be properly understood by the consuming agent. This is because humans both think and write in chunks. In the last case (H2H), it may seem that no segmentation is required, since human-created data ought to be naturally understandable to humans. However, this is not true at scale: We humans are still bad at parsing lots of human information, so machines can play a useful role in helping us, by recommending information in parcels.

For example, a user writing a “methodology” section for a new engagement may want to query their company’s knowledge base for previous reports on similar projects. In this case, the results will be more useful if the “methodology” sections can be returned independently, rather than embedded in longer documents which the user is forced to comb through. To do this at scale, we need an algorithmic way to break up a human-created document into the right “chunks”: sections, subsections, lists and so on. We call this process “segmentation”.

Old-school segmentation

Segmentation is easy for people but hard for computers, because the typical indicators of document structure — the font choice, text size, enumerations, whitespace, layout and so on — vary greatly across documents. This variation makes it hard to handcraft an algorithmic solution, which is, if you are a computer scientist, the traditional approach to automating things. Such an algorithm quickly devolves into a long, messy list of rules — one for each special case you can remember seeing — for detecting section headings. 

For example, suppose you notice headings often begin with enumerators, like numbers or letters of the alphabet (“1. Introduction”; “B. Methodology”, etc). So, you write software to flag a line of text as a section boundary each time it sees an enumerator. But then you realise enumerators can also signify lists, rather than sections. How would you program the computer to make this distinction? 

Perhaps, you speculate, lists are always indented, but section headings never are. So, you code up a new rule: “Section headings begin with enumerators, but aren’t indented”. But, even setting aside the tricky question of how much whitespace equals an indent, you quickly find another document that breaks this rule, too. In response, you must either modify the existing rule, or else create a new one. In each case, this makes your program more complex and harder to debug. Over time, you see new documents with new formats and layouts, and so you repeat this process, incrementally spaghettifying your code. And still you find new exceptions! A new strategy is called for.

Shiny new segmentation

After learning the brittleness of hand-coded segmentation algorithms the hard way, we changed our approach. Instead of brainstorming to anticipate section boundary patterns, we built a machine learning algorithm to do this for us. 

The algorithm has access to all the low-level features in the document that may help it to make this decision itself. For example, it sees layout information like whitespaces and word bounding box coordinates. It also knows about features reflecting font size and value, and the presence or absence of enumerators. 

The magic is that the algorithm decides for itself how best to combine these features into a meaningful signal, based on training data. These training data are documents with human-annotated section boundaries. This annotation is the “ground truth” for the machine. 

This approach has two big advantages. First, we avoid the spaghetti code scenario sketched above. A machine learning algorithm is now responsible for combining the features to identify section boundaries, not a hand-coded program. Second, it allows both Lateral and its end users to improve the segmentation performance by correcting mistakes. This means the software can learn to understand the structure even of documents with layouts it has never seen before. This is impossible for a hard-coded solution.

Try it!

This algorithm powers easy-to-use software that you don’t need a computer science degree to use. Please get in touch if you would like to try it out.