Toward automating complex legislation updates

One of the projects we’re working on at the moment is for The National Archives (TNA) and relates to updating the system for consolidation of legislation – the Expert Participation Model (known as Participation). It’s a big, complex project using many cutting edge technologies and techniques. We thought we would start to share some of the workings.

One aspect of the editorial process has always been very manual. That is, identifying the changes to existing legislation that occur as part of new legislation. It is very common for new UK legislation to change existing legislation, often in very complex and subtle ways. Processing these changes is a very time consuming process. In order to improve the (presently manual) processing one of the two main strands of activity within Participation is to automate the extraction of the changes that new legislation makes to existing legislation. New legislation is published every working day so our objective is to process the new documents as they are loaded.

Amendments to legislation break down into various types. The project has so far tackled several areas, with the most complete and complex area being so-called ‘textual amendments’. These are amendments that actually modify the text of other legislation. For instance, it might be something like ‘In section 3 for “vehicle” substitute “automobile”’.

Now, in an ideal world the format of these amendments would be very strictly controlled. Unfortunately that isn’t how it is. It’s not completely random however and there is a considerable amount of uniformity across documents. We like to think of it as a semi-controlled language.

To tackle the problem we’ve used our Data Enrichment Service (DES) infrastructure – http://openup.tso.co.uk/des. The DES provides a platform to execute GATE pipelines. A GATE pipeline is a series of processing steps, with each step doing something to the text, with the end result being additional value extracted from the text. GATE is actually a framework for doing ‘natural language processing’ type tasks and was created by the University of Sheffield (http://gate.ac.uk).

The DES infrastructure can be used to run pretty much any GATE pipeline. It also includes of the DES Starter, which provides a basic capability to identify things such as names, places, organisations, and events. Pipelines can make use of the DES Starter in order to short cut then development process if required. Participation is however run as an independent process that simply makes use of the DES infrastructure.

The legislation.gov.uk website utilises t­he standard DES API (Application Programming Interface), to send documents to be processed and receives back an enhanced version.

The process also makes use of the ability of the DES to update itself. The Harvester facility of the OpenUp platform monitors the RSS feeds for new legislation and sends the information to the DES API. The DES then uses this information to update the GATE pipeline process. This ensures that the extraction process can recognise the latest items of legislation.

The output from the DES is provided as an HTML demo as part of the development process to enable people working on the project to see easily what has been automatically produced. An example of a snippet of legislation is given below. You can see that the blue boxes describe the information that has been extracted.

The HTML output is actually generated from XML. It is the XML that is the output from the DES. The XML is generated by a custom parser that has been written specially to process the legislation amendments. The parser is written as a GATE plug-in and simply slots into the GATE pipeline. The ability to easily write and incorporate customised plug-ins into GATE is one of its best features. An example of the last part of the HTML in XML format is given below:

<InGroupChange terminal="no">
    <Location type="In" id="leg-item-156"
        sourceRef="http://www.legislation.gov.uk/id/uksi/2011/1503/article/19">In</Location>
    <LegRef id="leg-item-157" uriFrom="Annotations" uri="/schedule/2"
        minorType="schedule" type="Schedule"
        sourceRef="http://www.legislation.gov.uk/id/uksi/2011/1503/article/19">Schedule 2</LegRef>
    <InlineLocationLegislation terminal="no">
        <Location type="Legislation" id="leg-item-158"
            sourceRef="http://www.legislation.gov.uk/id/uksi/2011/1503/article/19">to</Location>
        <Legislation context="http://www.legislation.gov.uk/id/uksi/2003/3198"
            rule="SecondaryLegislation" type="uksi" id="leg-item-159"
            sourceRef="http://www.legislation.gov.uk/id/uksi/2011/1503/article/19">the
            Communications (Isle of Man) Order 2003</Legislation>
    </InlineLocationLegislation>
    <InlineActionDeleteSubRef terminal="no">
        <Action type="Delete" id="leg-item-160"
            sourceRef="http://www.legislation.gov.uk/id/uksi/2011/1503/article/19">omit</Action>
        <LegRef type="Paragraph" minorType="paragraph" uri="/70/c, /d" id="leg-item-161"
            sourceRef="http://www.legislation.gov.uk/id/uksi/2011/1503/article/19">paragraph 70(c) and (d)</LegRef>
    </InlineActionDeleteSubRef>
</InGroupChange>

As a slightly more complex example of the kind of amendment that can occur here is another HTML output from a different document:

The XML output for this is a little bit too long to put here!

The XML needs to be very precise because it is then further processed to create another flavour of XML, which in turn is used to generate spreadsheets, PDFs and RDF, the RDF being the data format that is used by the rest of the system. The RDF is held in our OpenUp platform RDF store.

In addition to returning the changes contained within each  item of legislation, the extraction process also returns the original legislation XML with additional annotations. These additional annotations should then permit enhanced outputs, such as additional links on the legislation.gov.uk website.

It’s a pretty challenging problem in all and we have by no means solved all of the issues yet. But we are getting ever better results (we’ll have some publishable statistics soon) and the process is constantly improving.

Hopefully we have given you a feel for how the OpenUp platform can be used to process text and extract additional information from that text using a process that can be self-updating and easily accessed over the web.

Participation is perhaps the most sophisticated example to date of the use of the multiple facilities within then OpenUp platform. It is most exciting because the output has clearly identifiable productivity benefits.

November 4, 2011, 5:21 pm