Recently, I’ve been involved in a project to ensure our consultations support RDFa markup, to make them indexable and reusable by third parties, including Directgov. Without duplicating the quite accessible and useful COI guidance, I thought I’d summarise here the process involved from the perspective of implementing the standard with minimal prior knowledge of the whys and wherefores.
As of Jan 1st 2010, it’s now a mandatory requirement for government sites. But more importantly than that, it’s a Jolly Good Idea to provide a low-maintenance way of enabling other systems and services to grab a list of consultations from your site, and identify the important metadata about them, including the closing date and how to respond. Short term, it will make services like TellThemWhatYouThink and Directgov more useful, but in terms of the bigger picture, it will expose the opportunity to get involved with policymaking to a wider audience, and reduce the hassle for those who are already part of our regular stakeholder group (by making possible new services such as auto email alerts, RSS feeds, cross-government updates and so on).
RDFa offers a simple way to add meaningful information to existing web pages, which can be extracted easily by software (as opposed to hit-and-miss ‘scraping’ of regular web pages). As a lay person, I’d say there are three key principles which I can articulate:
- Be unobtrusive and minimalistic: taking this approach lets you add extra items to pages which aren’t seen by regular browsing visitors, but which are accessible to software robots looking for them. It’s also not ‘an extra thing’ to maintain and serve like an RSS feed, so reduces risk, in theory.
- Offer clean data: through being consistent in how data about the consultation is described, the idea is that RDFa helps to extract very clean information about the consultation – for example, an unambiguous closing date, a response email address, an exact postcode, all in formats which can then be used in other ways (plotted on a map, listed on a calendar, turned into a mailform on a website etc)
- Extend existing conventions: the most complicated aspect of implementing this particular specification is that the authors have gone out of their way to find existing wheels rather than reinvent their own. So they use Dublin Core metadata to describe authors and organisations; vCard to describe response contact information; plus nods to DBPedia and FOAF (Friend Of A Friend) to support these major semantic web initiatives. Only for the gaps where specific consultation information needs to be marked up is there a new standard introduced, using the namespace (prefix)
In a nutshell, the process involves tweaking the template for your consultation pages, adding extra metadata elements and attributes. This is only as easy or hard as your CMS makes it. It’s important that it’s right though – even a few ‘broken bits’ could render the page useless to a software robot trying to extract data from it.
How to do it
Read the COI guidance (and give it to your developer), which is the most comprehensive guide, with useful illustrated examples. There’s also a worked up HTML page showing how this works, and of course you’re welcome to look at ours (which I *think* are right, based on feedback from the gurus).
As an example (but again, you should read the official guidance) I found I needed to work through the following:
- ensure we have a single page per per consultation
- amend the DOCTYPE, if you’re using something like the standard XHTML strict/transitional version. Needs to tell requesters of the page that it contains RDFa
- add some attributes to the <html> element, highlighting the namespaces (vocabularies) you’re referencing in the document
- add Dublin Core metadata elements/attributes to your page <head> element if they’re not there already
- ensure we have a wrapper <div> around the consultation information which again references the namespaces (vocabularies) you’re using. This also identifies the name of the organisation publishing the document
- add some Dublin Core metadata attributes as <spans> within this <div> identifying this as a consultation
- add some Dublin Core attributes to key bits of the HTML, such as the consultation title, start date, closing date and description, marking these as such – and in the case of dates, ensuring there’s a machine-readable data format value in the attribute. Also add a unique identifier – a reference number – to each consultation (not something we’d done routinely before)
- ensure the contact details for responses is carefully structured using vCard format, with separate ‘Full Name’, ‘Street Address’, ‘Locality’ and ‘Post Code’ elements, suitably marked-up with attributes. Since vCard doesn’t cover the specific case of a consultation with an email reply address, for example, these elements are marked up with the new argot: namespace attributes
- add Dublin Core-based attributes describing the file attachments – the consultation document itself, and any related ones such as appendices or Impact Assessments
UPDATE: in retrospect, it was foolish to attempt a blog post about code without some code examples. I’ve tried and failed to find a half-decent code syntax highlighter plugin for WordPress, but the following couple of screenshots hopefully illustrate the before and after situations for the contact information part of a consultation:
Before, plain HTML:
After, with RDFa added (and marked up more semantically as a list item within the consultation metadata)
What help is available?
I worked from the examples given in the COI guidance and the pioneers in this at the Ministry of Justice. The COI Digigov team are your allies in helping to implement this, and should be able to answer queries and/or direct you to sources of further implementation advice and support.
P.S. If you Know About This Stuff and feel I’m giving duff advice here, please drop me a line in the comments or via the contact form and I’ll correct. Thanks.