Archives Hub Blog: September 2009

The Semantic Web has always interested me, although some years have elapsed since I first came across it. It feels like it took a back seat for a while, but now it is back and starting to go places, particularly with the advent of Linked Data, which is a central concept behind the Semantic Web.

The first Linked Data Meetup was recently held in London, with presentations, case studies, panels and a free bar in the evening, courtesy of Talis and the prize winners of Best-in-use-Track Paper award from the European Semantic Web conference, who generously donated their winnings behind the bar. The venue may have been hidden away in Hammersmith, but the room was packed and the general atmosphere was one of expectation and enthusiasm.

I am still in the process of trying to grasp the issues surrounding the Semantic Web, and whilst some of the presentations at this event were a little over my head, there was certainly a great deal to inform and interest, with a good mix of people, including programmers, information professionals and others, although I was probably the only archivist!

One of the most important messages that came across was the importance of http URIs, without which linked data cannot work. URIs may commonly be URLs but essentially they are also unique identifiers, and this is what is important about them. We heard about what the BBC are up to from Tom Scott. They are making great strides with linked data, creating identifiers for every programme, in order to make the programme into an entity. But there are identifiers for a great deal more than just programmes - natural history is a subject area they have been focussing on, and now they have identifiers for animals, for groups of animals, for species, for where they live, etc. By ensuring that all of these entities have URIs it is possible to think about linking them in imaginative ways. Furthermore, relationships between entities have URIs - this is where the idea of triples comes in, referring to the concept of a subject linked to an object through a relationship.

The three parts of each triple are called its subject, predicate, and object. A triple mirrors the basic structure of a simple sentence, such as: the Archives Hub is based at Mimas. The Hub is the subject 'is based at' is the predicate and Mimas is the object.

Whilst humans may read sentences such as this and understand the entities and the relationships, the Semantic Web vision is that machines can do the same - finding, sharing, analysing and combining information.

Issues such as sustainability were raised, and the great need to make Linked Data easier to create and use. We heard about DataIncubator.org, a project that is creating and publishing Linked Data. The Talis Connected Commons scheme offers free access to the Talis platform for public domain data, which means you have access to an online triple store. Talis will host the data, although the end goal is for original curator of data to take it back and publish it themselves. But this does seem to be a great way to help get the momentum going on Linked Data. Talis are one of the leading suppliers of library software, but clearly they have decided to put their weight behind the Semantic Web, and they are keen to engage the community in this by providing help and support with dataset conversion, that is to say, conversion of data into RDF.

There was some talk of the need to encourage community norms, for example, with linking and attribution, something that is particularly important when taking someone else's data. People should be able to trace the path back to original dataset. Another issue that came up was the need to work together, particularly avoiding different people working on converting the same dataset. It is important to make all of the code available and to benefit from shared expertise. It was very obvious that the people taking part in this event and showing us their projects were keen to collaborate and take a very open approach.

Leigh Dodds from Talis explained that dataincubator.org has already converted some major datasets, such as the NASA space flight dataset, which includes every space flight launch since 1950, and OpenLibrary, which already publishes RDF but the modelling of the data was not great and so Talis have helped with this. The data that Leigh talked about is already in public domain, so the essential task is to model it for output as RDF. Leigh gave us two of his wish list data sets for possible conversion: the Prelinger Archives, a collection of over 2,000 historic films (the content is in the Internet Archive) and Lego, which adds a fun element and would mean a meeting of similar minds, as people into lego are generally as anal as those who are into the Semantic Web!

Whilst many of the participants at the Linked Data Meetup were enthusiastic programmers rather than business people or managers, there was still a sense of the importance of the business case and taking a more intelligent approach to promotion and marketing.

Archivists are always very interested in issues of privacy, rights, and the ownership of data, and these issues were also recognised and discussed, though not in any detail. There did seem to be a rather curious suggestion of changing copyright law to 'protect facts', and thus bring it more in line with what is happening in the online environment.

As well as examples of what is happening at the BBC, we heard about a various other projects, such as a project to enable people to find, store, share, track, publish and understand statistics - timetric. This is essentially about linking statistics and URIs and creating meaningful relationships between numbers. One of the interesting observations made here was that it is better to collect the data first and then decide how to sort and present it, rather than beforehand, because otherwise you may design something that does not fit in with what people want.

For me, the Government Data Panel was one of the highlights of the day. It gave me a good sense of what is happening at the moment with Linked Data and what the issues are. Tim Berners-Lee (inventor of the Web) and Nigel Shadbolt talked about the decision to prioritise UK government data within the Linked Data project - clearly it is of great value for a whole host of reasons, and a critical mass of data can be achieved if the government are on board, and also we should not forget that it is 'our data' so it should be opened up to us - public sector data touches all of us, businesses, institutions, individuals, groups, processes, etc.

The Linked Data project is not about changing the way government data is managed but about access, enabling the data to be used by all kinds of people for all kinds of things. It is not just about transparency, but about actually running things better - it may increase efficiencies if the data is opened up in this way. Tim Berners-Lee told us how government ministers tended to refer to 'the database' of information, as in the creation of one massive database, a misconception of what this Linked Data project is all about. Ministers have also raised worries about personal data, about whether this project will require more time and effort from them, and whether they will have to change their practices. But within government there are a few early adopters who 'get it', and it will be important to try to clone that understanding! There was brief mention, in passing, of the Ordnance Survey being charged to make money to run its operations, and therefore there is a problem with getting this data. Similarly, when parts of the public sector were privatised, the franchises took the data with them (e.g. train timetables).

Location data was recognised as being of great importance. A huge percentage of data has location in it, and it can as hub to join disparate datasets. We need an RDF datastore of counties, authorities, constituencies, etc, and we should think about the importance of using the same identifier for a particular location so that we can use the location data in this way.

There was recognition that we have tended to conflate Linked Data and open data, but they are different. It is important to stress that open data may not be data that is linked up, and Linked Data may not be open, it may have restricted access. But if we can start to join up datasets, we can bring whole new value to them, for example, combining medical and educational data in different ways, maybe in ways we have not yet thought about. We want to shift the presumption that the data should be held close unless a reason is give to give it up (an FoI request!). If the data can be made available through FoI, then why not provide as linked data?

One of the big challenges that was highlighted was with local government, where attitudes are not quite so promising as with central government. Unfortunately, as one panel member put it, we are not in a benevolent dictatorship so we cannot order people to open up the data! It is certainly a diffcult issue, and although it was pointed out that there are some examples of local authorities who are really keen to open up their data, many are not, and Crown copyright does not apply to local authorities.

Tim encouraged us all to make RDF files, create tools, enable mash-ups, and so on, so that people can take data and do things with it. So, do go and visit http://data.gov.uk once it is up and running and show that you support the initiative.

Whilst other initiatives in e-governement and standards do appear to have come and gone, it ma be that we wouldn't have got to where we are now without them, so often these things are all part of the evolutionary process. The approach to the Linked Data Project is bottom-up, which is important for its sustainability. Whislt support of the Prime Minister is important, in a way it is the support of the lower levels in govt that is more important.

The Semantic Web could bring enormous benefits if it is realised. The closing presentation by Tom Heath, from Talis, gave a sense of this, as well as a realistic assessment of what lies ahead. The work that is going on demonstrated what might be achievable, but it also demonstrated that we are in the very early stages of this journey. There are huge challenges around the quality of the data and disambiguation. I find it exciting because it takes us along the road of computers as intelligent agents, opening up data and enabling it to be used in new and imaginative ways.

If any archivists out there are thinking of doing anything with Linked Data we would be very interested to hear from you!

Image from: http://linkeddata.org/home

Labels: linked data, rdf, semantic web

Archives Hub Blog

22 September 2009

A few thoughs on context and content

11 September 2009

Linked Data: towards the Semantic Web

01 September 2009

The Spanish Civil War

Links

Feeds

Other Archives Blogs

Previous Posts

Archives