03 March 2010

Linked Data: one thing leads to another

I attended another very succesful Linked Data meetup in London on 24 February. This was the second meetup, and the buzz being created by Linked Data has clearly been generating a great deal of interest, as around 200 people signed up to attend. All of the speakers were excellent, and found that very helpful balance
between being expert and informative whilst getting their points across clearly and in a way that non-technical people could understand.
Tom Heath (Talis) took us back to the basic principles of Linked Data. It is about taking statements and expressing them in a way that meshes more closely with the architecture of the Web. This is done by assigning identifiers to things. A statement such as Jane Stevenson works at Mimas can be broken down and each part can be given a URI identifer. I have a URI (http://www.archiveshub.ac.uk/janefoaf.rdf) and Mimas has a URI (http://www.mimas.ac.uk/). The predicate that describes the relationship 'worksAt' must also have a URI. Creating statements like this, we can start linking datasets together.
Tom talked about how this Linked Data way of thinking challenges the existing metaphors that drive the Web, and this was emphasised thoroughout the sessions. The document metaphor is everywhere - we use it all the time when talking about the Web; we speak about the desktop, about our files, and about pages as documents. It is a bit like thinking about the Web as a library, but is this a useful metaphor? Mabye we should be moving towards the idea of an exploratory space, where we can reach out, touch and interact with things. Linked Data is not about looking for specific documents. If I take the Archives Hub as an example, Linked Data is not so much concerned with the fact that there is a page (document) about the Agatha Christie archive; what it is concerned about is the things/concepts within that page. Agatha Christie is one of the concepts, but there are many others - other people, places, subjects. You could say the description is about many things that are linked together in the text (in a way humans can undertand), but it is presented as a page about Agatha Christie. This traditional way of thinking hides references within documents, they are not 'first class citizens of the web' in themselves. Of course, a researcher may be wanting information about Agatha Christie archives, and then this description will be very relevant. But they may be looking for information about other concepts within the page. If 'Torquay' and 'novelist' and 'nursing' and 'Poirot' and all the other concepts were brought to the fore as things within their own right, then the data could really be enriched. With Linked Data you can link out to other data about the same concepts and bring it all together.
Tom spoke very eloquently about how you can describe any aspect you like of any thing you like by giving identifiers to things - it means you can interact with them directly. If a researcher wants to know about the entity of Agatha Christie, the Linked Data web would allow them to gather information about that topic from many different sources; if concepts relating to her are linked in a structured way, then the researcher can undertake a voyage of discovery around their topic, utilising the power that machines have to link structured data, rather than doing all the linking up manually. So, it is not a case of gathering 'documents' about a subject, but of gathering information about a subject. However, if you have information on the source of the data that you gather (the provenance), then you can go to the source as well. Linked Data does not mean documents are unimportant, but it means that they are one of the things on the Web along with everything else.
Having a well-known data provider such as the BBC involved in Linked Data provides a great example of what can be done with the sort of information that we all use and understand. The BBC Wildlife Finder is about concepts and entities in the natural world. People may want to know about specific BBC programmes, but they are more likely to want to know about lions, or tigers, or habitats, or breeding, or, other specific topics covered in the programmes. The BBC are enabling people to explore the natural world through using Linked Data. What underlies this is the importance of having URIs for all concepts. If you have these, then you are free to combine them as you wish. All resources, therefore, have HTTP URIs. If want to talk about sounds that lions make, or just the programmes about lions, or just one aspect of a lion's behaviour, then you need to make sure each of these concepts have identifiers.
Wildlife Finder has almost no data itself; it comes from elsewhere. They pull stuff onto the pages, whether it is data from the BBC or from elsewhere. DBPedia (Wikipedia output in RDF) is particularly important to the BBC as a source of information. The BBC actually go to Wikipedia and edit the text from there, something that benefits Wikipedia and other users of Wikipedia. There is no point replicating data that is already available. DBPedia provides a big controled vocabulary - you can use the URI from Wikipedia to clarify what you are talking about, and it provides a way to link stuff together.
Tom Scott from the BBC told us that the BBC have only just released all the raw data as RDF. If you go to the URL it content negotiates to give you what you want (though he pointed out that it is not quite perfect yet). Tom showed us the RDF data for an Eastern Gorilla, providing all of the data about concepts that go with Eastern Gorillas in a structured form, including links to programmes and other sources of information.
Having two heavyweights such as the BBC and the UK Government involved in Linked Data certainly helps give it momentum. The Government appears to have understood that the potential for providing data as open Linked Data is tremendous, in terms of commercial exploitation, social capital and improving public service delivery. A number of times during the sessions the importance of doing things in a 'web-centric' way was emphasised. John Sheridan from The National Archives talked about data.gov.uk and the importance of having 'data you can click on'. Fundamentally, Linked Data standards enable the publication of data in a very distributed way. People can gather the data in ways that are useful to them. For example, with data about schools, what is most useful is likely to be a combination of data, but rather than trying to combine the data internally before publishing it, the Government want all the data providers to publish their data and then others can combine it to suit their own needs - you don't then have to second guess what those needs are.
Jeni Tennison, from data.gov.uk, talked about the necessity of working out core design patterns to allow Linked Data to be published fast and relatively cheaply. I felt that there was a very healthy emphasis on this need to be practical, to show benefits and to help people wanting to publish RDF. You can't expect people to just start working with RDF and SPARQL (the query language for RDF). You have to make sure it is easy to query and process, which means creating nice friendly APIs for them to use.
Jeni talked about laying tracks to start people off, hepling people to publish their data in a way that can be consumed easily. She referred to 'patterns' for URIs for public sector things, definitions, classes, datasets, and providing recommendations on how to make URIs persistent. The Government have initial URI sets for areas such as legislation, schools, geographies, etc. She also referred to the importance of versioning, with things having multiple sources and multiple versions over time it is important to be able to relate back to previous states. They are looking at using named graphs in order to collect together information that has a particular source, which provides a way of getting time-sliced data. Finally, ensuring that provenance is recorded (where something originated, processing, validation, etc.) helps with building trust.
There was some interesting discussion on responsibilities for minting URIs. Certain domains can be seen to have responsibilities for certain areas, for example, the government minting URIs for government departments and schools, the World Health Organisation for health related concepts. But should we trust DBPedia URIs? This is an area where we simply have to make our own judgements. The BBC reuse the DBPedia URI slugs (the string-part in a URL to identify, describe and access a resource) on their own URLs for wildlife, so their URLs have the 'bbc.co.uk' bit and the DBPedia bit for the resource. This helps to create some cohesion across the Web.
There was also discussion about the risks of costs and monopolies - can you rely on data sources long-term? Might they start to charge? Speakers were asked about the use of http URIs - applications should not need to pick them apart in order to work out what they mean. They are opaque identifiers, but they are used by people, so it is useful for them to be usable by people, i.e. readable and understandable. As long as the information is made available in the metadata we can all use it. But we have got to be careful to avoid using URIs that are not persistent - if a page title in Wikipedia changes the URI changes, and if the BBC are using the Wikipedia URI slug then is a problem. Tom Scott made the point that it it worth choosing persistence over usability.
The development of applications is probably one of the main barriers to uptake of Linked Data. It is very different, and more challenging, than building applications on top of a known database under your control. In Linked Data applications need to access multiple datasets.
The session ended by again stressing the importance of thinking differntly about data on the Web. To start with things that people care about, not with a document-centric way of thinking. This is what the BBC have done with the Wildlife Finder. People care about lions, about the savannah, about hunting, about life-span, not about specific documents. It is essential to identify the real world things within your website. It is the modelling that is one of the biggest challenges - thinking about what you are talking about and giving those things URIs. A modelled approach means you can start to let machines do the things they are best at, and leave people to do the things that they are best at.
Post by Jane Stevenson (jane.stevenson@manchester.ac.uk)
Image: Linked Data Meetup, February 2010, panel discussion.

Labels: ,