12 February 2010

It's all about YOU: Manchester as an Open Data city

There are plans afoot to declare Manchester as an Open Data city. At the Manchester Social Media Cafe last week I attended a presentation by Julian Tait, a founder of the Social Media Cafe, who talked to us about why this would be a good thing.

The Open Data initiative emerged as a result of Future Everything 2009, a celebration of the digital future in art, music and ideas. But what is an Open Data city? It is based upon the principle that data is the life blood of a city; it allows cities to operate, function, develop and respond, to be dymanic and to evolve. Huge datasets are generally held in places that are inaccessible to many of the populace; they are largely hidden. If data is opened up then applications of the data can be hugely expanded and the possibilities would be limitless.

There are currently moves by central government to open up datasets, to enable us to develop a greater awareness and understanding of society and of our environment. We now have data.gov.uk and we can go there and download data (currently around 2000 datasets) and use the data as we want to. But for data to have meaning to people within a city there has to be something at a city level; at a scale that feels more relevant to people in an everyday context.

Open data may be (should be?) seen as a part of the democratic process. It brings transparency, and helps to hold government to account. There are examples of the move towards transparency - sites such as They Work For You , which allows us all to keep tabs on our MP, and MySociety. In the US, Columbia has an initiative known as Apps for Democracy, providing prizes for innovative apps as a way to engage the community in 'digital democracy'.

They key here is that if data is thrown open it may be used for very surprising, unpredictable and valuable things: "The first edition of Apps for Democracy yielded 47 web, iPhone and Facebook apps in 30 days - a $2,300,000 value to the city at a cost of $50,000".

Mapumental is a very new initiative where you can investigate areas of the UK, looking at house price indexes, public transport data, etc. If we have truly open data, we could really build on this idea. We might be able to work out the best places to live if we want a quiet area with certain local amenities, and need to be at work for a certain time but have certain restrains on travel. Defra has a noise map of England, but it is not open information - we can't combine it with other information.

Julian felt that Open Data will only work if it benefits people in their everyday existence. This may be true on a city scale. On a national scale I think that people have to be more visionary. It may or may not have a discernable impact on everyday living, but it is very likely to facilitate research that will surely benefit us in the long term, be it medically, environmentally or economically.

The Open Data initiative is being sold on the idea of people becoming engaged, empowered and informed. But there are those that have their reservations. What will happen if we open up everything? Will complex issues be simplified? Is there a danger that transparent information will encourage people to draw simplistic inferences? come to the 'wrong' conclusions? Maybe we will lose the subtleties that can be found within datasets, maybe we will encourage mis-information? Maybe we will condemn areas of our cities to be ghettos? With so much information at our fingertips about where we should live, the 'better areas' might continue to benefit at the expense of other areas.

The key question is whether society is better off with the information or without the information. Certainly the UK Government is behind the initiative, and the recent 'Smarter Government' (PDF) document made a commitment to the opening up of datasets. The Government believes it can save money by opening up data, which, of course, is going to be a strong incentive.

For archivists the whole move towards numerous channels of information, open data, mashing up, recombining, reusing, keeping data fluid and dynamic is something of a nightmare from a professional point of view. In addition, if we start to see the benefits of providing everyone with access to all data, enabling them to do new and exciting things with it, then might we change our perspective about appraisal and selection. Does this make it more imperative that we keep everything?

Image: B of the Bang, Manchester

Labels: , ,

09 February 2010

Digital Preservation the Planets Way

As a representative of the UK Society of Archivists, which is a member of the Digital Preservation Coalition,  I attended the first day of this 3-day event on a partial DPC scholarship. It gave an overview of digital preservation and of the Planets project. Planets is a 4-year project that is European Community funded, with 16 partner organisations, and a budget of 13.7 million euros, showing a high level of commitment from the EC. The programme is due to finish May 2010. The Planets approach wraps preservation planning services, action services, characterisation services and a testbed within an interoperability framework. It seeks to respond to the OAIS reference model and it became clear as the day went on that having a knowledge of OAIS terminology was useful in following the talks, which often referred to SIPs, AIPs and DIPs.

After a keynote by Sheila Anderson, Director of the Centre for E-Research at Kings College, which touched upon some of the important principles that underlie digital preservation and outlined some projects that the Centre is involved in, we got into a day that blended general information about digital preservation with quite detailed information about the Planets services and tools.

Ross King from the Austrian Institute of Technology gave a good overview, looking at the scale of the digital universe and the challenges and incentives to preserve. Between now and 2019, the volume of content that organisations need to hold will rise twenty-five fold from an average of 20TB to over 500TB. (from Are You Ready? Assessing Whether Organisations are Prepared for Digital Preservation - PDF). We would need about 1 trillion CD-Roms to hold all of the digital information produced in 2009. Importantly, we have now reached a point at which information creation is exceeding storage capacity, so the question of what to preserve is becoming increasingly important. I found this point interesting, as at the last talk that I attended on digital preservation we heard the old cry of 'why not keep everything digital - storage is no problem'.

Digital preservation is about using standards, best practices and technologies to ensure access over time. With digital information, there are challenges around bit-stream preservation (bytes and hardware) and logical preservation (software and format). Expounding on the challenge of formats, King said that typically knowledge workers produce at least two thirds of their documents in proprietary formats. These formats have high preservation risks relating to limited long-term support and limited backwards-compatibility.

The preservation planning process is also vital, and Planets provides help and guidance on this. It is important to know what we want to preserve, profile the collections and identify the risks in order to mitigate them. Hans Hofman of the National Archives of the Netherlands gave an introduction to preservation planning. A preservation plan should define a series of preservation actions that need to be taken to address identified risks for a given set of digital objects or records. It is the translation of a preservation policy. He talked about the importance of looking at objects in context, and about how to prepare in order to create a preservation planning strategy: the need to understand the organisational context, the resources and skills that are available. Often small organisations simply do not have the resources, and so large organisations inevitably lead the way in this area. Hans went through the step-by-step process of defining requirements, evaluating alternatives, analysing results and ending up with recommendations with which to build a preservation plan, which then needs to be monitored over time.

Planets has developed a testbed to investigate how preservation services act on digital objects (now open to anyone, see https://testbed.planets-project.eu/testbed/). Edith Michaeler of the Austrian National Library explained that this provides a controlled environment for experimentation with your own data, as well as with structured test data that you can use (Corpora). It enables you to identify suitable tools and make informed decisions. The testbed is run centrally, so everything is made available to users and experiments can benefit the whole community. It is very much integrated within the whole Planets framework. Edith took us through the 6 steps to run an experiment: defining the properties, designing the experiment, running it, the results, the analysis and the evaluation. The testbed enables experiments in migration, load test migration and viewing in an emulator as well as characterisation and validation. So, you might use the testbed to answer a question such as which format to migrate a file to, or to see if a tool behaves in the way that you expected.

The incentives for digital preservation are many and for businesses those around things like legislative compliance and clarification of rights may be important incentives. But business decisions are generally made based on the short-term and they are made based on a calculated return on investment. So, maybe we need to place digital preservation in the area of risk management rather than investment. The risk needs to be quantified, which is not an easy task. How much is produced? What are the objects worth? How long do they retain their value? What does it cost to preserve? If we can estimate the financial risk, we can justify the preventative investment in digital preservation. (see MK Bergman, Untapped Assets - PDF).

During the panel discussion, the idea of 'selling' digital preservation on the basis of risk was discussed. Earlier in the day William Kilbride, director of the Digital Preservation Coalition, talked about digital preservation as sustaining opportunities over time, and for many delegates, this was much more in-tune with their sentiments.  He outlined the work of the DPC, and emphasised the community-based and collaborative approach it takes to raising awareness of digital preservation. 

Clive Billenness went through how Planets works with the whole life-cycle of digital preservation:

1. Identify the risks
2. Assess the risks (how severe they are, whether they are immediate or long-term)
3. Plan and evaluate

4. Implementation plan
5. Update and review

The cycle will be repeated if there is a new risk trigger, which might be anything that signifies a change practice, whether it be a change in policy, a change in the business environment or a change in the technical environment. For the whole life-cycle, Planets has tools to help. The Plato Preservation Planning Tool, the Planets Characterisation Services, the Testbed and the Planets Core Registry, which is a file format registry based upon Pronom, including preservation action tools and file formats, taking a community-based approach to preservation.

Types of preservation action were explained by Sara van Bussel of the National Library of the Netherlands. She talked about logical preservation and accessing bit streams, and how interpretation may depend on obsolete operating systems, applications or formats. Sara summarised migration and emulation as preservation processes. Migration means changing the object over time to make it accessible in the current environment, whatever that may be. This risks introducing inconsistencies, functionality can be lost and quality assessment can be difficult. Migration can happen before something comes into the system or whilst it is in the system. It can also happen on access, so it is demand-led. Emulation means changing the environment over time, so no changes to the object are needed. But it is techncially challenging and the user has to have knowledge about the original environment. An emulator emulates a hardware configuration. You need your original operating system and software, so they must be preserved. Emulation can be useful for viewing a website in a web archive, for opening old files, from Word Perfect files to databases, and for executing programs, such as games or scientific applications. It is also possible to use migration through emulation, which can get round the problem of a migration tool becoming obsolete.

We were told about the Planets Gap Analysis (PDF), which looked at existing file formats, and found 137 different formats in 76 institutions. The most archived file formats in archives libraries and museums are tiff, jpg, pdf and xml, but archives hardly archive the mp3 format, while libraries and museums frequently do. Only 22% of the archived file formats were found in four or more institutions, only two file formats, tiff and jpg, were found in over half of all institutions. So, most preservation action tools are for common file formats, which means that more obscure file formats may have a problem. However, Sara gave three examples where the environment is quite different. For DAISY, which is a format for audio books for the blind, there is a consortium of content providers who address issues arising with new versions of the format For FITS, a format for astronomical data, digital preservation issues are often solved by the knowledgeable user-base. But with sheet music the community is quite fragmented and uncoordinated so it is difficult to get a consensus to work together.

The Gap Analysis found that out of the top ten file formats, 9 are covered by migration tools known or used by Planets partners. XML is not covered, but it is usually the output of a file format rather than the input, so maybe this is not surprising. Many tools are flexible, so they can address many types of format, but each organisation has specific demands that might not be fulfilled by available tools.

Manfred Thaller from the University at Cologne gave a detailed account of the Planets Characterisation Services. He drew attention to the most basic layer of digital information - the 1's and 0's that make up a bit-stream. He showed a very simple image and how this can be represented by 1's and 0's, with a '5, 6' to indicate the rows and columns (...or is that columns and rows - the point being that information such as this is vital!). To act on a file you need to identify it, validate it, extract information, undertake comparison. If you do not know what kind of file you have - maybe you have a bit-stream but do not know what it represents - DROID can help to interpret the file and it also assigns a permanent identifier to the file. DROID uses the PRONOM-based Planets File Format Registry.

Thaller emphasised that validation is often complex, and in real life we have to take files that are not necessarily valid. There is no Planets born validation service, but it does provide tools like JHOVE. Extraction is easier to deal with - the examination of what is really in a file. Many services extract some characteristics from a file. The traditional approach is to build a tool for each file format. The Planets Extensible Characterisation Language (XCL) approach is to have one tool which extracts many kinds of files. It provides a file format description language as well as a general container format for file characterisation.

Hannes Kulovitz from the Vienna University of Technology talked about how Plato, an interactive software tool provided by Planets, can help in preservation planning and went through the process of defining requirements, evaluating alternatives, analysing results, recommendations and building the preservation plan. In the ensuing discussion it became clear that the planning process is a major part of the preservation process, especially as each format requires its own plan. The plan should be seen as requiring a major investment of time and effort, and it will then faciliate more effective automation of the processes involved.

Ross King made a return to the podium to talk about integrating the components of digital preservation. There is no archive component being developed as part of Planets, so the assumption is that institutions already have this. His talk concentrated on workflows through Planets, with suggested templates for submission, migration and access. He then went on to give a case study of the British Library (tiff images of newspapers). The content is complex and it required the template to be changed to accommodate the requirements. He built up the workflow through the various stages and referred to options for using various Planets' tools along the way. I would have liked this case study to be enlarged and followed through more clearly, as giving an example helps to clarify the way that the numerous tools available as part of Planets may be used.

We ended with a glimpse of the future of Plants. Most outputs will be freely available, under an Apache 2 licence. But to get take-up there must be a sustainability plan to maintain and develop the software, ensure continued access to Planets services (currently based at the University of Glasgow), support partners who are committed to the use of Planets, grow the community of users and promote further research and development. With this in mind, a decision has been taken to form something along the lines of an Open Planets Foundation (OPF), a not-for-profit organisation, limited by guarantee under UK law but with global membership. There has already been a commitment to this and there is financial support but Billenness was naturally reserved about being explicit here because the OPF will be a new body. There will be different classes of membership and the terms of membership are currently being finalised. But most of the software will remain free to download.

Image shows Planets Interoperability Framework.


01 February 2010

Charles Wesley (1707-88)

Charles Wesley This month we highlight the new catalogue for the personal papers of Anglican minister, Methodist preacher and religious poet Charles Wesley (1707-88). There is an introduction by Dr. Gareth Lloyd, Methodist Archivist at the Methodist Archives and Research Centre, The University of Manchester, The John Rylands University Library.