Releasing data service software as free open source software
Linked data is a way of representing and joining information from a variety of sources to allow it to be accessed, browsed, searched and used as easily as one would browse the web. One of the principles of linked data is that URIs are used to name things whether these be people, places, books, software, magazines, departments, machines etc. As anyone can develop their own linked data sets, and propose their own URIs, many URIs may be created for the same thing. sameAs.org is a service offered by Seme4 Limited that allows users to find out which URIs refer to the same thing. sameAs Lite is a refactored, open source, version of the software that powers sameAs.org. We are providing consultancy to Seme4 on how to improve sameAs Lite for deployers and developers and to promote community engagement.
Many URIs represent the same thing
Consider these URLs:
- http://sws.geonames.org/2650225/ from the GeoNames geographical database which covers all countries and contains over 8,000,000 place names.
- http://data.ordnancesurvey.co.uk/doc/50kGazetteer/81482, from the Ordnance Survey gazeteer.
- http://data.nytimes.com/N45752581330994625741, from The New York Times news vocabulary of 10,000 subject headings.
- http://dbpedia.org/resource/Auld_Reekie, from DBpedia, a crowd-sourced community effort to extract and publish structured information from Wikipedia.
These URIs all refer to the city of Edinburgh. As they refer to the same thing they are termed co-referent. Determining which URIs, produced by different authors for different purposes, refer to the same things, determining these co-references, is one way by which these distributed data sets can linked and explored as if they were one virtual data set.
sameAs.org
sameAs.org is a search engine that, if given a URI, returns URIs that are co-referent, should any be known to it. The engine searches data harvested from a number of sources. Searches can be initiated via web form or via an HTTP-based API (e.g. http://sameas.org/?uri=http://sws.geonames.org/2650225/). Four output formats are supported: RDF XML, RDF N3, JSON, or plain text. The data is made available under a Creative Commons 0 universal public domain dedication licence. A browser widget and a plug-in for Java linked data applications have been developed by third-parties.
sameAs.org was originally developed by Hugh Glaser and Ian Millard at the University of Southampton and has been live since 2009. Hugh and Iain are now senior technical personnel in Seme4 Limited, which provides development, consultancy and education services around Semantic Web and linked data technologies. Seme4's founding partners are Dame Wendy Hall and Sir Nigel Shadbolt of the University of Southampton, internationally recognised researchers in the Semantic Web. Seme4 retains strong links with the university.
Apart from its own data stores, sameAs.org also hosts linked data stores for a number of organisations, including Freebase (which helps power Google Knowledge Graph), the British Library, other national libraries including those of Spain, France, Norway, Germany and Hungary, VIAF (Virtual International Authority File), and the Ordnance Survey.
sameAs Lite
Many of these organisations may want to run their own data stores and Seme4 would like to help them do this. To this end, Seme4 has produced sameAs Lite, a refactored, free open source, version of the software that powers sameAs.org. Though sameAs.org is based on an old version of the software, sameAs.org may be migrated to use sameAs Lite in due course.
sameAs Lite is implemented as two PHP libraries, one which implements the core storage management functionality, and one which implements a REST web application. These depend upon a number of other libraries and PHP Composer is used for dependency management. It is intended that the core library can be used within other applications, outwith the REST web application. Originally the core library could be used as-is with SQLite but the core library currently only supports MySQL. There is a desire to make it database agnostic, but without degrading performance.
A key non-functional requirement of the core library is that performance must be very good. As a result, a significant amount of performance analysis has been done with the core library on a variety of machines. Sample data for which the performance of the core library is known are available.
sameAs Lite is held within a repository hosted within Seme4's project on GitHub.
Objectives
We will provide recommendations on how openness and community engagement can be improved, based upon a review of the sameAs Lite open source project infrastructure. Once complete, we will then review sameAs Lite from the perspective of a deployer who wishes to set up a local deployment of sameAs Lite, and a developer setting up a development/build/test environment, for maintaining, extending or bug fixing sameAs Lite. We will also provide recommendations as to how the sameAs Lite core library support for MySQL and SQLite can be refactored to be database agnostic without degrading performance.
Benefits
Releasing sameAs Lite as open source has the potential to deliver wider exploitation of the important ideas embodied in the software, including helping a number of organisations manage and use linked data more effectively. It is necessary, then, to turn potential users into actual users and one way this can be achieved is by ensuring that sameAs Lite is straightforward to download and deploy. If this is too difficult or time consuming, potential users may become disillusioned and discard sameAs Lite.
Communities that find sameAs Lite of benefit might have members who are interested in migrating to the commercial, sameAs, product. The commercial product provides additional data management facilities not present in sameAs Lite. They may also wish to purchase Seme4's value-added expertise to help them to exploit sameAs data stores and to work with systems that are "sameAs aware", including those platform services offered by Seme4.
Organisations who deploy sameAs Lite may want to customise, fix, or extend it for their own requirements. Such organisations need suitable documentation, and supporting resources, to help them set up a development environment to do this. Supporting organisations in this way can not only provide a free source of effort for bug fixes and feature development, it can help to keep an open source product alive and to evolve a community around it.
Both deployers and developers may run into bugs or identify new features or have questions about deployment or development. There needs to be a way that these can be communicated, recorded, managed and addressed, in a systematic way. Suitably-chosen open source project infrastructure can help to ensure that deployers and developers are not shouting into the void and that Seme4 do not become overwhelmed with support requests. It can also provide an environment in which deployers and developers can, and are encouraged to, help each other. This can help to evolve a community around sameAs Lite.