In order to be able to provide the search functions, similarity measures and other functionality CORE harvests both metadata and fulltext items from repositories. This raises questions about whether we are allowed to harvest metadata or fulltext items, and if so what are we allowed to do with them once we have harvested them. In the first phase of CORE we relied on OAI-PMH to harvest metadata, and then used links from the harvested records to try to discover the related fulltext item.
This is the first in a series of blog posts looking at these issues, the problems we've encountered and the solutions we have put in place (so far). In this post I'm going to focus on the question of finding fulltext items from the metadata. This wasn't always straightforward. Not all repositories link to fulltext records from the metadata in the same way, and in many cases there is no direct link from the metadata to the fulltext reocrds, but rather a link to the repositories webpage for the record, rather than to the full text.
This (edited for brevity) example from the University of Cambridge (which uses the DSpace software) has a link in <dc:identifier>, which links to the html page describing the item. To get the fulltext, you then need to find the link to the pdf on that page and click through.
<oai_dc:dc xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<dc:title>Reading Lists in Cambridge: A Standard System?</dc:title>
While this example from the University of Southampton (again edited) links directly to the pdf from <dc:identifier>, and links to the html page for the item using <dc:relation>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>A methodology for developing high damping materials with application to noise reduction of railway track</dc:title>
<dc:identifier>Ahmad, Nazirah (2009) A methodology for developing high damping materials with application to noise reduction of railway track. University of Southampton, Institute of Sound and Vibration Research, Doctoral Thesis, 250pp.</dc:identifier>
The lack of consistency here obviously raises some challenges for those wishing to harvest fulltext items.
When I posted some questions around this topic to the ever-helpful code4lib mailing list, Godmar Black (http://people.cs.vt.edu/~gback/) pointed out that the definition of the OAI-PMH says "To facilitate access to the resource associated with harvested metadata, repositories should use an element in metadata records to establish a linkage between the record (and the identifier of its item) and the identifier (URL, URN, DOI, etc.) of the associated resource. The mandatory Dublin Core format provides the identifier element that should be used for this purpose." (from http://www.openarchives.org/OAI/openarchivesprotocol.html#UniqueIdentifier)
Note that this does not state what type of identifier should be used, and where an URL is used it isn't stated that this should resolve to the fulltext item in the browser (although it does suggest that it should identify the resource, not identify the description of the resource).
As part of the same discussion Raffaele Messuti (http://atomotic.com/) noted that in Italy records describing theses are required to do the following:
- Publish metadata as MPEG DIDL (see http://www.dlib.org/dlib/november03/bekaert/11bekaert.html)
- Populate dii:Identifier with a URL for the html web page (jump off page) describing the item
- Use didl:Component to represent each full text document composing the Item
From what I can see looking at an example (http://amsdottorato.cib.unibo.it/cgi/oai2?verb=GetRecord&metadataPrefix=...) the link to the actual resource is given in <didl:Resource> within <didl:Component>.
This approach feels useful not just because it introduces consistency, but it also clearly answers the question of what to link to where the item described consists of multiple files/parts.
Creating a standard approach may prove successful for a small, well defined, community - and I think it would be useful to UK HE repository managers to work towards a standard approach, similar to the Italian etheses example. However, this would only solve the problem for CORE for a particular subset of repositories. CORE is already looking at harvesting repositories from outside the UK, and the wider we cast our net for repositories to harvest, the more likely we are to hit a variety of practices across communities.
So what will CORE do? I'm going to come back to this in a later post - in the next post in this short series I want to look at policies on metadata and fulltext harvesting, and how 'harvesting' differs from 'crawling' (the latter being the approach that a web search engine like Google might take).