How does CORE see my repository?

Body: 

Since we first made CORE public, we've had a number of repository managers ask if we are harvesting their particular repository data. We've also sometimes come across issues harvesting content that we want to feedback to the repository manager (or other relevant staff). In order to be able to answer such questions, and to be transparent about what CORE is doing - what we have harvested, where we have encountered problems etc. we've now released a 'Repository Analytics' dashboard.

This dashboard lists all the repositories we are currently harvesting, lists the number of metadata and full-text PDFs we have successfully harvested, and makes available the logs of our harvesting activity for more detailed information for troubleshooting.

The dashboard is very much in 'beta' at the moment, and although we expect most figures on their to be accurate in terms of what CORE has done, there may sometimes be oddities and issues, and inaccuracies, that we need to work through. We would really like feedback on the dashboard in general, and of course we want to know about any inaccuracies in our data. Where there are problems we would also like to work with repository managers to resolve them.

To access the dashboard, please visit http://core.kmi.open.ac.uk/repository_analytics/. On the first screen you can see some overview information - ticks and crosses indicating success or otherwise of our attempts to harvest metadata and content. If you mouse-over the icon you'll get a record count. Because we don't always know how many records/documents there are to harvest you can sometimes get a green tick when only a tiny fraction of your content has been harvested - so generally a tick means more that there were no obvious technical issues.

If you have feedback, please add it to this post in the comments (you can add comments at http://core-project.kmi.open.ac.uk/node/37)

Best of both worlds

Body: 

In the two previous blogs posts in this series (Finding fulltext and What does Google do?) I've described some of the challenges related to harvesting metadata and full text from institutional repositories. I've omitted some of the technical issues we've encountered (e.g. issues with OAI-PMH Resumption Tokens) as generally we've been able to work around these - although I may come back to these at some point in the future. Also worth a read is Nick Sheppard's post on the UKCORR blog touching on some of these issues.

Given the issues described in the previous posts, CORE is faced with the question of what to do about harvesting where permissions are unclear, inconsistent and not easy to apply purely through software (i.e. not machine readable), and where the location of full-text items (which CORE wants to harvest) is not necessarily given in the harvested metadata?

We would propose that for metadata the answer is simple: Harvest it anyway, until asked explicitly not to. This may seem a glib and self-serving answer, but this is not the intention. The arguments for going ahead with the metadata harvest no matter what the policy are as follows:

These factors mean that our assessment of the risk of there being a negative consequence (whether legal or reputational) to any party as a result of us harvesting and using the metadata is that the risk is negligible, and we can react to individual cases and 'takedown' requests as necessary.

The question of fulltext harvesting is more problematic. The copyright inherent in fulltext is unlikely to belong to the repository (or the hosting organisation), and the policies expressed by repositories show a varied view on what is permissible for a third party to do with the fulltext content (and for what purposes). The services offered by CORE only work where fulltext is available and can be harvested and parsed by our software, and in some cases we know this is permitted, but the problems of finding the relevant policies, and understanding the full intentions and implications of policies remain.

Despite all these issues, web search engines such as Google are able to harvest full-text content from repositories in most cases. Rather than rely on understanding published policies they rely instead on a simple control mechanism for all websites - the robots.txt file. This gives control to the publisher, and offers a simple way of ensuring content is not crawled when this is not desired for some reason.

In our original discussions about what CORE should do, we discussed the possibility of proposing a simple method for repositories to tell CORE whether it could harvest content or not. We also felt strongly that CORE (a non-profit service designed to improve the discoverability of open access content for the academic community) should not be at a disadvantage to Google et al when it came to building services on top of Open Repository content. At this point we realised that if we brought these desires together, we had a straightforward answer - CORE should harvest full-text content but respect any restrictions in robots.txt - and so we would be competing on a level playing field with Google, and offering a clear mechanism to repositories to 'opt-out' of the process if they wish to do so.

At one point in our investigations I began to wonder whether the whole use of OAI-PMH was actually worth the time and effort - afterall search engines don't rely on this mechanism, and if we think of respositories as simply web based resources, then why do we need a sector specific protocol to 'crawl' or 'harvest' the records and content? I still think this is an important question, but since we had already put in a significant effort into our harvesting software and processes, and there are certainly advantages to using OAI-PMH (such as easily knowing about changes and deletions without recrawling the whole repository web presence), we are still going to use this for the metadata.

However, once we have the metadata, with a URL which may point to a full-text resource, or may point to a record page, from which there are links to one or more full-text resources, we are going to use web crawler technology to try to retrieve the full-text PDFs related to the metadata record. It is likely we will use the Apache Nutch crawler to do this which brings with it two key advantages:

  • It should help us with the issue of the metadata record not linking directly to the full text - but crawling say two levels from the URL given in the metadata record, we can look for PDFs and link them to the metadata record
  • Nutch will adhere to the robots.txt directives automatically, bringing us in line with other web crawlers

We feel this approach gives us the best of both worlds - harvesting for the metadata then directed crawling for the full-text. This isn't a solution I've seen elsewhere (although please get in touch if you know of similar implementations - directly, or in the comments on this post). For the repository managers it means there is a straightforward mechanism to control access to full-text items, and a single place to limit access to specific items whether this is from Google, Bing, or CORE.

We are, of course, very interested in getting feedback on this proposed approach, especially from repository managers - so please get in touch and let us know what you think (directly, or in the comments on this post)

What does Google do?

Body: 

This is the second post in a series about the issues CORE has encountered trying to harvest (and build services on) metadata and fulltext items from UK HE research repositories. The first post "Finding fulltext" looked at the problems of harvesting fulltext due to variations in how links are made (or not) from metadata records to fulltext content.

In this post I want to consider the question of what services like CORE are allowed or permitted to do with repository content. A third post will then describe some of the solutions to the various challenges we see.

It may seem obvious that repositories offering the ability to harvest metadata should expect external services to do exactly that, and then make some use of the metadata. However, most UK HE research repositories have some policies relating to what other services can do with both metadata and content made available by the repository. This differentiation between metadata and fulltext content is deliberate, and repositories will often be more permissive in what they state is allowed to be done with metadata compared to fulltext content.

My starting point in investigating repository policies was OpenDOAR, a directory of academic open access repositories http://www.opendoar.org/. On OpenDOAR there are 155 'institutional' repositories in the UK (not necessarily limited to HE), of which 125 offer OAI-PMH. I should note here that the data in this post (whether from OpenDOAR or elsewhere) was generally collected and analysed using a combination of ScraperWiki and Google Refine. All figures are based on a snapshot of data taken between 18th February and 18th March 2012, and of course any errors are of course mine alone. If I've made any errors in my interpretation or recording of policies or data please let me know on twitter (http://twitter.com/ostephens) or by email owen@ostephens.com (although note the metadata policies and summaries I'm reporting from OpenDOAR are not mine to correct). A spreadsheet containing the data I refer to in this post is available at https://docs.google.com/spreadsheet/ccc?key=0ArKBuBr9wWc3dFM5Vi1QLWdOR0t...

Looking at the 'metadata' policy summaries that OpenDOAR has recorded for these 125 repositories the majority (57) say "Metadata re-use policy explicitly undefined" which seems to sometimes mean OpenDOAR doesn't have a record of a metadata re-use policy, and sometimes seems to mean that OpenDOAR knows that there is no explicit metadata re-use policy defined by the repository. Of the remaining repositories, for a large proportion (47) OpenDOAR records "Metadata re-use permitted for not-for-profit purposes", and for a further 18 "Commercial metadata re-use permitted".

However, although OpenDOAR has made a substantial effort to collect and accurately reflect institutional policies (and indeed, has been behind a push to get repositories to formulate and state policies clearly via its Policies Tool), perhaps inevitably there are both errors and omissions. For example, the OpenDOAR record for Aston University says "Metadata re-use policy explicitly undefined" and notes "Policy not found". However, if we refer directly to (a way of getting some XML about the OAI-PMH service for a repository) we find that there is a link to a web page which defines the Aston University repository policies. This states:

The metadata may be re-used in any medium without prior permission for not-for-profit purposes and re-sold commercially provided: the OAI Identifier or a link to the original metadata record are given; Aston University Research Archive is mentioned

We can see that re-use of the metadata is permitted, even though this was not recorded on OpenDOAR. My suspicion is that one reason this has happened is that Aston University do not list the policy in their XML response to the OAI-PMH 'Identify' request, but just link to the policy on another web page. However, in common with all other repositories I have looked at to date, the policy (whether in the XML response, or on a separate page) is designed to be human readable, not machine readable. The guidance in how to handle both record and repository level rights statements in OAI-PMH from http://www.openarchives.org/OAI/2.0/guidelines-rights.htm, has some examples where rights are linked to machine readable versions of common licenses - particularly the Creative Commons licenses. However, I have yet to come across a real-world example of this. Although I don't doubt that there are examples, I think this is the exception rather than the rule.

This raises significant challenges for services such as CORE which rely on automated harvesting and processing of records - for such services the policies being made available by repositories are hard to make use of, as they don't provide terms in a form that is easy for the software to extract and 'understand'.

If we move onto the harvesting of fulltext content, the situation has a lot of similarities to that I've described for metadata. Unfortunately, the situation is even less clear for fulltext content than it is for metadata. OpenDOAR lists 54 repositories with the policy summary "Full data item policies explicitly undefined", but after that the next most common (29 repositories) policy summary (as recorded by OpenDOAR) is "Rights vary for the re-use of full data items" - more on this in a moment. OpenDOAR records "Re-use of full data items permitted for not-for-profit purposes" for a further 20 repositories, and then (of particularly interest for CORE) 16 repositories as "Harvesting full data items by robots prohibited".

Once again, deliving into individual repository policies can throw up conflicts with the policy OpenDOAR has recorded. For example OpenDOAR lists the University of Southampton policy on full data re-use as "Harvesting full data items by robots prohibited", but querying the University of Southampton directly we see (amongst other statements):

Full items may be harvested by robots transiently. Where full items are harvested permanently permission must be sought from the University of Southampton.

As suggested by the presence of several policy summaries stating "Rights vary for the re-use of full data items", it is quite possible (and I think it would be quite exceptional for it to be otherwise) that rights will vary depending on the fulltext item in question. This, in theory, can be expressed in a repository record, and can be made available to those harvesting the records - for example in a <dc:rights> tag in the associated metadata record. However, this information seems to be absent more often than it is present, and even where the rights data has been entered in the repository, not all repositories output the rights metadata in the records accessible via OAI-PMH (it seems particularly that in eprints the dc:rights field is not output via OAI-PMH by default).

The problem of parsing rights information with software (whether this is repository level, or record level 'dc:rights' information) remains with fulltext content, and if anything policies relating to the use of fulltext content are more nuanced and subtle than with metadata content. For example the University of Bristol policy states (amongst other things) that:

Full items must not be harvested by robots except transiently for full-text indexing or citation analysis.

In this case very specific particular types of use are being granted. Additionally this policy (and many others) uses several specific terms without defining them in detail. I would argue that definitions of 'robot', 'transiently', 'full-text indexing' and 'citation analysis' would be necessary to fully understand the implications of this policy.

So we are faced with conflicting information, spread across multiple services, expressed in a way that our software (and sometimes a human) cannot easily understand!

This brings us to the title of this post "What does Google do?" (a play on the Jeff Jarvis book "What would Google do?"). Despite the complications I have described around policies relating to fulltext harvesting from repositories, Google (both http://google.com and http://scholar.google.com) and other search engines do include repository content, capturing metadata and fulltext, providing fulltext indexing and caching of pdf documents, and possibly other services or information based on machine based parsing of content. Some repositories mention indexing by Google or Google Scholar in their promotional material e.g. http://www3.imperial.ac.uk/library/find/spiral/faq and http://www.lib.cam.ac.uk/repository/usecases/.

Given the complexity of the situation, as described above, we thought it was worth looking at what Google did, and how repositories related to Google (and by implication other web search engines). Neither Google nor Google Scholar use OAI-PMH (or at least, not as far as they are public about what they do) to harvest metadata, but they rely on crawling repository web pages, just like any other website, although Google Scholar has some specific provisions to help repositories (and publishers) get accurate metadata into the Google Scholar index.

Starting with the same list of respositories in OpenDOAR as previously, I created a Google search per repository that looked for any PDFs indexed in Google from that site - for example http://www.google.com/#hl=en&q=site:aura.abdn.ac.uk+filetype%3Apdf finds pdfs indexed by Google in the Aberdeen University Research Archive repository. This showed that only 8 out of the 125 repositories had zero pdfs in Google's index, and while a few had trivial numbers (4 repositories had just a single pdf in the Google index), in general most repositories had some kind of presence in the Google index, with some having substantial numbers of pdfs indexed (e.g. the University of Southampton result for this search gave 13,700). It should be noted that the results Google gives for searches are not always accurate, but I think they give a flavour of the situation, and a zero is usually pretty conclusive.

If we look at the results for the ROSE repository at the University of Bristol from the search http://www.google.com/#hl=en&q=site:rose.bris.ac.uk+filetype%3Apdf, we can see there are "About 1,320" results. In each case it seems that the PDF document is cached for display in the results page, and in some cases there is a 'quickview' option which displays the cached version of the document in Google Docs.

Similarly looking at Google Scholar using searches of the form http://scholar.google.com/scholar?q=site:aura.abdn.ac.uk, we can see that only 8 repositories in the list give zero hits. Taking ROSE as an example again, the search http://scholar.google.com/scholar?q=site:rose.bris.ac.uk gives "about 242", but unlike with http://www.google.com, in this case the PDF is not obviously cached.

Both Google and Google Scholar offer advice to ensuring they index your site effectively. Google offer a wealth of advice, and tools, to help webmasters get their sites indexed correctly - all available from https://www.google.com/webmasters/tools/home. Google Scholar also offers advice at http://scholar.google.com.au/intl/en/scholar/inclusion.html, which is specifically aimed at the type of content (e.g. research papers) covered by Google Scholar. Of particular note the Google Scholar guidelines include information on how to include bibliographic metadata for Google Scholar to harvest.

For both Google and Google Scholar, it is made clear in the guidelines that the mechanism for marking any content the repository does not want to be indexed in the respective indexes is the 'robots.txt' file. This is a simple text file that can be placed on your web server, and gives simple instructions to web crawlers/spiders/robots as to what they should and should not index on the site, as well as some other aspects of site indexing behaviour. The robots.txt file is a long standing convention (see the robots.txt entry on Wikipedia for more information), but is not strictly enforceable - it is up to the robot software crawling the site to read robots.txt and obey - if an unscrupulous robot chooses to ignore the file, there is nothing inherent in robots.txt to prevent it doing what it wants (there are, of course, other methods of blocking particular pieces of software etc. accessing your site, but they are not related to robots.txt).

Looking at the robots.txt files across all the repositories in the list, we can see the majority are not doing anything specific to block Google etc. from indexing their sites. There are a few exceptions - which account for some of the 'zero' results sets described above. For example the "Bradford Scholars" portal at the University of Bradford blocks access in robots.txt (http://bradscholars.brad.ac.uk/robots.txt) as follows:

User-agent: *
Disallow: /
Disallow: /browse
# Uncomment the following line only if sitemaps.org or HTML sitemaps are used
#Disallow: /browse-title

Sites blocking access are the exception rather than the rule.

If we go back to the ROSE repository at Bristol, and look at a specific record such as https://rose.bris.ac.uk/handle/1983/286, we can see in the html the following:

<meta content="Champneys, AR; Kuznetsov, YA; Sandstede, B" name="citation_authors" />
<meta content="homoclinic orbit; numerical analysis; continuation; bifurcation; Preprint" name="citation_keywords" />
<meta content="http://rose.bris.ac.uk/handle/1983/286" name="citation_abstract_html_url" />
<meta content="2006-01-31T17:34:08Z" name="citation_date" />
<meta content="A numerical toolbox for homoclinic bifurcation analysis" name="citation_title" />
<meta content="en" name="citation_language" />

This is the markup for metadata used specifically by Google Scholar - so it seems that Bristol is making some effort to appear in Google Scholar results. We can also look at the ROSE robots.txt file:

====
The contents of this file are subject to the license and copyright
detailed in the LICENSE and NOTICE files at the root of the source
tree and available online at

http://www.dspace.org/license/
====
User-agent: *

# Uncomment the following line ONLY if sitemaps.org or HTML sitemaps are used
# and you have verified that your site is being indexed correctly.
# Disallow: /browse

# You also may wish to disallow access to the following paths, in order
# to stop web spiders from accessing user-based content:
# Disallow: /advanced-search
# Disallow: /contact
# Disallow: /feedback
# Disallow: /forgot
# Disallow: /login
# Disallow: /register
# Disallow: /search

This basically means that robots.txt is saying any web crawler can feel free to access to any part of the site, although as this looks like a default DSpace configuration, it is possible that this is not a deliberate choice on the part of the repository.

What do we conclude from all this? Google, Google Scholar, and other web search engines do not rely on the repository specific mechanisms to index their content, and do not take any notice of repository policies (there are certainly examples where the fulltext policy listed on OpenDOAR explicitly says harvesting by robots is not allowed, but the robots.txt file is permissive although I have not yet tracked down a completely clear cut example of this when looking at policies hosted directly on a repository website).

By adopting a clear, unambiguous mechanism for allowing a content owner to say whether the search engine can crawl the site, with an 'opt-out' rather than 'opt-in' approach (if robots.txt is not present, web crawlers will assume permission to crawl the website), internet search engines have made it possible to build indexes of large amounts of content on the web. It clearly would not be possible for Google to operate in its current form if it asked permission to index content each time its rights to do so were unclear. In the main, web content owners, including repository owners, accept this as a quid pro quo of sorts in order to make their content more discoverable. Ignoring the question of how Google and others profit from this (and whether this counts as non-commercial activity), the point for the repositories is to make their content discoverable.

What does this mean for CORE (and any similar services). I bet you can't wait for the next blog post in this series to find out!

Finding fulltext

Body: 

In order to be able to provide the search functions, similarity measures and other functionality CORE harvests both metadata and fulltext items from repositories. This raises questions about whether we are allowed to harvest metadata or fulltext items, and if so what are we allowed to do with them once we have harvested them. In the first phase of CORE we relied on OAI-PMH to harvest metadata, and then used links from the harvested records to try to discover the related fulltext item.

This is the first in a series of blog posts looking at these issues, the problems we've encountered and the solutions we have put in place (so far). In this post I'm going to focus on the question of finding fulltext items from the metadata. This wasn't always straightforward. Not all repositories link to fulltext records from the metadata in the same way, and in many cases there is no direct link from the metadata to the fulltext reocrds, but rather a link to the repositories webpage for the record, rather than to the full text.

This (edited for brevity) example from the University of Cambridge (which uses the DSpace software) has a link in <dc:identifier>, which links to the html page describing the item. To get the fulltext, you then need to find the link to the pdf on that page and click through.

<record>
<header>
<identifier>oai:www.dspace.cam.ac.uk:1810/221924</identifier>
</header>
<metadata>
<oai_dc:dc xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<dc:title>Reading Lists in Cambridge: A Standard System?</dc:title>
<dc:creator>Jones, Huw</dc:creator>
<dc:identifier>http://www.dspace.cam.ac.uk/handle/1810/221924</dc:identifier>
<dc:relation>1/4</dc:relation>
</oai_dc:dc>
</metadata>
</record>

While this example from the University of Southampton (again edited) links directly to the pdf from <dc:identifier>, and links to the html page for the item using <dc:relation>

<record>
<header>
<identifier>oai:eprints.soton.ac.uk:66183</identifier>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>A methodology for developing high damping materials with application to noise reduction of railway track</dc:title>
<dc:creator>Ahmad, Nazirah</dc:creator>
<dc:format>application/pdf</dc:format>
<dc:identifier>http://eprints.soton.ac.uk/66183/2451/P2503.pdf</dc:identifier>
<dc:identifier>Ahmad, Nazirah (2009) A methodology for developing high damping materials with application to noise reduction of railway track. University of Southampton, Institute of Sound and Vibration Research, Doctoral Thesis, 250pp.</dc:identifier>
<dc:relation>http://eprints.soton.ac.uk/66183/</dc:relation>
</oai_dc:dc>
</metadata>
</record>

The lack of consistency here obviously raises some challenges for those wishing to harvest fulltext items.

When I posted some questions around this topic to the ever-helpful code4lib mailing list, Godmar Black (http://people.cs.vt.edu/~gback/) pointed out that the definition of the OAI-PMH says "To facilitate access to the resource associated with harvested metadata, repositories should use an element in metadata records to establish a linkage between the record (and the identifier of its item) and the identifier (URL, URN, DOI, etc.) of the associated resource. The mandatory Dublin Core format provides the identifier element that should be used for this purpose." (from http://www.openarchives.org/OAI/openarchivesprotocol.html#UniqueIdentifier)

Note that this does not state what type of identifier should be used, and where an URL is used it isn't stated that this should resolve to the fulltext item in the browser (although it does suggest that it should identify the resource, not identify the description of the resource).

As part of the same discussion Raffaele Messuti (http://atomotic.com/) noted that in Italy records describing theses are required to do the following:

From what I can see looking at an example (http://amsdottorato.cib.unibo.it/cgi/oai2?verb=GetRecord&metadataPrefix=...) the link to the actual resource is given in <didl:Resource> within <didl:Component>.

This approach feels useful not just because it introduces consistency, but it also clearly answers the question of what to link to where the item described consists of multiple files/parts.

Creating a standard approach may prove successful for a small, well defined, community - and I think it would be useful to UK HE repository managers to work towards a standard approach, similar to the Italian etheses example. However, this would only solve the problem for CORE for a particular subset of repositories. CORE is already looking at harvesting repositories from outside the UK, and the wider we cast our net for repositories to harvest, the more likely we are to hit a variety of practices across communities.

So what will CORE do? I'm going to come back to this in a later post - in the next post in this short series I want to look at policies on metadata and fulltext harvesting, and how 'harvesting' differs from 'crawling' (the latter being the approach that a web search engine like Google might take).

Subscribe to RSS - ostephens&#039;s blog