CORE wins the Best Poster/Demo Award at TPDL 2011

Body: 

The KMi submission authored by Petr Knoth, Vojtech Robotka and Zdenek Zdrahal entitled: " Connecting Repositories in the Open Access Domain using Text Mining and Semantic Data" won the Best Poster/Demo Award at the International Conference on Theory and Practise of Digital Libraries (TPDL 2011) which is this week taking place in Berlin, Germany.

The European Conference on Research and Advanced Technology for Digital Libraries (ECDL) has been the leading European scientific forum on digital libraries for 14 years. For the 15th year the conference was renamed into: International Conference on Theory and Practice of Digital Libraries (TPDL).

http://news.kmi.open.ac.uk/11/1276

CORE Plugin deployed in Open Research Online

Body: 

The CORE Plugin has been finally approved by the OU Library and became last week and became a part of the institutional repository Open Research Online. An example of the plugin can be seen at the bottom of this page.

Final post

Body: 

What have we produced:

Software tools

Publication:

Knoth, P., Robotka, V. and Zdrahal, Z. (2011) Connecting Repositories in the Open Access Domain using Text Mining and Semantic Data, International Conference on Theory and Practice of Digital Libraries 2011 (TPDL 2011), Berlin, Germany

Poster presentation:

Knoth, P. and Zdrahal, Z. (2011) CORE: Connecting Repositories in the Open Access Domain, CERN workshop on Innovations in Scholarly Communication (OAI7), Geneva, Switzerland

Youtube video presentation:

http://www.youtube.com/watch?v=_YuOJnjCEAA&feature=player_embedded

Linked Data in Libraries event (London) presentation:

http://www.slideshare.net/petrknoth/core-presentation-8593721?from=ss_embed

Next steps:

  • Find ways how to further develop CORE to enable the inclusion of larger amounts of content, i.e. the aggregation of content from more repositories.
  • Integration of CORE with currently emerging Research Data management and repository systems to allow the linking of publications with data.
  • Further dissemination of the service to increase its adoption

    Evidence of Reuse:

    • Data and services currently being reused by the Open Research Online Repository.
    • Positive feedback received from the participants the OAI7 workshop, namely Astrid van Wesenbeeck (SPARC Europe).
    • A positive feedback about CORE received by email as a reaction on the upload of the CORE video on YouTube from Graham Steel.
    • Our team has discovered a set of OAI-PMH base URLs that were not up to date in the OpenDOAR repository and provided this feedback to OpenDOAR. Bill Hubbard of OpenDOAR appreciated this collaboration.

    Skills:

    The project has helped us to further develop skills needed to technically handle large amounts of data. It also increased our understanding of the current state-of-the-art technologies for access and retrieval of Open Access content. These skills will help us to further develop CORE in the future.

    Most significant lessons:

    • Though OAI-PMH harvesting is considered messy by the digital library community, harvesting and processing full-text content is by a magnitude more difficult.
    • Do not use Java tools for thumbnail generation, use ImageMagick instead.
  • Cost/benefits of approach

    Body: 

    It wouldn't be possible to achieve the CORE results without the help of JISC which allowed us to invest time in developing the CORE system and tools. In addition, the project benefitted from extra time spent on the project by KMI staff or students. Therefore, I would like to thank to all involved in the development and dissemination of CORE. This includes Zdenek Zdrahal, Owen Stephens, Jakub Novotny, Gabriela Pavel, Magdalena Krygielova, Harriet Cornish, Vojtech Robotka, Petr Kremen, Michal Chloupek, Sophie Wise, Ian Tindle.

    The CORE project has developed an infrastructure and tools that can be used for a number of purposes both by end users (people searching for Open Access papers) as well as institutional repositories developers. Though this is already very good, we believe that CORE is just the beginning of something bigger. In particular, we would like to extend CORE to become a large aggregation service for even more OA repositories. In order to do that we will have to face a number of challenges mentioned in some of the previous blog posts.

    Overall, we believe that the benefits of CORE far exceed the investment. By talking to people in the OA community, to developers of the Open Research Online repository or others, we learned that CORE is a unique system which is extremely useful to promote Open Access and to provide better services to researchers and students.

    CORE Video

    Body: 

    How others could follow in our footsteps?

    Body: 

    The CORE project produced a number of tools that can be reused or adapted to solve specific problems. In this blog post, we are going to explain how do we envisage this to happen and describe how can our team assist. Some of the answers were developed during the last Advisory Board meeting that took place on Monday 25th July.

    1) Development of subject based repositories as aggregations of content from a set of existing Open Access repositories - the CORE harvesting software can be easily set to perform metadata and content harvesting from any set of OAI-PMH compliant repositories. The fact that CORE provides access to the full-texts enables us to apply different text mining and classification methods to filter the content to be finally presented to the user.

    2) Providing mobile access to publications stored in Open Access repositories - the mobile client developed within the CORE project can be used for searching and accessing content stored in any set of Open Access repositories. We would assume this might be an interesting functionality also for individual institutions that could provide researchers and students mobile access to all content stored within their institutions. Doing so requires to install the CORE server system and to adapt the mobile client.

    3) Integrating the CORE Plugin into institutional Open Access repositories - The CORE Plugin can be reused and easily integrated into institutional OAI-PMH compliant repositories. The CORE Plugin is platform independent and can be integrated into any repository by just adding a piece of Javascript code into the web page. The design of the Plugin can be customised using an attached CSS style.

    4) Reusing the RDF triples exposed - Third party tools addressing various resource discovery problems can be developed and can take advantage of the CORE triple store. The repository can be programmatically accessed using the provided SPARQL endpoint.

    Our team is ready to provide any advice on these steps. We would suggest those interested in the development or in the reuse of the CORE results to contact us for details.

    The project team in KMi would also like to build on top of the existing solution:

    First, we would like to increase the number of repositories. In particular, we would like to incorporate more and more OA repositories to reach in the future all OA repositories. This will require us to further optimise the system to allow very efficient content download and also to improve the hardware currently available at the Open University.

    Second, we will be closely monitoring the current discussions about the management of research data in the Open Archives Initiative (OAI) community to allow in the future also the inclusion and the integration of research data into the CORE system.

    Third, we would like to get involved in projects that take advantage of the developed CORE architecture and apply text mining techniques to extract or derive interesting information from the publications.

    Small wins and fails

    Body: 

    The development of the CORE system has been rapid and we were overcoming issues at a daily basis. It is just now, when the CORE system is fully functional, when we can evaluate the successes and comment on the issues we had to face.

    Let us first start with the challenges we were facing to and explain how we addressed them:

    • Metadata harvesting - Our decision to reuse the OCLC OAIHarvester2 as a component in our system proved to be a good one. However, as the component was originally designed as a command line tool, it had to be slightly modified in order to use it reliably on Tomcat. This required us to update some exception handling etc. Overall, these fixes required relatively minor effort.
    • Getting the OAI-PMH base URL for British Open Access repositories - OpenDOAR and ROAR were used as authoritative lists of the OAI-PMH base URLs. We have found that these URLs were not valid for a number of the British OA repositories (we will provide a list of these repositories in one of the subsequent blog posts). We were able to resolve this issue in a few cases by guessing the correct URL. We will provide a detailed feedback on this to OpenDOAR.
    • Downloading Content from OA repositories - We have implemented a set of Java classes to carry out the downloading of pdf files. Our pragmatic decision here was to download only content in the pdf format. There were two challenges we had to face: 1) The file download has to be fast enough. We have addressed this problem by downloading the content to a set of Open University servers connected to a very fast broadband, by using appropriate BufferedStreams in order to fully exploit the connection potential and by automatically cancelling the download when the remote server response was very slow (typically when the remote server did not send any data for two minutes). 2) The second issue was associated to the cost of data storage. Given the fact that CORE needs to download data from many Open Access repositories, the system requires a large disk space. At the moment we have downloaded and processed more than 50k files which accounts for about 200GB of data. We estimate that nowdays approximately 5TB might be required to carry out the same work for all OA repositories worldwide. At the time of the proposal writing, we believed that disk space is one of the cheapest hardware components, however we realised that in order to carry out system backups and allow a quick response of the system and the integration of CORE with the OU infrastructure, fast SAS disks are required. We have negotiated with the OU technical admin team to buy a another TB of disk space for CORE to be covered from the OU central budget at a cost of £3,000. This will enable a long term sustainability of the CORE system for British repositories, but won't be sufficient for all Open Access repositories worldwide.
    • PDF to text extraction - This was one of the most challenging parts of the CORE system development. We have tested 3 systems for pdf to text extraction - iText, Apache Tika (PDFBox) and pdftotext. The issue with Apache Tika was that the extraction was very slow (about 30s to 1 minute per average pdf, which was prohibitive for the scale of the application), the issues with iText and pdftotext were the quality of the text. To summarize, Apache Tika produced good quality text, but the extraction was too slow, while the other tools were fast enough, but the quality of the resulting text was inferior. Eventually we managed to speed the extraction up, by optimising our system which communicated with Apache Tika using BufferedStreams instead of pure Strings. At the moment we are able to extract text from about 500 PDFs per hour.
    • Thumbnail generation - In order to develop a nice search web interface that would enable access to the harvested and processed articles, we wanted to generate for each article an image thumbnail. Originally we have used PDFBox for this task as well, but we discovered that about one in about a thousand of pdfs caused the PDFBox to crash the Java Virtual Machine. This is something that in theory shouldn't happen and we have reported this issue to Apache (https://issues.apache.org/jira/browse/PDFBOX-1019). The bug is still being solved, but it appears that the problem requires a fix in Oracles's Java implementation. Though the problem appeared rarely, the consequence for us was that we had to restart the Tomcat server on which our application was running. To avoid this problem completely, we have implemented a different solution which uses ImageMagick (http://www.imagemagick.org/script/index.php) instead. Since that time the issue never reoccurred.
    • Similarity calculation - our team knew right from the start of the project that we will need to provide a very well optimised version of our similarity calculation system to be able to discover relevant papers in a reasonable amount of time due to the problem of a large number of combinations. In order to make this task possible, we have not only optimised the calculation, but also developed a new heuristic that cuts the number of combinations to be taken into account using a document frequency cut criterion. The result is that the time complexity of the similarity calculation is approximately linear with respect to the number of items in the index (in contrast to the theoretical quadratic complexity) which allows the CORE system to scale. During the project we also had to face to other issues regarding the calculation: 1) the calculation results were poor due to a low text quality. This problem has been fully resolved by optimising the text extraction system. 2) The similarity calculation and the impact of the heuristic was affected by a number of strings in the index that were not carrying any meaning. These strings were the result of text extraction of mathematic formula, numbers and other types of noisy data. To face this issue, we have developed our own TextAnalyzer and TextFilter on top of the Lucene library, which filters out these tokens.

    Overall, we are glad to say that we were able to recover from all the major issues we have encountered. We found it it extremely useful to develop and test the system on a daily basis using agile development methodologies. The proof of the very active development and involvement of the CORE project team is that today we have already 575 code revisions in our SVN repository since the project start.

    Main successes:

    • The project fulfilled and exceeded all the benchmarks mentioned in the project proposal. For example, our goal was to generate about 1,000,000 million RDF triples, but the project has already exposed more than 3 million of them and the number is still growing. The CORE repository has been interlinked in the Linked Data Cloud with the OAI repository. The CORE repository is registered on CKAN (http://ckan.net/package/core)
    • The project originally did not envisage the development of a user interface, but has developed a Web Portal (http://core.kmi.open.ac.uk) that allows searching and navigating across the harvested content from British Open Access repositories. In addition we have developed a visualisation system for the article similarities.
    • The CORE project has developed a backoffice tool, not originally envisaged in the Description of Work, that allows to extend or reuse the system for any set of OAI-PMH compliant repositories.
    • The CORE project has developed a mobile application client, not originally envisaged in the Description of Work, that allows to search and download content from the CORE system and access it online on your mobile device. The system is now available for free on the Android Market and supports Android phones and tablet devices that account for about 50% of the mobile market. We have also developed the same tool for iPhones and iPads (which account for about 25% of the market). The Apple tools are implemented, but their distribution for free over iTunes is now in the process of an internal OU approval.
    • The CORE project results were submitted and accepted to two major international conferences, the OAI7 Workshop (http://indico.cern.ch/conferenceDisplay.py?confId=103325) and the TPDL 2011 conference (http://www.tpdl2011.org/). The project has been also presented at the Linked Data and Libraries event in London (http://consulting.talis.com/event/linked-data-in-libraries/).

    Overall, we all believe that CORE has been a huge success and we are keen and committed to further develop and extend the system in the future.

    The results of the CORE project to be presented at TPDL 2011

    Body: 

    The project team has submitted a paper describing CORE to the International Conference on Theory and Practise in Digital Libraries (TPDL 2011) - http://www.tpdl2011.org/ to be held in September in Berlin. This conference is the main scientific forum on digital libraries in Europe. The paper has been accepted and the acceptance rate for this year was 33%.

    Overview of the CORE project

    Body: 

    We provide an overview presentation of the CORE project.

    CORE repository in the Linked Data cloud

    Body: 

    The first version of the CORE dataset been released yesterday and registered in the Linked Data cloud (http://ckan.net/package/core). The CORE project exposes data about similarities between papers in the Open Access domain. We are providing links to the OAI repository. The similarities are calculated using Natural Language Processing techniques based on the full-text. This distinguishes CORE from other systems, such as Mendeley or MarcXimiL. The similarities are provided only for research articles with an accessible and machine readable full-text.

    At the moment we expose more than 3 million RDF triples describing similarities calculated on a set of more than 50,000 full-text articles harvested from British Open Access repositories. In the future we want harvest information and content from as many Open Access Repositories as possible. At the moment there are more than 1,900 of them and we are processing content from only 143 British repositories. We aim at processing all full-text articles available online and making information about record similarities available in a machine readable format. As a result, the number of the RDF triples in our store is likely to grow significantly. Have a look at the data description at http://core-project.kmi.open.ac.uk/node/13#overlay=node/13 or check some example queries at http://core.kmi.open.ac.uk:8081/COREWeb/example-queries to see what is available.

    Pages

    Subscribe to RSS - blogs