At this stage in the project our main objective is to implement a “vertical” slice of the research reporting process, by taking some source data, mapping it into CERIF, storing it in a CERIF compliant database and then indexing that data with Apache Solr for display and interaction via Blacklight, which will ultimately be used to generate reports on the research information. There are a number of challenges involved in this process:
- How to map the data sources such as HESA, SITS, HR and Publications data into CERIF. In some cases there will be clear mappings, and in other some creativity may be required, and in yet others it may not be possible.
- How to turn the complex relational schema that is CERIF into a flat, indexable, set of key/value pairs which can be used by Solr and make sense to the user of the reporting software
- How to configure Solr
- How to configure Blacklight
At the moment we have the following technical outputs from the project:
- A test CERIF dataset created using the Open Biblio project’s Medline dataset as the seed data
- A MySQL CERIF schema which was acquired from euroCRIS
- A theoretical mapping from the datasources to CERIF (not yet implemented)
- A set of Solr configuration files and data importers which relate the MySQL CERIF database to a set of flat key/value pairs which meet the requirements of the project’s exemplar report. No general configuration has been produced for CERIF yet, as we are focussed on this specific vertical.
- Some installation and configuration experience with Blacklight. We have done a number of demonstrations of Blacklight to investigate what the final interface will look like, but as yet no realistic data has been presented through it.
- A high-spec dedicated project server with the capacity for storing and processing the large quantites of data that will be generated throughout has been installed and is ready to start working with the data.
Experiences with CERIF
Overall, mapping data to and from CERIF has not been too troublesome. It is a relational standard, which means that flattening it for Solr has been a bit tricky (more on that later). In addition, it does not always have clear ways of representing the data we want to represent, and it appears that the Semantic Layer is where most of the complexity will ultimately reside.
Experiences with Solr
Solr has been reliable (if complex to configure) throughout the process, and the project team is now comfortable and confident that it meets most if not all of the requirements that will be placed on it.
Experiences with Blacklight
Blacklight has so far been the weak link in the project. It is extremely difficult to install and configure, and no two installations go the same way so a large amount of time has been sunk in trying to make it work at all. It is partly for this reason that the project is not yet displaying the data from Solr in Blacklight.
Flattening CERIF for Solr
As CERIF is a relational format, flattening it for indexing by Solr has been a careful task for the project. We cannot represent all of the data in the CERIF database exactly as it appears in MySQL, since Solr does not strictly have the relational qualities of a database.
Instead we have begun to construct solr documents (effectively these are Object Classes) which are designed to meet the reporting requirements. That is, for our exemplar report (see linked presentation), which is focussed on the individuals, we create Solr documents which have the person as the key entity, and we add to the document extensive information about the organisational units that the person is part of, their publications, and so on.
Later we will construct documents which are designed to meet other reporting requirements, and may therefore be organisation or publication oriented. With a well designed Solr schema, all these different documents will co-exist comfortably side-by-side in the index, and we’ll be able to generate a variety of different kinds of report based on that data.
- Finalise the datasource mappings to CERIF
- Harden the CERIF to Solr indexing process based on the final datasource mappings
- Get Blacklight to behave
- Generate reports from search results. The the project is looking at Prawn, a rails application which can generate PDFs of the results.