From the CERIF Model to the Solr Index

September 11, 2011

Part of the challenge of the BRUCE project is to take a highly relational model like CERIF and convert it into something which can be adequately indexed for searching and faceting.

Apache Solr, like many traditional search engines, works on the principle of key-value pairs. A key-value pair is simply an assertion that some value (on the right) is associated with some key (on the left). Examples of key-value pairs are:

name : Richard
project : bruce
organisation : Brunel University

Typically, the keys on the left come from a set of known terms, while the values on the right can vary arbitrarily. Therefore, when you search for documents belonging to “Richard”, you are asking which documents have the value “Richard” associated with the key “name”.

In addition, keys are often repeatable (although depending on the search index schema this might not be always the case), so you could have multiple “name” keys, with different values.

Approach

The objective, then, is for us to convert the graph-like structure of CERIF (that is, it has entities and relationships which do not follow a hierarchy) into the flat key-value structure of a search index. It should be clear from the outset, therefore, that data-loss will necessarily result from this conversion; it is not possible to fully and adequately represent a graph as a set of key-value pairs.

The project aimed, instead, to extract the key information from the CERIF schema from the point of view of one of the Base Entities.

There are 3 Base Entities in CERIF: Publications, People and Organisational Units. Since BRUCE is concerned with reporting principally on staff, we selected People as the Base Entity from which we would view the CERIF graph. By doing this we reduce the complexity of the challenge, since a graph viewed from the point of view of one of its nodes behaves like a hierarchy at least in the immediate vicinity (see the real analysis of this, below, for a clear example).

Our challenge is then simplified to representing a tree structure as a set of key-value pairs.

The second trick we need to use is to decide what kind of information we want to actually report on, and narrow our indexing to fields in the CERIF schema which are relevant to those requirements. This allows us to index values which are actually closely related to eachother as totally separate key-value pairs: as long as the index provides enough information for searching and faceting, it won’t matter that information about their relationship to eachother is lost.

For example: suppose we want to index the publications associated with a person, and we want to be able to list those publications as well as providing an integer count of how many publications were published by that person in some time frame. Initially this might look quite difficult, as a “publication” is a collection of related pieces of information, such as the title, the other authors, the date of publication, and other administrative terms such as page counts and so on. To place this in a set of key-value pairs would require us to do something like:

title: My Publication
publication_date: 01-09-2008
pages: 10

This is fine if there is only one publication by the person, but if they have multiple publications it would not be possible to tell which publication_date was associated with which title.

Instead, we have to remember that this is an index and not a data store. If we wish to list publication titles and count publications within date ranges, then it is just necessary for us to index the titles and the dates separately and ensure that they are used separately within the index. So we may have:

title: My First Paper
title: My Second Paper
publication_date: 01-09-2008
publication_date: 23-05-2009

This configuration loses data by not maintaining the links between publication_date and title, but is completely adequate for the indexing and faceting requirements.

To meet our original requirement stated above we can just count the number of publication_date keys which contain a date which lies within our desired time frame and return this integer count, while simultaneously listing the titles of the publication. The fact that these two pieces of information are not related in the index makes no difference in producing the desired outcome.

CERIF schema

The CERIF schema that we are working with is a limited sub-set of the project, and has been presented in a previous post. The set of tables which describe the graph contain the following fields that we are interested in are:

CERIF Table Columns
cfPers cfPersId, cfGender
cfPers_Class cfPersId, cfClassSchemeId, cfClassId
cfPersName cfPersId, cfFirstNames, cfOtherNames, cfFamilyNames
cfPers_ResPubl cfPersId, cfResPublId, cfClassSchemeId, cfClassId
cfPers_OrgUnit cfPersId, cfOrgUnitId, cfClassSchemeId, cfClassId, cfFraction, cfEndDate
cfPers_Pers cfPersId1, cfPersId2, cfClassSchemeId, cfClassId
cfPers_Fund cfPersId, cfFundId
cfFund cfFundId, cfCurrCode, cfAmount
cfOrgUnit cfOrgUnitId, cfHeadcount
cfOrgUnitName cfOrgUnitId, cfName
cfOrgUnit_OrgUnit cfOrgUnitId1, cfOrgUnitId2, cfClassSchemeId, cfClassId
cfResPubl cfResPublId, cfResPublDate
cfResPublTitle cfResPublId, cfTitle
cfResPubl_Class cfResPublId, cfClassSchemeId, cfClassId

Next, imagine that we pick up the graph by cfPers using cfPersId as the identifier which relates the person to all the other entities, and we can see that a rough hierarchy emerges:

cfPersId
    cfGender
    cfClassSchemeId
    cfClassId
    cfFirstNames
    cfOtherNames
    cfFamilyNames
    cfResPublId
        cfClassSchemeId
        cfClassId
        cfResPublDate
        cfTitle
    cfOrgUnitId
        cfClassSchemeId
        cfClassId
        cfFraction
        cfEndDate
        cfHeadcount
        cfName
        cfOrgUnitId2**
    cfFundId
        cfCurrCode
        cfAmount

With the exception of the Org Unit data (marked with **), the result is a straightforward enough hierarchy. We can avoid considering the graph that emerges under the organisation unit data by ensuring that the cfPers_OrgUnit table contains all the relevant relationships that we want to consider during indexing, so that we don’t have to attempt to index the org unit graph when preparing an index from the perspective of the person.

Solr index

The Solr index allows us to specify a field name (the key, in the key-value pair), and whether that field is repeatable or not. Each set of key-value pairs is grouped together into a “document”, and that document will represent a single person in the CERIF dataset, along with all the relevant data associated with them. When we have fully built our index, there will be one document per person.

The Solr index which then meets our requirements is constructed from the above CERIF data as follows:

Field Single/Multi Value Notes
entity single “cfPers” Indicates that this is a person oriented document. This allows us to extend the index to view other kinds of entities as well, all represented within one schema.
id unique cfPersId A unique id representing the entity. When other entities are included in the index, this could also be their ids (e.g. cfResPublId)
gender single cfGender
name single a combination of cfFirstNames, cfOtherNames and cfFamilyNames This is the first person name that is encountered in the database, and is used for sorting and presented as the authors actual name. There is another field for name variants
name_variants multi a combination of cfFirstNames, cfOtherNames and cfFamilyNames This allows us to have multiple names for the author for the purposes of searching, although they will not be used for sorting or presented to the end user
contract_end single cfOrgUnit/cfEndDate Taken from the cfEndDate field in the cfPers_OrgUnit table which is tagged by cfClassId as Employee
funding_code multi cfFundId
org_unit_name multi cfOrgUnit/cfName
org_unit_id multi cfOrgUnit/cfOrgUnitId
primary_department single cfOrgUnit/cfName This differs from org_unit_name in that it is the department that the person should be considered most closely affiliated with. This would be, for example, their department or research group. It is used specifically for display and sorting, which is why it may only be single valued.
primary_department_id single cfOrgUnit/cfOrgUnitId The id for the department contained in primary_department
primary_position single cfOrgUnit/cfClassId The position that the person holds in their primary department (e.g. “Lecturer”)
fte single cfOrgUnit/cfFraction The fraction of the time that the person works for their organisational unit which is tagged with cfClassId of Employee.
supervising multi cfPers_Pers/cfPersId2 This lists the ids of the people that the person is supervising. These can be identified as the cfPers_Pers relationship has a cfClassId of Supervising
publication_date multi cfResPubl/cfResPublDate This lists the dates upon which the person published any result publications. This is a catch-all for all types of publication. Individual publication types are broken down in the following index fields
publication_id multi cfResPubl/cfResPublId This lists the ids of all the publications of any kind which the person published.
journal_date multi cfResPubl/cfResPublDate This is the list of dates of publication of all publications which have a cfClassId of “Journal Article”.
journal_id multi cfResPubl/cfResPublDate This is the list of ids publications which have a cfClassId of “Journal Article”.
book_date multi cfResPubl/cfResPublDate This is the list of dates of publication of all publications which have a cfClassId of “Book”.
book_id multi cfResPubl/cfResPublDate This is the list of ids publications which have a cfClassId of “Book”.
chapter_date multi cfResPubl/cfResPublDate This is the list of dates of publication of all publications which have a cfClassId of “Inbook”.
chapter_id multi cfResPubl/cfResPublDate This is the list of ids publications which have a cfClassId of “Inbook”.
conference_date multi cfResPubl/cfResPublDate This is the list of dates of publication of all publications which have a cfClassId of “Conference Proceedings Article”.
conference_id multi cfResPubl/cfResPublDate This is the list of ids publications which have a cfClassId of “Conference Proceedings Article”.

These terms are encoded in a formal schema for Solr which can be found here.

Data Import

Apache Solr provides what it calls “Data Import Handlers” which allow you to import data from different kinds of sources into the index. Once we have configured the index as per the previous section we can construct a Data Import Handler which will import from the CERIF MySQL database.

This is effectively a set of SQL queries which are used to populate the index fields in the ways described in the previous section. A representitive example of the kinds of query include:

SELECT cfPers.cfPersId, cfPers.cfGender, 'cfPers' AS entity
FROM cfPers 
    INNER JOIN cfPers_Class 
        ON cfPers.cfPersId = cfPers_Class.cfPersId 
WHERE cfPers_Class.cfClassSchemeId = 'BRUCE' 
    AND cfPers_Class.cfClassId = 'Main';

This query is at the root of the Data Import Handler, and selects our cfPersId which will be the central identifier that we will use to retrieve all other information, as well as any information which we can quickly and easily obtain by performing a JOIN operation across the cfPers* tables.

SELECT concat(cfFamilyNames, ', ', cfFirstNames, ' ', cfOtherNames) AS cfName 
FROM cfPersName 
WHERE cfPersId = '${person.cfPersId}'
LIMIT 1;

This query selects the first person’s name and performs the appropriate concatenation to turn the three name parts cfFamilyNames, cfFirstNames and cfOtherNames into a usable single string.

SELECT cfEndDate 
FROM cfPers_OrgUnit
WHERE cfPersId = '${person.cfPersId}'
    AND cfClassId = 'Employee' 
    AND cfClassSchemeId = 'cfCERIFSemantics_2008-1.2';

This query selects the person’s contract end date by looking for the organisational unit to which the person’s relationship (cfPers_OrgUnit) is annotated with the cfClassId ‘Employee’.

SELECT cfResPubl.cfResPublId, cfResPubl.cfResPublDate 
FROM cfResPubl 
    INNER JOIN cfPers_ResPubl 
        ON cfPers_ResPubl.cfResPublId = cfResPubl.cfResPublId
    INNER JOIN cfResPubl_Class
        ON cfResPubl.cfResPublId = cfResPubl_Class.cfResPublId
WHERE cfPers_ResPubl.cfPersId = '${person.cfPersId}'
    AND cfResPubl_Class.cfClassSchemeId = 'cfCERIFSemantics_2008-1.2'
    AND cfResPubl_Class.cfClassId = 'Journal Article';

This query selects the ids and dates of publications by the selected person which have a class of ‘Journal Article’.

Here we will not go into this at any further length; instead the code which provides the Data Import functionality can be obtained here.

It is probably worth noting, though, that these queries are quite long and involve JOINing across multiple database tables, which makes reporting on the data hard work if done directly from source. The BRUCE approach means that this is all compressed into one single Data Import Handler, and leaves all the exciting stuff to the much simpler search engine query.

Use of the index

Once we have produced the index, we feed it into SolrEyes (discussed in more detail here) which is configured to produce the following functionality based on the indexed values:

Field Usage
entity facet
id unused, required for index only
gender facet, result display
name sort, result display
name_variants currently unused
contract_end facet, sort, result display
funding_code result display
org_unit_name currently unused
org_unit_id currently unused
primary_department sort, result display
primary_department_id currently unused
primary_position facet, result display
fte facet, sort, result display
supervising result display (a function presents the number of people being supervised by the person)
publication_date facet, result display (a function counts the number of publications in between the date ranges specified by the facet)
publication_id currently unused
journal_date result display (a function counts the number of journal articles in between the date ranges specified by the publication_date facet)
journal_id currently unused
book_date result display (a function counts the number of journal articles in between the date ranges specified by the publication_date facet)
book_id currently unused
chapter_date result display (a function counts the number of journal articles in between the date ranges specified by the publication_date facet)
chapter_id currently unused
conference_date result display (a function counts the number of journal articles in between the date ranges specified by the publication_date facet)
conference_id currently unused

Key:

facet
used to create the faceted browse navigation
result display
used when presenting a “document” to the user. Sometimes the value is a function of the actual indexed content.
sort
used for sorting the result set

Note that a more thorough treatment of the Solr index would split the fields up into multiple indexed fields which are customised for their purposes, but that we have not done this in the prototype. For example, fields used for sorting will go through normalising functions to ensure consistent sorting across all values, while displayable values will be stored unprocessed.

We can now produce a user interface like that shown in the screen shot below.

The approach used here could be used to extend to more features of the person Base Entity, but also other Base Entities (and, indeed, any entity in the CERIF model) could be placed at the centre of the report, and its resulting hierarchy of properties mapped into a set of key-value pairs, and all could co-exist comfortably in the same search index.

Advertisements

SolrEyes

August 19, 2011

SolrEyes is a basic but effective wrapper around Apache Solr which has been developed by the BRUCE project as a replacement for Blacklight.

As per our previous post, significant problems were had in stabilising Blacklight, so a brief exploratory was carried out in attempting to replicate that functionality which was necessary to the project (not all of the existing functionality was necessary for our needs). Having been successful we went on to introduce support for ranged facets, allowing us to limit by date or any other rangeable field.

Technology

SolrEyes uses the SolrPy Python library to communicate with Apache Solr and the Mako templating language to provide the User Interface. It presents the results of search requests to the user with facet counts and current search parameters alongside the search results themselves. It allows for facets to be added and removed, as well as allowing for sorting and sub-sorting and fully supports flexible paging over result sets.


Note that the data presented in these screen shots is artificial, and should not be considered indicative of anything.

It is strongly inspired by Blacklight, and provides all of the basic search and facet functionality from that system. In addition, the configuration has been done as a JSON document which makes it easier to separate from the application and to modify and extend.

Reporting features

A key difference between SolrEyes and Blacklight is that SolrEyes comes pre-prepared for some reporting features. Every search constraint and every search result field is passed through a processing pipeline which converts it from the value in the Solr index into the display value, and that pipeline is created in configuration by the user.

In the simplest case, this allows us to switch the indexed value M in the gender field for the word Male before it is displayed. This is done by specifying that the facet values for the gender field should be passed to a function called value_map which maps M to Male and F to Female:

“facet_value_functions” : {
    “gender” : {“value_map” : {“M” : “Male”, “F” : “Female”}},
}


This shows a configuration option which exchanges facet values for display values in the gender field.

This approach could also be used, though, to substitute date ranges for descriptions, such as “RAE 2008”, or other useful terms.

At the more powerful end of the spectrum, though, this feature can be used to process result values themselves to present information as functions of those values. An example of the way this is used in BRUCE is as follows:

We wish to present counts of the number of publications that researchers have published in the reporting period. The reporting period can be set by choosing the appropriate date range from the navigation (this constrains the publication_date field to contain values from that range). This means that we cannot index this data in advance, as it is dependent on the exact date range that the user selects, which could be absolutely anything. Instead we pipe the date range selected and a result field which contains the dates on which the user published to a function which can compare those publication dates with the constraint range and return a count of those publications which fall within it. In order to achieve this effect the documents in our index contain a list of the dates upon which the author published.

“dynamic_fields” :{
    “period_publications_count” : {
        “date_range_count” : {
            “bounding_field” : “publication_date”,
            “results_field” : “publication_date”
        }
    }
}


This shows a configuration of a “dynamic field” which presents the count of values in the index field publication_date which fall within the constraining facet publication_date.


This screenshot shows a single record which has been constrained to all publications from a single year (see the green box which displays the constraint). The final 6 result columns contain values which are dynamically generated by comparing the publication dates of the different publication types with that constraint. So, here, S W Burnham is seen to have published 2 items in 1880: 1 Book and 1 Conference Paper.

External Take Up

SolrEyes has proved sufficiently simple to operate and configure while providing useful functionality that it has also had some take-up outside of the project.

The functionality was designed deliberately to be flexible to other use cases (although the reporting use cases were the ones focussed upon by the project team), and as such it has also found use as a front-end for a bibliographic data index.

The Open Bibliography project (which provided the MedLine data that the BRUCE project built the CERIF test data from), the OKF and Cottage Labs are also involved in the development of the BibJSON standard and related BibServer software which powers the under-development BibSoup service. This service is using SolrEyes to operate the search and faceted browse features, and so the software is already getting feedback and enhancements from external developers.

We hope that SolrEyes fulfills a niche for a simple but powerful interface to Apache Solr. Its advantages over Blacklight and VuFind are in the simplicity of the environment and a generic approach to presenting the contents of a search index (both Blacklight and VuFind are more geared towards providing catalogue interfaces).

Using SolrEyes

The SolrEyes software can be downloaded here.

To use SolrEyes successfully, it requires the most recent development version of Solrpy (we found a bug in 0.9.4 and submitted a patch which was accepted, but which has not yet been packaged in a formal release). You can install the latest version with (all on one line):

sudo pip install -e hg+https://solrpy.googlecode.com/hg/#egg=solrpy

You will also need to install web.py and mako which you can do with easy_install:

sudo easy_install web.py
sudo easy_install mako

Next go into the directory where you downloaded SolrEyes and modify the config.json file with your custom configuration (documentation is inline).

Finally, you can start SolrEyes by executing:

python solreyesui.py 8080

This will start SolrEyes on port 8080 on localhost.

We are very interested in taking the development of SolrEyes forward so please contact us if you have any questions, feedback or suggestions.


Blacklight, Solr, Ruby and Rails on Ubuntu

April 27, 2011

The BRUCE project is using Apache Solr and Project Blacklight as its core technologies and at least some of the development is going on on Ubuntu. The Blacklight Ubuntu install has a couple of gotchas which aren’t covered by the standard Blacklight documentatio0n and aren’t so well serviced by all of the Rails on Ubuntu blog posts that we read. So here is a very short write up of an approach to Blacklight, Solr, Ruby and Rails on Ubuntu:

First, Some Useful Links

The Blacklight README

The Blacklight Pre-Requisites

How-To for Ruby 1.8 on Ubuntu

Install Ruby

The following commands install everything you need to run ruby (it’s more than is required just for Blacklight):

sudo aptitude install build-essential zlib1g zlib1g-dev libxml2 libxml2-dev libxslt-dev

sudo aptitude install sqlite3 libsqlite3-dev locate git-core libmagick9-dev

sudo aptitude install curl wget

sudo aptitude install ruby1.8-dev ruby1.8 ri1.8 rdoc1.8 irb1.8 libreadline-ruby1.8

sudo aptitude install libruby1.8 libopenssl-ruby rubygems1.8

Ruby does not put everything on the system PATH so, set the following symlinks:

sudo ln -s /usr/bin/rdoc1.8 /usr/bin/rdoc

sudo ln -s /usr/bin/irb1.8 /usr/bin/irb

sudo ln -s /usr/bin/ri1.8 /usr/bin/ri

sudo ln -s /usr/bin/gem1.8 /usr/local/bin/gem

Install Java

Solr depends on java, so if you don’t have a JDK installed already we need to get one of those

sudo aptitude install openjdk-6-jdk

Add some new Ruby Gems sources

We need to add a couple of sites where gem will get the ruby gems from. These are the most useful ones, which will cover us for this install.

gem sources -a http://gems.github.com

gem sources -a http://gems.rubyonrails.org

gem sources -a http://gemcutter.org

Install the required Gems (including Rails)

Now we use gem to install the rest of the ruby dependencies for Blacklight. NOTE that we DO NOT install Rails using aptitude. This would be a disaster and nightmare rolled into one.

sudo gem install rubygems-update

sudo gem install rake nokogiri hpricot builder cheat daemons

sudo gem install json uuid rmagick sqlite3 fastthread rack

Finally install the specific version of rails recommended by the Blacklight documentation. NOTE that this command can take a while to initialise, but it will get there in the end, just be patient.

sudo gem install -v=2.3.11 rails

Put relevant Gems on the PATH

On Ubunutu, the gem installer does not place your executables on the PATH. This is they key step that all Rails on Ubuntu install documents miss out. They are all located in:

/var/lib/gems/1.8/bin/

You can execute them directly from there by using the full path to the binary, but this is no good for automated tasks such as the Blacklight installer which expect the executables to be available. Also, you can’t just add this directory to your user’s PATH variable, as this will not propagate up into the root user’s PATH when sudo is used; Instead, you must symlink the binaries you want to use into an existing PATH location:

sudo ln -s /var/lib/gems/1.8/bin/rails /usr/bin/rails

sudo ln -s /var/lib/gems/1.8/bin/rake /usr/bin/rake

Install Blacklight

Now we can follow the install parts of the Blacklight documentation. This basically comes down to

rails ./blacklight-app -m http://projectblacklight.org/templates/latest.rb

Although it depends on which version of Blacklight you are trying to install. See their README for more detail.

Use the following answers to the questions posed by the installer (or don’t – this is just what we did):

  1. yes
  2. no
  3. yes
  4. yes
  5. no
  6. yes

Some install options may fail; since Blacklight installs Solr for you, but doesn’t start it, it will fail to load the default marc data. If your PATH isn’t set up properly (as described above) it will fail to run all the rake command; this can be fixed by sorting out the PATH (as described above) and running

sudo rake gems:install

rake db:migrate RAILS_ENV=development

Initialising Solr

Start the solr web service in the jetty directory

./blacklight-app/jetty$ java -jar start.jar

To confirm that it is up, visit the URL:

http://localhost:8983/solr/

If you want to install the demo marc data, in the blacklight directory run:

rake solr:marc:index_test_data

Starting Blacklight

Start the ruby server:

./blacklight-app$ ./script/server

to confirm that it is up, visit the URL:

http://localhost:3000/

And that’s it. Blacklight and Solr should now be up and running. Have fun with it!


What is BRUCE?

March 14, 2011

The BRUCE project aims to develop a prototype tool, based on CERIF, that will facilitate the analysis and reporting of research information from data sources that are already in use at the majority of HEIs. 

What does that actually mean?  HEIs already collect lots of information about the research that they do but that information tends to be stored in separate silos, e.g. the HR database, the institutional repository, student records, etc.  The idea of BRUCE is to pull that data out of those silos, index it using CERIF and then create a new tool that can analyse and report on that data.


The BRUCE Project at Brunel University

March 14, 2011

Welcome to the BRUCE project blog!

The BRUCE (Brunel Research Under a CERIF Environment) project is led by Brunel University in collaboration with St George’s, University of London, Cottage Labs and Symplectic Ltd.

The project is funded by JISC under the Research Information Management strand of the Infrastructure for Education and Research Programme (JISC Grant Funding 15/10) and you can read the full project proposal on the JISC website.

We will use this blog to keep you up to date on our progress with the project which will run from February to July 2011.