CERIF Test Data

August 17, 2011

Due to the privacy and data protection status of the real research information at Brunel – which includes data such as pay scales and so forth – it is not possible for the project to demonstrate its tools to people who are not Brunel employees (at the very least). Furthermore, that data cannot even be taken off-site or placed onto computers which are not under direct control of the university. Combine this with the need within the project for two parallel development tracks: one mapping source data (such as HR and publications) into CERIF and the other indexing and reporting on that CERIF data, and there is a compelling need for a test dataset.

A test CERIF dataset could be used in any demonstrations of the project outputs, and could be put in-place for the CERIF indexing side of the project so that it is not critically dependent on the outputs of the data mapping side.

Initially we had hoped that such a dataset already existed, but there was nothing available on the euroCRIS website (the CERIF guardian organisation) and extensive searching turned up nothing of value. There are other JISC projects which may ultimately have yielded some useful data (such as CERIFy), but they are also running in parallel to BRUCE.

The project therefore developed a piece of software which can be used to generate test data, and has made it available open source here (in the cerifdata folder at that link).

The approach to developing the test data and the software were as follows:

1. Identify a seed dataset

We were lucky that at exactly the time that we were seeking for a seed dataset, the Open Bibliography project – also JISC funded – had succeeded in liberating the MedLine dataset consisting of around 20 million publication records.

This was an ideal source of the most difficult data to artificially generate: author names and publication titles. By using this dataset as our seed we would be able to generate artificial research data based on open access bibliographic data, which would give us the freedom necessary to do as we needed with the dataset at the same time as making it look suitably realistic.

2. Define the model we are populating

Although actually done in several iterations, the model we worked towards was as presented in a previous post.

This meant generating data about Staff, Organisational Units and Publications. We have only written code to generate the data required for our example model, but we have endeavoured to write the software itself in a way which allows it to be extended throughout the project and into the future.

3. Develop a flexible production mechanism

The test data is generated by the following process:

First, source data is obtained from the MedLine data file. This source data is then passed through a set of CERIF data “aspect generators” which produce CERIF entites and relationships (such as staff records and their relationships to organisational units and publications). These are then written to CSVs which reflect the database table structure in the CERIF SQL schema. The CSVs are finally converted into a single large SQL file suitable for import into a database.

The architecture of the software is designed to be flexible so that new aspects can easily be added and existing aspects can easily be modified.

4. Produce the test data

We simply provide one of the MedLine source data files to the program and it will generate our test data in SQL format for us:

python data.py /path/to/medline.xml

Which produces the CSVs:

$ ls *.csv
cfFund.csv
cfOrgUnit_OrgUnit.csv
cfPers_Class.csv
cfPers_Pers.csv
cfResPublTitle.csv
cfOrgUnit.csv
cfPers.csv
cfPers_Fund.csv
cfPers_ResPubl.csv
cfResPubl_Class.csv
cfOrgUnitName.csv
cfPersName.csv
cfPers_OrgUnit.csv
cfResPubl.csv

For example, the following data are all related through a single person (cfPers):

cfPers.csv
f0b2517b-4b65-4fa5-b562-ff931cd213f2, F

cfPersName.csv
f0b2517b-4b65-4fa5-b562-ff931cd213f2, Teresa, J, Krassa

cfPers_Fund.csv
f0b2517b-4b65-4fa5-b562-ff931cd213f2, MM122

cfPers_OrgUnit.csv
f0b2517b-4b65-4fa5-b562-ff931cd213f2, 1, cfCERIFSemantics_2008-1.2, Employee, 1.0, 2019-10-05
f0b2517b-4b65-4fa5-b562-ff931cd213f2, 3, cfCERIFSemantics_2008-1.2, PhD, 1.0, 2019-10-05
f0b2517b-4b65-4fa5-b562-ff931cd213f2, 1, cfCERIFSemantics_2008-1.2, Member, 1.0, 2019-10-05
f0b2517b-4b65-4fa5-b562-ff931cd213f2, 2, cfCERIFSemantics_2008-1.2, Member, 1.0, 2019-10-05
f0b2517b-4b65-4fa5-b562-ff931cd213f2, 3, cfCERIFSemantics_2008-1.2, Member, 1.0, 2019-10-05

cfPers_ResPubl.csv
f0b2517b-4b65-4fa5-b562-ff931cd213f2, 1936bdc4-aadd-4028-bb9e-b9eec2561c00, cfCERIFSemantics_2008-1.2, Author

This shows us a person with ID f0b2517b-4b65-4fa5-b562-ff931cd213f2 who is Female (from cfPers.csv) who has the name Teresa,J,Krassa (from cfPersName.csv) who has funding from funding code MM122 (from cfPers_Fund.csv), who is an Employee of Organisational Unit 1, is a PhD Student in Organisational Unit 3 and is a Member of Organisational Units 1, 2 and 3 (from cfPers_OrgUnit.csv). It also shows that this person is the Author of a Result Publication with ID 1936bdc4-aadd-4028-bb9e-b9eec2561c00 (from cfPers_ResPubl.csv).

This repeats with variations on the data across the entire seed dataset, giving us a rich spread of people, publications, organisational units and relationships between them upon which to carry out our development, testing and demonstrations.

These CSVs are then converted into a single SQL file, which can then be imported into our MySQL database and used.

If you wish to use the software yourself, you can download it from the version control but unfortunately at time of writing the MedLine data in the format required by the program is not publicly available. It is available as n-quads on CKAN and the project is discussing with Open Bibliography the possibilities of publishing the data in its original format also. In the mean time, please feel free to contact us and we will be able to help you obtain the data in the relevant format.

Advertisements

Switching off the Blacklight

August 10, 2011

At the outset of the project we had planned to use Blacklight as the user interface to Apache Solr through which we would present our reporting interface. This post describes the reasons that we subsequently abandoned this approach and developed an alternative which met our requirements.

The principle issue that we had with using Blacklight was simply due to the instability of the install process. Although Blacklight is a Ruby on Rails application, and should therefore be highly portable, the technical team had significant problems getting it to work across all the relevant platforms. Much of the development work for the project took place on Linux (Ubuntu) and Mac OS X, but the primary deployment environment was to be Windows; as such, portability is very important.

Installation on Ubuntu was difficult, although not impossible, and we blogged a How-To guide which patched some holes in existing online guides. Results on Windows were variable, with issues of dependency version resolution being the primary difficulty (although this was not the only issue, and was also not limited to the Windows install). Installation on Mac OS X proved too error prone to complete at all.

While we anticipate that these installation problems would ultimately be resolvable, they reduced our confidence in Ruby on Rails as a workable environment and also held up progress on the interesting parts of the project!

Another limitation for Blacklight was that ranged faceting was not supported in the default install. Instead there was an experimental add-on available which would have offered this feature. Ranged faceting is a key component for the project as the reporting needs to be limited by date (for example, per academic year or RAE/REF period). Ultimately we decided that – given the difficulties getting started with Blacklight – adopting an experimental add-on would raise the risk of project failure to an unacceptable level (given only 6 months for the whole project).

For these reasons we embarked on a short experiment to explore the difficulty of providing a basic reporting UI from scratch which would meet the project requirements. We found that it took a very small amount of time to develop the basic facet viewing features, and so we continued to introduce ranged searching and a more appropriate report generating interface. Having found that we could provide a more stable application (written in Python) which would provide us with the desired functionality, the project therefore decided to abandon Blacklight and dedicate some development time to our own interface.

It is worth noting that the important features of the reporting approach actually lie in Apache Solr – this does all the hard work in indexing, searching and faceting the content. The User Interface exists purely as a presentation layer, so we do not lose anything by switching from Blacklight to a custom development.

A future post will provide more details about the custom development.


3rd BRUCE Steering Group Meeting

July 22, 2011

The Project Steering Group, chaired by Professor Geoff Rodgers the Pro Vice Chancellor (Research) at Brunel, met for the third time on Thursday 21st July.  The minutes of the second meeting (Minutes_BPSG2) were agreed and, as agreed by the Group, are now being made public here.


Reporting Data Model for CERIF

July 14, 2011

The picture shows a sketch of our data model that we have so far built using CERIF. The model employs two Base Entities: cfPers and cfOrgUnit, one Result Entity: cfResPubl and one Second Level Entity: cfFund. These cover all of the features that we need to store the data for the example reports.

 

Important features of the model

PhD Supervision

PhD supervisions are modelled as Person-to-Person relationships (cfPers_Pers), annotated with the CERIF 2008 Semantic term “Supervisor”. This is non-contentious, and it is easy to determine the number of people that a particular staff member is supervising.

Payroll Numbers

We have 2 identifiers for each person: a HESA ID and an Employee or Payroll ID. We have adopted the HESA ID as the primary identifier for a staff member, but this leaves us with no clear place to store the Payroll ID. Options considered were the Person Keywords (cfPersKeyw) and a Person-to-Person relationship; in the end we decided to model this by creating a new Person record with the Payroll ID as the cfPersId, and to assert a relationship between the HESA identified person and the Payroll identified person (in the cfPers_Pers table) with a relationship type of “Payroll” in our own “BRUCE” semantic namespace.

This “solves” the problem but introduces significant extra complexity into the model. First, from an indexing point of view, it is difficult to obtain a list of unique staff members from the cfPers table, as multiple records refer to the same actual person. For this reason we have also had to introduce a Class for each person record, so that we can identify the “Main” person record.  We have therefore added entries into the cfPers_Class table under the “BRUCE” namespace for “HR” person records and “Main” person records (which are the HESA identified ones). With this in place it is easy again for us to select from the cfPers table a list of the “Main” person records.

Person relationship to org unit

The Org Unit and Person relationships can be realised in a number of way, and anecdotally not two organisations do this the same way, which makes our solution quite brittle. In order to fully relate a person to their organisational units and also make the data indexable in a way which is useful for reporting we have constructed the person’s relationship to their org units as follows:

  1. The person is an “Employee” of the parent organisation
  2. The person has a Position in the lowest level unit (e.g. department or research group)
  3. The person is a “Member” of all of the organisation units in the heirarchy

This allows us to provide a good indexing solution, and is a reasonable approximation of the real relationships that the person has with its organisation, but it is highly inflexible and includes assumptions about the structure of the organisation. There is no obvious way around this that the project team can determine.

Org unit to org unit relations

We have modelled the org unit to org unit relations as a strict parent/child relationship. The CERIF semantic layer includes a “Part” term, but this term does not clearly indicate direction, so we have adopted the convention that “Part” means ‘Has Part’. So:

Brunel University ‘Has Part’ Natural Sciences

Person class (for deduplication)

As discussed in the section on Payroll numbers, we have had to annotate each person record with a Class which identifies whether this is the primary record for that person, so that they can be deduplicated during indexing.

The result of this is that instead of requiring 2 database table rows per person we now need 6, which is is a significant increase in storage space and complexity, and suggests a flaw in the CERIF model.

Conclusions

The data model that we have produced so far provides us with the coverage we need to explore the reporting aspects of the project, and accommodates all the source data that we will be mapping into CERIF. It is worth noting that this is only a small part of the total CERIF model.

The key thing that the project team had to understand about CERIF is that it is about modelling the real world as accurately as possible inside the database, so all entities have to be represented in the entity tables, rather than being added as metadata elements to other entities. An example is that a journal within which an article has been published must be modelled as a Result Publication itself, and linked to the article’s entry in cfResPubl using the cfResPubl_ResPubl table. This makes CERIF a very complex model and a very rapidly growing dataset, with a high entry barrier to getting content in.

The team has spent a long time just understanding the schema and working with the data to get it into a format where it can be put into a CERIF database, and it is felt that this ought to be easier.


OAI7: Advocacy

June 24, 2011

Really interesting session on advocacy at the OAI7 workshop in Geneva yesterday with presentations by Monica Hammes, William Nixon and Heather Joseph.  A key message coming out was that if OA repositories (e.g. BURA) are really going to take-off they have to be embedded in the workflows of the institution and researchers – as Cameron Neylon said this morning, we need to be moving from providing scaffolding to becoming part of the infrastructure.  I wonder if BRUCE can be part of this by making it a little bit easier to embed repositories into the institutional RIM infrastructure?

 


On the Road with BRUCE

June 22, 2011

The BRUCE project team are at the CERN workshop on Innovations in Scholarly Communication (OAI7) this week.

Lorna’s infamous sense of direction kicked in and so she arrived slightly late, following a quick trip around Geneva on the bus from the airport (she forgot to get off so had to go all the way round again!)


BRUCE Project Steering Group Meeting

June 20, 2011

The Project Steering Group, chaired by Professor Geoff Rodgers the Pro Vice Chancellor (Research) at Brunel, met for the second time on Thursday 16th June.  The minutes of the first meeting (Minutes_BPSG1) were agreed and, as agreed by the Group, are now being made public here.