JISC RIM Projects Final Event

September 21, 2011

Really useful dissemination event in Manchester yesterday – thanks to Mahendra and the rest of the team at UKOLN for all of the organisation.

If you weren’t able to come along to the event or if you went along to the IRIOS workshop instead of to the BRUCE session (we all make mistakes!) have a look at the presentation from the workshop.  Let us know if you have any comments or questions – feedback is always welcome.

 P.S. Only joking Simon!


BRUCE to CERIF Data Mapping

September 19, 2011

Data Sources

Snapshots of data were taken from BRAD, HR, and HESA and dumped into csv & spread sheet format.  I have read access to SITS data at table level (this made it extremely easy to get all the data required for the CERIF schema (SITS was used to extract data on Research students to enable reporting on supervision)

We used a sample report to determine which tables to use in the CERIF schema (time constraints determined that we wouldn’t be able to do a full mapping) see below the diagram of the tables used in our project.

I then modelled this data by creating a spread sheet model of the CERIF schema.  This was the quickest way to get the data into the CERIF Schema format due to time restrictions on the project.  If this were to be put into production then automatic data extraction from the data sources would be explored.

Lots of issues with Publications when translating into CERIF Schema

          Different field sizes
          Date formats needed tweaking
          Null values from the sample data needed to have dummy defaults set

The data was then imported into a MySql database (installed on the BRUCE server)

SolrEyes was installed from BitBucket (instructions detailed here)

Download and install Apache Solr 3.1. and initiate on the server

Initiated SolrEyes on the Server

It took about 10-15 minutes to install the software and get it up and running on the server.

BRUCE Data Model – modelled from CERIF schema – project time limits restricted how much of CERIF we could sensibly prototype

 

 

CERIF model

 

The Interface (displaying live data from Brunel) – facets on left hand side – used to refine the data displayed – sorting & paging also implemented

SolrEyes

The SolrEyes interface

Solr Indexer – middle layer pulls data from MySQL – this is where the work is done – entities defined and the data modelled etc….

Report generated from BRUCE data (modelled on the CERIF schema)  using Datavision (Open Source Report Generator) – can be exported in csv, xml, excel, pdf, word formats.

BRUCE CERIF Data Model modelled in Excel as spread sheets (a sheet per table)

Spreadsheet One – Staff

cfPers

cfPersName

cf_Pers_Pers

cfPers_Class

cfPers_OrgUnit

cfPers_Fund

cfPers_ResPubl

Spreadsheet Two – Research Students

cfPers

cfPersName

cf_Pers_Pers

cfPers_Class

cfPers_OrgUnit

Spreadsheet Three – Organisational Levels

cfOrgUnit

cfOrgUnitName

cfOrgUnit_OrgUnit

Spreadsheet Four – Funding

cfFund

cfFundDesc

Spreadsheet Five – Publications

cfResPubl

cfResPublTitle

cfResPubl_Class

How I found working with CERIF

CERIF is predominantly relational (but not pure – the semantics took a while to understand) the link tables & class tables was a bit Object-Orientated – and for me it didn’t quite make sense.  But it incorporated flexibility into the schema and there is the scope to use in many ways.  Because our time was restricted we didn’t spend too much time analysing how to use it.  Richard & I came up with the BRUCE model in a couple of hours and went with it.  Once the entities were defined in Solr – we found that we had to populate some of the link/class tables with unnecessary duplication of data just to get the interface to work – although this goes against my understanding of how relational data works the indexer was really fast – especially when we used the ‘test data’ – the Brunel data snapshot is quite small so wouldn’t determine speed efficiency.

If this prototype were to be developed into production, we would need to analyse the data structures & mappings in more depth.  We have all the data on-line and automation from the various systems could be developed so the data amalgamation would be seamless.  Time determined that reporting was just touched upon – I used an open-source tool ‘DataVision’ to connect directly to the database and a very simple report was produced. 

Reporting would be a separate component using data extracted from the queries in the interface – pushed into a temp table – would need more development and thought to achieve this.


Research Information Management – the Sample Reports from the BRUCE project

September 15, 2011

The work to create a CERIF mapping that could then be indexed via Solr and presented via SolrEyes was based on two Sample Reports, a Summary Staff Report and a Simple Publications Report, created by Dr Rosa Scoble at Brunel.

The idea behind the sample reports was to identify key reporting requirements that would be useful to institutions across the sector. In doing so, we hoped to encourage other institutions to have a play with the tools produced by the BRUCE project in order to generate copies of the reports using their own data.

Feedback on draft reports was sought via the Project Advisory Group. That feedback was then incorporated into the reports that are now being made available here.


JISC RIM Projects (including BRUCE) dissemination event

September 13, 2011

You are invited to join us at a major conference showcasing the work of the second round of RIM Projects (BRUCE, CERIFY, IRIOS and MICE) – this is a great opportunity to come along, meet the team and find out more about what the BRUCE project has actually achieved.

The workshop takes place at the Manchester Conference Centre on Tuesday 20th September, 2011 and is absolutely free to attend (lunch is also provided).

Further information and the workshop programme are available on the UKOLN website.

Hope to see you all in Manchester!


From the CERIF Model to the Solr Index

September 11, 2011

Part of the challenge of the BRUCE project is to take a highly relational model like CERIF and convert it into something which can be adequately indexed for searching and faceting.

Apache Solr, like many traditional search engines, works on the principle of key-value pairs. A key-value pair is simply an assertion that some value (on the right) is associated with some key (on the left). Examples of key-value pairs are:

name : Richard
project : bruce
organisation : Brunel University

Typically, the keys on the left come from a set of known terms, while the values on the right can vary arbitrarily. Therefore, when you search for documents belonging to “Richard”, you are asking which documents have the value “Richard” associated with the key “name”.

In addition, keys are often repeatable (although depending on the search index schema this might not be always the case), so you could have multiple “name” keys, with different values.

Approach

The objective, then, is for us to convert the graph-like structure of CERIF (that is, it has entities and relationships which do not follow a hierarchy) into the flat key-value structure of a search index. It should be clear from the outset, therefore, that data-loss will necessarily result from this conversion; it is not possible to fully and adequately represent a graph as a set of key-value pairs.

The project aimed, instead, to extract the key information from the CERIF schema from the point of view of one of the Base Entities.

There are 3 Base Entities in CERIF: Publications, People and Organisational Units. Since BRUCE is concerned with reporting principally on staff, we selected People as the Base Entity from which we would view the CERIF graph. By doing this we reduce the complexity of the challenge, since a graph viewed from the point of view of one of its nodes behaves like a hierarchy at least in the immediate vicinity (see the real analysis of this, below, for a clear example).

Our challenge is then simplified to representing a tree structure as a set of key-value pairs.

The second trick we need to use is to decide what kind of information we want to actually report on, and narrow our indexing to fields in the CERIF schema which are relevant to those requirements. This allows us to index values which are actually closely related to eachother as totally separate key-value pairs: as long as the index provides enough information for searching and faceting, it won’t matter that information about their relationship to eachother is lost.

For example: suppose we want to index the publications associated with a person, and we want to be able to list those publications as well as providing an integer count of how many publications were published by that person in some time frame. Initially this might look quite difficult, as a “publication” is a collection of related pieces of information, such as the title, the other authors, the date of publication, and other administrative terms such as page counts and so on. To place this in a set of key-value pairs would require us to do something like:

title: My Publication
publication_date: 01-09-2008
pages: 10

This is fine if there is only one publication by the person, but if they have multiple publications it would not be possible to tell which publication_date was associated with which title.

Instead, we have to remember that this is an index and not a data store. If we wish to list publication titles and count publications within date ranges, then it is just necessary for us to index the titles and the dates separately and ensure that they are used separately within the index. So we may have:

title: My First Paper
title: My Second Paper
publication_date: 01-09-2008
publication_date: 23-05-2009

This configuration loses data by not maintaining the links between publication_date and title, but is completely adequate for the indexing and faceting requirements.

To meet our original requirement stated above we can just count the number of publication_date keys which contain a date which lies within our desired time frame and return this integer count, while simultaneously listing the titles of the publication. The fact that these two pieces of information are not related in the index makes no difference in producing the desired outcome.

CERIF schema

The CERIF schema that we are working with is a limited sub-set of the project, and has been presented in a previous post. The set of tables which describe the graph contain the following fields that we are interested in are:

CERIF Table Columns
cfPers cfPersId, cfGender
cfPers_Class cfPersId, cfClassSchemeId, cfClassId
cfPersName cfPersId, cfFirstNames, cfOtherNames, cfFamilyNames
cfPers_ResPubl cfPersId, cfResPublId, cfClassSchemeId, cfClassId
cfPers_OrgUnit cfPersId, cfOrgUnitId, cfClassSchemeId, cfClassId, cfFraction, cfEndDate
cfPers_Pers cfPersId1, cfPersId2, cfClassSchemeId, cfClassId
cfPers_Fund cfPersId, cfFundId
cfFund cfFundId, cfCurrCode, cfAmount
cfOrgUnit cfOrgUnitId, cfHeadcount
cfOrgUnitName cfOrgUnitId, cfName
cfOrgUnit_OrgUnit cfOrgUnitId1, cfOrgUnitId2, cfClassSchemeId, cfClassId
cfResPubl cfResPublId, cfResPublDate
cfResPublTitle cfResPublId, cfTitle
cfResPubl_Class cfResPublId, cfClassSchemeId, cfClassId

Next, imagine that we pick up the graph by cfPers using cfPersId as the identifier which relates the person to all the other entities, and we can see that a rough hierarchy emerges:

cfPersId
    cfGender
    cfClassSchemeId
    cfClassId
    cfFirstNames
    cfOtherNames
    cfFamilyNames
    cfResPublId
        cfClassSchemeId
        cfClassId
        cfResPublDate
        cfTitle
    cfOrgUnitId
        cfClassSchemeId
        cfClassId
        cfFraction
        cfEndDate
        cfHeadcount
        cfName
        cfOrgUnitId2**
    cfFundId
        cfCurrCode
        cfAmount

With the exception of the Org Unit data (marked with **), the result is a straightforward enough hierarchy. We can avoid considering the graph that emerges under the organisation unit data by ensuring that the cfPers_OrgUnit table contains all the relevant relationships that we want to consider during indexing, so that we don’t have to attempt to index the org unit graph when preparing an index from the perspective of the person.

Solr index

The Solr index allows us to specify a field name (the key, in the key-value pair), and whether that field is repeatable or not. Each set of key-value pairs is grouped together into a “document”, and that document will represent a single person in the CERIF dataset, along with all the relevant data associated with them. When we have fully built our index, there will be one document per person.

The Solr index which then meets our requirements is constructed from the above CERIF data as follows:

Field Single/Multi Value Notes
entity single “cfPers” Indicates that this is a person oriented document. This allows us to extend the index to view other kinds of entities as well, all represented within one schema.
id unique cfPersId A unique id representing the entity. When other entities are included in the index, this could also be their ids (e.g. cfResPublId)
gender single cfGender
name single a combination of cfFirstNames, cfOtherNames and cfFamilyNames This is the first person name that is encountered in the database, and is used for sorting and presented as the authors actual name. There is another field for name variants
name_variants multi a combination of cfFirstNames, cfOtherNames and cfFamilyNames This allows us to have multiple names for the author for the purposes of searching, although they will not be used for sorting or presented to the end user
contract_end single cfOrgUnit/cfEndDate Taken from the cfEndDate field in the cfPers_OrgUnit table which is tagged by cfClassId as Employee
funding_code multi cfFundId
org_unit_name multi cfOrgUnit/cfName
org_unit_id multi cfOrgUnit/cfOrgUnitId
primary_department single cfOrgUnit/cfName This differs from org_unit_name in that it is the department that the person should be considered most closely affiliated with. This would be, for example, their department or research group. It is used specifically for display and sorting, which is why it may only be single valued.
primary_department_id single cfOrgUnit/cfOrgUnitId The id for the department contained in primary_department
primary_position single cfOrgUnit/cfClassId The position that the person holds in their primary department (e.g. “Lecturer”)
fte single cfOrgUnit/cfFraction The fraction of the time that the person works for their organisational unit which is tagged with cfClassId of Employee.
supervising multi cfPers_Pers/cfPersId2 This lists the ids of the people that the person is supervising. These can be identified as the cfPers_Pers relationship has a cfClassId of Supervising
publication_date multi cfResPubl/cfResPublDate This lists the dates upon which the person published any result publications. This is a catch-all for all types of publication. Individual publication types are broken down in the following index fields
publication_id multi cfResPubl/cfResPublId This lists the ids of all the publications of any kind which the person published.
journal_date multi cfResPubl/cfResPublDate This is the list of dates of publication of all publications which have a cfClassId of “Journal Article”.
journal_id multi cfResPubl/cfResPublDate This is the list of ids publications which have a cfClassId of “Journal Article”.
book_date multi cfResPubl/cfResPublDate This is the list of dates of publication of all publications which have a cfClassId of “Book”.
book_id multi cfResPubl/cfResPublDate This is the list of ids publications which have a cfClassId of “Book”.
chapter_date multi cfResPubl/cfResPublDate This is the list of dates of publication of all publications which have a cfClassId of “Inbook”.
chapter_id multi cfResPubl/cfResPublDate This is the list of ids publications which have a cfClassId of “Inbook”.
conference_date multi cfResPubl/cfResPublDate This is the list of dates of publication of all publications which have a cfClassId of “Conference Proceedings Article”.
conference_id multi cfResPubl/cfResPublDate This is the list of ids publications which have a cfClassId of “Conference Proceedings Article”.

These terms are encoded in a formal schema for Solr which can be found here.

Data Import

Apache Solr provides what it calls “Data Import Handlers” which allow you to import data from different kinds of sources into the index. Once we have configured the index as per the previous section we can construct a Data Import Handler which will import from the CERIF MySQL database.

This is effectively a set of SQL queries which are used to populate the index fields in the ways described in the previous section. A representitive example of the kinds of query include:

SELECT cfPers.cfPersId, cfPers.cfGender, 'cfPers' AS entity
FROM cfPers 
    INNER JOIN cfPers_Class 
        ON cfPers.cfPersId = cfPers_Class.cfPersId 
WHERE cfPers_Class.cfClassSchemeId = 'BRUCE' 
    AND cfPers_Class.cfClassId = 'Main';

This query is at the root of the Data Import Handler, and selects our cfPersId which will be the central identifier that we will use to retrieve all other information, as well as any information which we can quickly and easily obtain by performing a JOIN operation across the cfPers* tables.

SELECT concat(cfFamilyNames, ', ', cfFirstNames, ' ', cfOtherNames) AS cfName 
FROM cfPersName 
WHERE cfPersId = '${person.cfPersId}'
LIMIT 1;

This query selects the first person’s name and performs the appropriate concatenation to turn the three name parts cfFamilyNames, cfFirstNames and cfOtherNames into a usable single string.

SELECT cfEndDate 
FROM cfPers_OrgUnit
WHERE cfPersId = '${person.cfPersId}'
    AND cfClassId = 'Employee' 
    AND cfClassSchemeId = 'cfCERIFSemantics_2008-1.2';

This query selects the person’s contract end date by looking for the organisational unit to which the person’s relationship (cfPers_OrgUnit) is annotated with the cfClassId ‘Employee’.

SELECT cfResPubl.cfResPublId, cfResPubl.cfResPublDate 
FROM cfResPubl 
    INNER JOIN cfPers_ResPubl 
        ON cfPers_ResPubl.cfResPublId = cfResPubl.cfResPublId
    INNER JOIN cfResPubl_Class
        ON cfResPubl.cfResPublId = cfResPubl_Class.cfResPublId
WHERE cfPers_ResPubl.cfPersId = '${person.cfPersId}'
    AND cfResPubl_Class.cfClassSchemeId = 'cfCERIFSemantics_2008-1.2'
    AND cfResPubl_Class.cfClassId = 'Journal Article';

This query selects the ids and dates of publications by the selected person which have a class of ‘Journal Article’.

Here we will not go into this at any further length; instead the code which provides the Data Import functionality can be obtained here.

It is probably worth noting, though, that these queries are quite long and involve JOINing across multiple database tables, which makes reporting on the data hard work if done directly from source. The BRUCE approach means that this is all compressed into one single Data Import Handler, and leaves all the exciting stuff to the much simpler search engine query.

Use of the index

Once we have produced the index, we feed it into SolrEyes (discussed in more detail here) which is configured to produce the following functionality based on the indexed values:

Field Usage
entity facet
id unused, required for index only
gender facet, result display
name sort, result display
name_variants currently unused
contract_end facet, sort, result display
funding_code result display
org_unit_name currently unused
org_unit_id currently unused
primary_department sort, result display
primary_department_id currently unused
primary_position facet, result display
fte facet, sort, result display
supervising result display (a function presents the number of people being supervised by the person)
publication_date facet, result display (a function counts the number of publications in between the date ranges specified by the facet)
publication_id currently unused
journal_date result display (a function counts the number of journal articles in between the date ranges specified by the publication_date facet)
journal_id currently unused
book_date result display (a function counts the number of journal articles in between the date ranges specified by the publication_date facet)
book_id currently unused
chapter_date result display (a function counts the number of journal articles in between the date ranges specified by the publication_date facet)
chapter_id currently unused
conference_date result display (a function counts the number of journal articles in between the date ranges specified by the publication_date facet)
conference_id currently unused

Key:

facet
used to create the faceted browse navigation
result display
used when presenting a “document” to the user. Sometimes the value is a function of the actual indexed content.
sort
used for sorting the result set

Note that a more thorough treatment of the Solr index would split the fields up into multiple indexed fields which are customised for their purposes, but that we have not done this in the prototype. For example, fields used for sorting will go through normalising functions to ensure consistent sorting across all values, while displayable values will be stored unprocessed.

We can now produce a user interface like that shown in the screen shot below.

The approach used here could be used to extend to more features of the person Base Entity, but also other Base Entities (and, indeed, any entity in the CERIF model) could be placed at the centre of the report, and its resulting hierarchy of properties mapped into a set of key-value pairs, and all could co-exist comfortably in the same search index.


CERIF Test Data

August 17, 2011

Due to the privacy and data protection status of the real research information at Brunel – which includes data such as pay scales and so forth – it is not possible for the project to demonstrate its tools to people who are not Brunel employees (at the very least). Furthermore, that data cannot even be taken off-site or placed onto computers which are not under direct control of the university. Combine this with the need within the project for two parallel development tracks: one mapping source data (such as HR and publications) into CERIF and the other indexing and reporting on that CERIF data, and there is a compelling need for a test dataset.

A test CERIF dataset could be used in any demonstrations of the project outputs, and could be put in-place for the CERIF indexing side of the project so that it is not critically dependent on the outputs of the data mapping side.

Initially we had hoped that such a dataset already existed, but there was nothing available on the euroCRIS website (the CERIF guardian organisation) and extensive searching turned up nothing of value. There are other JISC projects which may ultimately have yielded some useful data (such as CERIFy), but they are also running in parallel to BRUCE.

The project therefore developed a piece of software which can be used to generate test data, and has made it available open source here (in the cerifdata folder at that link).

The approach to developing the test data and the software were as follows:

1. Identify a seed dataset

We were lucky that at exactly the time that we were seeking for a seed dataset, the Open Bibliography project – also JISC funded – had succeeded in liberating the MedLine dataset consisting of around 20 million publication records.

This was an ideal source of the most difficult data to artificially generate: author names and publication titles. By using this dataset as our seed we would be able to generate artificial research data based on open access bibliographic data, which would give us the freedom necessary to do as we needed with the dataset at the same time as making it look suitably realistic.

2. Define the model we are populating

Although actually done in several iterations, the model we worked towards was as presented in a previous post.

This meant generating data about Staff, Organisational Units and Publications. We have only written code to generate the data required for our example model, but we have endeavoured to write the software itself in a way which allows it to be extended throughout the project and into the future.

3. Develop a flexible production mechanism

The test data is generated by the following process:

First, source data is obtained from the MedLine data file. This source data is then passed through a set of CERIF data “aspect generators” which produce CERIF entites and relationships (such as staff records and their relationships to organisational units and publications). These are then written to CSVs which reflect the database table structure in the CERIF SQL schema. The CSVs are finally converted into a single large SQL file suitable for import into a database.

The architecture of the software is designed to be flexible so that new aspects can easily be added and existing aspects can easily be modified.

4. Produce the test data

We simply provide one of the MedLine source data files to the program and it will generate our test data in SQL format for us:

python data.py /path/to/medline.xml

Which produces the CSVs:

$ ls *.csv
cfFund.csv
cfOrgUnit_OrgUnit.csv
cfPers_Class.csv
cfPers_Pers.csv
cfResPublTitle.csv
cfOrgUnit.csv
cfPers.csv
cfPers_Fund.csv
cfPers_ResPubl.csv
cfResPubl_Class.csv
cfOrgUnitName.csv
cfPersName.csv
cfPers_OrgUnit.csv
cfResPubl.csv

For example, the following data are all related through a single person (cfPers):

cfPers.csv
f0b2517b-4b65-4fa5-b562-ff931cd213f2, F

cfPersName.csv
f0b2517b-4b65-4fa5-b562-ff931cd213f2, Teresa, J, Krassa

cfPers_Fund.csv
f0b2517b-4b65-4fa5-b562-ff931cd213f2, MM122

cfPers_OrgUnit.csv
f0b2517b-4b65-4fa5-b562-ff931cd213f2, 1, cfCERIFSemantics_2008-1.2, Employee, 1.0, 2019-10-05
f0b2517b-4b65-4fa5-b562-ff931cd213f2, 3, cfCERIFSemantics_2008-1.2, PhD, 1.0, 2019-10-05
f0b2517b-4b65-4fa5-b562-ff931cd213f2, 1, cfCERIFSemantics_2008-1.2, Member, 1.0, 2019-10-05
f0b2517b-4b65-4fa5-b562-ff931cd213f2, 2, cfCERIFSemantics_2008-1.2, Member, 1.0, 2019-10-05
f0b2517b-4b65-4fa5-b562-ff931cd213f2, 3, cfCERIFSemantics_2008-1.2, Member, 1.0, 2019-10-05

cfPers_ResPubl.csv
f0b2517b-4b65-4fa5-b562-ff931cd213f2, 1936bdc4-aadd-4028-bb9e-b9eec2561c00, cfCERIFSemantics_2008-1.2, Author

This shows us a person with ID f0b2517b-4b65-4fa5-b562-ff931cd213f2 who is Female (from cfPers.csv) who has the name Teresa,J,Krassa (from cfPersName.csv) who has funding from funding code MM122 (from cfPers_Fund.csv), who is an Employee of Organisational Unit 1, is a PhD Student in Organisational Unit 3 and is a Member of Organisational Units 1, 2 and 3 (from cfPers_OrgUnit.csv). It also shows that this person is the Author of a Result Publication with ID 1936bdc4-aadd-4028-bb9e-b9eec2561c00 (from cfPers_ResPubl.csv).

This repeats with variations on the data across the entire seed dataset, giving us a rich spread of people, publications, organisational units and relationships between them upon which to carry out our development, testing and demonstrations.

These CSVs are then converted into a single SQL file, which can then be imported into our MySQL database and used.

If you wish to use the software yourself, you can download it from the version control but unfortunately at time of writing the MedLine data in the format required by the program is not publicly available. It is available as n-quads on CKAN and the project is discussing with Open Bibliography the possibilities of publishing the data in its original format also. In the mean time, please feel free to contact us and we will be able to help you obtain the data in the relevant format.


Reporting Data Model for CERIF

July 14, 2011

The picture shows a sketch of our data model that we have so far built using CERIF. The model employs two Base Entities: cfPers and cfOrgUnit, one Result Entity: cfResPubl and one Second Level Entity: cfFund. These cover all of the features that we need to store the data for the example reports.

 

Important features of the model

PhD Supervision

PhD supervisions are modelled as Person-to-Person relationships (cfPers_Pers), annotated with the CERIF 2008 Semantic term “Supervisor”. This is non-contentious, and it is easy to determine the number of people that a particular staff member is supervising.

Payroll Numbers

We have 2 identifiers for each person: a HESA ID and an Employee or Payroll ID. We have adopted the HESA ID as the primary identifier for a staff member, but this leaves us with no clear place to store the Payroll ID. Options considered were the Person Keywords (cfPersKeyw) and a Person-to-Person relationship; in the end we decided to model this by creating a new Person record with the Payroll ID as the cfPersId, and to assert a relationship between the HESA identified person and the Payroll identified person (in the cfPers_Pers table) with a relationship type of “Payroll” in our own “BRUCE” semantic namespace.

This “solves” the problem but introduces significant extra complexity into the model. First, from an indexing point of view, it is difficult to obtain a list of unique staff members from the cfPers table, as multiple records refer to the same actual person. For this reason we have also had to introduce a Class for each person record, so that we can identify the “Main” person record.  We have therefore added entries into the cfPers_Class table under the “BRUCE” namespace for “HR” person records and “Main” person records (which are the HESA identified ones). With this in place it is easy again for us to select from the cfPers table a list of the “Main” person records.

Person relationship to org unit

The Org Unit and Person relationships can be realised in a number of way, and anecdotally not two organisations do this the same way, which makes our solution quite brittle. In order to fully relate a person to their organisational units and also make the data indexable in a way which is useful for reporting we have constructed the person’s relationship to their org units as follows:

  1. The person is an “Employee” of the parent organisation
  2. The person has a Position in the lowest level unit (e.g. department or research group)
  3. The person is a “Member” of all of the organisation units in the heirarchy

This allows us to provide a good indexing solution, and is a reasonable approximation of the real relationships that the person has with its organisation, but it is highly inflexible and includes assumptions about the structure of the organisation. There is no obvious way around this that the project team can determine.

Org unit to org unit relations

We have modelled the org unit to org unit relations as a strict parent/child relationship. The CERIF semantic layer includes a “Part” term, but this term does not clearly indicate direction, so we have adopted the convention that “Part” means ‘Has Part’. So:

Brunel University ‘Has Part’ Natural Sciences

Person class (for deduplication)

As discussed in the section on Payroll numbers, we have had to annotate each person record with a Class which identifies whether this is the primary record for that person, so that they can be deduplicated during indexing.

The result of this is that instead of requiring 2 database table rows per person we now need 6, which is is a significant increase in storage space and complexity, and suggests a flaw in the CERIF model.

Conclusions

The data model that we have produced so far provides us with the coverage we need to explore the reporting aspects of the project, and accommodates all the source data that we will be mapping into CERIF. It is worth noting that this is only a small part of the total CERIF model.

The key thing that the project team had to understand about CERIF is that it is about modelling the real world as accurately as possible inside the database, so all entities have to be represented in the entity tables, rather than being added as metadata elements to other entities. An example is that a journal within which an article has been published must be modelled as a Result Publication itself, and linked to the article’s entry in cfResPubl using the cfResPubl_ResPubl table. This makes CERIF a very complex model and a very rapidly growing dataset, with a high entry barrier to getting content in.

The team has spent a long time just understanding the schema and working with the data to get it into a format where it can be put into a CERIF database, and it is felt that this ought to be easier.