Reporting Data Model for CERIF

The picture shows a sketch of our data model that we have so far built using CERIF. The model employs two Base Entities: cfPers and cfOrgUnit, one Result Entity: cfResPubl and one Second Level Entity: cfFund. These cover all of the features that we need to store the data for the example reports.

 

Important features of the model

PhD Supervision

PhD supervisions are modelled as Person-to-Person relationships (cfPers_Pers), annotated with the CERIF 2008 Semantic term “Supervisor”. This is non-contentious, and it is easy to determine the number of people that a particular staff member is supervising.

Payroll Numbers

We have 2 identifiers for each person: a HESA ID and an Employee or Payroll ID. We have adopted the HESA ID as the primary identifier for a staff member, but this leaves us with no clear place to store the Payroll ID. Options considered were the Person Keywords (cfPersKeyw) and a Person-to-Person relationship; in the end we decided to model this by creating a new Person record with the Payroll ID as the cfPersId, and to assert a relationship between the HESA identified person and the Payroll identified person (in the cfPers_Pers table) with a relationship type of “Payroll” in our own “BRUCE” semantic namespace.

This “solves” the problem but introduces significant extra complexity into the model. First, from an indexing point of view, it is difficult to obtain a list of unique staff members from the cfPers table, as multiple records refer to the same actual person. For this reason we have also had to introduce a Class for each person record, so that we can identify the “Main” person record.  We have therefore added entries into the cfPers_Class table under the “BRUCE” namespace for “HR” person records and “Main” person records (which are the HESA identified ones). With this in place it is easy again for us to select from the cfPers table a list of the “Main” person records.

Person relationship to org unit

The Org Unit and Person relationships can be realised in a number of way, and anecdotally not two organisations do this the same way, which makes our solution quite brittle. In order to fully relate a person to their organisational units and also make the data indexable in a way which is useful for reporting we have constructed the person’s relationship to their org units as follows:

  1. The person is an “Employee” of the parent organisation
  2. The person has a Position in the lowest level unit (e.g. department or research group)
  3. The person is a “Member” of all of the organisation units in the heirarchy

This allows us to provide a good indexing solution, and is a reasonable approximation of the real relationships that the person has with its organisation, but it is highly inflexible and includes assumptions about the structure of the organisation. There is no obvious way around this that the project team can determine.

Org unit to org unit relations

We have modelled the org unit to org unit relations as a strict parent/child relationship. The CERIF semantic layer includes a “Part” term, but this term does not clearly indicate direction, so we have adopted the convention that “Part” means ‘Has Part’. So:

Brunel University ‘Has Part’ Natural Sciences

Person class (for deduplication)

As discussed in the section on Payroll numbers, we have had to annotate each person record with a Class which identifies whether this is the primary record for that person, so that they can be deduplicated during indexing.

The result of this is that instead of requiring 2 database table rows per person we now need 6, which is is a significant increase in storage space and complexity, and suggests a flaw in the CERIF model.

Conclusions

The data model that we have produced so far provides us with the coverage we need to explore the reporting aspects of the project, and accommodates all the source data that we will be mapping into CERIF. It is worth noting that this is only a small part of the total CERIF model.

The key thing that the project team had to understand about CERIF is that it is about modelling the real world as accurately as possible inside the database, so all entities have to be represented in the entity tables, rather than being added as metadata elements to other entities. An example is that a journal within which an article has been published must be modelled as a Result Publication itself, and linked to the article’s entry in cfResPubl using the cfResPubl_ResPubl table. This makes CERIF a very complex model and a very rapidly growing dataset, with a high entry barrier to getting content in.

The team has spent a long time just understanding the schema and working with the data to get it into a format where it can be put into a CERIF database, and it is felt that this ought to be easier.

Advertisements

2 Responses to Reporting Data Model for CERIF

  1. […] Although actually done in several iterations, the model we worked towards was as presented in a previous post. […]

  2. […] schema that we are working with is a limited sub-set of the project, and has been presented in a previous post. The set of tables which describe the graph contain the following fields that we are interested in […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: