Due to the privacy and data protection status of the real research information at Brunel – which includes data such as pay scales and so forth – it is not possible for the project to demonstrate its tools to people who are not Brunel employees (at the very least). Furthermore, that data cannot even be taken off-site or placed onto computers which are not under direct control of the university. Combine this with the need within the project for two parallel development tracks: one mapping source data (such as HR and publications) into CERIF and the other indexing and reporting on that CERIF data, and there is a compelling need for a test dataset.
A test CERIF dataset could be used in any demonstrations of the project outputs, and could be put in-place for the CERIF indexing side of the project so that it is not critically dependent on the outputs of the data mapping side.
Initially we had hoped that such a dataset already existed, but there was nothing available on the euroCRIS website (the CERIF guardian organisation) and extensive searching turned up nothing of value. There are other JISC projects which may ultimately have yielded some useful data (such as CERIFy), but they are also running in parallel to BRUCE.
The project therefore developed a piece of software which can be used to generate test data, and has made it available open source here (in the cerifdata folder at that link).
The approach to developing the test data and the software were as follows:
1. Identify a seed dataset
We were lucky that at exactly the time that we were seeking for a seed dataset, the Open Bibliography project – also JISC funded – had succeeded in liberating the MedLine dataset consisting of around 20 million publication records.
This was an ideal source of the most difficult data to artificially generate: author names and publication titles. By using this dataset as our seed we would be able to generate artificial research data based on open access bibliographic data, which would give us the freedom necessary to do as we needed with the dataset at the same time as making it look suitably realistic.
2. Define the model we are populating
Although actually done in several iterations, the model we worked towards was as presented in a previous post.
This meant generating data about Staff, Organisational Units and Publications. We have only written code to generate the data required for our example model, but we have endeavoured to write the software itself in a way which allows it to be extended throughout the project and into the future.
3. Develop a flexible production mechanism
The test data is generated by the following process:
First, source data is obtained from the MedLine data file. This source data is then passed through a set of CERIF data “aspect generators” which produce CERIF entites and relationships (such as staff records and their relationships to organisational units and publications). These are then written to CSVs which reflect the database table structure in the CERIF SQL schema. The CSVs are finally converted into a single large SQL file suitable for import into a database.
The architecture of the software is designed to be flexible so that new aspects can easily be added and existing aspects can easily be modified.
4. Produce the test data
We simply provide one of the MedLine source data files to the program and it will generate our test data in SQL format for us:
python data.py /path/to/medline.xml
Which produces the CSVs:
$ ls *.csv
For example, the following data are all related through a single person (cfPers):
f0b2517b-4b65-4fa5-b562-ff931cd213f2, Teresa, J, Krassa
f0b2517b-4b65-4fa5-b562-ff931cd213f2, 1, cfCERIFSemantics_2008-1.2, Employee, 1.0, 2019-10-05
f0b2517b-4b65-4fa5-b562-ff931cd213f2, 3, cfCERIFSemantics_2008-1.2, PhD, 1.0, 2019-10-05
f0b2517b-4b65-4fa5-b562-ff931cd213f2, 1, cfCERIFSemantics_2008-1.2, Member, 1.0, 2019-10-05
f0b2517b-4b65-4fa5-b562-ff931cd213f2, 2, cfCERIFSemantics_2008-1.2, Member, 1.0, 2019-10-05
f0b2517b-4b65-4fa5-b562-ff931cd213f2, 3, cfCERIFSemantics_2008-1.2, Member, 1.0, 2019-10-05
f0b2517b-4b65-4fa5-b562-ff931cd213f2, 1936bdc4-aadd-4028-bb9e-b9eec2561c00, cfCERIFSemantics_2008-1.2, Author
This shows us a person with ID
f0b2517b-4b65-4fa5-b562-ff931cd213f2 who is Female (from cfPers.csv) who has the name
Teresa,J,Krassa (from cfPersName.csv) who has funding from funding code
MM122 (from cfPers_Fund.csv), who is an Employee of Organisational Unit
1, is a PhD Student in Organisational Unit
3 and is a Member of Organisational Units
3 (from cfPers_OrgUnit.csv). It also shows that this person is the Author of a Result Publication with ID
1936bdc4-aadd-4028-bb9e-b9eec2561c00 (from cfPers_ResPubl.csv).
This repeats with variations on the data across the entire seed dataset, giving us a rich spread of people, publications, organisational units and relationships between them upon which to carry out our development, testing and demonstrations.
These CSVs are then converted into a single SQL file, which can then be imported into our MySQL database and used.
If you wish to use the software yourself, you can download it from the version control but unfortunately at time of writing the MedLine data in the format required by the program is not publicly available. It is available as n-quads on CKAN and the project is discussing with Open Bibliography the possibilities of publishing the data in its original format also. In the mean time, please feel free to contact us and we will be able to help you obtain the data in the relevant format.