Enriching our bib records using the BL Linked Data Service

As members of the Bloomsbury Library Management System Group (comprising Bloomsbury Colleges and Senate House Libraries), the IOE library has been participating in an interesting project to ensure that our MARC21 bibliographic data is in good shape and fit for purpose in the 21st Century. This means that any member who migrates to Kuali OLE will be able to do so in the knowledge that their expensively created MARC data is working hard for them.  Those who are at this stage only looking to add a new discovery front-end system will gain similar benefits. We have also been doing some experimentation with next generation discovery systems such as VuFind and it has revealed that shortcomings in your MARC data can be exposed by these. For example, if your language code is missing, then filtering on language will give a misleading result.

We have been using the Linked Open Data service from the British Library to try to enrich our bibliographic  records. The first question is why would you do this? Well, there are a number of reasons ranging from ensuring that the number of identifiers in the records is maximised (the print and electronic ISBN for example) to adding the Dewey classification number or Library of Congress Subject Headings in order to both assist retrieval and provide a platform on which common points of access might be derived across a consortium of library catalogues, for example, from the BLMS members.

In terms of the methodology, we used the BNB Sparql endpoint service and created some php scripts which fired every BLMS record which had an ISBN at it. This retrieved the target record where exactly one match was found. From that BNB record, we retrieved the fields ISBN, DDC (082) amd LCSH (650). These were then compared with the original source record from BLMS members. Where a value differed (potentially an enrichment), this was recorded in a database for further scrutiny.

At this stage, the procedure is fairly simplistic in certain ways. It only looks at the first ISBN in the source record; discards the record set where more than one result is returned as potentially unsafe. These are all things that could be added as enhancements without too much difficulty. The main problem that occurred was that for some unknown reason the script kept pausing and not resuming. The answer was to kill it and restart it at the point where it had reached. Rather messsy, but did the job!

The results were as follows:

1.3 million records had an ISBN (43% of the dataset). 299,149 (23%) BNB record’s data were found within this and harvested via our fairly risk averse process. Of these, 473,757 discrete proposed data enrichments were found spanning over 198,205 records. The breakdown of the enrichments types is:

ISBN10    43,800
ISBN13    15,127
DDC    97,712
LCSH    317,118

The next stage is to work out how we might validate these and incorporate them back into our host Library Management Systems. Alternatively, those institutions migrating to Kuali could use this as a staging post pre-migration to optimise their Marc data.

This has been a very useful example of a practical application of linked data and one which we will continue to explore. The next question is whether this can be done using title where there is no ISBN to use as a hook. That can turn into something of a nest of vipers, to be left perhaps for another day!

Posted in MARC, OEM-UK | Tagged , , | Leave a comment

Search engines not playing ball :(

One of the things I’ve been trying to do since the project itself ended is to get the site represented on search engines like Google. Naively, I thought that this was going to be the easy part of the story, but it turns out not. Earlier, I’d made the decision only to include our library catalogue records and exclude the archives and two repositories. The reason for this was that all except the library catalogue were already being crawled whereas the library catalogue had no web presence.

During the project itself, I’d installed the Drupal SEO Checklist module which is a very useful “to do” list ensuring that you have done what you need in order to optimize your SEO ranking. Some of these were very simple, such as ensuring clean urls were being used and that each page had a unique and meaningful title. Others (such as addition of structured metadata) were more complex and I hadn’t time do this.

I signed up to Google webmaster and initially things looked quite promising on release in Mid November 2012. Within 10 days, we had climbed to a whopping 58,000 indexed items (out of about 240000). This plateaued and lasted until the end of the year. Then in early January 2013, it dropped within one week to just 12,000 items and has never really since recovered beyond about 18,000. No changes had been made to the site during those weeks and therefore it was not possible to isolate the cause.

Reading various resources, it seemed possible that it might be a lack of exposed structured metadata which was holding things back. Perhaps Google had changed its indexing policy to coincide with the new year. My dilemma was that although there was structured data in the database, because I was using the Drupal Panels module to make a more user (people) friendly interface as a springboard to reaching the catalogue record, it was hiding the structured data from the search engine.

Eventually I reckoned I could work with the panels module to expose the required structured data, particularly as Google then introduced a structured data testing tool with which I could see the results of my handiwork. A couple of hours allowed me to shoehorn this together and the testing tool seemed happy with the schema.org metadata I was pumping out. I then resubmitted the site map and waited. And here we more than a month later and whilst there was a spike from 18-24,000, it has now reverted to 18,000 again so it seems no gain.

At which point, I have run out of ideas. Help! Does anyone have any more suggestions?

 

Posted in Uncategorized | Leave a comment

Final Post

As we are now at the end of the formal part of the project, here is a final post, although this blog will remain in place for new information regarding the project so in fact, this may not be the final post at all!

Outputs:

  • Adaptation of the COMET project’s IPR analysis workflow and embedding of this in an automated process of record selection which is made prior to the IPR clear data being published as Open Data.
  • Cataloguing of 8,178 pre-1950 Examination Papers and related historical textbooks in the subjects of History and Science/technology.
  • Use and proof of concept of a low-barrier cataloguing interface which ramps up the volume of retrospective cataloguing which can be achieved whilst retaining an acceptable standard of data quality.
  • Release of 439,000 records as downloadable files from the Library and Archives catalogue systems under the Public Domain Dedication and License v1.0 and for use by all. The breakdown of numbers is:
  1.     Library catalogue: 214,000 (bibs); 198,000 (authority)
  2.     Archive catalogue: 26,800
  • Working connectors between the four contributing systems, with data RDF indexed and available via a SPARQL endpoint interface.
  • A working incremental record update workflow which synchronizes the data derived from the sources.
  • Release of records to Google which were previously not indexed

Lessons learnt:

  • Connecting library systems to Drupal in a way which is scalable to large complex datasets is difficult using the currently available modules.
  • Synchronization (daily incremental updating) is possible but relies on scripting and is probably not too sustainable.
  • Once your data is properly ingested to Drupal, adding RDF mappings at granular level is a trivial task and non-technical staff could do this.
  • It is relatively easy to embed an IPR filtering mechanism into the workflow which suits the institution’s risk appetite.
  • A low barrier cataloguing mechanism is a good way to surface previously hidden (print) resources with semantically rich records in an economical way and without requiring specialist cataloguing expertise. We experimented with various methods of creating subject index headings and the use of scripts was a far more economical solution than those solutions that required even minimal professional staff input at the record level. We also enriched records after they had been created by asking researchers to provide indexing terms for those OEM-UK books they consulted. Using a low barrier cataloguing mechanism enabled us to focus limited resources on semantically enriching every OEM-UK record in a process that continues even after they are created.
  • There are simple technical methods by which basic records can be downloaded to Drupal from well-known third party catalogue sources, but the IPR and rights and responsibilities in this area remain unclear.
  • Building the rdf index on Drupal (with a large dataset) was challenging but once figured out, simple to reschedule.
  • Releasing records to Search Engines requires some thought in terms of development of site map and META tag content in order to ensure maximum visibility through the crawler process.

Opportunities and Possibilities:

  • We intend to examine how far it is possible to update our catalogue records using other linked data sources such as VIAF.
  • We will monitor traffic coming from Google via Google Analytics to demonstrate tangible use increases due to having opened the data.
  • Better advice/training for systems librarians on SEO would be both welcome and useful in opening data to a wider audience.
  • We will continue to use the low barrier cataloguing methodology to increase our retrospective cataloguing output.

A low barrier cataloguing mechanism is a good way to surface previously hidden (print) resources with semantically rich records economically and without requiring specialist cataloguing expertise. We experimented with various methods of creating subject indexing and the use of scripts was a far more economical solution than those solutions that required even minimal professional staff input at the record level. We also enriched records post-creation by asking researchers to provide indexing terms for those OEM-UK books they consulted. Using a low barrier cataloguing mechanism enables us to focus limited resources on semantically enriching every OEM-UK record in a process that continues even after they are created

Normal
0

false
false
false

EN-GB
X-NONE
X-NONE

MicrosoftInternetExplorer4

/* Style Definitions */
table.MsoNormalTable
{mso-style-name:”Table Normal”;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-qformat:yes;
mso-style-parent:””;
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:11.0pt;
font-family:”Calibri”,”sans-serif”;
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:”Times New Roman”;
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:”Times New Roman”;
mso-bidi-theme-font:minor-bidi;}

Posted in Uncategorized | Leave a comment

Exporting from Drupal and Importing to Sirsidynix Symphony LMS

When we had finished cataloguing on Drupal, we exported to a csv file. I did this by setting up a special Drupal view which exported the fields I wanted in the order I wanted. I then wrote a script which converted the csv into Sirsi flatascii format. The essence of this was very simple (map each csv field to a marc field number and write to the output file). The detail turned out to be rather more difficult.

The first issue was repeatable fields. Drupal has a concept of repeated fields but outputs them all into one comma-separated field in the csv. Using an array which iterated over each comma-separated item, it was relatively easy to turn them into repeatable fields for marc. The exception to this was the AACR2 rule regarding 100 and 700 fields and how the values for those interact with the statement of reponsibility in subfield c of the 245. After some jiggling, we were able to make this work satisfactorily. The script is not pretty, but it is fast (on a thousand records) and does the job. I will release this script shortly. We loaded 1083 history textbooks records via the Symphony import report in 1 minute. Later, we repeated the exercise with 786 science and technical textbooks.

The difference in the latter was that we mined subject-indexing terms from the title and series fields using a pre-defined mapping to our London Education Thesaurus (LET). 1718 subject terms were created over the 786 records. It was possible to do this because we were working with a dataset which had fairly a tightly-defined subject area and in which the title often (by the very nature of the material) was itself describing a subject. Our next challenge is to see what we can do with the exam papers when we import those to Symphony and how far the automatic subject allocation system can be taken there.

Posted in MARC, OEM-UK, Subject indexing, Textbooks | Tagged , , , , , | 2 Comments