- About Archives
- About SAA
- Careers
- Education
- Publications
- Advocacy
- Membership
Are you facing the problem of incomplete data in ArchivesSpace? Is there data missing from your records that you would like to fix one day, only when will you ever have time to do all of that work? Is the missing data threatening to cause much larger problems down the road for your organization? In 2017, George Washington University Special Collections Research Center faced just this problem when we upgraded our instance of ArchivesSpace to version 1.5. The biggest change introduced by this upgrade was the creation of top-level containers. Representative of the ubiquitous archival box, top level containers allow new ways to generate and use data, but they also introduce difficulties for migrating old data.
The ArchivesSpace migration tool pulled from four sources of data within each collection to create each top level container. The sources data used are box number or indicator, container type, location, and barcode. Some of our collections had records without barcodes. And some of those records had box numbers that were duplicated among multiple containers in the collection. While the physical containers did have barcodes, that information was maintained in a separate database.
We needed a way to disambiguate between all containers. A plug-in that could recognize each container and populate the barcode field with a faux barcode solved this problem by providing a unique identifier for each container. This provided the data the migrator needed to differentiate between multiple containers sharing the same box number. While our immediate problem was fixed, we wanted to complete the process by ensuring that all top level containers had real barcodes.
This would turn into a project that would involve the work and expertise of multiple library staff. We used Python scripts to retrieve and post data through the ArchivesSpace API. This gave us both the data we needed to identify every extant container and the ability to enter thousands of barcodes in seconds. Two tools were necessary to complete the project; OpenRefine and Jupyter Notebook. The former was necessary to parse out the data pulled from ArchivesSpace and the latter provided the frame for activating the Python scripts.
While it eases the workload in the long run, writing effective code is a difficult process that requires a lot of work upfront. With one staff person coordinating the project, we were extremely lucky to have multiple people working in the department and in the library with the skills needed to complete this project. Chief among those skills was the ability to edit Python. With their assistance, the scripts needed to complete the project were finished in early 2019.
This blog post is intended to spread awareness of this solution for those who may be facing a similar problem. A follow-up post next month will provide the details of how we completed this project. It will include links to the scripts we used to complete this project and a step-by-step guide through the process. We hope that others out there who have the same need will find value in our sharing of this project.