- About Archives
- About SAA
- Careers
- Education
- Publications
- Advocacy
- Membership
Our instance of ArchivesSpace includes numerous collections where some or all of the archival objects are attached to containers that do not have barcodes. This is a step-by-step guide for batch posting barcodes to ArchivesSpace through the API.
This document will give archivists a tool they can use to ensure all collections have . For anyone unfamiliar with batch editing records through the API, this will also provide an introduction to the process and possibly allow future batch editing projects to follow.
There are two applications that you will need for this project.
Anaconda. This can be downloaded here: https://www.anaconda.com/distribution/ Jupyter Notebook is the only tool you will use in the Anaconda suite, but there are Python libraries that come with the Anaconda suite that you will need. Make sure you download Python version 3.X. The scripts I am sharing are written for Python 3 and will not work in Python 2.
Just above the download link, there are tab options to download the appropriate version for your operating system. Make sure you select the correct system for your computer.
OpenRefine. This can be downloaded here: http://openrefine.org/. I use OpenRefine version 2.7.
Tutorials for both can be found online. This document will provide as much detail as possible for each step of the way but there may be basic details in how each application runs that are missing.
There are four individual actions that must be completed. They are numbered below. Each action is followed by an itemized list of its constituent steps.
Before running the script,you will need to make sure you have all of the Python libraries you will need.
Open the Anaconda Navigator.
Click on the Environments tab.
Making sure you are looking at the ‘installed’ libraries, scroll down to make sure you have the following libraries installed.
Jsonschema
Requests
Pandas
If any of these are missing you will need to install them.
In your start menu, search for Anaconda Prompt.
It will open up a command line interface.
In the interface, type pip install [missing library] where you remove the square brackets and replace the text inside with the name of whichever library you are missing, e.g. pip install jsonschema
Here is a screenshot of what your command line interface should look like
In Jupyter Notebook, create a new Python 3 notebook.
Copy this Python script into the notebook. You will notice that the script is divided into segments. These are called cells. It is important that you recreate the cell structure when copying. This means that you must copy all of the text within each cell and paste into a single cell in Jupyter Notebook. Do not paste multiple cells from Github into one cell in Jupyter Notebook and do not alter the order of the cells as they appear on Github. The cells may need to be run in a sequence.
In Jupyter Notebook you will notice a + sign in the upper left hand corner of the screen. This is how you add cells.
In cell 2, you will need to add the appropriate login information for you to access ArchivesSpace. This includes the username and password you would normally use. Add the ArchivesSpace URL inside of the single quotation marks after Host =. You may need to add the local port to the end of the URL. The Host URL with port will look like this.
The default for this script will be to create a list of all containers in ArchivesSpace. You will need to edit some of the text so that it only returns a list of containers for the collection you are interested in.
In cell five you will see red text. Some of that red text can stay as is. But the red XXXXXs in the line that reads if "XXXXX" in collection will need to be edited with your search term and the red filename in fh = open('XXX_containers.jsonl', 'a') will need to be edited to reflect a filename that will be useful for you.
The easiest way to edit the search term to target the collection you are working with is to use the collection name. Using collection identifiers worked with previous versions of ArchivesSpace, but I have found that this option no longer works in verison 2.5.2. If your first search returns an empty file, try using different search terms. Remember that letter case may be important.
The dataset that will be downloaded to your computer will get its title from this line. Edit the portion that reads XXX_containers.json to a file title that works for you.
Pay attention to the folder location the script is saved in. The datafile you pull from ArchivesSpace will be saved into the same folder location as the script.
This particular script is a GET request. This means it will not alter the data in ArchivesSpace, so you don’t have to worry about testing the script before running it against the production server. All it will do is pull data from ArchivesSpace.
Do not be concerned about limiting your search to only those containers that are missing barcodes. If half of the containers in a collection are missing barcodes you should still work with a datafile that includes all containers in that collection.
At the top of the Jupyter Notebook interface, you will see a button that says >| Run. When your cursor is inside of a cell, click this button to activate the cell. Remember that the cells must be run in sequence from top to bottom.
After you run the entire script, a JSON file will appear in the folder. This file will be unusable for this project in its native format.
Create a new project in OpenRefine.
Select the JSON file you just pulled from ArchivesSpace. Here is a picture of what OpenRefine’s create project screen looks like.
Click Next.
In the options, select JSON files. Hover your cursor over the top left hand corner of the dataset. You should see a yellow box encompass the entire dataset. It should look like this. Leave the default settings in place. Click the mouse.
You do not need to manipulate or alter the data in OpenRefine. Check that containers are neither inappropriately split nor inappropriately merged, e.g. an indicator is either repeated when it should not be within a collection or a shared box is not separated by Aspace into two containers. If you find this problem, you may need to manually edit records.
Once you are certain that the container list is correct export the data through the custom tabular exporter into a CSV file, as seen here. This will allow you to only select the data fields you need for this project. In this case, select only URI and indicator as the fields to be exported, as seen here. Your CSV file should then appear in your downloads folder. For the purposes of this exercise, we will title this export file_1.csv.
Pull a list of barcodes from your native box tracking database. The list will need to be in CSV file format. If your list is in another format, such as .XLSX, you can use OpenRefine to convert it into a CSV file. To complete this step, it is important that each container has a unique container identifier, e.g. box number.
The container identifier field will be matched up with the indicator field in file_1.csv so make sure your container identifier field holds the same data as the indicator field. The data must be written exactly the same. If the indicator field says 10, the container identifier field must say 10, not box 10.
If your list of barcodes had thousands of entries and they all include container identifiers that say Box ##, OpenRefine will allow you to easily separate that column in two. Create a new project and identify the column of container identifiers. Clicking on the arrow at the top of that column will open up a menu. From the menu, select ‘edit column’ and then ‘split into several columns’
An options menu will appear asking how you wish to split the columns. Select ‘by separator.’ The separator will likely be a comma symbol by default. Backspace over the comma and then, when the cursor cannot backspace any farther, tap the spacebar once and click ‘OK.’ This will ensure that the space between the word box and the number is what it is separated on.
Export your project to a new CSV file.
Save the CSV file with barcodes and container identifiers into the same folder as file_1.csv. For this exercise, we will call this second CSV file file_2.csv.
It is not suggested to open the new CSV file in Excel. Excel will sometimes transform long strings of numbers such as barcodes. If you want to ensure the CSV file has the appropriate data, open it using Notepad or a similar program instead.
Open a new Python 3 notebook and copy this script into it.
As before, keep cells separate where appropriate.
The text in red will need to be edited. Change the red in the first two lines to reflect whatever file titles you used when creating your version of what we titled file_1.csv and file_2.csv.
Change common_element to the header for the box number column. The header will probably be indicator, but if you used any other text (e.g. indicators) then make sure you edit the text to fit your documents.
Before running the script, make sure that file_1.csv and file_2.csv are located in the same folder as the script you are about to run.
Change the red text that reads output.csv to whatever file title you prefer. For the purposes of this exercise, we will use the file title output.csv.
Here is an image of what output.csv should look like if you open it in Notepad.
Create a new Python 3 notebook and copy this script into it.
As before, copy and paste each cell individually.
Make sure that output.csv is in the same folder location as the script.
READ BEFORE RUNNING THE SCRIPT. The first script we ran only pulled information from ArchivesSpace. This time, we are altering data in ArchivesSpace. It is highly recommended that you run this script against a test instance before running it in production. As long as your test instance is designed to perfectly mirror the production environment, running the script there first will allow you to ensure that data is not altered in unintended ways.
As before, you should ensure that no one else is working with ArchivesSpace before executing the script. This is true for the test instance, as well, as someone may be doing specific work with the test instance.
You will have to edit the information in cell 3 to include your login information. The username and password should be your ArchivesSpace login credentials. The host and port information will be the same as in step 1.c.ii, unless you are running it in the test instance..
You must also define the baseURL in cell 3. This is the home URL for either the test or production server, depending on which one you are working with, and it will probably be the same as Host but without the port appended at the end.
In cell 5, you will have to edit the red text as appropriate.
The red text in the line that reads reader = csv.DictReader(open('output.csv')) should be edited to reflect the title of the csv file you created in step 3.
The line that reads output['barcode'] = row['real'] requires that the column header for the list of barcodes was edited to say real as mentioned in step 2c. If you did not edit the header as specified, than you will need to edit the red text in this line in the script to reflect whatever the header in your csv file is titled.
Go to ArchivesSpace and make sure that the barcodes have been populated as before.