Information for Community Archives: Electronic Records

Information for Community Archives

Electronic Records / Tom Brown

PRELIMINARY DRAFT

Today, any organization producing records worth preserving or any individual creating papers worth preserving is using a computer. Born-digital materials form an important part of the documentation of a person's life or an organization's activities that need to be preserved. Preserving materials digitally has significant advantages over analog formats. Digital materials require little space, are easily shared, and offer the possibility of enhanced access. When copied, the original and the copy are identical. This is different from electrostatic or Xerox reproductions of paper materials where each “generation” looses some clarity, definition, legibility, etc.

 

This section offers some practical suggestions to queer repositories on how to approach these materials. It assumes that the archive relies mainly on a volunteer staff with minimal financial resources. It also assumes that the archive has a desktop computer or at least access to a computer. However if the archive is part of a larger operation that has a local area network with a system administrator, the archivist should consult with the system administrator to determine how to best implement these recommendations.

The general principle in this guide should be to move a copy of the digital materials worth preserving from the originator's computer to the archive computer. This should be done when the originator has completed using the electronic records, so the archive is sure to have the final version.

The first step in moving the materials to the archive computer is to make a copy on a transfer medium. The recommended transfer medium is the CD; it is ubiquitous, inexpensive, and transportable.

After the material has been copied from the CD to the archive computer, the material should be then moved to magnetic tape and placed in an off-site archival storage area if possible. While some advocate on-line storage with tape back-up of the archival materials, this guide opts for off-line assuming that maintenance of such servers would be beyond the resources of most queer archive, and that archival materials by their nature are seldom used.

There is a rationale for storing on tape as opposed to CDs. A CD is not an appropriate preservation medium. At the present time, there is no test as whether a CD is deteriorating. And as soon as one part of the CD fails, the entire CD is unreadable without resorting to highly technical (and expensive) recovery procedures. In other words, one could read a CD one day and find it totally unreadable the next. At the present time, the recommended tape is either 4mm or 8mm DAT tape. This is digital audio tape that is a digital magnetic tape format originally developed for audio recording and now primarily used for computer backup tape. The latest DAT storage format is DDS (digital data storage). If a tape drive and reader is unavailable duplicate copies of the “gold” CDs should be made with one stored on-site and the duplicate off-site.

Whenever a file is copied, the archivist should ensure that the number of bytes in the original equals the number of bytes in the copy even if a comparison is done manually.

According to archival theory, records have content, context, structure, and appearance. This guide recommends the preservation first and foremost of the content. And where possible, it offers recommendations to preserve the context and structure. Some archivists have discussed the need to preserve the “rendition” or the “look and feel” of a document. But to do so can be a very expensive undertaking. The guiding principle that something is better than nothing.

This guide will examine different computer applications commonly used by organizations or individuals and make recommendations on what to preserve and how to preserve them. For all these recommendations use the same guidelines as in the appraisal and acquisitions sections you would use for paper documents to determine if the materials have archival value.

WORD PROCESSING

This is probably the easiest form of electronic records that an archivist confronts. Save the final version of word processing documents. If the document contains very limited formatting saving the file as text (file extension.txt) is preferable to the original word format. This eliminates the need to refresh (or migrate) the file every couple of years when a newer version of the software is produced. Depending upon your resources, you may also want to consider printing out the document on acid free paper as the archival copy.

E-MAIL RECORDS

If word processing is the easiest form of electronic records for an archive to address, then e-mail may be one of the most challenging. Not only may the word processing documents attached to emails have archival value, but other types of attachments and the emails themselves may be historically-valuable records as well. Archive face the challenge of preserving the digitally encoded attachments and maintaining the context for them and the emails in the midst of an exponential increase in the use of email applications and the automated delete function built into these email systems.

The automatic deletion of both incoming and outgoing e-mail after a set period of time may be a partial solution to the archival conundrum of e-mail. The deletion process will reduce the volume of emails that an archive will receive through the removal of insignificant correspondence. However, deletion can occur to significant material as well. Organizations and individuals interested in establishing an electronic archive and archive staff working with contributors who are actively creating archival material, need to make sure that every staff member and any individual understands the automatic destruction time of their e-mail system. The system will induce them to save the important e-mail with needed documentation. One common way of saving e-mail is to print it out along with any attachments. At that point, it can be managed within the organization or by the individual as paper material is managed. Most e-mail systems, however, offer the option of saving the e-mail electronically within the system in a separate directory or file. In this situation, the archivist should copy the saved e-mail and the attachments to a CD for transport to the archive. The text of the e-mail should be copied in a text format (.txt); the attachments should be copied in their native format. The text of the e-mail will preserve the context of the attachments. The archive should either copy the e-mail within any directory and subdirectory structure the originator had created. Preserving the relationship of the e-mails within their directories will preserve the context of the e-mails. While acquiring the e-mail that the originator saved is not a perfect solution, it conforms to this guide's approach that something is better than nothing.

For organizations, e-mail is generally organized within the system by the individual who is sending and receiving the e-mail. This organization ensures that duplicates will appear through out the system. Thus, to limit the sheer number of e-mail messages and thus duplicates flowing into the archive, due diligence must be exercised in selecting the individuals whose e-mail will be preserved.

Processing the e-mail at the archive may be tedious. One way to preserve the text of the e-mail is to convert the e-mail text to a “stable analog format,” i.e. print to paper. This would also apply to word processing attachments. If the decision is made to preserve the e-mail digitally, the full directory structure should be printed to paper. The e-mail should be copied to DAT tape one directory or subdirectory at a time. With the exception of word processing, an attachment should be handled as discussed for its specific format. For word processing records, they can be copied in their native format. Viewer technology should be available for the long term to allow users to read the text of the document intermingled with the extraneous control characters. Doing such research will not be easy, but it will be doable.

DATABASE MANAGEMENT APPLICATIONS

While not a prolific as word processing and e-mail, many organizations and individuals maintain databases in a variety of proprietary applications. For example, organizations may maintain the membership records in a database. Capturing the membership database at specified intervals will provide information on the growth or decline of the organization over time. The membership information should be kept electronically by the archive so that analysis of the members' zip codes will show how the membership changed geographically over time. If the membership information includes additional information on the members, it only becomes a more robust research tool for analyzing the make up of an organization. Other databases may serve as indices to paper records. Whatever the purpose of the database, if it is worth preserving in an archive, it is worth preserving electronically. This will preserve its functionality or manipulability. Databases should not be preserved in their native format that is dependent on database management software that created the database. Such formats will become technologically obsolescent within five to ten years. Overcoming this obsolescence will be an extremely expensive undertaking.

All database management systems have an “export" feature. One can go to the Help file and search for instructions to “export” the data. Using the computer of the creator, the archivist should follow the export instructions and copy the database to a CD. During the export process, the user will be prompted to specify the “type” or “format” of the output. The user should specify “.csv” or “comma separated value.” This is a file format used as a portable representation of a database . Each line is one entry or record and the fields in a record are separated by commas. Commas may be followed by arbitrary space and/or tab characters which are ignored. If field includes a comma, the whole field must be surrounded with double quotes.

Some archivists have argued against using .csv formats because of problems caused by presence of the comma as part of the information within a field. This has prompted the recommendation to use another character as the delimiter, such a pole (i.e., | ). While this avoids one problem, it creates another. Many applications have routines to import .csv files and thus the .csv is easier for secondary use. Since .csv is something of a de facto standard, this guide recommends its use.

Technical documentation must accompany the .csv data file. This includes three items. First, a .csv file will have one record for a given universe of persons, events or things, and the documentation must explicitly define what items are in the universe. Second, the first record in a .csv file is a list of the field headings; each name is separated or “delimited” by a comma as are the data fields in each record. While the meaning of some field names will be known outside of the organization (e.g., zip code), other names might be known only to individuals in organization. These might be such names as: “Status,” “Rank,” “Date,” “Amount,” or “Active?” with fields populated with either a “Y” or “N.” Thus the documentation must include an explicit definition of each of the field names. Third, fields in database sometimes use codes. In the last example above “Active?”, Y is a code that probably means “Yes” and N is a code that probably means “No.” If the Status field in the records is populated with A, B, C, D and E and if the Rank field in the records is populated with 1, 2, 3 and 4, the documentation must explain the meanings of these codes in a code table.

Database records can be an attachment to an e-mail. In these cases, the saved text of the e-mail and the database need to be linked through the documentation for the database.

SPREADSHEETS

The archival management of spreadsheets mirrors that of databases. The one exception is that spreadsheets may not use an “export” feature. For example, in Excel, the File menu has a “Save As” option. When this option is selected, at the bottom there's “Save as type” with a pull down list of different formats. That list includes .csv. Then this .csv file is processed the same way as a .csv file from a database.

Technical documentation must accompany the .csv data file from a spreadsheet. This must include an explicit statement of the universe, an explicit definition of each column or field, and an explicit definition of any coded values.

Spreadsheets can be attachments to an e-mail. In these cases, the saved text of the e-mail and a spreadsheet need to be linked through the documentation for the database.

WEB CONTENT RECORDS

Most organizations and a few individuals have Web sites. Archive may wish to acquire the content of each major design of the web site. The purpose is not to capture the information on the Web site because this information will be found in other records within the organization's records or the individual's personal papers. For example, an organizational web site may list all the press releases. But within the organization, the press release probably exists as a separate series. The archival purpose of capturing the Web content of each major redesign of the Web site is to document the major changes in the Web site and the organization or individual wished to present themselves to the general public.

Transferring Web sites to an archive can be technically challenging. The U.S. National Archive (NARA) has issued transfer instructions to Federal agencies on this topic. See http://www.archive.gov/records-mgmt/initiatives/web-content-records.html (as of December 21, 2005). These somewhat technical guidelines answer four major questions:

(1) What type of web content records can be transferred?

(2) How should the web content records be prepared for transfer?

(3) What documentation should accompany the web content records?

(4) How to transfer the records?

For an archive with limited technical knowledge, one possible strategy would be to provide a copy of these guidelines to the organization's webmaster and ask whether the webmaster can comply with the guideline's answers to the first three questions. Regarding the fourth question, the guidelines offer two options appropriate to a small archive. The archive could acquire an open source “web harvester” or “web crawler.” Using the website's URL, the program will copy the basic content of the website onto the computer system in the archive. The other option is to have the webmaster manually copy the content record directly to a CD for transfer to the archive. The guidelines discuss the pros and cons of each method and which method is appropriate for which content format in the web site. The use of a harvester or crawler will capture the basic portion of the web site and will document how the organization presented itself to the public over time. For a manual transfer, the webmaster has to perform several additional steps and provide additional technical documentation to obtain the same content format as the harvester or crawler.

PORTABLE DOCUMENT FORMAT (PDF)

Portable Document Format (PDF) was initially developed as a distribution format where multiple copies of electronic documents could be widely disseminated and still have the same appearance on any computer. The Federal court system began accepting briefs and, in some instances, cases requiring briefs, in PDF format so that the electronic submissions would retain pagination and paragraph numbering. And so PDF became a transfer format. For a PDF publication in a queer archive where the format was used as a distribution format, the recommendation is to print it to acid free paper. While this loses the “search” capability of PDF software, it preserves the content and structure. Another approach is to copy the PDF format as it exists on the creator's computer and move it through the archive computer onto magnetic tape. In the fall of 2005, a standard for PDF/A was adopted with the /A meaning “archival.” This is essentially a stripped down version of PDF that preserves the essential components of PDF documents in a simple and fairly vanilla format. It is anticipated that commercial programs that will convert PDF documents to PDF/A will become available. When this happens, all accessioned PDF documents should be converted to PDF/A.

DIGITAL PHOTOGRAPHS

Another electronic format that is becoming increasingly ubiquitous among queer individuals and organizations is the digital photograph. Transferring digital photographs to an archive can be technically challenging. NARA has issued transfer instructions to Federal agencies on transferring digital photographs. See http://www.archive.gov/records-mgmt/initiatives/digital-photo-records.html (as of January 4, 2006). These technical guidelines suggest that the photographs be in either a Tagged Image File Format (TIFF), in 'II' format, and JPEG 5 File Interchange Format (JFIF, JPEG).

The acceptable TIFF formats include versions 4.0 (April 1987), 5.0 (October 1988), and 6.0 4 (June 1992). The default file extensions include .TIFF and .TIF. For .JPEG the default file extensions include .JPEG, .JFIF, and .JPG. These NARA guidelines offer significantly more technical detail than can be explained here within this guide.

POWERPOINT PRESENTATIONS

Trying to preserve the functionality and bells and whistles of a PowerPoint presentation over time is a major technical challenge and probably beyond the capabilities of most queer archive. In line with the general approach of this guide, the recommendation is to print it to acid free paper. Fortunately, PowerPoint is widely available and has been fairly stable. Consequently, finding the application software for conversion to paper should be a fairly easy task. If the PowerPoint presentation was an attachment to the e-mail, the printed copy should have a cover sheet on archival quality paper that links it to the e-mail. The e-mail will indicate whether the printed PowerPoint presentation is a draft or a final version and provide other contextual information.

PRESERVATION PROGRAM

Possibly the best preservation for paper materials is to let them repose undisturbed on the shelves of an archive. This is a recipe for disaster if applied to electronic materials. Electronic records require a pro-active preservation program. The information in this section relies heavily on Maggie Jones ' and Neil Beagrie ' s Preservation Management of Digital Materials: A Handbook ( London : British Library, 2002). The online version, available at http://www.dpconline.org/graphics/ handbook/ (December 2005), is constantly being updated with new information . This is an invaluable resource for any small repository that is starting a preservation program for electronic records.

As indicated above, the electronic materials should be stored on 4mm or 8 mm DAT tape. Studies have indicated that data encoded on DAT tapes that have been used two or three times are more stable than data recorded on new tapes. Consequently, the recommendation is to use for archival storage those tapes that have been used two or three times for back ups of the archive computer system. Studies have also indicated that DAT tape is subject to rapid deterioration as a result changes in temperature and humidity. It is relatively stable in an ambient office environment with air conditioning and heat. The materials should be stored in a stable, controlled environment and away from direct sunlight. The temperature range for long-term storage for DAT tape is 39 0 F to 89 0 F; the relative humidity range for long term storage is 20% to 60%. While these ranges are fairly wide, the important aspect is a stable environment. Consequently, the archive should monitor changes in temperature and humidity. While a hydro-thermograph is ideal, the monitoring can also be manual on a daily basis. Since paper dust can adversely affect magnetic tape, the tapes should be segregated from the archival paper materials. This does not mean a separate room. The tapes can be stored in air tight plastic containers designed for tape storage and placed on shelves next to paper documents.

The archive should recopy each tape onto newer media according to a regular refreshment cycle. This should take place within the minimum time specified by the supplier generally every five years. Because of the danger that all digital formats will become obsolescent, they should be copied every three to five years in the latest version of the software (i.e. Microsoft Word) or updated images formats TIFF or JPEG. One or two back-up copies should be created on 4 mm or 8 mm tape and stored separately from the master copy. Ideally, the alternate storage site(s) would be at a different location if this is not possible at least in different room (s) and, if possible, on different floors. The archive should read a random sample of its tapes each year. If 1800 or less tapes, then the NARA standard is to read a 20% sample or 50 tapes which ever is larger. This seems reasonable for small archive. If the annual sample reveals read “errors,” the archive should recopy the tapes that were copied at approximately the same time as the tapes with the read errors and tapes in the same batch from the supplier. Archive should use comparable tapes purchased from different suppliers to guard against faults introduced by the media's suppliers into their products or into batches of their products. If possible, the archive should employ quality control procedure such as bit/byte or other checksum comparisons with originals to their copies. Finally, maintain a log of all actions taken regarding copying and during the annual sample.

RESOURCES NEEDED

Any computer operation needs hardware, software, and people-ware. From the foregoing, a queer archive needs a standard desktop computer with a printer, a CD readable drive, and a 4 or 8 mm tape drive. It also needs an external (and thus removable) CD writeable drive. This external would be used to copy materials from a creator' computer without a CD-writeable drive.

The software need not necessarily be extensive. Operating systems have utilities to report on the number of bytes in a file and to copy files from one medium to its hard drive and then to an output medium. The first special piece of software that the archive would need according to above scenarios is software to convert PDF files to PDF/A files as that becomes available in 2007. The second piece is photo editing software, like Photoshop, to migrate digital photos from one version of TIFF or JPEG to the latest version.

For staff, m any young people today are very computer literate. In accepting volunteers for the archive, one should assess the level of computer literacy. Particular importance should be paid to identifying individuals who have experience with operating systems rather than application programs and those who have hardware experience in installing peripheral devices, such as CD drives or tape drives.

Return to Information for Community Archives home page.