Monday, October 10, 2011

Repurposing embedded image metadata for DSpace batch loading (XMP to CSV to DSpace Dublin Core)

John Glenn climbs into Friendship 7
We are currently working on adding a new community to the Knowledge Bank - the Ohio Congressional Archives. One of the community's collections I am preparing for batch loading is the John H. Glenn Archives Photo Gallery. The gallery contains 537 images dating from 1917 to 2009 - a sampling of the more than five thousand photographs held in the John H. Glenn Archives. The digital images that we are archiving in the Knowledge Bank (KB) all contain embedded descriptive metadata added by the Archivist for the Ohio Congressional Archives.

Our routine process for batch loading involves creating a spreadsheet (.csv) containing the metadata and filename for each item. A stand-alone Java tool transforms the metadata contained in the spreadsheet into dublin_core.xml files and builds the simple archive format directory (metadata + content files) required for the DSpace item importer.

Working with the Archivist, I designed the Qualified Dublin Core (QDC) metadata for the KB image collection. My initial mock-ups of the collection incorporated much of the descriptive metadata the Archivist had added to the digital images using Adobe Photoshop. Although there was not a straight one-to-one relationship between the embedded metadata and the KB metadata, I certainly wanted to re-use the embedded metadata when building the batch loading spreadsheet. One possibility for re-using the metadata would be to have a staff member or student assistant manually copy and paste the image metadata into the KB spreadsheet by following a mapping of the Photoshop fields and the spreadsheet columns (QDC fields). That approach, however, would be very time-consuming and inefficient. I chose instead to investigate the automated re-use of the embedded descriptive metadata - something we had not done before for the Knowledge Bank - and something I could develop a new workflow for that could be used for future projects.

I considered several options for extracting the XMP (etc.) metadata from the images and creating QDC metadata for batch loading into the KB. One option would be to extract the metadata in XML and write XSLT to transform the metadata to DSpace Dublin Core. Going this route would by-pass the spreadsheet step, and rather than the standalone Java tool, a short Perl script would be used to build the archive directory. Another option would be to export the metadata into a .csv file for use with the Java tool. Given that a few of the KB fields required metadata values not contained in the embedded image metadata, and the fields were not constant data, I opted to look for a way to export the embedded metadata into a .csv file. Using a metadata spreadsheet, rather than DSpace dublin_core.xml files, would allow students or staff to assist with the enhancement of the extracted metadata for the KB without having to work with the XML directly.

I looked at a few tools for exporting/extracting embedded metadata including ExifTool - a platform-independent Perl library plus a command-line application for reading, writing and editing meta information in a wide variety of files, ImageMagick, and FITS - a Java toolset that wraps JHOVE, ExifTool, and several other format analysis tools.

ExifTool worked perfectly for what I needed (the descriptive embedded metadata for all of the collection's images exported to a .csv file).

Running exiftool.exe from the command line, I was able to export the embedded metadata for all of the images to a .csv file.

exiftool -csv -r t/images > out.csv

The -csv option pre-extracts information from all input files, produces a sorted list of available tag names as the column headers, and organizes the information under each tag. A "SourceFile" column is also generated. The features of the -csv option make it great for extracting all information from multiple images. The -r option recurses through all images in a hierarchy of directories.

Once I had the .csv output I deleted the un-wanted columns (data we would not be using) and renamed the column headers for the remaining data based on the QDC mapping for the collection. All that remained to do was an enhancement to three exported fields, changing the character used for delimiting multiple values in a field, and adding three fields not available in the embedded metadata. Success!

Note: rather than deleting unwanted data, you can also do a 'targeted' CSV export:

exiftool -csv -title -rights -subject t/images > out.csv

Wednesday, June 29, 2011

A DSpace batch load for one item

The Curation System for DSpace (as of release 1.7) includes a Bitstream Format Profiler curation task by default. The task can be performed on any DSpace object (item, collection, community). Operating the profiler on an item examines all the bitstreams in an item and produces a table (profile) that is configured to display in the Admin UI. The result shows the count of bitstreams of the named format in the left column and a letter in parentheses which is an abbreviation of the repository-assigned support level for the format (U-Unsupported, K-Known, S-Supported).

I thought this task would be fun to run on an item that we had to batch load into the Knowledge Bank given the number of files:

The item profiled above, the Índice crítico del teatro uruguayo (1808-1980) [Critical Index of Uruguayan Theater (1808-1980)], contains 2,895 bitstreams. However, as the item is archived as a Web site, only one bitstream (the index.html file) is displayed as a file via the public UI .

The Critical Index of Uruguayan Theater collects the archive produced between 1976 and 1980 by Graciela Míguez (1949-2000) and Abril Trigo. It consists of three interconnected parts: an inventory of authors and playwrights, an index of the theatrical plays attributed to them, and a set of critical-analytical reviews of an extensive and representative selection of plays. For nearly 30 years, Abril Trigo preserved the archive of typewritten records containing this unique cultural resource.

In 2008, The Ohio State University Libraries digitized and indexed the records for presentation on the Web. The Critical Index of Uruguayan Theater was archived in the Knowledge Bank in 2009. Due to the number of files, we batch loaded the item. The archive directory for the batch load contained just one item directory with the dublin_core.xml metadata file, the contents file listing the files to be added as bitstreams to the item, and the 1,448 content files (PDF, HTML, PNG, JPG, and CSS).  The total count of bitstreams profiled above includes 1,443 extracted text files (the 1,443 Plain Text) and 4 thumbnails (4 of the 7 JPEG files) generated post-load by the media filters.

Monday, June 20, 2011

Crowdsourcing captions... image metadata from Antarctica

The Byrd Antarctic Expedition Photo Albums contain more than 3,500 images in 5 albums covering Richard E. Byrd's first and second expeditions to Antarctica in 1928-1930 and 1933-1935, respectively. Over 3,000 digitized images from the albums have been archived in the Knowledge Bank in the Byrd Antarctic Expedition Photo Albums collections.

Although all but a handful of the individual item records have generic keyword metadata, relatively few of the album images in the Knowledge Bank have more specific captions (less than 6 %) as part of the metadata. A researcher in Antarctica recently sent captions to our Polar Curator for an additional 122 Knowledge Bank images. I will be using the DSpace batch metadata editing feature to add the new captions to the existing item metadata. I am also looking at usage of the images respective to the quality of the metadata and to metadata enhancements over time.

Existing keywords: 
expedition members (crew and personnel); books;
recreation; equipment and supplies; furniture

New caption: "The Aviation Pilots in Conference in the Library". Left to right: Dean C. Smith, Alton N. Parker, Richard E. Byrd, Bernt Balchen, and Harold June.

Existing keywords: expedition members (crew and personnel); dogs; animals; equipment and supplies

New caption: Freddie Crockett holding one of the camp's favorite pets, Belle, after she was bitten in a fight.

Wednesday, June 15, 2011

The Ohio State University official student yearbook - the Makio

Congratulations to our new Ohio State alumni!

9,700 students graduated from The Ohio State University this past Sunday, June 12, 2011. As they look forward, we have an opportunity to look back:

We are currently adding digitized copies of the official student yearbook of the The Ohio State University, the Makio, to the Knowledge Bank.
We started archiving the yearbooks with the first Makio published in 1880.

Left: page image from the Makio, Volume XI, 1891

"The Library contains over 10,000 volumes of valuable material, selected with special reference to the wants of the University" -- Makio, 1891

Below: page images from the Makio, Volume I, 1880