Monday, October 10, 2011

Repurposing embedded image metadata for DSpace batch loading (XMP to CSV to DSpace Dublin Core)

John Glenn climbs into Friendship 7
http://hdl.handle.net/1811/50452
We are currently working on adding a new community to the Knowledge Bank - the Ohio Congressional Archives. One of the community's collections I am preparing for batch loading is the John H. Glenn Archives Photo Gallery. The gallery contains 537 images dating from 1917 to 2009 - a sampling of the more than five thousand photographs held in the John H. Glenn Archives. The digital images that we are archiving in the Knowledge Bank (KB) all contain embedded descriptive metadata added by the Archivist for the Ohio Congressional Archives.

Our routine process for batch loading involves creating a spreadsheet (.csv) containing the metadata and filename for each item. A stand-alone Java tool transforms the metadata contained in the spreadsheet into dublin_core.xml files and builds the simple archive format directory (metadata + content files) required for the DSpace item importer.

Working with the Archivist, I designed the Qualified Dublin Core (QDC) metadata for the KB image collection. My initial mock-ups of the collection incorporated much of the descriptive metadata the Archivist had added to the digital images using Adobe Photoshop. Although there was not a straight one-to-one relationship between the embedded metadata and the KB metadata, I certainly wanted to re-use the embedded metadata when building the batch loading spreadsheet. One possibility for re-using the metadata would be to have a staff member or student assistant manually copy and paste the image metadata into the KB spreadsheet by following a mapping of the Photoshop fields and the spreadsheet columns (QDC fields). That approach, however, would be very time-consuming and inefficient. I chose instead to investigate the automated re-use of the embedded descriptive metadata - something we had not done before for the Knowledge Bank - and something I could develop a new workflow for that could be used for future projects.

I considered several options for extracting the XMP (etc.) metadata from the images and creating QDC metadata for batch loading into the KB. One option would be to extract the metadata in XML and write XSLT to transform the metadata to DSpace Dublin Core. Going this route would by-pass the spreadsheet step, and rather than the standalone Java tool, a short Perl script would be used to build the archive directory. Another option would be to export the metadata into a .csv file for use with the Java tool. Given that a few of the KB fields required metadata values not contained in the embedded image metadata, and the fields were not constant data, I opted to look for a way to export the embedded metadata into a .csv file. Using a metadata spreadsheet, rather than DSpace dublin_core.xml files, would allow students or staff to assist with the enhancement of the extracted metadata for the KB without having to work with the XML directly.

I looked at a few tools for exporting/extracting embedded metadata including ExifTool - a platform-independent Perl library plus a command-line application for reading, writing and editing meta information in a wide variety of files, ImageMagick, and FITS - a Java toolset that wraps JHOVE, ExifTool, and several other format analysis tools.

ExifTool worked perfectly for what I needed (the descriptive embedded metadata for all of the collection's images exported to a .csv file).

Running exiftool.exe from the command line, I was able to export the embedded metadata for all of the images to a .csv file.


exiftool -csv -r t/images > out.csv


The -csv option pre-extracts information from all input files, produces a sorted list of available tag names as the column headers, and organizes the information under each tag. A "SourceFile" column is also generated. The features of the -csv option make it great for extracting all information from multiple images. The -r option recurses through all images in a hierarchy of directories.

Once I had the .csv output I deleted the un-wanted columns (data we would not be using) and renamed the column headers for the remaining data based on the QDC mapping for the collection. All that remained to do was an enhancement to three exported fields, changing the character used for delimiting multiple values in a field, and adding three fields not available in the embedded metadata. Success!

Note: rather than deleting unwanted data, you can also do a 'targeted' CSV export:


exiftool -csv -title -rights -subject t/images > out.csv