This page last changed on Jul 28, 2010 by bbranan.
Description
The intent of this document is to detail the process for verifying the integrity of content that is copied onto hard drives to be shipped to a storage provider for bulk ingest.
At a high level, the basic approach is to generate a manifest of md5/fileId pairs while the content is still on the source network and again after the content has been copied to the drive.
These two manifests are then compared to ensure no corruption or file loss occurred in the copy process.
The California Digital Library's BagIt standard and accompanying software will be used to generate the md5 manifests.
The BagIt sw provides for the creation of "bags" in two ways:
- User points to list of directories containing content to be bagged, and BagIt copies it all over to a user-specified destination directory (operation:create)
- User points to content that is already in the BagIt-defined directory structure, and BagIt creates the metadata and md5 manifests 'in-place' (operation:baginplace)
bag/
bag/data/
bag/data/[duracloudpilot:content]
where [duracloudpilot:content] is any number of user content files or nested directories of user content
bag/
bag/manifest-md5.txt
bag/bagit.txt
bag/tagmanifest-md5.txt
bag/bag-info.txt
bag/data/
bag/data/[duracloudpilot:content]
where the four txt files are generated by the BagIt sw
Approach
Assuming bulk ingest deals with large amounts of data (1TB +), the operation:create approach is probably not reasonable.
Additionally, given the simple directory structure defined by BagIt, DuraCloud proposes the operation:baginplace approach with the help of symlinks.
Details
- User creates a bag and data directory on network accessible to content
$mkdir -p {some-path}/source-bag/data
- User creates symlinks from within data/ directory to content directories
$cd source-bag/data
$ln -s {path-to-content-dir0}
$ln -s {path-to-content-dir1}
$ln -s {path-to-content-dirN}
- User runs BagIt process:baginplace over virtual bag to generate metadata and manifests
$./bag baginplace {some-path}/source-bag/
- User attaches and formats shipment drive to ext3 (or ext2)
$mkfs -t ext3 /dev/{sdb1} [or whatever steps are appropriate for your env]
$fsck -f -y /dev/{sdb1} [or whatever...]
- User mounts formatted drive
$mkdir /mnt/dura
$mount /dev/{sdb1} /mnt/dura
- User copies contents referenced under the data/ directory onto the shipment drive
$cp -r {path-to-content-dir0} /mnt/dura
$cp -r {path-to-content-dir1} /mnt/dura
$cp -r {path-to-content-dirN} /mnt/dura
- User creates a bag and data directory at the top of the shipment drive
$mkdir -p /mnt/dura/duracloud-bag/data
- User creates symlinks from within data/ directory to content directories
$cd /mnt/dura/duracloud-bag/data
$ln -s /mnt/dura/{path-to-content-dir0}
$ln -s /mnt/dura/{path-to-content-dir1}
$ln -s /mnt/dura/{path-to-content-dirN}
- User runs BagIt process:baginplace over virtual bag to generate metadata and manifests
$./bag baginplace /mnt/dura/duracloud-bag/
- User verifies content by manually comparing manifest-md5.txt files, or runs [helper utility|Release 0.1^manifest-verifier.jar]
$java -jar manifest-verifier.jar {some-path}/source-bag/manifest-md5.txt /mnt/dura/duracloud-bag/manifest-md5.txt
|