This page last changed on Oct 28, 2010 by bbranan.

Purpose

The Duplicate on Upload service can be used to copy a space (including all content and metadata) from the primary storage provider (currently Amazon S3) to another storage provider.

Options

The following options are available:

  • Space to duplicate - A space which should be copied in its entirety to a secondary storage provider.
  • Destination Store - The storage provider where content will be copied
  • Space Name - the name of the replicated space

Notes

  • After duplication, the original and destination spaces should contain identical content item lists
  • Duplicated content items will include the full set of metadata available on the original content items

Flow

  1. Service collects user input
    • space to replicate
    • store to replicate to
    • work space (for logs)
    • output space (for results file)
  2. Service starts up Amazon Elastic MapReduce job via RunHadoopJobTaskRunner
    • Service passes the following info into the job:
      • set of input buckets
      • storeId of store to replicate to
      • work bucket
      • output bucket
      • duracloud account host/port/context
      • duracloud account username/password
    • Hadoop pulls files from S3 and splits into mapper functions
    • Each mapper runs by
      • picking up the file provided by hadoop
      • using StoreClient to create connections to both the source and destination ContentStores
      • ensuring that the destination space exists
      • checking to see if the destination space already contains the content item in question
        • if so: compare checksums, only move the file if they don't match
      • adding the S3 file to the destination bucket (if necessary)
      • retrieving the content metadata from the source content item and adding it to the destination content item
    • Reducer runs to collect final output, which is stored in output space
Document generated by Confluence on Jan 26, 2011 19:44