This page last changed on Oct 28, 2010 by bbranan.
Purpose
The Duplicate on Upload service can be used to copy a space (including all content and metadata) from the primary storage provider (currently Amazon S3) to another storage provider.
Options
The following options are available:
- Space to duplicate - A space which should be copied in its entirety to a secondary storage provider.
- Destination Store - The storage provider where content will be copied
- Space Name - the name of the replicated space
Notes
- After duplication, the original and destination spaces should contain identical content item lists
- Duplicated content items will include the full set of metadata available on the original content items
Flow
- Service collects user input
- space to replicate
- store to replicate to
- work space (for logs)
- output space (for results file)
- Service starts up Amazon Elastic MapReduce job via RunHadoopJobTaskRunner
- Service passes the following info into the job:
- set of input buckets
- storeId of store to replicate to
- work bucket
- output bucket
- duracloud account host/port/context
- duracloud account username/password
- Hadoop pulls files from S3 and splits into mapper functions
- Each mapper runs by
- picking up the file provided by hadoop
- using StoreClient to create connections to both the source and destination ContentStores
- ensuring that the destination space exists
- checking to see if the destination space already contains the content item in question
- if so: compare checksums, only move the file if they don't match
- adding the S3 file to the destination bucket (if necessary)
- retrieving the content metadata from the source content item and adding it to the destination content item
- Reducer runs to collect final output, which is stored in output space
|