This page last changed on Oct 28, 2010 by bbranan.

Introduction

Due to the potential for DuraCloud users to store large number of files in the DuraCloud system, it becomes necessary to be able to run processing jobs over large data sets. This can be done in a variety of ways, but two primary distinctions come in the separation of services which take advantage of the capabilities of underlying cloud provider offerings, and services which require a DuraCloud-provided strategy for managing a distributed processing environment.

Cloud Provider offerings

Amazon

Amazon's Elastic Map Reduce offering makes use of the Hadoop project available from Apache. Amazon's service manages the server cluster on which Hadoop is run. Users provide the code necessary to actually perform the necessary computations using the Map/Reduce algorithm.

DuraCloud makes use of this capability in the Bulk Image Transformer, the Duplicate on Demand service, and the Bulk Bit Integrity Checker.

A DuraCloud service which uses Amazon's Elastic Map Reduce capability like the Bulk Image Transformer is made up of three parts:

  1. The Processor Jar. This is a project which extends the base-processor project (the image-conversion-processor project does this for the Bulk Image Transformer.) This code extends the Hadoop framework to perform processing as part of the Map/Reduce algorithm. This Jar file is included as a resource within the DuraCloud service, then moved to S3 as part of service deployment to make it available to Amazon Elastic Map Reduce.
  2. The DuraCloud Service, such as the bulkimageconversionservice project, is what is stored in the DuraCloud service registry and deployed into the DuraCloud service OSGi container. This code handles:
    1. Taking in all of the necessary parameters from the user
    2. Moving the processor jar and any bootstrap scripts to S3 storage
    3. Providing feedback to the user to indicate how the job is progressing
  3. The DuraCloud Task. There are three tasks which communicate directly with the Elastic Map Reduce system to actually run the Hadoop jobs. This code is contained in the S3StorageProvider project, under the tasks package. New services making use of Hadoop may or may not need to update this code depending on whether there are job-specific parameter values that need to be processed. The tasks are:
    1. Run Hadoop Job Task - Actually starts the Hadoop job in Elastic Map Reduce with the parameters provided by the user and the processing jar and bootstrap script locations
    2. Describe Hadoop Job Task - Gets information about a Hadoop job in process, in order to provide this information back to the user
    3. Stop Hadoop Job Task - Stops a Hadoop job that is in process. If this call is made to a Hadoop job that is already complete, it is disregarded.

Microsoft Azure

Microsoft's Dryad project, which is still in the research phase, is said to be a very generic graph generation and processing engine which has the potential to be very powerful. Dryad should be able to handle the Map/Reduce algorithm as well as many other types of processing algorithms. Of course, with this added flexibility comes added complexity.

As noted here, Dryad is not yet available on Azure. So far there do not appear to be any published tests using Dryad.

A good comparison of Hadoop and Dryad

According to this article, Microsoft may be considering the use of Hadoop within Azure.

DuraCloud Distributed Processing

More to come...

Document generated by Confluence on Apr 27, 2011 14:55