Batch processing on AWS allows for the on-demand provisioning of a multi-part job processing architecture that can be used for instantaneous or delayed deployment of a heterogeneous, scalable "grid" of worker nodes that can quickly crunch through large batch oriented applications in place today that can leverage this style of on-demand processing, including claims processing, large scale transformation, media transcoding and multi-part data processing work.

Batch processing architectures are often synonymous with highly variable usage patterns that have significant usage peaks (e.g. month-end processing) followed by significant periods of underutilzation.

There are numerous approaches to building a batch processing architecture. This document outlines a basic batch processing architecture that supports job scheduling, job status inspection, uploading raw data, outputting job results, gird management, and reporting job performance data.

Gliffy Macro Error

You do not have permission to view this diagram.

#Description
1Users interact with the Job Manager application which is deployed on an Amazon Elastic Computer Cloud (EC2) instance. This component controls the process of accepting, scheduling, starting, managing, and completing batch jobs. It also provides access to the final results, job and worker statistics, and job progress information.
2Raw job data is updated to Amazon Simple Storage Service (S3), a highly-available and persistent data store.
3Individual job tasks are inserted by the Job Manager in an Amazon Simple Queue Service (SQS) input queue on the user's behalf.
4Worker nodes are Amazon EC2 instances deployed on an Auto Scaling group. This group is a container that ensures health and scalability of worker nodes. Worker nodes pick up job parts from the input queue automatically and perform single tasks that are part of the list of batch processing steps.
5Interim results from worker nodes are stored in Amazon S3.
6Progress information and statistics are stored on the analytics store. This component can be either an Amazon SimpleDB domain or a relational database such as an Amazon Relational Database Service (RDS) instance.
7Optionaly, completed tasks can be inserted in an Amazon SQS queue for chaining to a second processing stages.