Robust Batch Processing with PHP (part 1/2)

I submitted a proposal for php|tek 2008 entitled "Robust Batch Processing with PHP." Granted, the schedule has not been posted yet, so I don't know if my talk has even been accepted, but I wanted to formulate some thoughts around the topic for a long-overdue blog post. So, first things first: what is batch processing?

Let's look at some Wikipedia definitions on the topic:

"Batch processing is execution of a series of programs ("jobs") on a computer without human interaction."

...and...

"Batch jobs are set up so they can be run to completion without human interaction, so all input data is preselected through scripts or commandline parameters. This is in contrast to interactive programs which prompt the user for such input."

So, no human interaction, which means they're generally running out of a scheduler, such as crond.

In this post, I'm going to talk about batch processing in relation to web-based applications. Given this, what are some examples of common batch processing used in web-based applications? Here are a few:

  • Sending of emails
  • Video transcoding
  • Generating image thumbnails
  • Communication with third-party services
  • Processing post-authorization of credit/debit card and online check transactions

...just to name a few. Typically, tasks such as these would be done with PHP scripts run from the command line, either interactively or from a scheduler such as cron or at. In some cases, you may even go so far as to set aside a dedicated machine (or machines) to perform these operations (this is my preferred method).

"Why would I want to do any of these things in batch?" you ask? Here are some reasons:

  • To keep front end webservers doing what they do best: serving requests!
  • To allow for graceful handling of failure in the form of retrying the operation against a third-party vendor. For example, if your credit card processing vendor is down for maintenance, but you want to post-authorize credit card payments, you want to wait a little bit and try again. This is easiest done in a batch process. The alternative would be to, say, charge the user during their actual HTTP request and retry over and over until the post-authorization request completed. This is a lousy user experience.
  • Sending emails from front end webservers is just silly and wasteful. Why make SMTP connections from these webservers? Send emails in batch on the back end so you can handle failure cases, hard and soft bounces of messages and so on.

Batch processing is tricky because it's non-interactive. Jobs such as the examples above will run at least once a day, and more often than not, they'll run every few minutes or hours. Maybe you only post-authorize credit cards every six hours, but you would definitely transcode videos around or send emails all throughout a day in order to keep your site "living."

What are some of the challenges with batch processing?

  • Developers need to be made aware of problems
  • Processing needs to be retried if any sort of failure occurred
  • Detailed logs of job executions must be kept so developers can investigate failures and successes; you should leave a full audit trail to anyone can track down the lifecycle of processing
  • These batch jobs should be easy for developers to develop. Imagine duplicating logging code across all of your batch processes -- you don't want to repeat yourself!

Point here is that if any of your processes is failing, your developers should be made aware of it immediately, or at least sometime shortly after the failure. How do we handle these requirements?

Define error levels

First, what are the different types of errors that we have? Well, in my experience, they're similar to the Syslog priority levels. These are made available in PHP for use with trigger_error() using some pre-defined constants.

Out of these pre-defined constants, you have these main levels:

  • debug
  • info
  • notice
  • warning
  • error
  • fatal

What do we do with errors of these levels? Let's say that developers should only be emailed for anything warning or above. Anything else should just be written to the log.

Making developers aware of failures

When you think of the best way to notify developers of problems during processing, what comes to mind first? ... ... ... what was that? Email? Yes, email. So, those warning-level error messages just just spoke about...all of those should be emailed to the developer at the completion of the process.

Now, it's not the only option, but it may be the most obvious. If your transaction to post-authorize a credit card fails, your developers should be made aware of it right away so someone can contact the vendor, or, say, identify firewall issues in your environment. Similarly, if your video transcoding server(s) is/are down, videos can't be transcoded -- someone needs to be made aware of that! Let's email them.

Let's take this rough example code:

$config = array('foo' => 'bar', 'baz' => 'bop'); // Config options
$batch = Batch::getInstance($config);
$vendor = Some_Billing_Processor::factory();
$accounts = Foo::getAccountsForPostAuth();

foreach ($accounts as $account) {
    $accountId = $account->getId();
    $amountToBill = Foo::getPreAuthorizationAmount($account);

    try {
        if ($vendor->postAuthorize($account)) {
            $batch->info("[$accountId] billed $amountToBill");
        } else {
            $batch->warning("[$accountId] failed to bill $amountToBill");
            // Record failure so processing can be retried
        }
    } catch (Vendor_Exception $e) {
        $batch->error(
            "[$accountId] caught exception during vendor communication");
        // Record failure so processing can be retried
    }
}

In this case, we've raised a warning for the failed post-auth, and we raised an error if an exception was caught (i.e. inability to connect to the vendor's service). The info() call won't result in an email to the developer, though, but that message will be logged. Bottom line...if we can't take the customer's money, a developer needs to address the situation soon.

In the cases of the warning and error, these log entries will be emailed to the error email recipient(s) upon completion of the script.

Another alternative here would be to never email warnings when they occur, but write a separate script that parses logs for warnings, rolls them up into one message body, and emails the developers every few hours. This keeps the email traffic down, and ultimately keeps your developers from thinking that their back end scripts "cry wolf" by being too chatty.

What about those logs you speak of?

Everyone's been in a situation where they have a single directory full of log files. These can be either foo.log, foo.log.1, or even foo.log.20071114 ...or any number of naming conventions. Even worse is a single log file for a process that just grows and grows. Log rotation is an easy fix for these scenarios.

Personally, I feel that this is bad practice. I tend to prefer date-based directory names for storing log files. In my opinion, planning for this from the start of your project is far better than having to react in a knee-jerk fashion later on once you've filled a directory or reached some sort of maximum file size limit on your filesystem. Consider this directory and the files in it:

/var/log/cc_auth
    pre_auth.log.20071114
    post_auth.log.20071114
    pre_auth.log.20071115
    post_auth.log.20071115
    pre_auth.log.20071116
    post_auth.log.20071116
    pre_auth.log.20071117
    post_auth.log.20071117
    pre_auth.log.20071118
    post_auth.log.20071118

Messy, right? In this example, you end up with a few downsides:

  • A lot of files in each directory
  • Potential to hit Unix max files per directory limit (on ext2 and some older/other filesystems)
  • Date-based filenames are cumbersome to type (or even auto-complete in your Unix shell)

Personally, I prefer a structure using date-based directories like so:

/var/log/cc_auth/2007/11/14
    pre_auth.log
    post_auth.log
/var/log/cc_auth/2007/11/15
    pre_auth.log
    post_auth.log
/var/log/cc_auth/2007/11/16
    pre_auth.log
    post_auth.log
/var/log/cc_auth/2007/11/17
    pre_auth.log
    post_auth.log
/var/log/cc_auth/2007/11/18
    pre_auth.log
    post_auth.log

In this situation, you've got a clean structure laid out in directories on disk. Now, you could make an argument that you use more inodes, but that's a weak argument. Point here being...nice and pretty, right?

What are some other useful things to log during batch job execution?

The more useful data you can log, the better (within reason, of course). Here are some handy examples:

  • PID
  • Start time of job
  • End time of job
  • Elapsed time
  • Number of notices, warnings, errors, etc.

To illustrate, here's a log entry for a script running on a batch job class that I built at work:

(5527) ------------------------------------
(5527)   Hostname: articuno (batch)
(5527)     Script: /data/baz/deploy/batch/Foo/Bar/some_script.php
(5527)   Log File: /data/baz/log/Foo/Bar/2007/11/13/some_script.log
(5527)      Start: 2007-11-13 02:39:02 GMT
(5527) ------------------------------------
(5527) [2007-11-13 02:39:02] [info] locked 1 items for copyright scanning
(5527) [2007-11-13 02:39:02] [info] [3EC7539BF9F0C72EE040050AEE042902] performing copyright scan; entity type id = 3; name = High sound TR TONE.mp3; scanning file = /foo/bar.mp3; mime type = audio/mpeg
(5527) [2007-11-13 02:39:18] [info] [3EC7539BF9F0C72EE040050AEE042902] entity is not copyrighted
(5527) [2007-11-13 02:39:18] [info] [3EC7539BF9F0C72EE040050AEE042902] removing entity from pending state
(5527) [2007-11-13 02:39:18] [info] [3EC7539BF9F0C72EE040050AEE042902] copied files in temporary storage to public storage
(5527) [2007-11-13 02:39:18] [info] [3EC7539BF9F0C72EE040050AEE042902] deleted all files in temporary storage
(5527) [2007-11-13 02:39:18] [info] [3EC7539BF9F0C72EE040050AEE042902] set permissions on entity's public storage directory
(5527) [2007-11-13 02:39:18] [info] [3EC7539BF9F0C72EE040050AEE042902] set copyright scan outcome
(5527) [2007-11-13 02:39:18] [info] [3EC7539BF9F0C72EE040050AEE042902] removed entity from upload queue
(5527) [2007-11-13 02:39:18] [info] [3EC7539BF9F0C72EE040050AEE042902] queued cdn purge of entity's urls
(5527) [2007-11-13 02:39:18] [info] released lock for process articuno (batch):5527
(5527) [2007-11-13 02:39:18] [info] found 0 copyrighted entities
(5527) ------------------------------------
(5527)       End: 2007-11-13 02:39:18 GMT
(5527)   Elapsed: 16.472630s
(5527) ------------------------------------

At first glance, there are a bunch of things that are really clear from this log entry:

  • The process ID is 5527
  • The entire execution took about 16.5 seconds
  • We see what is being processed, and an entry for every action taken along with its success (or failure)

Now, this is a pretty useful, but successful, log entry. Let's take a look at a failure case, shall we?

(906) ------------------------------------
(906)   Hostname: sentret (video01)
(906)     Script: /data/baz/deploy/batch/Foo/Bar/transcode_videos.php
(906)   Log File: /data/baz/log/Foo/Bar/2007/11/07/transcode_videos.log
(906)      Start: 2007-11-07 15:58:02 GMT
(906) ------------------------------------
(906) [2007-11-07 15:58:02] [info] locked 2 items of type 4
(906) [2007-11-07 15:58:02] [info] [3D3A7D11E19B8906E040050AEE04323B] Starting flv transcode for 3gp and thumbnails
(906) [2007-11-07 15:59:04] [info] [3D3A7D11E19B8906E040050AEE04323B] Finished
(906) [2007-11-07 15:59:04] [info] [3D3A7D11E19B8906E040050AEE04323B] Starting preview flv transcode for entity page
(906) [2007-11-07 15:59:31] [notice] [3D3A7D11E19B8906E040050AEE04323B] error transcoding video to flash; skipping video; path = /foo/Croud.flv; message = Encoding process encountered an error
(906) [2007-11-07 15:59:31] [info] [3E48BC06162F4A8CE040050AEE042BCC] Starting flv transcode for 3gp and thumbnails
(906) [2007-11-07 15:59:56] [notice] [3E48BC06162F4A8CE040050AEE042BCC] error transcoding video to flash; skipping video; path = /foo/DBA_27108.gif; message = Could not create the flix handle - flixd unreachable (not running); flix result = -9
(906) [2007-11-07 15:59:56] [error] [3E48BC06162F4A8CE040050AEE042BCC] error connecting to the flix engine; skipping video; path = /foo/DBA_27108.gif; message = Could not create the flix handle - flixd unreachable (not running); flix result = -9
(906) [2007-11-07 15:59:57] [info] released lock for process sentret (video01):906
(906) ------------------------------------
(906)       End: 2007-11-07 15:59:57 GMT
(906)   Elapsed: 114.864283s
(906) ------------------------------------

In this case, we see that an error occurred. The developers would receive an email reading:


(906) ------------------------------------
(906)   Hostname: sentret (video01)
(906)     Script: /data/baz/deploy/batch/Foo/Bar/transcode_videos.php
(906)   Log File: /data/baz/log/Foo/Bar/2007/11/07/transcode_videos.log
(906)      Start: 2007-11-07 15:58:02 GMT
(906) ------------------------------------
(906) [2007-11-07 15:59:56] [error] [3E48BC06162F4A8CE040050AEE042BCC] error connecting to the flix engine; skipping video; path = /foo/DBA_27108.gif; message = Could not create the flix handle - flixd unreachable (not running); flix result = -9
(906) ------------------------------------
(906)       End: 2007-11-07 15:59:57 GMT
(906)   Elapsed: 114.864283s
(906) ------------------------------------

We maintain a setting for "minimum email log level," which defaults to warnings. So, that's what allows us to email anything at warning-level or higher to the developers that can address the situation. Alternatively, we could set that level to email developers on anything at notice-level or above. It's all configurable in the batch framework.

Similarly, we define a default exception handler and an error handler to trap uncaught exceptions and errors from PHP. Having an exception handler, for example, allows us to catch all exceptions, log their being uncaught, and email the developers to let them know of the problem. Likewise, PHP notices or warnings are logged and emailed if applicable, too.

We've definitely achieved our goal of making developers aware of problems!

So, this all looks great, Brian, but how can I get my hands on it?

Well, at this time, I'm not at liberty to release any of this code. Perhaps it's worth submitting a Zend Framework proposal to keep it in Userland, or even a PEAR2 module.

Even still, let's assume that we'll want the following:

  • Parsing of command line options (short and long)
  • Lock file support
  • Email recipient(s) on errors (you could even, say, send SMS messages!)
  • Flexible logging in date-based directories or files, or any arbitrary structure
  • Ability to define levels at which emails are generated
  • Easy way to use batch functionality in any batch script

On the database side of things, let's consider these requirements:

  • Ability to delay retry of processing for a specified amount of time
  • Ability to retry up to X times, then cease retries

I've had this post brewing for a long time now, so I'm going to deem this one "part one of two" and address some of the points above in a second post on the topic. The database portion alone is pretty lengthy. I also haven't heard back on php|tek acceptance at this point, but if I get accepted, I'll definitely be bringing some more cohesion to this topic.

If you have any questions or comments, just ask! I'm also going to send a PEAR2 proposal post-Thanksgiving, so heads up!