Category: Scalability

Robust Batch Processing with PHP (part 1/2)

I submitted a proposal for php|tek 2008 entitled “Robust Batch Processing with PHP.” Granted, the schedule has not been posted yet, so I don’t know if my talk has even been accepted, but I wanted to formulate some thoughts around the topic for a long-overdue blog post.

So, first things first: what is batch processing?

Let’s look at some Wikipedia definitions on the topic:

“Batch processing is execution of a series of programs (“jobs”) on a computer without human interaction.”

…and…

“Batch jobs are set up so they can be run to completion without human interaction, so all input data is preselected through scripts or commandline parameters. This is in contrast to interactive programs which prompt the user for such input.”

So, no human interaction, which means they’re generally running out of a scheduler, such as crond.

In this post, I’m going to talk about batch processing in relation to web-based applications. Given this, what are some examples of common batch processing used in web-based applications? Here are a few:

  • Sending of emails
  • Video transcoding
  • Generating image thumbnails
  • Communication with third-party services
  • Processing post-authorization of credit/debit card and online check transactions

…just to name a few. Typically, tasks such as these would be done with PHP scripts run from the command line, either interactively or from a scheduler such as cron or at. In some cases, you may even go so far as to set aside a dedicated machine (or machines) to perform these operations (this is my preferred method).

“Why would I want to do any of these things in batch?” you ask? Here are some reasons:

  • To keep front end webservers doing what they do best: serving requests!
  • To allow for graceful handling of failure in the form of retrying the operation against a third-party vendor. For example, if your credit card processing vendor is down for maintenance, but you want to post-authorize credit card payments, you want to wait a little bit and try again. This is easiest done in a batch process. The alternative would be to, say, charge the user during their actual HTTP request and retry over and over until the post-authorization request completed. This is a lousy user experience.
  • Sending emails from front end webservers is just silly and wasteful. Why make SMTP connections from these webservers? Send emails in batch on the back end so you can handle failure cases, hard and soft bounces of messages and so on.

Batch processing is tricky because it’s non-interactive. Jobs such as the examples above will run at least once a day, and more often than not, they’ll run every few minutes or hours. Maybe you only post-authorize credit cards every six hours, but you would definitely transcode videos around or send emails all throughout a day in order to keep your site “living.”

What are some of the challenges with batch processing?

  • Developers need to be made aware of problems
  • Processing needs to be retried if any sort of failure occurred
  • Detailed logs of job executions must be kept so developers can investigate failures and successes; you should leave a full audit trail to anyone can track down the lifecycle of processing
  • These batch jobs should be easy for developers to develop. Imagine duplicating logging code across all of your batch processes — you don’t want to repeat yourself!

Point here is that if any of your processes is failing, your developers should be made aware of it immediately, or at least sometime shortly after the failure. How do we handle these requirements?

Define error levels

First, what are the different types of errors that we have? Well, in my experience, they’re similar to the Syslog priority levels. These are made available in PHP for use with trigger_error() using some pre-defined constants.

Out of these pre-defined constants, you have these main levels:

  • debug
  • info
  • notice
  • warning
  • error
  • fatal

What do we do with errors of these levels? Let’s say that developers should only be emailed for anything warning or above. Anything else should just be written to the log.

Making developers aware of failures

When you think of the best way to notify developers of problems during processing, what comes to mind first? … … … what was that? Email? Yes, email. So, those warning-level error messages just just spoke about…all of those should be emailed to the developer at the completion of the process.

Now, it’s not the only option, but it may be the most obvious. If your transaction to post-authorize a credit card fails, your developers should be made aware of it right away so someone can contact the vendor, or, say, identify firewall issues in your environment. Similarly, if your video transcoding server(s) is/are down, videos can’t be transcoded — someone needs to be made aware of that! Let’s email them.

Let’s take this rough example code:

$config = array('foo' => 'bar', 'baz' => 'bop'); // Config options
$batch = Batch::getInstance($config);
$vendor = Some_Billing_Processor::factory();
$accounts = Foo::getAccountsForPostAuth();

foreach ($accounts as $account) {
    $accountId = $account->getId();
    $amountToBill = Foo::getPreAuthorizationAmount($account);

    try {
        if ($vendor->postAuthorize($account)) {
            $batch->info("[$accountId] billed $amountToBill");
        } else {
            $batch->warning("[$accountId] failed to bill $amountToBill");
            // Record failure so processing can be retried
        }
    } catch (Vendor_Exception $e) {
        $batch->error(
            "[$accountId] caught exception during vendor communication");
        // Record failure so processing can be retried
    }
}

In this case, we’ve raised a warning for the failed post-auth, and we raised an error if an exception was caught (i.e. inability to connect to the vendor’s service). The info() call won’t result in an email to the developer, though, but that message will be logged. Bottom line…if we can’t take the customer’s money, a developer needs to address the situation soon.

In the cases of the warning and error, these log entries will be emailed to the error email recipient(s) upon completion of the script.

Another alternative here would be to never email warnings when they occur, but write a separate script that parses logs for warnings, rolls them up into one message body, and emails the developers every few hours. This keeps the email traffic down, and ultimately keeps your developers from thinking that their back end scripts “cry wolf” by being too chatty.

What about those logs you speak of?

Everyone’s been in a situation where they have a single directory full of log files. These can be either foo.log, foo.log.1, or even foo.log.20071114 …or any number of naming conventions. Even worse is a single log file for a process that just grows and grows. Log rotation is an easy fix for these scenarios.

Personally, I feel that this is bad practice. I tend to prefer date-based directory names for storing log files. In my opinion, planning for this from the start of your project is far better than having to react in a knee-jerk fashion later on once you’ve filled a directory or reached some sort of maximum file size limit on your filesystem. Consider this directory and the files in it:

/var/log/cc_auth
    pre_auth.log.20071114
    post_auth.log.20071114
    pre_auth.log.20071115
    post_auth.log.20071115
    pre_auth.log.20071116
    post_auth.log.20071116
    pre_auth.log.20071117
    post_auth.log.20071117
    pre_auth.log.20071118
    post_auth.log.20071118

Messy, right? In this example, you end up with a few downsides:

  • A lot of files in each directory
  • Potential to hit Unix max files per directory limit (on ext2 and some older/other filesystems)
  • Date-based filenames are cumbersome to type (or even auto-complete in your Unix shell)

Personally, I prefer a structure using date-based directories like so:

/var/log/cc_auth/2007/11/14
    pre_auth.log
    post_auth.log
/var/log/cc_auth/2007/11/15
    pre_auth.log
    post_auth.log
/var/log/cc_auth/2007/11/16
    pre_auth.log
    post_auth.log
/var/log/cc_auth/2007/11/17
    pre_auth.log
    post_auth.log
/var/log/cc_auth/2007/11/18
    pre_auth.log
    post_auth.log

In this situation, you’ve got a clean structure laid out in directories on disk. Now, you could make an argument that you use more inodes, but that’s a weak argument. Point here being…nice and pretty, right?

What are some other useful things to log during batch job execution?

The more useful data you can log, the better (within reason, of course). Here are some handy examples:

  • PID
  • Start time of job
  • End time of job
  • Elapsed time
  • Number of notices, warnings, errors, etc.

To illustrate, here’s a log entry for a script running on a batch job class that I built at work:

(5527) ------------------------------------
(5527)   Hostname: articuno (batch)
(5527)     Script: /data/baz/deploy/batch/Foo/Bar/some_script.php
(5527)   Log File: /data/baz/log/Foo/Bar/2007/11/13/some_script.log
(5527)      Start: 2007-11-13 02:39:02 GMT
(5527) ------------------------------------
(5527) [2007-11-13 02:39:02] [info] locked 1 items for copyright scanning
(5527) [2007-11-13 02:39:02] [info] [3EC7539BF9F0C72EE040050AEE042902] performing copyright scan; entity type id = 3; name = High sound TR TONE.mp3; scanning file = /foo/bar.mp3; mime type = audio/mpeg
(5527) [2007-11-13 02:39:18] [info] [3EC7539BF9F0C72EE040050AEE042902] entity is not copyrighted
(5527) [2007-11-13 02:39:18] [info] [3EC7539BF9F0C72EE040050AEE042902] removing entity from pending state
(5527) [2007-11-13 02:39:18] [info] [3EC7539BF9F0C72EE040050AEE042902] copied files in temporary storage to public storage
(5527) [2007-11-13 02:39:18] [info] [3EC7539BF9F0C72EE040050AEE042902] deleted all files in temporary storage
(5527) [2007-11-13 02:39:18] [info] [3EC7539BF9F0C72EE040050AEE042902] set permissions on entity's public storage directory
(5527) [2007-11-13 02:39:18] [info] [3EC7539BF9F0C72EE040050AEE042902] set copyright scan outcome
(5527) [2007-11-13 02:39:18] [info] [3EC7539BF9F0C72EE040050AEE042902] removed entity from upload queue
(5527) [2007-11-13 02:39:18] [info] [3EC7539BF9F0C72EE040050AEE042902] queued cdn purge of entity's urls
(5527) [2007-11-13 02:39:18] [info] released lock for process articuno (batch):5527
(5527) [2007-11-13 02:39:18] [info] found 0 copyrighted entities
(5527) ------------------------------------
(5527)       End: 2007-11-13 02:39:18 GMT
(5527)   Elapsed: 16.472630s
(5527) ------------------------------------

At first glance, there are a bunch of things that are really clear from this log entry:

  • The process ID is 5527
  • The entire execution took about 16.5 seconds
  • We see what is being processed, and an entry for every action taken along with its success (or failure)

Now, this is a pretty useful, but successful, log entry. Let’s take a look at a failure case, shall we?

(906) ------------------------------------
(906)   Hostname: sentret (video01)
(906)     Script: /data/baz/deploy/batch/Foo/Bar/transcode_videos.php
(906)   Log File: /data/baz/log/Foo/Bar/2007/11/07/transcode_videos.log
(906)      Start: 2007-11-07 15:58:02 GMT
(906) ------------------------------------
(906) [2007-11-07 15:58:02] [info] locked 2 items of type 4
(906) [2007-11-07 15:58:02] [info] [3D3A7D11E19B8906E040050AEE04323B] Starting flv transcode for 3gp and thumbnails
(906) [2007-11-07 15:59:04] [info] [3D3A7D11E19B8906E040050AEE04323B] Finished
(906) [2007-11-07 15:59:04] [info] [3D3A7D11E19B8906E040050AEE04323B] Starting preview flv transcode for entity page
(906) [2007-11-07 15:59:31] [notice] [3D3A7D11E19B8906E040050AEE04323B] error transcoding video to flash; skipping video; path = /foo/Croud.flv; message = Encoding process encountered an error
(906) [2007-11-07 15:59:31] [info] [3E48BC06162F4A8CE040050AEE042BCC] Starting flv transcode for 3gp and thumbnails
(906) [2007-11-07 15:59:56] [notice] [3E48BC06162F4A8CE040050AEE042BCC] error transcoding video to flash; skipping video; path = /foo/DBA_27108.gif; message = Could not create the flix handle - flixd unreachable (not running); flix result = -9
(906) [2007-11-07 15:59:56] [error] [3E48BC06162F4A8CE040050AEE042BCC] error connecting to the flix engine; skipping video; path = /foo/DBA_27108.gif; message = Could not create the flix handle - flixd unreachable (not running); flix result = -9
(906) [2007-11-07 15:59:57] [info] released lock for process sentret (video01):906
(906) ------------------------------------
(906)       End: 2007-11-07 15:59:57 GMT
(906)   Elapsed: 114.864283s
(906) ------------------------------------

In this case, we see that an error occurred. The developers would receive an email reading:


(906) ------------------------------------
(906)   Hostname: sentret (video01)
(906)     Script: /data/baz/deploy/batch/Foo/Bar/transcode_videos.php
(906)   Log File: /data/baz/log/Foo/Bar/2007/11/07/transcode_videos.log
(906)      Start: 2007-11-07 15:58:02 GMT
(906) ------------------------------------
(906) [2007-11-07 15:59:56] [error] [3E48BC06162F4A8CE040050AEE042BCC] error connecting to the flix engine; skipping video; path = /foo/DBA_27108.gif; message = Could not create the flix handle - flixd unreachable (not running); flix result = -9
(906) ------------------------------------
(906)       End: 2007-11-07 15:59:57 GMT
(906)   Elapsed: 114.864283s
(906) ------------------------------------

We maintain a setting for “minimum email log level,” which defaults to warnings. So, that’s what allows us to email anything at warning-level or higher to the developers that can address the situation. Alternatively, we could set that level to email developers on anything at notice-level or above. It’s all configurable in the batch framework.

Similarly, we define a default exception handler and an error handler to trap uncaught exceptions and errors from PHP. Having an exception handler, for example, allows us to catch all exceptions, log their being uncaught, and email the developers to let them know of the problem. Likewise, PHP notices or warnings are logged and emailed if applicable, too.

We’ve definitely achieved our goal of making developers aware of problems!

So, this all looks great, Brian, but how can I get my hands on it?

Well, at this time, I’m not at liberty to release any of this code. Perhaps it’s worth submitting a Zend Framework proposal to keep it in Userland, or even a PEAR2 module.

Even still, let’s assume that we’ll want the following:

  • Parsing of command line options (short and long)
  • Lock file support
  • Email recipient(s) on errors (you could even, say, send SMS messages!)
  • Flexible logging in date-based directories or files, or any arbitrary structure
  • Ability to define levels at which emails are generated
  • Easy way to use batch functionality in any batch script

On the database side of things, let’s consider these requirements:

  • Ability to delay retry of processing for a specified amount of time
  • Ability to retry up to X times, then cease retries

I’ve had this post brewing for a long time now, so I’m going to deem this one “part one of two” and address some of the points above in a second post on the topic. The database portion alone is pretty lengthy. I also haven’t heard back on php|tek acceptance at this point, but if I get accepted, I’ll definitely be bringing some more cohesion to this topic.

If you have any questions or comments, just ask! I’m also going to send a PEAR2 proposal post-Thanksgiving, so heads up!

ZendCon 07: “Mobilizing and Sharing: How Zend Framework Builds Community for Nokia MOSH”

On October 9, 2007, Ben Ramsey and I spoke on how we used Zend Framework in building Nokia MOSH at Schematic. I also touched on some of the architectural details as well.

The talk went great! A link to a PDF of the slides is below:

Grab a PDF of the slides

Example: Who’s Online with PHP and Memcached

I figured it best to give an example to back up my last post entitled “Who’s Online with PHP and Memcached.”

First, let’s look at the WhosOnline class itself. This class is meant to be a Singleton, so you have to access it with WhosOnline::getInstance().

Also, DISCLAIMER: I wrote this code in about 20-30 minutes. There may be little odds and ends-type problems with it, but please post comments if you’ve got feedback!

/**
 * Class for accessing Who's Online data via Memcached.
 *
 * @author Brian DeShong
 */
class WhosOnline
{
    const RECORDING_DELAY_SECONDS = 120;
    private static $_instances = array();
    private $_mc;

    /**
     * Protected constructor to force use as a singleton.
     *
     * @param Memcache $mc Memcache object.
     */
    protected function __construct(Memcache $mc)
    {
        $this->_mc = $mc;
    }

    /**
     * Classic Singleton getInstance() method.  Allows for multiple
     * WhosOnline instances, though.  For example, maybe you want to use one
     * Memcached pool for users online in your forums, and another for users
     * online in your online dating application.  Coupling a different
     * Memcache object with a different $uniqueId allows this.
     *
     * @param Memcache $mc Memcache object.
     * @param string $uniqueId Unique ID of the object; optional.
     * @return WhosOnline
     */
    public static function getInstance(Memcache $mc, $uniqueId = 'default')
    {
        if (!isset(self::$_instances[$uniqueId])) {
            self::$_instances[$uniqueId] = new self($mc);
        }

        return self::$_instances[$uniqueId];
    }

    /**
     * Determines if current user's online status needs to be recorded or
     * updated.
     *
     * @return bool
     * @todo This method shouldn't reach out to $_SESSION.
     */
    public function needToRecordOnline()
    {
        return
            !isset($_SESSION['lastOnlineRecorded']) ||
            (isset($_SESSION['lastOnlineRecorded']) &&
             $_SESSION['lastOnlineRecorded'] <
                 time() - self::RECORDING_DELAY_SECONDS);
    }

    /**
     * Records given user ID as being online and records last activity
     * timestamp.
     *
     * @param int $userId User ID.
     * @return bool
     */
    public function recordOnline($userId)
    {
        if (!self::setUserOnline($userId)) {
            return false;
        }

        $_SESSION['lastOnlineRecorded'] = time();
        return true;
    }
    /**
     * Gets array of all users online.  Array is keyed by user ID with activity
     * timestamp as the value.
     *
     * @return array
     */
    public function getUsersOnline()
    {
        $usersOnline = $this->_mc->get('usersOnline');

        return ($usersOnline !== false ? $usersOnline : array());
    }

    /**
     * Sets an array of user IDs with their activity timestamps.
     *
     * @param array $usersOnline Array of user IDs online.
     * @return bool
     */
    public function setUsersOnline(array $usersOnline)
    {
        return
            $this->_mc->set('usersOnline', $usersOnline) &&
            $this->_mc->set('numUsersOnline', count($usersOnline));
    }

    /**
     * Sets given user ID as being online.
     *
     * @param int $userId User ID.
     * @return bool
     */
    protected function setUserOnline($userId)
    {
        $usersOnline = $this->getUsersOnline();
        $usersOnline[$userId] = time();
        return $this->setUsersOnline($usersOnline);
    }
}

Note the primary methods:

  • WhosOnline::getInstance()
  • WhosOnline::needToRecordOnline
  • WhosOnline::recordOnline()
  • WhosOnline::getUsersOnline()
  • WhosOnline::setUsersOnline()

The main reason we leave setUsersOnline() public is so that it can be accessed via a back end script to cleanup the entire array of user IDs online.

Next, our example file using this class:

wol_test.php

// Startup the session and assign a user ID.  Typically you would do this at
// authentication time.
session_start();

if (!isset($_SESSION['user_id'])) {
    $_SESSION['user_id'] = uniqid();
}

// Connect to Memcached and grab the Who's Online object.
$mc = new Memcache();
$mc->connect('localhost', 11211);
$who = WhosOnline::getInstance($mc);

// If user needs to be recorded as online, do so.
if ($who->needToRecordOnline()) {
    $who->recordOnline($_SESSION['user_id']);
}

// Grab users online to display; typically you would never do this on the
// front end, though.
$usersOnline = $who->getUsersOnline();
?>
Your session data:
<pre>
<?php echo print_r($_SESSION, true); ?>
</pre>

Users online: <?php echo count($usersOnline); ?>
<pre>
<?php echo print_r($usersOnline, true); ?>
</pre>

I placed the example wol_test.php file in my DocumentRoot and ran it through ApacheBench a few times, like so:

ab -c 10 -t 1000 http://localhost/wol_test.php

This causes the wol_test.php page to be requested 1,000 times at a level of 10 concurrent requests. I did this a few times and ended up with over 3,000 users in my array of users online. Based on a manual get from Memcached like so:

get usersOnline
VALUE usersOnline 1 126727

…we see that with over 3,000 users online, it only takes up 126,727 bytes in Memcached. Remember, the PECL extension for Memcache serializes any non-scalar values before storing them, so you have a cost associated with the serializing and unserializing of the array. Doing the math here, a 1MB serialized array will hold 30,838 users online. You’ll be able to squeeze more out of it if you have integer user IDs; I’m using uniqid() here just for example purposes.

But is this is a good idea? Retrieving 1MB, or even 127k from Memcached every so often isn’t cheap. Remember, you are:

  1. Retrieving string with serialized array of users online from Memcached
  2. Unserializing the string
  3. Adding user or updating their activity timestamp
  4. Serializing the array again
  5. Storing string back to Memcached

…this isn’t cheap. This is probably going to be more sluggish than you’re willing to acceept, and I doubt it’d scale well as you crept up into thousands of users online. I’m here with over 4,000 users in my array, and it performs well, but it’s also on a page with nothing else — once you tack on database queries and all sorts of other junk to render a page, you may be looking at a page that renders in over .5 seconds.

In a situation like this, you could consider splitting Who’s Online data up into multiple values in Memcached. Basically, you can write your application code to use, say, 10 “buckets” of users online. You would randomly select one of the 10 buckets to add/modify the user. The key in a situation like this is to have a back end process to merge all of the arrays together, iterate over them removing stale users, and evenly distributing them back into Memcached.

I’ve started coding an example of this, but don’t really have the will to finish it right now. :) Maybe later.

Lastly, let’s look at the back end batch process that keeps the array of users online tidy; typically this script would run as a cronjob:

whos_online_cleanup.php

require_once './wol.php';

$now = time();
$mc = new Memcache();
$mc->connect('localhost', 11211);
$who = WhosOnline::getInstance($mc);

$usersOnline = $who->getUsersOnline();

if (empty($usersOnline)) {
    print "no users online; exiting\n\n";
    exit();
}

print "num users online: " . count($usersOnline) . "\n\n";
print "processing users...\n";

$numUsersRemoved = 0;

foreach ($usersOnline as $userId => $timestamp) {
    if ($timestamp < $now - 300) {
        print "removing $userId; last seen " .
            ($now - $timestamp) . " seconds ago\n";
        unset($usersOnline[$userId]);
        $numUsersRemoved++;
    }
}

print "num users removed: $numUsersRemoved\n";
print "current num users online: " . count($usersOnline) . "\n";
print "saving users online...";
print ($who->setUsersOnline($usersOnline) ? 'done!' : '** FAILED **');
exit();

Here’s some example output from it:

brian@henery [/web/pages]$ php ./whos_online_cleanup.php
num users online: 56

processing users...
removing 46f549b786d68; last seen 496 seconds ago
removing 46f549b786e7c; last seen 480 seconds ago
removing 46f549b786ef2; last seen 476 seconds ago
removing 46f549b7871b5; last seen 445 seconds ago
removing 46f549b787213; last seen 480 seconds ago
removing 46f549b78a105; last seen 482 seconds ago
removing 46f549b789071; last seen 389 seconds ago
removing 46f549b7927eb; last seen 467 seconds ago
removing 46f549b79372a; last seen 437 seconds ago
removing 46f549b798931; last seen 487 seconds ago
removing 46f549b79b50b; last seen 423 seconds ago
removing 46f549b79bc0a; last seen 381 seconds ago
removing 46f549b79d000; last seen 379 seconds ago
removing 46f549b79dfa6; last seen 398 seconds ago
removing 46f549b79fb99; last seen 472 seconds ago
removing 46f549b7a3975; last seen 502 seconds ago
removing 46f549b7a8278; last seen 373 seconds ago
removing 46f549b7ab905; last seen 407 seconds ago
removing 46f549b7d635e; last seen 389 seconds ago
removing 46f549b7d63b0; last seen 500 seconds ago
removing 46f549b7d63d8; last seen 453 seconds ago
removing 46f549b7da797; last seen 309 seconds ago
removing 46f549b7dbb9a; last seen 491 seconds ago
removing 46f549b7dce8a; last seen 396 seconds ago
removing 46f549b7dddb0; last seen 361 seconds ago
removing 46f549b7e3da6; last seen 353 seconds ago
num users removed: 26
current num users online: 30
saving users online...done!

So, you can just cron this like so:

* * * * * /usr/local/bin/php /some/path/to/whos_online_cleanup.php > /dev/null 2>&1

…and feel free to redirect STDOUT to a log file if you’d like.

Some quick stats. With over 1,000 users in the array, running the cleanup script takes under .2 seconds:

brian@henery [/web/pages]$ time php ./whos_online_cleanup.php
num users online: 1221

...[snip]...

num users removed: 470
current num users online: 751
saving users online...done!
real    0m0.195s
user    0m0.040s
sys     0m0.040s

…so it’s pretty speedy. It’s worth noting that all of this is being done on my Mac Mini Core Solo with 2 GB RAM running PHP 5.2.4 and Apache 2.2.x on OS X 10.4.10. Oh, and with just over 5,000 users in the array, the script runs in .58 seconds.

So…pretty straightforward, right? What do you think? Surely there’s room for improvement…

Who’s Online with PHP and Memcached

Whenever you Google around for things like “Who’s Online php”, you’ll find that a lot of the solutions are centered around using a database. However, is this really necessary? For a site with, say, 50,000 concurrent users making, say, one page request every eight seconds, this could be a lot of database traffic if you’re recording the user’s activity on every request.

One goal here: get Who’s Online functionality off of the database. We’ll explore a possible solution with Memcached that I’ve personally implemented, and thus far, it’s been working great.

The first thing to consider: how real-time does something like “Who’s Online” need to be? Is having it be accurate to, say, users that have been online within the last two minutes acceptable? Next, do we really need to know when a user made any action on the site, or can we consider them online every so often? Not recording activity on each page request significantly reduces the amount of recording going on.

Next, we have to keep in mind that Memcached values can be up to one megabyte in size. If we’re going to have hundreds of thousands of users online, it’s possible to exceed the 1M limit. Let’s ignore this for now; we’ll address it later. Let’s assume that your site is relatively small and won’t have more than a few hundred or thousand users online at any given time.

For this type of scenario, you can store a single array in Memcache. A decent structure is like so:

array(
    '12345' => [unix timestamp],
    '12346' => [unix timestamp],
    '[user id]' => [unix timestamp],
    ...,
    ...);

Your user ID value can be whatever you’d like, as long as it’s unique. For example, your most common IDs will (should!) be numeric, but a unique username or GUID-based ID will work fine.

Next, you store the timestamp of the user’s last activity. A key point here is…how accurate does this data need to be? Is it sufficient to know who was online in, say, the last two minutes? If so, let’s define “being online:”

Online: an authenticated user who viewed a page on the website within a given period of time.

In our case, let’s say that the period of time is two minutes. You can code your application as follows:

If user is logged in

  1. Was user’s online state recorded within the last 2 minutes (a timestamp for this can be recorded in a session or cookie value)?
    • YES: do nothing
    • NO:
      1. Retrieve array of online user data
      2. Update timestamp of user ID’s last activity
      3. Save array of online user data back to Memcache
      4. Store online recording time (the current timestamp) to user session or cookie

Now, you need a backend process (or some sort of process) to clean up this array of online user data. For example, users that have not performed an activity in the past five minutes should be removed from this array. If a user has not been back within a given period of time, they should no longer be considered as online. This process could run out of cron, say, once a minute or every few minutes.

Your process for this back end script would be like so:

  1. Retrieve array of online user IDs
  2. Iterate over all user IDs, checking their last activity timestamp
    1. If user’s last activity is more than X seconds old (say, 5 minutes), remove them from the array
    2. If user’s last activitiy is within the past X seconds, they can remain in the array
  3. Store array of user IDs back to Memcached
  4. For convenience, you may also consider storing the number of users in the array in a separate Memcached value; this makes displaying a Who’s Online counter nice and cheap

I’ve implemented this exact process on a website of a decent size, and it’s been working great for a few months now. In our case, we’ve seen peaks of up to 140 users online at a time.

Maybe there are some holes in it, though? I’m not an expert on all of the inner-workings of Memcached, so maybe there’s some sort of a race condition here.

This is, however, a simple way to implemented Who’s Online functionality without taxing a database. Remember, just because a Memcached value can be 1MB, doesn’t mean that it should be. If you find that the size of your array is large enough that the retrieval from Memcached takes a while, consider splitting it up over a few cache keys to keep the retrievals cheap. Pulling 1MB down the wire every so often ain’t cheap!

UPDATE: See a follow-up post with example code here!

Thanks, Atlanta PHP!

This evening, I spoke at Atlanta PHP on “Designing for Scalability” (slides coming soon!).

It was a great crowd with some great questions, and most notably, was my first public speaking in the PHP community. I’ll be aiming to do more of these things in the coming months, so you may see me out on the conference scene soon.

Thanks again to the group! Please don’t hesitate to email at any time.

WordPress Themes