Who’s Online with PHP and Memcached

Whenever you Google around for things like “Who’s Online php”, you’ll find that a lot of the solutions are centered around using a database. However, is this really necessary? For a site with, say, 50,000 concurrent users making, say, one page request every eight seconds, this could be a lot of database traffic if you’re recording the user’s activity on every request.

One goal here: get Who’s Online functionality off of the database. We’ll explore a possible solution with Memcached that I’ve personally implemented, and thus far, it’s been working great.

The first thing to consider: how real-time does something like “Who’s Online” need to be? Is having it be accurate to, say, users that have been online within the last two minutes acceptable? Next, do we really need to know when a user made any action on the site, or can we consider them online every so often? Not recording activity on each page request significantly reduces the amount of recording going on.

Next, we have to keep in mind that Memcached values can be up to one megabyte in size. If we’re going to have hundreds of thousands of users online, it’s possible to exceed the 1M limit. Let’s ignore this for now; we’ll address it later. Let’s assume that your site is relatively small and won’t have more than a few hundred or thousand users online at any given time.

For this type of scenario, you can store a single array in Memcache. A decent structure is like so:

array(
    '12345' => [unix timestamp],
    '12346' => [unix timestamp],
    '[user id]' => [unix timestamp],
    ...,
    ...);

Your user ID value can be whatever you’d like, as long as it’s unique. For example, your most common IDs will (should!) be numeric, but a unique username or GUID-based ID will work fine.

Next, you store the timestamp of the user’s last activity. A key point here is…how accurate does this data need to be? Is it sufficient to know who was online in, say, the last two minutes? If so, let’s define “being online:”

Online: an authenticated user who viewed a page on the website within a given period of time.

In our case, let’s say that the period of time is two minutes. You can code your application as follows:

If user is logged in

  1. Was user’s online state recorded within the last 2 minutes (a timestamp for this can be recorded in a session or cookie value)?
    • YES: do nothing
    • NO:
      1. Retrieve array of online user data
      2. Update timestamp of user ID’s last activity
      3. Save array of online user data back to Memcache
      4. Store online recording time (the current timestamp) to user session or cookie

Now, you need a backend process (or some sort of process) to clean up this array of online user data. For example, users that have not performed an activity in the past five minutes should be removed from this array. If a user has not been back within a given period of time, they should no longer be considered as online. This process could run out of cron, say, once a minute or every few minutes.

Your process for this back end script would be like so:

  1. Retrieve array of online user IDs
  2. Iterate over all user IDs, checking their last activity timestamp
    1. If user’s last activity is more than X seconds old (say, 5 minutes), remove them from the array
    2. If user’s last activitiy is within the past X seconds, they can remain in the array
  3. Store array of user IDs back to Memcached
  4. For convenience, you may also consider storing the number of users in the array in a separate Memcached value; this makes displaying a Who’s Online counter nice and cheap

I’ve implemented this exact process on a website of a decent size, and it’s been working great for a few months now. In our case, we’ve seen peaks of up to 140 users online at a time.

Maybe there are some holes in it, though? I’m not an expert on all of the inner-workings of Memcached, so maybe there’s some sort of a race condition here.

This is, however, a simple way to implemented Who’s Online functionality without taxing a database. Remember, just because a Memcached value can be 1MB, doesn’t mean that it should be. If you find that the size of your array is large enough that the retrieval from Memcached takes a while, consider splitting it up over a few cache keys to keep the retrievals cheap. Pulling 1MB down the wire every so often ain’t cheap!

UPDATE: See a follow-up post with example code here!

  1. PHPDeveloper.org - trackback on September 21, 2007 at 3:15 pm
  2. Some example please :)

  3. There is a pretty significant and obvious race condition here, in that the read-modify-write cycle of the online users array may be occurring from many sessions at once.

    You may have 5 reads, then 5 modifies, then 5 writes. So only 1 user will be added to the array.

    If it’s okay to have this be “lossy”, then that may not be a tragedy. But it’s definitely an unsafe operation unless you implement some locking. Which is of course more expensive.

    There’s also the issue that if you’re using a cluster of memcache servers and you experience an outage on the one to which your cache key hashes, you will lose the online user list because the memcache client library will failover to the next available server. When the failed server comes back online, you will lose the online user list again, perhaps picking up the very old stale copy (if the outage was just a network partition). memcache restarts will lose the data, as will memory pressure from other cache entries. There’s no way to tell memcached to treat an entry as permanent.

    In general, it’s a bad idea to store data only in memcached. It’s mem*cache*d, not mem*store*d.

    And not doing locking is one of the reasons why memcached is faster than a RDBMS :-)

  4. Robert, absolutely! Thing here is that it’s okay for process to be “lossy.” Great feedback…some follow-up here.

    Bottom line here is that it’s not a super-critical operation; it’s never going to be 100% accurate, but, in my opinion, that’s the nature of “Who’s Online” functionality. Accuracy is more important when something like, say, web-based instant messaging comes into play. If you IM a person that’s supposedly online, but they’re not, you’ll never get a response back, so it’s misleading.

    In my opinion, it’s far better to store this data in Memcached rather than a database. That said, in our case, we are placing this data into the database in our back end batch processing script, but that’s just to support a certain accessory page.

    Remember the goal here…keep those operations off of the database. For our needs, this works really, really well because it doesn’t have to be 100% accurate. That said, it’s done a damn good job so far. The lossy nature of it may be more apparent with thousands of users online, though, but we’re not there yet.

  5. Sorry, but I don’t like your solution :) . First you rejected solution with database because of performance (many sql queries). But you gave solution for this problem: do not query database for each request, byt only from time to time ;) (2 minutes sounds reasonably for me).

    So, why I think that your approach is bad? There is some ‘race condition’ problem. If there will be many concurrent request in one time it could cause that some information won’t be saved in memcache. Another problem will be for many users online. Assumie that your array with users activity information has 500kb. Your php script must retrieve 500kb from memcache and put it back. For many requests/s it could be bottleneck.

  6. And some calculations:

    1000 users online at one time, every people make 1request/s so you have 1000requests/s (it is very huge value, believe me :) ).

    If you query database only one per 2 minutes you have only: 2000 (1 select, 1 update) queries per 2 minutes = 16.66 queries/s. You can set time to 5min and have only 6.66 queries/s. Use simple table with non transactional support (MyIsam in mysql) and performance will be ok :) .

  7. I know this is a bit outdated of a post at this point. But I’d like to add with some experience with memcached that this approach is a simple way to get memcahed running but does not provide (as others have mentioned) a secure and safe environment for users and sessions in a large server environment.

    In addition and in response to Radarek, memcached is needed in very high traffic websites, no database or amount of servers can keep up with the major social networking websites, which is why this technique now exists and is getting more and more popular. Even without non transactional support, databases crumble under even only hundreds of hits per second (which typically can mean tens of thousands of queries per second).

    Anyway, this is a nice quick tutorial I came across awhile back and was in my bookmarks and thought I’d comment on. Cheers

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackbacks and Pingbacks: