BisManOnline - Architecture Overview - Part 1 - Hardware & Software


Since 1999, when BisManOnline launched as a tiny unknown website until now, almost 12 years later, the overall software and hardware architecture has grown from a single server to the architecture we see today.   And tomorrows architecture will change as new hardware and software becomes available and our overall traffic grows.

Even as of 2006, BisManOnline still ran on a single machine, serving 100% of its requests real-time and primarily un-cached.  Every hit to the home page in 2006 ran a full select SQL statement against the database just to deliver the ad counts.

Over time we began to spread the load to more server instances.  We started with splitting off the ad server, then the database server for the application, the database server for the ad server, memory caching systems, and recently all static image and content serving.   In addition to this we have external systems for real-time traffic monitoring, as well as our data warehouse for reporting, statistics and CRM.

Today's architecture looks a bit like this:



As you can see, our primary architecture is a fairly common LAMP stack, and our real-time traffic monitoring and data warehouse systems are run on Microsoft software.

The overall design is nothing spectacular, and I will discuss some of our current issues / future plans below.

Based on some traffic estimates, and figuring some averages for percentage of content cached by the user's browser, and adding up our page views, ad server calls, image calls, and real-time ajax requests from the front end, this entire architecture serves roughly 440 Million requests per month.   (this is probably a low estimate as we assumed roughly 80% of image and static content requests are delivered via the browser cache - it's probably less than that)

Further scaling that number, based on 90% of our traffic coming from 5am to 11pm, we average roughly 13,000 requests per minute against this architecture, which equates to roughly 215 concurrent requests per second.

Hardware

All of the servers are Dell, multi-core servers (various), with fat amounts of RAM.  The primary group is hosted at BTINet's data center (see fancy picture here) in Bismarck, ND.   Our configuration is primarily Dell M610's, with dual 4 or 8 core Intel XEON, processors, anywhere from 6GB to 24GB of ram depending on the box, and RAID Level 1, dual 10K SAS drives.

In addition (not pictured), everything resides behind a set of Juniper Firewalls for security.

Software

The primary setup all runs on various configurations of the LAMP Stack (Linux, Apache, MySql, PHP).  In addition to the basic LAMP Stack, we add APC (A PHP Compiler Caching System), and Memcached (An extremely fast and versatile pure-memory caching system).

Our image / static content server uses the very lightweight and super-fast Lighttpd instead of Apache.

In addition, we have additional small chunks of software for handling things like Image Manipulation and Conversion (ImageMagick), plus various other utilities and programs.

Lessons Learned

In the future, we will publish additional articles in this series on how all of this stuff works together, but, here are a few of the biggest lessons we have learned over time

Relational Databases are Slow
Relational databases have their place, and, of course, we use them.  But the nature of their design makes them slow...very slow.   You can continue to throw hardware and ram at them, but you always reach a point where their huge size, and the number of requests you make to them, cannot be handled with a basic setup.  Some of our lessons here are:  
  • Reduce their workload - Cache everything you can in memory, including query results
  • Index everything that is read and perform updates offline
  • Never use Joins - SQL Joins are the worst performing request you can make
  • Avoid table locking with InnoDB
  • Select only what you need from the DB to reduce network load

Hard Disks are Slow

Hard disks are one of the biggest bottlenecks in any system, they are extremely slow, and, if designed right, you can usually keep them out of most of your requests.
  • Make sure your static content servers have enough RAM to deliver the majority of requests directly from memory
  • Use APC OpCode Cache with the OS Stat Calls turned off
  • Generate full-page caches, store them in RAM and deliver directly from memory (Use Memcached)
  • Cache SQL Calls to the DB in memory and retrieve them from there (use Memcached)

Turn off Logging

You simply don't need Apache's real-time traffic logging anymore.  Log your errors and that's it.  Use front-end services like Google Analytics to monitor your traffic.  Apache logging is a waste of resources and I/O against the hard-disks

Issues with our current Design

The number one issue with our current design is single-point of failure.   We currently rely on the uptime of each component in the system to ensure the entire application is functioning.   Our future plans include load balancing the front-end and ad-server, as well as replicating the database.   Currently this is simply a cost benefit issue.  With the excellent resources at BTI we can normally recover from a down server in a very short amount of time.  Once the cost of the  "Amount of Time" begins to increase, it is then that we will deploy redundancy and load balancing
 

Conclusion

No we haven't written anything earth-shattering here, but, we wanted to show our basic setup.  In future articles we will attempt to break down some of these various systems and discuss in more detail how they work.