Saturday, March 19, 2011

Scale out with services - Scale the services separately

Whatever I am writing about here has been taught to me by this brilliant write up that I quote from the DBSlayer project page.

"The typical LAMP strategy for scaling up data-driven applications is to replicate slave databases to every web server, but this approach can hit scaling limitations for high-volume websites, where processes can overwhelm their given backend DB's connection limits. Quite frankly, we wanted to scale the front-end webservers and backend database servers separately without having to coordinate them. We also needed a way to flexibly reconfigure where our backend databases were located and which applications used them without resorting to tricks of DNS or other such "load-balancing" hacks."

What this means is that it is better (from a scaling perspective) to split your application into services rather than have them run as one monolothic application on some very powerful machine.

Why is it better to do so? That's because each "service" may have different performance requirements and characteristics which can be better addressed in isolation if the service is running as a separate entity.

For example, consider a typical web-stack which consists of:
  1. Load balancer (nginx)
  2. Web server/script execution engine (apache/php)
  3. Data Store (mysql/pgsql)
  4. Caching service (redis/memcached)
  5. Asynchronous processing infrastructure (RabbitMQ)

For a medium to high load web site (a few million hits/day => about 20-30 req/sec), you could make do with just a single instance of nginx, 2 machines running apache/php, 2 machines running MySQL and one machine running RabbitMQ. Even for this deployment, you can see that each of the components have different requirements as far as the machine (and hardware usage) characteristics are concerned. For example,
  1. nginx is network I/O heavy. You could deploy it on a machine with a modest (1GB) amount of RAM, no hard disk space, and not a very fast CPU
  2. The Apache/php servers on the other hand would need more RAM as well as more CPU, but no disk space
  3. The MySQL nodes would need a lot of RAM, CPU as well as fast disks
  4. The node running RabbitMQ (a message queue) could again comfortably run on a machine with a configuration similar to nginx (assuming that data is stored on MySQL)

Thus, we have sliced our stack into components we can club together not just based on function, but also based on the charasteristics of the hardware that they would best be able to run on. Aside: This reminds me of Principal Component Analysis.

No comments: