Database Scalability, Part 5: The Process

So far we’ve looked at the three core areas involved in scaling databases and how to go about achieving greater scale in each are. In conclusion of this series on database scalability we will look at some general guidelines and basic principles for the process of scaling databases.

The process can be summed by YouTube‘s free and open source code for scalability:

while (true)
{
    identify_and_fix_bottlenecks();
    drink();
    sleep();
    notice_new_bottleneck();
}

This “recipe for handling rapid growth” conveys the reality that by nature there will always be bottlenecks. That is, there will always be a specific aspect of any database system that does not perform at the level of scale as other aspects. The process of identifying them and fixing them involves looking at the three core areas of database scalability we have discussed, and they should be approached in order:

  1. Hardware
  2. Resource distribution
    1. Replication (Single node)
    2. Replication (Multiple nodes)
    3. Sharding
    4. Geo-replication
  3. Architecture

The quickest, easiest, and most cost efficient is to find and eliminate bottlenecks in hardware. The most effective, however, is the next step of resource distribution. In principle, if a database cannot be distributed it cannot scale. Being able to distribute resources is where the noSQL movement claims advantage over SQL, but the reality is that accomplishing resource distribution takes a high level of maintenance and skill, even if you are using Mongo’s built in tools. For this very reason many services are moving to cloud based systems such as Amazon. The following graph shows the tradeoffs between cost and complexity versus availability:

This graph clarifies why putting money towards hardware first is the most cost efficient. There is a linear, perhaps even exponential, increase in cost and complexity as resource distribution goes from basic replication to sharding and geo-replication.

Last, is the issue of looking at the database architecture itself. Choosing the right database architecture should be the first thing, but in practice it never is for one reason: it takes nearly impossible foresight to have enough information to make an informed and sure decision on what architecture to choose. Databases on the internet change too rapidly to know in advance how data will be used.

Worrying about what architecture to choose is a non-item, because if it comes to the point where the database architecture itself is the bottleneck then the database engineers on board should be competent enough to deal with it, either by working around it or migrating to a different architecture. This brings up the final point: the scalability of a database directly correlates to the skill of the database manager. As soon as a website crosses the threshold of a single computer being able to handle the load on a database there should be a database engineer on the team who can run the database scalability loop of identifying and fixing bottlenecks.

See also, Part 1, Part 2, Part 3, and Part 4.

Published
Categorized as Database

By Joe Purcell

Joe Purcell is a technology virtuoso, cyberspace frontiersman, and connoisseur of Linux, Mac, and Windows alike.

Leave a comment