When we look to scaling databases there are a number of resources involved. Every cell in every table in every database is a resource, as well as the physical infrastructure of hard drives, processors, and memory (as previously discussed). The ability to handle resource distribution directly ties into the abilities of the database architecture, which some architectures support more than others. Here we will look at two major ways to distribute database resources: sharding and replication.
Let’s start with the topic that everyone wants to get to: sharding. First, we need to clear up some terminology. Sharding is the exact same thing as partitioning. And, practically speaking, we can think of sharding, partitioning, and federation as the same thing when it comes to databases. One article explains the terminology in more detail, which is supported elsewhere, and my position remains that the term “sharding” will suffice. Yet, I can’t pass the opportunity to quote a rather humorous take on the subject:
So, let’s say federation is like Star Trek. The Vulcans, Klingons, and Humans live in very separate policy domains, but they each pledge to work together to make sure Captain Kirk always gets the girl. And sharding is like AJAX, a great marketing term for stuff that may have already existed, but has taken on a new useful life on its own.
Second, it should be noted that any major database can be sharded, because I know that question is going through the minds of readers familiar with the topic of scalability. Alright, that said, what exactly does sharding look like?
Arguably, sharding is the bottom line to database scalability. If you cannot shard your database you cannot scale your database. However, this is more like saying if your car cannot burn high octane fuel you cannot win the Datyona 500. Most websites don’t have to perform on a NASCAR level. This, perhaps, isn’t the best analogy, but the point is this: only shard databases when the number of rows in a given table hinders performance or the number of replica sets is impractical. Replication, which we will come to later, avoids many of the complications involved in sharding.
Sharding in essence involves putting data on multiple machines. This ought to achieve linear scalability, which is why sharding is so effective. The principle is simple: if you have one computer that can handle 1,000 Qps, then if you had two computers you could handle 2,000 Qps. Yet, this poses some basic issues. How does your application know which database to use? How do you run join queries? Well, the issues presented are unique to which type of sharding you are doing.
There are two basic ways to shard: vertical sharding and horizontal sharding. Think of vertical sharding as splitting up the database by column and horizontal sharding as splitting up the database by row. In either case, a field is chosen as the criteria for sharding. For horizontal sharding the value will be a column and for vertical sharding the value will be the column name. For example, if sharding a user’s table horizontally into two shards one could do so by the first letter in the person’s last name. Thus, one server will have A-M and another N-Z. Or, for example, if sharding the user’s table vertically into two shards one could put the user’s contact information on one server and the user’s meta data related to the application on another. Yes, vertical sharding will involve the act of normalization. Horizontal sharding is typically automatic whereas vertical sharding is often done manually.
Depending on which way you shard the database there are various types of sharding available. There are four kinds of sharding criteria: range, list, hash, and a lookup service. Fortunately, their meaning is relates to their name. Range partitioning involves splitting by a range of values. The example of horizontally sharding the user’s table by the user’s last name, A-M and N-Z, is a range criteria. It is worth noting that dates are also a common way to shard data, and very effective in some cases. For example, sharding a table of blog posts by date allows for the older shards to be moved to lower performance servers. Next, list criteria splits up data based on whether the value is in a list. So, for example, we can split the user’s table up based on their country of origin. One server can be designated for the USA, Canada, or UK and another for China, Japan, and India.
Next, hash sharding involves processing the value through a hash algorithm to choose which server. Thus, each server will have an associated hash value and the computed hash value on our criteria will end up as one of those hash values. So, we could divide up the user’s table based on the user id and our hash function be the modulus of the id with the number 4. This will give a number between 0-3 so we will have four servers. Last, a lookup service will choose which server a piece of data should be on based on an algorithm. This algorithm, or lookup service, could select which server based on the server’s load.
Ok, so now we understand what sharding looks like, how is it implemented? There are three ways sharding is accomplished: by the application, by a proxy, or by the database architecture. Application-level sharding is arguably the easiest to implement overall compared to the configuration and knowledge necessary to implement it using a proxy or the database architecture. For example, if the application is using a MVC framework once could simply have the user’s model connect to one server and blog posts another server. Or, if horizontal sharding is needed, one could implement the sharding criteria in the application to choose which database to query from. It doesn’t get simpler than that, but the catch here is that the application has to always know the shard criteria of the data it is looking for and it may be that the models cannot be changed after launch, such as a mobile device or desktop application. Thus, application-level sharding is clearly not optimal.
Proxy-based sharding has pros and cons. A database proxy intercepts queries, sends them to the appropriate servers, and returns the result. The benefit is that the application does not have to be modified, scaling can happen more dynamically, and automatic failover. However, the configuration and knowledge needed to implement a proxy can be too big of a setback to be realistic. Finally, the moment we’ve all been waiting for: sharding using database architecture. This is the flagship of databases like MongoDB and Cassandra which have tools built into the database architecture to handle sharding, and the grounds for attacking SQL databases like MySQL. Do not be deceived. Not only are there pros and cons between database architectures, no database architecture is as simple as plug-and-play. Many databases, like MySQL, rely on third party tools like dbShards and DRBD. The climate of database architectures is rapidly changing, even MySQL is producing solutions like MySQL Cluster, to adapt to the essential need for seamless sharding at the layer of database architecture. Will will touch more on this in the next part, but suffice it to say that seamless sharding can be accomplished through the database architecture and/or third party tools.
Replication simply means to duplicate data and serves two purposes: fault tolerance and scalability. So, there can be database systems that involve both replication and sharding, and in fact most, if not all, sharded systems use replication for the purpose of fault tolerance. People who are new to scaling databases first look to sharding because of how much attention it is getting among all the hype surrounding SQL and NoSQL, MySQL and MongoDB. When looking at the progression of scaling a database system, replication is the first step in resource distribution.
Replication is easier to implement than sharding because it does not involve splitting the database. The whole database is replicated onto multiple servers. Typically, a basic setup is a replica set containing a master and two slaves. The master handles all writes from the application and the two slaves handle reads, as seen in the image below from MySQL’s documentation section on replication scale-out:
Because web applications are typically read heavy, this solution will likely achieve greater scale for the database. When the database needs scaled, additional read slaves or write masters can be added. In the event of a failover, a slave will take the place of the master.
One could host their application on one of IBM’s supercomputers that operates at petaFLOP speed with petabytes of memory and have a lot of scalability problems solved. However, the question then comes of whether or not the database architecture itself can handle the resources. This is where we will go to next. In the mean time, because there is so much to cover I will include some additional resources for further reading.
- Why you don’t want to shard.
- Building Scalable Databases: Pros and Cons of Various Database Sharding Schemes
- Database Sharding Whitepaper
- Top Five Scale-Out Pitfalls to Avoid
- MySQL On Demand Web Seminars (Select “MySQL Cluster and High Availability”)
- Thoughts on MySQL Replication
- Chapter 14. High Availability and Scalability (MySQL)
- Scaling a PHP MySQL Web Application, Part 1
- Read replication with MySQL
- A sharding model