Tuesday, November 25, 2008

Scaling MySQL for dummies

Alvinator's LOG: November 26, 2008 4:30am

I know it seems unhealthy to stay up awake at these times. I just finished my work but I can't manage to sleep anymore so I thought about writing a short blog.

MySQL has been my favorite RDBMS since I started working with databases. It's free, fast and easy to use although I know a lot of people arguing about its lack for features vs Oracle and postgresql specially on scaling. Well there's this mysql-cluster that promises High-Availabity and Performance, but I still find it pretty complex to setup and maintain. What you can do is put something on top of MySQL and you can do that with your own elbow-grease. There's a lot of techniques in doing this and these three are those that I've used so far:

1. Replication
You can replicate MySQL database (n) times, then configure your database applications to issue the SELECT commands on the replicas and INSERT/UPDATE/DELETE on the master database. This kind of setup is good for websites where most of the database operations is SELECT.

Advantages: Fast Reads
Disadvantages: Only one Master Database for writes (one point of failure)
Problems that may arise: Slave data might lag due to connectivity problems with the master rendering your SELECTS outdated at some point in time.

2. Partitioning
A table that has multi-million rows will drag your queries. One solution is to partition your table into multiple tables on a specific category like dates, year, location, etc. You can do this programatically yourself or you can make use of MySQL 5.x's built-in partitioning support. I've used it when parsing logs into database.

Advantages: Fast Reads
Disadvantages: queries into multiple tables is quite hairy
Problems that may arise: You're still using one MySQL server instance and that makes it one point of failure.

3. Sharding
I first heard about this when I was watching the youtube guys in google tech talk giving some information about their experiences scaling youtube. Sharding in a nutshell is where you partition your database across different MySQL servers and across different machines. A data can be stored in a shard (a MySQL database instance) using a specific foreign key like name, date, or a programmatic hash. I've personally implemented this kind of setup and I'm quite satisfied with it. Just one note: This is NOT the holy grail of MySQL Scalability, your specific requirements may not be met on this kind of setup.

Advantages: Fast Read/Writes, Better HA since you have multiple MySQL instances
Disadvantages: You have to implement it yourself, your database applications will need to connect to multiple databases.
Problem that may arise: Resharding is a nightmare, your system admins will hate you because manual queries will be harder, a lookup that is not using the foreign key will require you to query across all shards.

In sharding you have to decide how you can spread your data across shards and know how to easily obtain them. You use for example the first letter in username ('A' goes to shard1, 'B' goes to shard2, ...) which is human understandable although there's probably more A's than X's, Y's, and Z's right? A better solution is by programmatic hash. Using a programmatic hash ensures that your data is equally scattered among your shards. In python you can do it like this:

Python 2.5.1 (r251:54863, May 2 2007, 16:56:35)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> def hash_me(username, numshards):
... return hash(username) % numshards
>>> hash_me("alvinator", 4)

That makes "alvinator" assigned to shard number 3 if you have 4 shards.

No comments: