This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new Linux.com!

Linux.com

Feature

How to configure a low-cost load-balanced LAMP cluster

By Keith Winston on April 24, 2006 (8:00:00 AM)

Share    Print    Comments   

The ubiquitous Linux, Apache, MySQL, and PHP/Perl/Python (LAMP) combination powers many interactive Web sites and projects. It's not at all unusual for demand to exceed the capacity of a single LAMP-powered server over time. You can take load off by moving your database to a second server, but when demand exceeds a two-server solution, it's time to think cluster.

A LAMP cluster is not the Beowulf kind of cluster that uses specialized message-passing software to tackle a computation-intensive task. It does not cover high availability features, such as automatic failover. Rather, it is a load-sharing cluster that distributes Web requests among multiple Web and database servers while appearing to be a single server.

All the software required to implement a LAMP cluster ships with most Linux distributions, so it's easy to implement. We'll construct a cluster using seven computers for a fictitious company, foo.com. Two servers will run DNS, primary and backup, to distribute Web requests among three Web servers that read and write data from two MySQL database servers. You could build any number of different designs, with more or fewer of each kind of server, but this model will serve as a good illustration of what can be done.

Load balancing

The first part of the cluster handles load balancing by using the round robin feature of the popular DNS software Berkeley Internet Name Daemon (BIND). Round robin DNS is a load balancing method of serving requests for a single hostname, such as www.foo.com, from multiple servers.

To use round robin, each Web server must have its own public IP address. Many organizations use network address translation and port forwarding at the firewall to assign each Web server a public IP address while internally using a private address. In my DNS example, I show private IP addresses, but public IPs are required for the Web servers so DNS can work its magic.

This snippet from the DNS zone definition for foo.com assigns the same name to each of the three Web servers, but uses different IP addresses:

;
; Domain database for foo.com
;
foo.com.                IN      SOA     ns1.foo.com. hostmaster.foo.com. (
                        2006032801 ; serial
                        10800 ; refresh
                        3600 ; retry
                        86400 ; expire
                        86400 ; default_ttl
                        )
;
; Name servers
;
foo.com.                IN      NS      ns1.foo.com.
foo.com.                IN      NS      ns2.foo.com.
;
; Web servers
; (private IPs are shown for illustration, but public IPs are required)
;
www                     IN  A  10.1.1.11
www                     IN  A  10.1.1.12
www                     IN  A  10.1.1.13

When the DNS server gets requests to resolve the name www.foo.com, it will return one IP address the first time, then a different address for the next request, and so on. Theoretically, each Web server will get one-third of the Web traffic. Due to DNS caching and because some requests may use more resources that others, the load will not be shared equally, but over time it should be close enough.

If round robin DNS is too crude, and you have some money to throw at the problem, a number of companies sell hardware load balancing equipment that offers better performance. Some even take into account the actual load on each Web server to maximize cluster performance instead of just delegating incoming requests evenly.

Web servers

Configuring the Web servers for use in a cluster is largely the same as configuring a single Apache Web server, with one exception. Content on all the Web servers has to be identical, in order to maintain the illusion that visitors are using one Web site and not three. That requires some mechanism to keep the content synchronized.

My tool of choice for this task is rsync. To keep things in sync with rsync, designate one server, web1 for example, as the primary Web server, and the other two as secondaries. Make content changes only on the primary Web server, and let rsync and cron update the others every minute -- or whatever interval you think is best, depending on how often content on the server is updated. Thanks to the advanced algorithms in rsync, content updates happen quickly.

I recommend creating a special user account on each Web server, called "syncer" or something similar. The syncer account needs to have write permissions to the Web content directory on each server. Then, generate a pair of secure shell (SSH) keys for the syncer account using ssh-keygen on the primary Web server and distribute the public keys to the /home/syncer/.ssh directory on the other two Web servers. This allows you to use rsync over SSH without needing a password for authentication to keep the content up-to-date at regular intervals.

This short shell script uses rsync to update the Web content:

#!/bin/bash
rsync -r -a -v -e "ssh -l syncer" --delete /var/www/ web2:/var/www/
rsync -r -a -v -e "ssh -l syncer" --delete /var/www/ web3:/var/www/

Set up the script in cron to run regularly and push updates out to web2 and web3.

The cookie conundrum and application design

Cookies can be a tricky issue when LAMP applications use this kind of cluster. By default, Apache stores its cookies in the /tmp directory on the server where it is running. If a visitor starts a session on one Web server, but subsequent HTTP requests are handled by a different Web server in the cluster, the cookie won't be there and things won't work as expected.

Because the IP address of a Web server is cached locally, this doesn't happen often, but it is something that must be accounted for, and may require some application programming changes. One solution to the cookie problem is to use a shared cookie directory for all Web servers. Be particularly aware of this issue when using pre-built LAMP applications.

Aside from the cookie issue, the only other requirement for an application is that all database writes are sent to the database master, while reads should be distributed between the master and slave(s). In our example cluster, I would configure the master Web server to read from the master database server, while other two Web servers would read from the slave database server. All Web servers write to the master database server.

Database servers

MySQL has a replication feature to keep databases on different servers synchronized. It uses what is known as log replay, meaning that a transaction log is created on the master server which is then read by a slave server and applied to the database. As with the Web servers, we designate one database server as the master -- call it db1 to match the naming convention we used earlier -- and the other one, db2, is the slave.

To set up the master, first create a replication account -- a user ID defined in MySQL, not a system account, that is used by the slaves to authenticate to the master in order to read the logs. For simplicity, I'll create a MySQL user called "copy" with a password of "copypass." You will need a better password for a production system. This MySQL command creates the copy user and gives it the necessary privileges:

GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO copy@"10.1.0.0/255.255.0.0" IDENTIFIED BY 'copypass';

Next, edit the MySQL configuration file, /etc/my.cnf, and add these entries in the [mysqld] section:

# Replication Master Server (default)
# binary logging is required for replication
log-bin

# required unique id
server-id = 1

The log-bin entry enables the binary log file required for replication, and the server-id of 1 identifies this server as the master. After editing the file, restart MySQL. You should see the new binary log file in the MySQL directory with the default name of $HOSTNAME-bin.001. MySQL will create new log files as needed.

To set up the slave, edit its /etc/my.cnf file and add these entries in the [mysqld] section:

# required unique id
server-id = 2
#
# The replication master for this slave - required
# (replace with the actual IP of the master database server)
master-host =   10.1.1.21
#
# The username the slave will use for authentication when connecting
# to the master - required
master-user     =   copy

# The password the slave will authenticate with when connecting to
# the master - required
master-password =   copypass

# How often to retry lost connections to the master
master-connect-retry = 15

# binary logging - not required for slaves, but recommended
log-bin

While it's not required, it is good planning to create the MySQL replication user (copy in our example) on each slave in case it needs to take over from the master in an emergency.

Restart MySQL on the slave and it will attempt to connect to the master and begin replicating transactions. When replication is started for the first time (even unsuccessfully), the slave will create a master.info file with all the replication settings in the default database directory, usually /var/lib/mysql.

To recap the database configuration steps,

  1. Create a MySQL replication user on the master and, optionally, on the slave.
  2. Grant privileges to the replication user.
  3. Edit /etc/my.cnf on master and restart MySQL.
  4. Edit /etc/my.cnf on the slave(s) and restart MySQL.

How to tell if replication is working

On the master, log in to the MySQL monitor and run show master status:

mysql> show master status \G;
*************************** 1. row ***************************
            File: master-bin.006
        Position: 73
    Binlog_do_db:
Binlog_ignore_db:
1 row in set (0.00 sec)

On the slave, log in to the MySQL monitor and run show slave status:

mysql> show slave status \G;
*************************** 1. row ***************************
         Master_Host: master.foo.com
         Master_User: copy
         Master_Port: 3306
       Connect_retry: 15
     Master_Log_File: intranet-bin.006
               [snip]
    Slave_IO_Running: Yes
 Slave_MySQL_Running: Yes

The most important fields are Slave_IO_Running and Slave_MySQL_Running. They should both have values of Yes. Of course, the real test is the execute a write query to a database on the master and see if the results appear on the slave. When replication is working, slave updates usually appear within milliseconds.

Recovering from a database error

If the slave database server loses power or the network connection, it will no longer be able to stay synchronized with the master. If the outage is short, replication should pick up where it left off. However, if a serious error occurs on the slave, the safest way to get replication working again is to:

  1. Stop MySQL on the master and slave.
  2. Dump the master database.
  3. Reload the database on the slave.
  4. Start MySQL on the master.
  5. Start MySQL on the slave.

Depending on the nature of the problem, a full reload on the slave may not be necessary, but this procedure should always work.

If the problem is with the master database server and it will be down for a while, you can reconfigure the slave as the master by updating its IP address and /etc/my.cnf file. All Web servers then must be changed to read from the new master. When the old master is repaired, it can be brought up as a slave server and the Web servers changed to read from the slave again.

MySQL 5 introduced a special storage engine designed for distributed databases called NDB that provides another option. For more in-depth information on MySQL clustering, see the MySQL Web site or High Performance MySQL by Jeremy Zawodny and Derek Balling.

Going large

Clusters make it possible to scale a Web application to handle a tremendous number of requests. As traffic builds, network bandwidth also becomes an issue. Top-tier hosting providers can supply the redundancy and bandwidth required for scaling.

Share    Print    Comments   

Comments

on How to configure a low-cost load-balanced LAMP cluster

Note: Comments are owned by the poster. We are not responsible for their content.

Not a true cluster

Posted by: Anonymous Coward on April 24, 2006 10:39 PM
This article had the chance to talk about true clustering, but unfortunately blew it big time:

* Round-robin DNS is a very crude way of trying to send requests to two or more servers - DNS caching by upstream servers (regardless of TTL values) and no accounting for the load of each machine makes it too coarse a method for "load balancing".

* There should be no concept of a "master server" (or if there is, it should dynamically move automatically to another box if the current box assigned to be a master dies).

* You need to share the filestore across all servers in the pool for both read and write access (along with locking and re-synchronisation if a server goes down and comes back up again). One-way rsync'ing just doesn't cut it.

To balance Web serving and share filestore, I used this software:

* Spread
* Wackamole
* 2-way rsyncing (with my own custom 2-way locking and filestore change detection)

It was working OK until I realised that Wackamole logs on the original Web host that was requested *and* on the redirected Web host (if that other host was more lightly loaded), doubling the Web logging up a lot of the time. I ended up skewing the load balancing to favour one server over another unless the load was high which mostly avoided the double-logging.

I'd like to see a comprehensive article on LAMP clustering that covers properly sharing filestore (without using one-way rsync or NFS, both of which have single points of failure), Web load-balancing and, of course, database clustering. I'd also ideally like to see Linux distros start to include such software to make it easier to set up a LAMP cluster in the first place.

#

Re:Not a true cluster

Posted by: Anonymous Coward on April 25, 2006 01:44 AM
Maybe you should approach the editors at Linux.com and do an article on this. Sounds like it could be interesting.

#

Re:Not a true cluster

Posted by: Administrator on April 25, 2006 02:19 AM
I was gonna say the same thing. Clustering's a fun topic, and one of the more interesting things one can do with Linux, so I'd like to hear more about it.

#

Re:Not a true cluster

Posted by: Anonymous Coward on April 25, 2006 05:45 AM
Why don't you forward the request with some addition to the URL. Then it won't be counted twice in statistics.

#

relief joint

Posted by: Anonymous Coward on May 28, 2006 05:48 PM
[URL=http://painrelief.fanspace.com/index.htm] Pain relief [/URL]

  [URL=http://lowerbackpain.0pi.com/backpain.htm] Back Pain [/URL]

  [URL=http://painreliefproduct.guildspace.com] Pain relief [/URL]
[URL=http://painreliefmedic.friendpages.com] Pain relief [/URL]
[URL=http://nervepainrelief.jeeran.com/painrelief<nobr>.<wbr></nobr> htm] Nerve pain relief [/URL]

#

Not cookies

Posted by: Anonymous Coward on April 25, 2006 03:52 AM

Apache stores its cookies in the<nobr> <wbr></nobr>/tmp directory on the server



I think you mean file based sessions here not cookies (which are stored on the client not the server).

#

native mysql cluster

Posted by: Anonymous Coward on April 25, 2006 06:26 AM
a paper an bit more robust mysql cluster implementation can be found <a href="http://www.lod.com/whitepapers/mysql-cluster-howto.html" title="lod.com"> here</a lod.com>.

#

Round robin

Posted by: Anonymous Coward on April 25, 2006 07:08 AM
Round Robin DNS isn't as bad as the first poster makes it out to be. It load balances well enough that the load is distributed enough for most web hosting applications. I use it to serve out 130+ blogs on two servers doing almost a million hits a day. If you need to fail over, just have the other member assume both IP addresses.

rsync though is pretty corny, I'll agree. NFS works well.

When scaling out, one must determine the requirements -- is it simply load sharing or redundancy? There's nothing wrong with just going for load sharing, if that meets your requirements.

All that said, you get a lot of bang for your buck by simply tuning your servers properly.

#

What about clustered writes?

Posted by: Anonymous Coward on April 25, 2006 11:19 AM
How do we handle the real challenge: when one server cannot cope with all the database writes?

#

Horses for courses..

Posted by: Anonymous Coward on April 25, 2006 05:38 PM
This article is brief and the setup in my opinion does not advocate the best way of allocating resources. (Mistakes such as cookie locations doesn't help).

Firstly you should always look at your system and analysis where the bottle neck is. Sometimes just replicating the sytems n times, although may work is actually very cost ineffective. Instead of having the same setup 5 times, may be better to split the tasks on sperate machines.

Further having worked on many mysql systems, one of the fastest way to use mysql is to have 1 read database and many read database slaves. All it takes is coders to write code to use 2 different DB's. Remember for most LAMP sites reads (selects) significantly used more than writes. I dread when I sometimes look at the mysql queries constructed. Of course I understand that this is not always an option for some people as they use off-the-web applications (which many are pretty good) with little leeway for modification.

If you have the traffic that you need to scale to more than a few machines. In my experience if you need 3 or more machines, a load balancer will save you more time and aggro in the long term if you employ a good one of course (the old WSD were really bad when state tables filled up). Cookie issues become irrelevant as LB's have sticky cookies ensuring the user will always go back to the same machine. Also a good one (I won't advertise one) allows you many many load balancing options such as observed load balacing. Remember even if you get even traffic across your machines not everyone will use the same amount of resources and hence some machines will be more loaded and customer experience worsen.

I have worked on systems with hits exceeding 8 million a day and day one of the job is always convincing the owner that buying another machine is not always the best answer. Often anaylsis of the existing system results in me, moving resources around, setting up mysql more efficiently (not much can be done about poor DB schemas people employ even when better indexes are added) and many times no need hardware is needed, service runs faster and smoother.

There is so many resources on the web with help to system anaylsis you don't have to be a pro to work out where bottlenecks are. This is how I started but when you done it load of times it becomes easier and quicker to spot even with the changes in the software and hardware.

Be flexible and open minded. I am always suprised when I revisit old systems I have setup and find both things I could have done better and things that I did that I forgotten was a good idea!

#

How to configure a low-cost load-balanced LAMP

Posted by: Anonymous Coward on April 25, 2006 08:32 PM
This is interesting wikimedia.




<a href="http://meta.wikimedia.org/wiki/Image:Wikimedia-servers-2005-04-12.png" title="wikimedia.org">http://meta.wikimedia.org/wiki/Image:Wikimedia-se<nobr>r<wbr></nobr> vers-2005-04-12.png</a wikimedia.org>

#

Re:How to configure a low-cost load-balanced LAMP

Posted by: Anonymous Coward on April 26, 2006 05:09 AM
I appreciate the hard work of the author of this article and the article itself, however I find some statements misleading or inaccurate.

This statement I find very accurate: "Due to DNS caching and because some requests may use more resources that others, the load will not be shared equally," however the second part of that sentence is disobjective: "but over time it should be close enough".

Following paragraphs use a suggestive: "Some even take into account the actual load on each Web server to maximize cluster performance instead of just delegating incoming requests evenly.". It mistakedly suggests that the DNS Round-Robin model used in the example delegates incoming requests evenly.

On the countrary, this setup delegates DNS responses evenly which is very different from delegating http responses evenly. In order to delegate http responses evenly, a router must be used instead that alternates ip requests to port 80 and 443 between 3 separate servers.

#

Pain

Posted by: Anonymous Coward on May 28, 2006 05:49 PM
<tt>[URL=http://painrelief.fanspace.com/index.htm] Pain relief [/URL]
[URL=http://lowerbackpain.0pi.com/backpain.htm] Back Pain [/URL]
[URL=http://painreliefproduct.guildspace.com] Pain relief [/URL]
[URL=http://painreliefmedic.friendpages.c<nobr>o<wbr></nobr> m] Pain relief [/URL]
[URL=http://nervepainrelief.jeeran.com/pa<nobr>i<wbr></nobr> nrelief.htm] Nerve pain relief [/URL]</tt>

#

Some misleading statements.

Posted by: Anonymous Coward on April 26, 2006 05:18 AM
I appreciate the hard work of the author of this article and the article itself, however I find some statements misleading or inaccurate.

This statement I find very accurate: "Due to DNS caching and because some requests may use more resources that others, the load will not be shared equally," however the second part of that sentence is disobjective: "but over time it should be close enough".

Following paragraphs use a suggestive: "Some even take into account the actual load on each Web server to maximize cluster performance instead of just delegating incoming requests evenly.". It mistakedly suggests that the DNS Round-Robin model used in the example delegates incoming requests evenly.

On the countrary, this setup delegates DNS responses evenly which is very different from delegating http responses evenly. In order to delegate http responses evenly, a router must be used instead that alternates ip requests to port 80 and 443 between 3 separate servers.

#

Some more thoughts of what you can use...

Posted by: Administrator on April 25, 2006 01:44 PM
There are several technologies I would use in several levels here, so let's start with Linux (L):


  • Use a clustered filesystem like the <a href="http://www.redhat.com/software/rha/gfs/" title="redhat.com">RedHat GLobal File System</a redhat.com> to share common read-only files from a single source

  • Use <a href="http://www.linux-ha.org/" title="linux-ha.org">Linux HA Services</a linux-ha.org> (high availability) with Virtual IP addresses (used in DNS) to ensure that all my DNS IP nodes should always be alive

  • Use something like <a href="http://www.linux-nis.org/" title="linux-nis.org">NIS/NIS+</a linux-nis.org> or <a href="http://www.openldap.org/" title="openldap.org">OpenLDAP</a openldap.org> to keep common files in sync and to possible provide single sign on and related services

  • Perhaps you may also need a good TCP Load Balancing solution, and I can highly recommend <a href="http://siag.nu/pen/" title="siag.nu">Pen</a siag.nu>



In general you want your Linux OS servers in sync with each other with the ability to "fail" nodes without loosing connectivity from the Interet.


Now for a look at Apache (A), and here I would look at nothing really spectacular, except to deploy perhaps <a href="http://httpd.apache.org/docs/1.3/mod/mod_proxy.html" title="apache.org">mod_proxy</a apache.org> to serve cached static pages from the clustered file system.


Then, for MySQL (M) there are the new clustered services in version 5.x:


  • <a href="http://www.howtoforge.com/loadbalanced_mysql_cluster_debian" title="howtoforge.com">A HOWTO article</a howtoforge.com> taking you step by step through the set-up of MySQL Cluster

  • You could also look at <a href="http://dev.mysql.com/doc/refman/5.1/en/replication.html" title="mysql.com">MySQL Replication</a mysql.com> in cases where you do not need MySQl Cluster, like in e-commerce site back-ends.



Finally the P... I like <a href="http://perl.org/" title="perl.org">Perl (P)</a perl.org> and therefor I provide some clustered orientated modules that could help:


  • With Apache, you could use <a href="http://perl.apache.org/" title="apache.org">mod_perl</a apache.org> which includes several recipes for database based session management

  • Perhaps look at something like <a href="http://search.cpan.org/~arak/DBIx-DBCluster-0.01/DBCluster.pm" title="cpan.org">DBIx::DBCluster</a cpan.org> if you do not use something like MySQL CLuster.



Well, in a nutshell, there is the technologies I would consider when clustering LAMP. Have a nice day<nobr> <wbr></nobr>:)

#

MySQL Clustering Resources

Posted by: Administrator on April 28, 2006 05:42 PM
In addition to the resources mentioned at the end of the article another source for in-depth information on MySQL clustering is the MySQL Press book MySQL Clustering by Alex Davies and Harrison Fisk. More info: <a href="http://www.amazon.com/gp/product/0672328550" title="amazon.com">http://www.amazon.com/gp/product/0672328550</a amazon.com>

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya