Content management system (CMS) is a great way to build your website. The fact you can assemble required application in days without writing a line of code is a big jump in software development world. People and organizations that just need a web presence no longer need to hire developer and graphic designer. They can buy affordable web hosting, install CMS (with expert's help or without), buy ready-made template, write some content and that's it! It might be a little bit harder when you go for eCommerce or large site, but in general this approach is pretty easy to implement. Moreover, business users have role-based access to back-end, updates are regularly published by CMS and plugins vendors, hosting providers handle backups and certain level of security and a lot of other things happen almost without your involvement.
Unfortunately, in the world of medium to big enterprises everything is not as easy as described above. Mainly, because shared hosting is not an option for large organizations. Or organizations that expect huge load on their servers. Security, scalability and performance concerns force these organizations to user dedicated servers. In addition to that these companies usually need proxy server(s), load balancer(s), other systems integration, etc.
In terms of hardware and data center options the above organizations are usually fine, if budget is available. But how to scale your CMS might be an issue. This is open source (we start talking about Joomla now...), vendor won't help you much. You have to hire someone with this experience. If you don't have anyone, hope this post will give you some useful information and will help you to overcome this problem.
We will discuss situation where we have two dedicated web servers (also referred as "nodes") and one load balancer. If needed, you can easily scale this solution for a larger setup.
Each node has both: files and database. It is really important to synchronize all nodes properly, as all N (2 in this example) servers must act as one organism. We will discuss filesystem and database synchronizations a little bit later.
Alright, as you can see in the image on the left, any request coming from users will be first handled by load balancer. We will use cookie-based persistent sessions, which means session information will be written to a cookie that is attached to the HTTP requests are sent to the server. Load balancer will read this information and will redirect you to the server you are assigned to. This is required to keep user on the same server with each request. Otherwise, you might have problems with authentication and some other parts of your system. After that, one of the servers processes the request and returns a result.
For database synchronization we need to configure database replication. Usually, both master-master and master-slave approaches are fine. In this example, let's stick to master-master. In this case, whatever changes are made in any of databases, they will be immediately replicated to the second database.
For this task we will use RSYNC utility. It will copy only files that changed since last task execution, which is exactly what we need. Then, we will schedule RSYNC to execute every N minutes. N vary depend on changes frequency. If you are running large active portal it can be set to 15 minutes or if content isn't changed much and you use multiple servers just for availability once a day should be sufficient.
Administration interface access
As files are copied from a single source node to all other (destination) nodes, it's important to make changes only on the source node. Otherwise, changes done on destination nodes will be overwritten and lost forever during next RSYNC execution. Also, if RSYNC is not configured to delete mismatching files, you will have a situation when files are available on destination node and are not available on source node.
So, how to avoid mess with the files? The easiest way is to restrict access to a source server only. I'll explain how to do that for administration interface (yourdomain.com/administrator), but you can use this idea for front-end editing too.
The idea is very simple. As we know IPs of all our nodes, we can check if we are on the right server. To accomplish this, we need to add a small script to administration login page (/administrator/modules/mod_login/tmpl/default.php). This script should contain the following things:
- Detection which node we are on. We can use $_SERVER['SERVER_ADDR'] variable for this task.
- (optional) Exclude further script execution for some IP addresses (i.e. development or local servers).
- In case detected server is not the source one, disable login and password fields, so content editors and admins could not login (you can do that with CSS or JS).
- Render a link that will redirect user to a source node. Preferably add icons, so users visually can understand that they are in the wrong place.
Last point is not always easy, as usually you cannot access nodes by IP from the Web. You can do that via load balancer only. This means, you cannot provide a link that will do a redirection. What you can do, is to make a trick with the cookies. As I wrote earlier, load balancer will use information from the cookies to decide which node user need to be redirected to. So, why won't we just remove that cookie and refresh the page to get it regenerated again? In the most primitive implementation, user will just be clicking "change server" link until correct server will be selected. Good news are that users won't need to do that every time. When user eventually hits the right server, cookie is set and load balancer will keep redirecting him/here to the same place. Until cookie is expired or regenerated, of course.
The below image is an example of admin page with implemented redirection mechanism:
An only real drawback of this approach is that load balancing is not happening properly. When people return to our website, they already have cookie set and they are already assigned to one of the servers. So, returning users will keep accessing servers they used to access. And new users will be distributed 50/50. This may lead to a situation, when new users will hit a server together with users who are already assigned to it, which will lead to a load not equally distributed among the nodes. If you have 2 servers and moderate number of users this shouldn't be a problem. But imagine, you have 8-10 servers and suddenly one or two will be hit with the majority of the users! In this case you have to stick to regular load balancing and exclude content authoring server. This server will be accessed directly, not through load balancer and only its data will be copied to all other servers. In this case it make sense to block content authoring server from being accessed from the Web. Let your content editors to access it from corporate network or via VPN only.
In this post we are mostly targeting people who cannot afford dedicated specialist or due to some reasons are not able to find one. For this reason, we focus on one of the easiest approaches available. The main idea of this approach is to have a load balancer to distribute the load and several servers that work as a single system. To be a single system those servers have to have their files and databases synchronized. For these tasks we use RSYNC utility and database replication. To make sure files synchronized properly we restrict content editors' access to a server files are copied from only. And don't forget about load balancing problem described above.