In my latest role at Chegg, I’ve been tasked with making sure all of the backend services and main website are able to handle the number of customers that come to us during our peak seasons (just before Fall and Spring semesters).
There have been countless mishaps along the way but certainly many successes. Some are simple tweaks that anyone can implement for free or quite cheaply while others require coding and an good understand of the overall architecture.
So here’s a list of website performance tricks and gotchas that I’ve encountered.
1. Data Storage
There are just about a billion options out there for storing data but they mostly fall into a few categories.
- Relational database
- NoSQL
- Cache
- Flat files
Each type has a place in the ecosystem and have their own positives and negatives but mostly I’ve learned that you shouldn’t try to force any of the options to do something that it wasn’t designed for.
For example, flat files are great for logging data config files that are read once during server load and cached but horrible for storing user data as the I/O of opening files is quite an expensive (CPU intensive) action and is really slow.
Another example that I’ve encountered recently was the whole NoSQL craze. A slew of these options have popped up over the last few years and they all have some major flaw or another that’ll come to bite you as your user-base grows. Take MongoDB, it’s fast and easy to use and most languages have bindings for it. Did you know that it doesn’t have a “hot” master-to-master ability? If the data became corrupted or one of your data centers goes offline, you can’t just switch to your backup. If this happened during one of Chegg’s peak periods, this would cost us millions of dollars!
2. Redundancy
To put this into one sentence, it would be to make sure that there is a redundant system for EVERYTHING! It helps to have a good idea of all of the different pieces that could fail along the way as a few key pieces are often overlooked.
When interviewing new engineering candidates, I regularly ask them what is involved in performance testing for a superbowl ad. One of the facets that I’m looking for is redundancy for all systems through the pipeline. It’s amazing how many points of failure there really are!
Here’s a short list of oft-ignored systems that can and will break at some point that requires redundancy (in no particular order):
- your datacenter (yes, even Amazon EC2 can go down)
- your DNS
- your database and/or database server
- your web server or any part of your web application
- your memcache service
- the router that your web server is connected
- your load balancer
- your queue server and/or queue service
- the network card(s) on any of your servers
- your operations team (aka you if you’re a small company)
You need a redundant system for each and every one of these types of failures to make sure your website is up most of the time. Several of these problems can be off-loaded if your pay for a managed service (Rackspace Cloud offers such a service) which for a small business, is a lot cheaper than paying for all of these services and people separately.
3. HTTP Requests
It always astounds me when I load a webpage and view the network data from Firebug and see the number uncompressed images, CSS and javascript files. My web application SiteLab.co helps customers identify many of these issues.
You’ll receive considerable webpage load performance benefits by doing the following steps:
- minimize css and javascript files
- combine multiple css and multiple javascript files
- for tiny images, combine many of them together into a CSS sprite
- for larger images, lower the DPI and set the dimensions to an exact fit
4. Use a Content Delivery Network (CDN)
You can save a ton of money on servers by simply serving your static content through a CDN which will also gives the benefit of speeding up the webpage load time.
A CDN copies your static data and propagates it around the world so when your end-user from India loads your webpage that’s served from a datacenter in Dallas, the static data will actually come from a server that’s geographically closer and thus load faster.
There are two types of CDNs. Those where you manually upload your static files to a repository and those that automatically pull the content from your server when requested. A couple examples of the former are Amazon S3 and Rackspace Cloudfiles and examples of the latter are Akamai, Limelight and CDN77.com.
5. Use the Cloud Luke!
The cloud makes redundancy simple to achieve at a cheap price. I prefer Rackspace Cloud but there are several other options including Amazon EC2 and plug ‘n play models like Heroku and Google Appengine.
They all have limitations and quirks but the overall value is great for small, medium and some large companies. At Chegg, we use the Amazon cloud because our company is so cyclical. During the few months a year that we’re really busy, we rent more servers and then turn them off the rest of the year. For a small to medium company, it’s just too expensive to keep a team of operations guys on the payroll to manage the physical servers around the clock.
6. Clear Out Old Log Files
Log files can quickly take up a bunch of harddrive space which is a big deal for cloud servers with limited HD space. You need to create a cleanup script that you can setup to run nightly via a cron job that archives and/or deletes log files older than say one month.
There’s nothing worse than having your cloud server die because the harddrive is filled up with a bunch of files that you haven’t any use for!
7. Move Sessions to Memcache
Actually, you can use any form of caching for this, it doesn’t have to be memcache. The point is that if you’re using multiple servers, file-based sessions are useless unless they are stored in a shared directory (which is a pain to setup) because your users are not always going to come to the same server.
More than likely, when a user loads your website in a browser, some requests are going to come from server-A and others from server-B. If the sessions are not shared in some way, your user will run into some strange behavior.
To set this up, start memcached on each of your servers and configure your application on all of your servers to duplicate the session in each of the memcache servers. Most web frameworks make this a trivial task so don’t be alarmed.
8. Email Production Errors to Your Team
Simply put, if you don’t know that your website is failing, you’re losing money! You need to know the instant that your website either stops responding or a section of your website is unavailable due to a coding issue (bug).
Most frameworks will log errors into a log file that you’ll need to parse for errors regularly while others create separate files for each error. The latter is easier to deal – scan the “error” directory regularly for new files and email them – but there are tools available that can help you parse error logs on a regular basis.
For my personal projects, I use Web2py which is a Python web framework that creates separate files for each error and someone shared a script that will send emails for each new file found. At Chegg, our OPs team built a slew of monitoring tools that keep an eye on errors.
9. Send Emails Via a Queue
You should try to avoid sending emails immediately during an HTTP request because if your email system is temporarily down, your user’s browser will continue to wait for a response.
If you send email requests to a queue to be processed in the near future, your user’s browser will get a response returned immediately. Nobody likes to wait for the internet!
10. Cache Database Query Responses
A majority of the time, your web application will gather data from your database that will not change often so why it’s better to save the expensive (I/O read) query data in memory for short periods of time.
Again, memcache is a good option, here. Most frameworks have utilities that make this task trivial to implement so there’s really no excuse to not use it. If not, it’s pretty easy to code this up yourself. But be aware that if you’re not sharing cache amongst your servers, each server will have to store the query data locally which is not horrible but not optimal either.
*Bonus Tip*
11. Many of Your Users Use Mobile Devices
This is often overlooked by small businesses that have enough trouble just trying to get their websites performing on a PC. You have to keep in mind that mobile users often use the internet via 3G networks which are pretty slow compared to the broadband that you use at home. Also, mobile devices have slower processors and less memory to load large web pages.
A common technique to appease mobile user is to create a separate “mobile” version of a website. I know it’s a pain to create another website but if your paying customers want to use your website from a mobile device, you don’t have much of a choice.
So there you have it!
I hope you found some useful nuggets within this article. If so, please share this with your friends. You can also leave feedback if I’m totally off-base or leave a kind message if you’re happy.
Enjoy!
Jim Kellas
