In my latest role at Chegg, I’ve been tasked with making sure all of the backend services and main website are able to handle the number of customers that come to us during our peak seasons (just before Fall and Spring semesters).

There have been countless mishaps along the way but certainly many successes.  Some are simple tweaks that anyone can implement for free or quite cheaply while others require coding and an good understand of the overall architecture.

So here’s a list of website performance tricks and gotchas that I’ve encountered.

1. Data Storage

There are just about a billion options out there for storing data but they mostly fall into a few categories.

  1. Relational database
  2. NoSQL
  3. Cache
  4. Flat files

Each type has a place in the ecosystem and have their own positives and negatives but mostly I’ve learned that you shouldn’t try to force any of the options to do something that it wasn’t designed for.

For example, flat files are great for logging data config files that are read once during server load and cached but horrible for storing user data as the I/O of opening files is quite an expensive (CPU intensive) action and is really slow.

Another example that I’ve encountered recently was the whole NoSQL craze.  A slew of these options have popped up over the last few years and they all have some major flaw or another that’ll come to bite you as your user-base grows.  Take MongoDB, it’s fast and easy to use and most languages have bindings for it.  Did you know that it doesn’t have a “hot” master-to-master ability?  If the data became corrupted or one of your data centers goes offline, you can’t just switch to your backup.  If this happened during one of Chegg’s peak periods, this would cost us millions of dollars!

2. Redundancy

To put this into one sentence, it would be to make sure that there is a redundant system for EVERYTHING!  It helps to have a good idea of all of the different pieces that could fail along the way as a few key pieces are often overlooked.

When interviewing new engineering candidates, I regularly ask them what is involved in performance testing for a superbowl ad.  One of the facets that I’m looking for is redundancy for all systems through the pipeline.  It’s amazing how many points of failure there really are!

Here’s a short list of oft-ignored systems that can and will break at some point that requires redundancy (in no particular order):

  • your datacenter (yes, even Amazon EC2 can go down)
  • your DNS
  • your database and/or database server
  • your web server or any part of your web application
  • your memcache service
  • the router that your web server is connected
  • your load balancer
  • your queue server and/or queue service
  • the network card(s) on any of your servers
  • your operations team (aka you if you’re a small company)

You need a redundant system for each and every one of these types of failures to make sure your website is up most of the time.  Several of these problems can be off-loaded if your pay for a managed service (Rackspace Cloud offers such a service) which for a small business, is a lot cheaper than paying for all of these services and people separately.

3. HTTP Requests

It always astounds me when I load a webpage and view the network data from Firebug and see the number uncompressed images, CSS and javascript files.  My web application SiteLab.co helps customers identify many of these issues.

You’ll receive considerable webpage load performance benefits by doing the following steps:

  • minimize css and javascript files
  • combine multiple css and multiple javascript files
  • for tiny images, combine many of them together into a CSS sprite
  • for larger images, lower the DPI and set the dimensions to an exact fit

4. Use a Content Delivery Network (CDN)

You can save a ton of money on servers by simply serving your static content through a CDN which will also gives the benefit of speeding up the webpage load time.

A CDN copies your static data and propagates it around the world so when your end-user from India loads your webpage that’s served from a datacenter in Dallas, the static data will actually come from a server that’s geographically closer and thus load faster.

There are two types of CDNs.  Those where you manually upload your static files to a repository and those that automatically pull the content from your server when requested.  A couple examples of the former are Amazon S3 and Rackspace Cloudfiles and examples of the latter are Akamai, Limelight and CDN77.com.

5. Use the Cloud Luke!

The cloud makes redundancy simple to achieve at a cheap price.  I prefer Rackspace Cloud but there are several other options including Amazon EC2 and plug ‘n play models like Heroku and Google Appengine.

They all have limitations and quirks but the overall value is great for small, medium and some large companies.  At Chegg, we use the Amazon cloud because our company is so cyclical.  During the few months a year that we’re really busy, we rent more servers and then turn them off the rest of the year.  For a small to medium company, it’s just too expensive to keep a team of operations guys on the payroll to manage the physical servers around the clock.

6. Clear Out Old Log Files

Log files can quickly take up a bunch of harddrive space which is a big deal for cloud servers with limited HD space.  You need to create a cleanup script that you can setup to run nightly via a cron job that archives and/or deletes log files older than say one month.

There’s nothing worse than having your cloud server die because the harddrive is filled up with a bunch of files that you haven’t any use for!

7. Move Sessions to Memcache

Actually, you can use any form of caching for this, it doesn’t have to be memcache.  The point is that if you’re using multiple servers, file-based sessions are useless unless they are stored in a shared directory (which is a pain to setup) because your users are not always going to come to the same server.

More than likely, when a user loads your website in a browser, some requests are going to come from server-A and others from server-B.  If the sessions are not shared in some way, your user will run into some strange behavior.

To set this up, start memcached on each of your servers and configure your application on all of your servers to duplicate the session in each of the memcache servers.  Most web frameworks make this a trivial task so don’t be alarmed.

8. Email Production Errors to Your Team

Simply put, if you don’t know that your website is failing, you’re losing money!  You need to know the instant that your website either stops responding or a section of your website is unavailable due to a coding issue (bug).

Most frameworks will log errors into a log file that you’ll need to parse for errors regularly while others create separate files for each error.  The latter is easier to deal – scan the “error” directory regularly for new files and email them – but there are tools available that can help you parse error logs on a regular basis.

For my personal projects, I use Web2py which is a Python web framework that creates separate files for each error and someone shared a script that will send emails for each new file found.  At Chegg, our OPs team built a slew of monitoring tools that keep an eye on errors.

9. Send Emails Via a Queue

You should try to avoid sending emails immediately during an HTTP request because if your email system is temporarily down, your user’s browser will continue to wait for a response.

If you send email requests to a queue to be processed in the near future, your user’s browser will get a response returned immediately.  Nobody likes to wait for the internet!

10. Cache Database Query Responses

A majority of the time, your web application will gather data from your database that will not change often so why it’s better to save the expensive (I/O read) query data in memory for short periods of time.

Again, memcache is a good option, here.  Most frameworks have utilities that make this task trivial to implement so there’s really no excuse to not use it.  If not, it’s pretty easy to code this up yourself.  But be aware that if you’re not sharing cache amongst your servers, each server will have to store the query data locally which is not horrible but not optimal either.

*Bonus Tip*

11. Many of Your Users Use Mobile Devices

This is often overlooked by small businesses that have enough trouble just trying to get their websites performing on a PC.  You have to keep in mind that mobile users often use the internet via 3G networks which are pretty slow compared to the broadband that you use at home.  Also, mobile devices have slower processors and less memory to load large web pages.

A common technique to appease mobile user is to create a separate “mobile” version of a website.  I know it’s a pain to create another website but if your paying customers want to use your website from a mobile device, you don’t have much of a choice.

So there you have it!

I hope you found some useful nuggets within this article.  If so, please share this with your friends.  You can also leave feedback if I’m totally off-base or leave a kind message if you’re happy.

Enjoy!

Jim Kellas

I’m quite excited to announce the release of my latest creation, SiteLab.co!  SiteLab helps small ecommerce websites load quickly and available at all hours of the day.

It does this by analyzing a website for speed issues and continuously loads the website to make sure it’s available for the general public.

Lastly, I added a feature that allows you to make screenshots of your website in any modern browser/OS combination imaginable — yes, iPhone, iPad and Android devices included!

You can check out the application at http://sitelab.co!

Cheers!

Jim Kellas


I’ll be presenting at the inaugural South Bay Automation Meetup @ Chegg headquarters at 6:00 PM. The address is 2350 Mission College Blvd Suite 1400, Santa Clara, CA (across the street from Intel).

The South Bay Automation Meetup is an effort to cross-pollinate automation strategies between different software companies in the south bay.  It’s amazing how many interesting ideas exist out there!  In a Meetup-type setting, not only do you get to learn about the subject but you can always ask questions and add your own ideas to the theme.

During my presentation, I’ll be discussing an alternate way of automating a website without the constant maintenance needed to keep up with minor design changes (hint: bypass the web browser).

Hope to see you there!

btw, did I mention it’s FREE to attend and participate???

Here’s a link to the Meetup:  http://www.meetup.com/Bay-Area-Software-Quality-Group/events/17529888/

There are times when writing code where nothing wants to work and/or you just can’t get past a hurdle.  My advice to you: Step Away from the Computer!

More than likely, you’re doing something really lame but you’re not going to figure it out until you take a breather.  If you’re like me, you’ll probably run the problem through your head a hundred times after you step away and that’s OK.  Just make sure to physically step away and do something else – like sleep – before trying to tackle the problem again.  Another option is to run the problem past one of your colleagues who may or may not be able to help you but just speaking out the problem may help you to solve it on your own.

I ran into this today and followed the latter advice.  While explaining my problem to a PM sitting next to me (which of course required me to explain the problem in more detail), I noticed a clue that I missed that eventually led me to the answer.  I don’t want to bore you with the details of the problem or the clue but understand that “- 8″ does not equal “+ 8″.  :)

In short, my problem turned out to be PEBCaK (look it up) situation.  <sigh>

I joined the software QA ranks at PayPal in 2000.  It was near the end of the big bubble and there was still a major glut of tech jobs vs. people to fill the jobs so some VERY unqualified people were hired to fill the roles, myself included.  The role was pretty simple at the time and didn’t require a lot of technical knowledge.  Luckily, I had a really good – albeit narcissistic – manager to mentor my way into the world of blackbox QA.

Well, things have changed over the past 11+ years.  It’s rare to find QA positions that don’t require some level of programming ability.  Software QA Engineers today (often called Test Engineers) work side-by-side with developers and help build test-able products.  In other words, before a product is built, unit tests and test stubs are created that allow QA to thoroughly test it early and often.  Of course, the test engineer needs to understand the product from top to bottom to request (or build himself) the unit tests and necessary stubs.  This puts QA smack dab in the middle of the initial meetings when designing a new product.  Looking back to my blackbox days, I was always told about a new product after it had been built and expected to test it.  SERIOUSLY?!?

QA engineers today test features from every angle: web server, DB schema, CDN, S3, file system, DRM, cloud servers, co-location implications, server hardening, OS and browser compatibility, A/B tests, load, stress and performance tests, acceptance, fail over, usability, integration, etc.  We’re also responsible to not just report a bug but to dig into the code and find out what happened (or at least narrow down the search).  We need to know why a failure occurred so that we can better prioritize the bug.

QA Engineers need to understand that a bug becomes more and more expensive the longer it exists within the code base.  A bug that makes it to production costs at least 10x more money to fix than if it had been found early which is why we need to identify potential weaknesses early in the cycle.  Never has the term “time is money” ever ring so true.

I often wonder where I’d be if I hadn’t taken a few programming courses in 2004 (C++, Python, PHP).  “Would you like fries with that?” comes to mind.

There is a place for Selenium and test tools of it’s like but not to the extent that I see many companies using it.  It can be made relatively maintenance free with a bit of finessing and product and development teams that think about testing continuously.  Of course, this is usually not the case and release after release, I see more and more time being devoted to just getting the test cases to work.  You see, with an agile environment comes A/B testing, rapid iteration and many changes to the GUI.

Test case maintenance should never take so much time that new test case creation suffers.  This is an overarching problem that I find with web-based functional testing.  Since this is the area that’s constantly changing – visually, not functionally – teams should automate at this layer sparingly.  I’m a big fan of back-end automation and what little bit of front-end automation I endorse, usually eliminates the browser all together.

Essentially, you can get the best bang for your front-end functional automation by simply POSTing/GETing forms via an xUnit test framework without worrying about your test cases due to GUI changes.  There’s no reason why a functional test case should fail because a button moved on a page or an image is taking too long to load – not that the latter is unimportant but a different test case verifying that functionality should fail instead.

I demonstrated this kind of testing at past companies and recently at Chegg.  Even though my script hadn’t been used in weeks, the only maintenance needed was a change to a POST url and the addition of another parameter (both of which would have also failed with Selenium, btw).  The script is not part of the regular regression suite simply because it’s used as a helper script to speed up tedium and thus was not used for a while.

To be clear, it’s a slow process to get started automating in this way but the extra amount of time spent building the test cases FAR out-weigh the time saved in the long run.  To illustrate this, the above script was created because the Selenium automation engineer refused to automate the functionality; those web pages constantly changed and would require maintenance every week.  Besides, once you start building up a library of HTML parsers and POST/GET URLs, the time to create new test cases speeds up.

Here’s an example of how simple a test case can become after the necessary libraries are built (example in Python, of course):

#open a new liquidation cart
url = cfg.get('urls', 'openCart') % wimsBaseUrl
elements = {'universal_input':cartName}
result = CL.postHtml(url, **elements)
myTag = CL.getTagData(result, 'input', **myArgs)

The “url” is the url to POST the parameters to, “elements” is a Dictionary of the parameters being POSTed, “result” is the HTML response after the POST and “myTag” is an example of a function used to parse the HTML (check out BeautifulSoup for Python).

In conclusion, slow and steady wins the race.

Follow

Get every new post delivered to your Inbox.