201: HTML: Caching Generated Output For Speed.

Now that you can generate HTML, why would you ever want to go back to the old way of unchanging HTML?

It takes some amount of work to generate HTML. I’m not talking about the work to program this. Make sure to listen to the previous episode for more information. I’m talking about the work that the computer itself needs to do in order to generate HTML. You might think your computer is busy when it’s starting up an application or when uploading or downloading files, but this is called input/output or just IO for short. The computer might be busy but the processor itself will usually be just sitting around waiting for the application to be read from your hard drive or for your files to be sent over the communication lines. IO can slow down what you’re trying to do but it still leaves the processor with time to cool down.

For a normal request, the CMS will generate HTML and send to the visitor. The web server forgets all about the HTML that was just sent. If another visitor lands on the same page, then the same work needs to be done all over again. Anytime you find yourself writing or using code that does anything over and over again, that’s a good opportunity to cache the results. And setting up a reasonable cache system in your CMS will help your web server survive a massive increase in visitors. That extra work needed to generate HTML pages may not be much but if your site suddenly gets popular and you get millions of visitors, then your web server just won’t be able to keep up with the total extra work and your whole website will go down. Caching HTML pages for regular visitors will save a lot of work when they would have all gotten the same HTML files anyway.

Listen to the full episode for more details including advice on when and how to invalidate the cache so that visitors will be able to get updated web pages. You can also read the full transcript below.

Transcript

We don’t feel this work because we’re not the ones running the code. The computer is doing all the work and other than sometimes noticing when the fan needs to turn on, we normally don’t think about how much stress we place on a computer.

Maybe that’s because normally, we don’t require much from our personal computers. Just writing email or browsing the web leaves your computer idle most of the time. You can be the world’s fastest typist and the computer will still be sitting around waiting for each key to be pressed.

Watching a DVD will put your computer under more pressure but it can still easily handle the work. Now, if you want to see your computer struggle, then try creating your own video. There’s a lot of work involved in rendering a video.

You can also try compiling a large computer program. It all comes down to this. Your computer will be the most busy whenever it has to create something that needs a lot of calculations.

You might think your computer is busy when it’s starting up an application or when uploading or downloading files, but this is called input/output or just IO for short. The computer might be busy but the processor itself will usually be just sitting around waiting for the application to be read from your hard drive or for your files to be sent over the communication lines. IO can slow down what you’re trying to do but it still leaves the processor with time to cool down.

Generating HTML has the potential to be heavily IO bound but it does need some amount of work from the processor. It shouldn’t require too much work. And the time it needs to wait around for database calls to complete so it has the basic information it needs to put the HTML files together is more than enough to let the processor cool down.

Your web site visitors certainly won’t notice a few extra microseconds as your web server figures out who the visitor is and what content should be returned as HTML.

It seems like a perfect solution. You can put the text of your website articles and pages in a database along with information about how you want your website to look such as what each menu item will do. And you can also store information about the users so your site will know which visitors should get which content.

Once all this is setup, you can use your website itself to add more pages and change things around. You can even change how the site looks with themes. All the coding to support this is already done for you if you’re using something like WordPress.

This whole system is known as a content management system of CMS for short. That’s because you can build up your website by focusing on the content and how it should behave instead of creating all the HTML files yourself. You can let the CMS software running on your web server do all the work. You’ll probably not have to create a single HTML file yourself.

There is some amount of work that needs to be done for each request. Each time a visitor lands on one of your web pages, the server computer needs to run code to generate the HTML. You’re trading some small amount of processing time for a huge benefit in easy web site creation and maintenance.

Is there any reason why you’d want to go back to the old way of just sending out HTML pages that don’t change?

Well, before you start creating your own HTML files, I’d say that you should at least consider using a content management system. It could be better for you even if you only have a few pages. It’s a little extra work on your part to learn how to install and configure the system but well worth it.

Now once you have WordPress, or Drupal, or Joomla, or another CMS installed, you might still want to have the system itself send out HTML files that don’t change. The key here isn’t that the HTML never changes, but that it probably doesn’t need to change so often that the HTML needs to be generated each time.

For a normal request, the HTML will be generated and sent to the visitor. The web server forgets all about the HTML that was just sent. If another visitor lands on the same page, then the same work needs to be done all over again.

Anytime you find yourself writing or using code that does anything over and over again, that’s a good opportunity to cache the results. We do this all the time in real life. If you work in a store where you keep getting the same question from customers, you might want to make a sign with the answer instead. Opening and closing hours are a good example. And most customers will expect a sign on the front door with the hours clearly posted. This works really good, until you need to close early one day.

Exceptions to the normal reply are harder to handle with a sign and the same thing applies to caching HTML files. Actually, caching anything will have this problem. A cache is just a place to put things where they can be reused later. It’s a balance you’ll need to test to see what your specific needs are.

On one side of the balance, there’s no caching at all. This never has any problems with special cases because the results are always calculated each time. It’s like taking down the sign at the front of the store with the hours and making customers ask each time. You’ll be able to reply with any special holiday closing hours and customers will always get the most up-to-date information. It means more work for you though. And if your site always generates the HTML, then visitors will always get the most current and up-to-date pages even right after you save changes. It means more work for your web server though.

On the other extreme of the balance, there’s full caching of everything so that it never expires. Once a particular HTML page is generated, instead of just returning it to the visitor and forgetting about it, the content management system will save the HTML so it never has to do that work again. This would be like posting your store hours chiseled in the stone wall of your store itself. It’ll last a really long time but any changes to the hours can never be updated.

Of course there’s lots of room in between these two extremes for a reasonable solution. But it’s not a simple thing. This is a general problem known as cache invalidation. I should probably mention that cache is spelled c a c h e and is not the same word as c a s h. We’re not talking about money. Cache invalidation is the problem of deciding when something that’s been placed in a cache should be removed so it can be updated. If you check too often, then you’re right back at the one extreme of not having a cache at all. And if you check too little, then you can end up returning something that’s out of date.

One solution for websites that usually works well is to disable the cache for any users actually logged into your website. Let the processor spend the extra time for each page visited to make sure that the most current content is always returned. In other words, don’t even look in the cache to see if a HTML page has already been generated and always create HTML from the latest information in the database each time.

Then for any other visitors to your website that don’t have an account of their own, check if an HTML file already exists in the cache and return that one instead of creating a new one. Then as a separate process unrelated to any visitor, go through the entire cache and remove all the generated HTML files every few hours. This will make sure that any page returned to a visitor will be at worst a few hours out of date. And if you rarely change your pages, then even a few hours won’t matter because the content hasn’t changed anyway. You can always manually decide to clear the cache if you make any major changes to your website.

Setting up a reasonable cache system like this will help your web server survive a massive increase in visitors. That extra work needed to generate HTML pages may not be much but if your site suddenly gets popular and you get millions of visitors, then your web server just won’t be able to keep up with the total extra work and your whole website will go down. Caching HTML pages for regular visitors will save a lot of work when they would have all gotten the same HTML files anyway.

201: HTML: Caching Generated Output For Speed.

Transcript

Tags

Leave a ReplyCancel reply

201: HTML: Caching Generated Output For Speed.

Transcript

Share this:

Tags

Leave a ReplyCancel reply