Consistency, index.html and www

Published in Search Engines on Tuesday, August 17th, 2004

Keeping your urls consistent with a couple of simple re-write rules.

Updated on 19/08/2004

The following are two (of the many) little things that I've learnt over the years trolling the bounty of information that is WebmasterWorld. They're useful, and I've noticed that not everyone seems to apply them...

Keep it clean an consistent

Search engines sometimes get a little fooled when they find links pointing to both the 'www' and 'non-www' version of websites.

Many people have the experience of finding, for example, that Google gives them a Page Rank of 4 for http://mydomain.com and 5 for http://www.mydomain.com. In addition, the same type of confusion can occur with http://www.mydomain.com/index.html vs. http://www.mydomain.com/.

What to do? Well, luckily the answer is quite simple.

Rewrite rules, baby

The general consensus for a solution to the aforementioned problems is to use a 301 - permanently moved response and send the user to the URL that you want to use. The following code, on an Apache server, will do the trick:

RewriteEngine on

# =============================================
# This sends all to www. Remove the 'www' to
# send to mydomain.com
# ---------------------------------------------

RewriteCond %{HTTP_HOST} !^www.mydomain.com 
RewriteRule ^(.*)$ http://www.mydomain.com/$1 [R=301,L]

# ==================================================
# This sends requests for index.html to the root.
# --------------------------------------------------

RewriteRule ^index.html$ / [R=301,L]

The result is that search engines have only one place to go to find your home page, and any links pointing to the other URLs are credited to the one which you select to use. Clean, simple and consistent.

An important point

RewriteEngine on

RewriteCond %{HTTP_HOST} ^mydomain.com 
RewriteRule ^(.*)$ http://www.mydomain.com/$1 [R=301,L]

Comments and Feedback

Finally, someone took the lead in this thing and wrote it up. This applies to more things though. Like creating directories for weblog posts with an 'index.php' inside (and linking to both, depending if you start from the mainpage or the Atom feed), all WordPress weblogs with posts retrievable in both the directory, /post/, and file, /post, way. Et cetera.

However, this is great first step. I prefer no-www myself, but consistency is key.

Good information I always wondered about how to get around the www or index.html.

What would the referring log show up as for it?

Hey Blake, here's the result of a quick test on my local laptop server:

127.0.0.3 ... "GET /index.html HTTP/1.1" 301 325
127.0.0.3 ... "GET / HTTP/1.1" 200 2397

Glad you both found this useful/relevant!

This is so easy yet so many big, big websites have issues between "www." and no "www.", especially ".co.uk" domains. I see no need for the "www." prefix now that http:// is synonymous with web documents (as opposed to FTP traffic, etc.). Anyway, nice to see this documented at a high profile site.

just one question though: what's the purpose of RewriteCond %{HTTP_HOST} .

unless i'm missing something, it just test whether HTTP_HOST is made up of any single character?

Thanks Patrick, all fixed. Funny how the eye misses these things...

Hmmmm, I wonder who wrote it first?

;-)

Ha, I do beleive that I e-mailed you about this before you used it on your sites ;-P. Maybe you missed it 'cause I never did hear back from you on that one...

Anyway, I didn't realize that someone had to take the lead in this thing and wrote it up, or I would have published it sooner!

You emailed me? *cough*Must've missed it*cough* You did guide me a bit, but I had the drive and determination to study for weeks before finding the correct way of doing it and writing the most concisely thorough article on the subject the world has ever seen. So I demand that you delete this entry along with all comments so that I get my just due.

...

To everyone who doesn't know me, yes I am joking. Mike and I be cool.

All good... I read yor article and liked it better; a different tone, one that comes with age and experience ;-]

Another option for enforcing www consistency, particularly if you are already running multiple sites, is to use a separate virtual host. For instance:

<VirtualHost *>

ServerName example.tld

Redirect permanent / http://www.example.tld/

</VirtualHost>

(Sorry for the double-line spacing, but your comment form insists that pre is an illegal tag... despite the fact that an existing comment uses it! And since it won't take br either, even in XHTML form, I had to make do with paragraphs.)

The same technique can be used to remove the www.

This has the advantage that it doesn't require mod_rewrite. On the other hand, it can't take care of extra index.html links.

One last note: on new sites, you can prevent duplicate index.html entries by making sure you never use index.html in a link. Absolute links are each, and for relative links within a directory, you can use <a href="./">...</a>. If search engines (and visitors making their own links) never see a link straight to the file, and it hasn't been manually submitted, they'll never even look at the index.html location.

Great advice Kelson, and sorry about the pre bit. I'll admit I cheated and added it right into the db. I was going to 'allow' pre and some other things today but haven't gotten around to it.

That last bit is most certainly true. Don't let'em know it exists and they can't post to it...

Check out the blog categories for older content