Forget keyword density, the real world is far more complicated…

Filed under: Updates — Ian Lavelle @ 3:54 pm

As my esteemed colleague Amye Saunders covers in her blog post Mayday, Mayday!, the search engines have been rolling out some changes that look to have affected webmasters across verticals. I had an idea for a blog post on the topic of “keyword density”, and thought this an opportune time to tie it back to some real-world events… (Disclaimer: this is just my theory, and thus by no means gospel!) :-)

Oftentimes clients will ask us about how many times they should use their keywords in their page copy, or “what is the optimal keyword density?” The answer is that there is no magic number, and the optimal keyword density changes for every search term, and I’ll explain why.

Search engines are more concerned with a value known as TF-IDF (term frequency-inverse document frequency), than just the percentage of your page content made up of your target keywords. I won’t go into the intricacies of the TF-IDF formula right now, but if you wish to delve a bit more into it, see good old Wikipedia.

Keyword density says nothing about your page’s relevance to a search query, in relation to the rest of the web. For this reason, we need to consider how many pages across the internet contain your search term, and also the total number of documents in the search engines index.

Let’s consider an example web page about ‘car insurance’ with 100 words, where the word ‘insurance’ appears 9 times. The term frequency (tf) in this case is 9/100 or 0.09. Now let’s make another assumption that the internet is made up of just 1000 web pages in total, and the word ‘insurance’ appears in 80 of these web pages. In this instance, the IDF value (inverse document frequency) is as follows: ln(1000/80) = 2.53. The TF-IDF score in this case is the product of these 2 numbers ie. 0.09*2.53=0.2277.

Let’s now consider the same example, except the search term we will focus on is more long-tail, let’s take ‘comprehensive car insurance’, which, I’ll assume, only appears twice on my example web page. Term frequency, tf, is 2/100 or 0.02. Across the entire collection of 1000 web pages, ‘comprehensive car insurance’ appears in, let’s say, 15 of them. In this long-tail case, the IDF value is: ln(1000/15)=4.2. The TF-IDF score in this case is 0.084 (less than the 0.2277 for ‘insurance’ in the above example). A higher TF-IDF score tells us that the word ‘insurance’ is more important to this imaginary internet overall than ‘comprehensive car insurance’.

Now let’s take a final example, with an expanded internet which now contains 1500 web pages, one and a half times the size of our initial index. In the above illustration, ‘comprehensive car insurance’ appeared in 1.5% of all web pages (15 out of 1000). Assuming that these new pages are just as likely to contain mention of ‘comprehensive car insurance’ (i.e. a 1.5% chance), we now have 23 pages containing our term overall. Now, our TF-IDF score works out to be 0.0836 (0.02*4.18). So from this, we can see that an increase in the size of the search engine index leads to a lower TF-IDF score for us on the term ‘comprehensive car insurance’. A high TF-IDF score means that your page is likely to be highly relevant, so your chances of ranking will improve in relation to your TF-IDF score. And as illustrated above, this score can drop if the internet expands, even though you haven’t changed anything on your site!

Share |

1 Comment »

  1. [...] apparent surge in spam pages showing in search results) could affect the long-tail. You can read Ian’s post here. var addthis_pub="consult"; var addthis_brand = "Outrider Australia"; var addthis_header_color = [...]

    Pingback by MAYDAY, MAYDAY - Possibly Google's Latest Algorithm Change | Outrider Search Blog — May 14, 2010 @ 4:04 pm

RSS feed for comments on this post. TrackBack URL

Leave a comment