Embryo’s Word Count Study 2021

This is the story of how Embryo embarked on the largest ever study of how much content it takes to help you to rank in the top 10 of Google.

Before we get into the hows and whys, here is the data that most of you will be here for. This graph shows the average number of (meaningful) words on pages in the top 10 of Google results for over 20,000 keywords (24,774 to be exact).

Graph showing the average number of word count for sites in the top 10 rankings of Google.

And here are the raw numbers:

Position

Average Word Count

1

2855

2

2923

3

2683

4

2479

5

2422

6

2363

7

2267

8

2244

9

2230

10

2242

Why did we undertake this content study?

The last meaningful data surrounding content length/word counts of pages was conducted by a company called serpIQ (now defunct). This was done way back in 2012, which meant that there had been a whole nine years in which things may have changed. During this nine years, hundreds of companies – including ourselves – had used the serpIQ data to exalt the benefits of long-form content. So we felt that someone should commit to another study to see how different things were, if at all.

So, that’s what we did!

SerpIQ Content Study Graph A graph showing data from serpIQ's 2012 content length study.

How did we do it?

Well, it wasn’t easy, and despite initially thinking that we would do it annually, we will probably do it once every 2-3 years, such was the effort required.

Over the course of six months, we did the following:

  • Collected just over 20,000 keywords
    • we used ‘every day’ keywords as much as possible – things that typical web users search for, rather than solely ‘premium’, single-word keywords
    • there was a mixture of top-tier keywords, question-based keywords, and short, mid, and long tail variations
    • we largely used the excellent SISTRIX for this.
  • Created our own crawling software
    • This allowed us to crawl the SISTRIX’s API for the top 10 sites for each keyword
      • Without SISTRIX kindly donating us lots of free API credits, this effort would have cost us a fortune, so a BIG thanks goes to the SISTRIX team
    • Crawl rates meant that we had to temper our excitement to get through all of the keywords as quickly as possible
  • Created our own word count software
    • This was by far the most complicated part of the whole study! We created perhaps eleven versions of this until we were satisfied with its output
    • Ignoring non-meaningful content (e.g. menu text items) is one of the most difficult things that we have ever undertaken, code-wise

So, with all those things in place, it took us around three months to collect all of the data, once we had got the keyword, crawler, and code in a place that we were happy with. Writing the software and code took around one month. So overall, four months was spent to deliver the output that you see on this page. Even though it was tough, we think it was worth it.

2012 to 2021 – what has changed?

Taking a look at the differences between what we discovered, and what serpIQ reported, it seems that websites has evolved somewhat, as we have recorded that a greater amount of content appears on top 10-ranked sites than it did in the past.

Pos.

serpIQ 2012

Embryo 2021

Difference

1

2455

2855

+400 words

2

2480

2923

+443

3

2420

2683

+263

4

2380

2479

+99

5

2320

2422

+102

6

2310

2363

+53

7

2210

2267

+57

8

2145

2244

+99

9

2080

2230

+150

10

2050

2242

+372

Average:

2285

2471

+204

It must be stated, for clarity, that the numbers for serpIQ entries in the table above are estimated from the graph that appears higher up on this page.

You can see that there is a general trend of a greater number of words appearing across pages that are in the top 10 of Google. However, we strongly believe that our own way of discarding non-useful text on a page probably means that there is an even greater amount of positive word count difference than there was in 2012. This is something that we cannot prove, as we don’t have access to both sets of data. And, for all we know, the serpIQ word count software was as good as what we have created (but we doubt it!).

Our ‘gut instinct’ thinks that the difference in number of words (on average) compared to 2012 is closer to 400, rather than the 200 as the data has recorded.

Questions versus non-question queries

While knowing what the average number of word counts across all 20,000 keywords is interesting, we found that as we drilled down into more data, even more interesting and useful stats came to light.

For example, the graph to the left (above on mobile) breaks down each keyword into two types – question and non-question – which was us simply checking for the uses of the words ‘why’, ‘how’ and so on in a query.

As you can see for the top three positions, to help rank for a keyword that is question-based requires a lot less content than for a non-question-based query. One could assume that a page built around a specific question is likely to need less context surrounding answering the question. A high-ranking page built to target a non-question keyword phrase is likely to have much more contextual content on topics that surround the main keyword, helping Google to know that the people behind the website know much more than just the solution to a specific problem.

Word counts for questions versus queries that are not questions. Word counts for questions versus queries that are not questions.

Word counts by number of words in query

To some, this data will be of interest, but probably not to most. Essentially, you could argue that for 3 to 6-word search queries, the content doesn’t have to be as long as for queries of other word lengths, as Google is likely to be able to ascertain with some certainty what the people using those queries are wanting to see. Single word queries presumably have the largest amount of content per page, as the keyword they type in could have several meanings and intents, depending on each particular user. This means that more content is likely to be able to satisfy the user, as it has a higher chance of containing the content they are searching for than a page with a lesser amount of content.

However, for searches using 7-10 keywords, the trend is reversed, which we found to be particularly strange. Perhaps this means that as might as Google is, isn’t yet able to fully understand queries of this length to the same degree that it understands more populous 3 to 6-word queries. Maybe you have your own thoughts on this?

Word count by query length graph.

SERP Features

As we recorded the data from the SISTRIX API, we were able to see which top 10 positions were taken up by various Google SERP features such as ‘People also ask’, inline videos, news items, jobs, inline Twitter items, and inline images.

We found that ‘People also ask’ (denoted as Question on the pie chart) appears much more than any other type of SERP feature, appearing for 68% of the times when a feature did occur.

 

 

 

‘People also ask’ – where does this feature appear?

Another piece of really useful information to the SEO community (we feel) is that of the placements of ‘People also ask’ features in Google SERPs. We haven’t previously found any useful data that shows at which SERP position these features are likely to occur – and especially not over a large dataset of 20,000+ keywords.

The graph to the right gives a clear indication that Google is much more likely to show a ‘People also ask’ (denoted as question boxes in graph) feature in the first three organic positions, with organic position number 2 by far the most prevalent. We will let you decide why this is the case…

Position of people also ask box.

Analysing user’s search intent using SISTRIX data

One of the many useful parts of the SISTRIX toolbox is the ability to ascertain what kind of intent is behind a search, and is done with a confidence score of 0-100. This gives an SEO and/or content team extra ideas and ways to build content around keywords to attract the type of intent they wish users to have when visiting the site. This is just one of the ways that SISTRIX’s intent data can be used.

You can find more about SISTRIX search intent here.

The graph to the left is a violin plot of the ‘visit’ intent, which is explained by SISTRIX as.

“These queries are very location focused and might include ‘near me’ or ‘closest’ in the full search query. These searches often trigger the Google Maps feature in the SERP. Some search queries have implicit locations requirements such “pizza” which is likely to require some location-based answers. In some cases Google may deliver website and location-based results and there are always cases where it’s almost impossible for Google to know what the requirement is. Consider ‘Apple store’ for example.”

What the graph to the left is showing is that across the 20,000+ keywords that we processed, SISTRIX reported in the vast majority of cases that it was less than 10% confident that searches were with the intent of looking for a physical store. You can see two other small clusters (at around 50% and 100%) where SISTRIX was either ~50% or ~100% sure that these searches were looking for something that could be physically visited.

Violin plot of website intent from SISTRIX.

Summary

We did produce some more graphs from various data, but they cover really small edge cases, and probably won’t be that interesting to the majority of readers. At least we’re being honest. 🙂

So, after 24,774 keywords checked, we can concretely assume that there has been around a 10% increase in the amount of content on pages in the top 10 of Google compared to 2012. However, as we stated earlier on this page, we believe (but cannot prove) this to be closer to 400 extra words across each of the pages that have a top 10 ranking. We think this because of being able to remove much of the non-important text on the vast majority of websites we checked, using various advancements in our text-stripping tools.

Until Google understands the contents of an uploaded video to a much, much higher degree than it does today, we expect the word counts of pages across the web to continue to rise. At what rate, we don’t know.

We must say a big thank you to Steve Paine of SISTRIX for his excellent help, allowing us to publish this data in a few months, rather than a few years!

Coding and tool design: James Welch, Zahed Kamal

Data Visualisations: Danny Waites