We connect a digital society.

Made possible through technology, made personal through data, made powerful through content. Bringing people, communities and brands together like never before.

Operating a problem-led, outcome-focused bespoke approach, we come to together to provide meaningful connections between brands and audiences through strategy & planning, campaigns & activation and consultancy & transformation.

Connect with us today for more insightful, innovative and imaginative ways to connect with your audience.

Talk to us about strategic solutions across analytics, search, design & content.

Introducing Builtvisible’s Indexation Checker

February 7, 2020 | 6 min read

As technical SEO specialists, understanding how Google and other search engines crawl, render and index our websites stand among our top priorities

However, getting the necessary data to analyse each of these areas is often not as easy as you might think.

While we do have a detailed guide on how to perform log file analysis to understand Googlebot’s crawling behaviour, our last post about gathering indexation data was back in 2016. And although some of the information given still stands, I’m going to build on the ideas by sharing some of the new techniques that we have developed internally to fetch Google indexation status data, particularly for large scale sites.

Why knowing your website’s indexation status is important

As an SEO specialist or website owner, you want to attract potential users/customers to your site through Google Search. If your website (or part of your website) is not indexed, then you won’t appear in the search results and you’ll forfeit any potential organic traffic, conversions and revenue.

You can also have the opposite problem. If your website creates URLs exponentially (a common issue on ecommerce sites) or allows uncontrolled user-generated content, then Google might be crawling and indexing more than it should. This can quickly lead to enormous inefficiencies that are to the detriment of your core architecture.

Common challenges when gathering Google indexation data

Right now, you can either use Google Search Console or a few third-party solutions to gather indexation data. However, both options come with their own set of drawbacks when it comes to checking indexation data at scale. These generally fall into two groups: accessibility of the data and accuracy of the results.

Google Search Console limitations

Google Search Console (GSC) is an incredibly accurate source when it comes to indexation status data. After all, it has the benefit of a connection to Google’s Indexing System (Caffeine). The new version of GSC brings three super-useful reports that provide indexation status data: the URL Inspector Tool, the Coverage Report and the Sitemaps Report.

However, none of these reports are suitable for large-scale websites because GSC limits the number of URLs you can check per day. I know this because Hamlet Batista built an amazing tool to automate the URL Inspector Tool and I found out about it the hard way!

In theory, the Coverage Report and the Sitemaps Report could do the trick, but unfortunately Google Search Console limits the export report to 1,000 rows of data and there is currently no API access to extract more.

The only way around it would be to divide your whole architecture into XML sitemaps of 1,000 URLs max. Therefore, if you have 100,000 (known/important) URLs, you would need to create 100 XML sitemaps. This would be very hard to manage and is therefore not really an option.

Additionally, this will not give you the indexation data you need about uncontrolled URLs generated via faceted navigation or user-generated content.

URL Profiler limitations

As mentioned in Richard’s post, URL Profiler was a valid option to gather indexation data in some cases. Although we love this tool for other tasks, we realised that it had a lot of issues getting accurate data for “non-clean” URLs.

Some examples include parameterised URLs, URLs with encoded characters (i.e. non-ASCII characters) and symbols, URLs with varied letter casing and URLs with unsafe characters.

I’m yet to find a tool that solves all of these issues – so we built our own!

The solution: Builtvisible’s Indexation Checker

Builtvisible’s Indexation Checker is a script developed internally by our Senior Web Developer Alvaro Fernandez, it was built to deal specifically with the limitations that I’m sure we’re not alone in experiencing.

Our script is able to verify an unlimited number of URLs with any kind of problematic characters: parameters, encoding, reserved characters, unsafe characters, different alphabets – if Google has indexed it, our script will find it.

For detailed steps on how to set up the script please see this blog post of mine.

Final thoughts

The web is filled with weird URLs and, if you’re reading this, the chances are you’re dealing with a mammoth website that has their fair share. Knowing which ones Google has indexed can be a massive pain.

Using our script, you’ll be able to check the indexation status for any type of URL and the size of the list will no longer be a problem – we hope you enjoy it as much as we do!

A few additional concepts

Before I go, I’d like to clarify a few things that I think are necessary.

Firstly, it’s important to make a distinction between Indexing and Serving. We, as users, do not have direct access to the actual Google Index (outside of the data we get from GSC). The second-best option is to therefore understand what Google serves through Google Search.

What this means is that we can be sure that a URL is in Google’s Index if we see it on Google Search, but we cannot be 100% sure that a URL is not in the Index if Google doesn’t serve it through Google Search.

However, in practical terms, if a URL is not served in the SERPs, users cannot see it. Therefore, to us, it would be like it’s not in the Index and we might want to find out why this is the case.

Secondly, having your URL indexed doesn’t mean that the content in your page is indexed. For example, if your site is using a JavaScript framework that renders client-side (CSR), it could be that Google has indexed an empty page. Hence the need to understand if and how Google has rendered your site.

Lastly, a small disclosure: the only way to gather indexation data outside of Google Search Console is by getting information out of Google SERPs. Google does not allow automated queries according to their Terms of Service. So, if you use our script, please use it responsibly.

I hope you’ve enjoyed this article and if you have any thoughts, please consider sharing them!