seo

Looking Back at Linkscape’s Trillion + URLs (and Announcing our Latest Index Update)

Ali JalilPour July 3, 2024

0 0 4 minutes read

As we rapidly approach the end of 2009 and opening of 2010, we’ve got a much anticipated index update ready to roll out gang. Say it with me “twenty-ten”. Oh yeah, I’m so gonna get a flying car and a cyberpunk android 🙂 …Ahem. I thought this would be a great time to take a look back at the year and ask, “where did all those pages go?” Being a data-driven kind of guy, I want to take a look at some numbers about churn, freshness and what it means for the size of the web and web indexes over the last year, and the hundreds of billions, indeed trillion plus urls we’ve gotten our hands on.

This index update has a lot going on, so I’ve broken things out section by section:

Table of Contents

An Analysis of the Web’s Churn Rate

Not too long ago, at SMX East, I heard Joachim Kupke (senior software engineer on Google’s indexing team) say that “a majority of the web is duplicate content”. I made great use of that point at a Jane and Robot meet up shortly after. Now, I’d like to add my own corollary to that statement: “most of the web is short-lived”.

Churn on the Web

After just a single month, a full 25% of the URLs are what we call “unverifiable”. By that I mean that the content was either duplicate, included session parameters, or for some reason could not be retrieved (verified) again (404s, 500s, etc.). Six months later, 75% of the tens of billions of URLs we’ve seen are “unverifiable” and a year later, only 20% qualifies for “verified” status. As Rand noted earlier this week, Google’s doing a lot of verifying themselves.

To visualize this dramatic churn, imagine the web six months ago…

the web six months ago

Using Joachim’s point, plus what we’ve observed, that six-month old content today looks something like this:

what remains of the the six month old web

What this means for you as a marketer is that some of the links you build and content you share across the web is not permanent. If you engage heavily with high-churn portions of the web, the statistics you monitor over time can vary pretty wildly. It’s important to understand the difference between getting links (and republishing content) in places that will make a splash now, but fade away, versus engaging in lasting ways. Of course, both are important (as high-churn areas may drive traffic that turns into more permanent value), but the distinction shouldn’t be overlooked.

Canonicalization, De-Duping & Choosing Which Pages to Keep

Regarding Linkscape’s indices, we capture both of these cases:

We’ve got an up-to-date crawl including fresh content that’s making waves right now. Blogscape helps power this, monitoring 10 million+ feeds and sending those back to Linkscape for inclusion in our crawl.
We include the lasting content which will continue to support your SEO efforts by analyzing which sites and pages are “unverifiable” and removing these from each new index. This is why our index growth isn’t cumulative — we re-crawl the web each cycle to make sure that the links + data you’re seeing are fresh and verifiable.

To put it another way, consider the quality of most of the pages on the web, as measured, for instance, by mozRank:

Most Pages are Junk (via mozRank)

I think the graph speaks for itself. The vast majority of pages have very little “importance” as defined by a measure of link juice. So it doesn’t surprise me (now at least) that most of these junk pages are disappearing after not too long. Of course, there are still plenty of really important pages that do stick around.

But what does this say about the pages we’re keeping? First of let’s take out any discussion of the pages that we saw over a year ago (as we’ve seen above, there’s likely less than 1/5th of them remaining on the web). In just the past 12 months, we’ve seen between 500 billion and well over 1 trillion pages depending on how you count it (via Danny at Search Engine Land).

Linkscape URLs in the last year

So in just a year we’ve provided 500 billion unique urls through Linkscape and the Linkscape powered tools (Competitive Link Finder, Visualization, Backlink Analysis, etc.). And what’s more, this represents less than half of the URLs we’ve seen in total, as the “scrubbing” we do for each index cuts approx. 50% of the “junk” (including canonicalization, de-duping, and straight tossing for spam and other reasons). There’s likely many trillions of URLs out there, but the engines (and Linkscape) certainly don’t want anything close to all of these in an index.

Linkscape’s December Index Update:

From this latest index (compiled over approx. the last 30 days) we’ve included:

47,652,586,788 unique URLs (47.6 billion)
223,007,523 subdomains (223 million)
58,587,013 root domains (59.5 billion)
547,465,598,586 links (547 billion)

We’ve checked that all of these URLs and links existed within the last month or so. And I call out this notion of “verified” because we believe that’s what matters for a lot of reasons:

I hope you’ll agree. Or, at least, share your thoughts 🙂

New Updates to the Free & Paid Versions of our API

I also want to call a shout out to Sarah who’s been hard at work on repackaging our site intelligence API suite. She’s got all kinds of great stuff planned for early the coming year, including tons of data in our free APIs. Plus she’s dropped the prices on our paid suite by nearly 90%.

Both of these items are great news to some of our many partners, including:

Thanks to these partners we’ve doubled the traffic to our APIs to over 4 million hits per day, more than half of which are from external partners! We’re really excited to be working with so many of you.

Ali JalilPour July 3, 2024

0 0 4 minutes read

Taking [true knowledge] Out for a Test Drive

Video and Images in Search Results: SMX West

Easy Link Love From Higher Ed Websites

When You Should Pause Your PPC Campaign

A Step-by-Step Strategy for B2B Pillar Pages

Where Are All the New People?

Tracking Browse Rate – A Cool Stickiness Metric

The Speed of Link Growth

How to Stop vBulletin from Ruining Your Website

The Bizarre Economy of Links – Portland SEMpdx Keynote

Whiteboard Friday: SMX West Interviews-Matt Cutts

How Does Your Local SEO Market Compare With Other Large Cities?

Looking Back at Linkscape’s Trillion + URLs (and Announcing our Latest Index Update)

An Analysis of the Web’s Churn Rate

Canonicalization, De-Duping & Choosing Which Pages to Keep

Linkscape’s December Index Update:

New Updates to the Free & Paid Versions of our API

Ali JalilPour

Leave a Reply Cancel reply

Web hosting for SEO: Why it’s important

SEM career playbook: Overview of a growing industry

What Is SEO – Search Engine Optimization?

SERoundTable Forums

Practical Tips for Presenting SEO Projects to Executives — Whiteboard Friday

Retargeting Hack for the Buying Cycle

How I Develop Successful Link Building Strategies for My Clients

Optimizing for AI Overviews

My Top 5 Local SEO and Marketing Takeaways From MozCon 2024

How I Develop Successful Link Building Strategies for My Clients

Top SEO Tips for 2024 — Whiteboard Friday

Intro to Python [Part 2]

An Analysis of the Web’s Churn Rate

Canonicalization, De-Duping & Choosing Which Pages to Keep

Linkscape’s December Index Update:

New Updates to the Free & Paid Versions of our API

Subscribe to our mailing list to get the new updates!

So You Think You Rank? Maybe Not

Diagrams for Solving Crawl Priority & Indexation Issues

Related Articles

Leave a Reply Cancel reply

Web hosting for SEO: Why it’s important

SEM career playbook: Overview of a growing industry

What Is SEO – Search Engine Optimization?

SERoundTable Forums

Practical Tips for Presenting SEO Projects to Executives — Whiteboard Friday

Retargeting Hack for the Buying Cycle

How I Develop Successful Link Building Strategies for My Clients

Optimizing for AI Overviews

My Top 5 Local SEO and Marketing Takeaways From MozCon 2024

How I Develop Successful Link Building Strategies for My Clients

Top SEO Tips for 2024 — Whiteboard Friday

Intro to Python [Part 2]