We all have it. The cruft. The low-quality, or even duplicate-content pages on our sites that we just haven’t had time to find and clean up. It may seem harmless, but that cruft might just be harming your entire site’s ranking potential. In today’s Whiteboard Friday, Rand gives you a bit of momentum, showing you how you can go about finding and taking care of the cruft on your site.
Video transcription
Howdy, Moz fans, and welcome to another edition of Whiteboard Friday. This week we’re chatting about cleaning out the cruft from your website. By cruft what I mean is low quality, thin quality, duplicate content types of pages that can cause issues even if they don’t seem to be causing a problem today.
What is cruft?
If you were to, for example, launch a large number of low quality pages, pages that Google thought were of poor quality, that users didn’t interact with, you could find yourself in a seriously bad situation, and that’s for a number of reasons. So Google, yes, certainly they’re going to look at content on a page by page basis, but they’re also considering things domain wide.
So they might look at a domain and see lots of these green pages, high quality, high performing pages with unique content, exactly what you want. But then they’re going to see like these pink and orange blobs of content in there, thin content pages with low engagement metrics that don’t seem to perform well, duplicate content pages that don’t have proper canonicalization on them yet. This is really what I’m calling cruft, kind of these two things, and many variations of them can fit inside those.
But one issue with cruft for sure it can cause Panda issues. So Google’s Panda algorithm is designed to look at a site and say, “You know what? You’re tipping over the balance of what a high quality site looks like to us. We see too many low quality pages on the site, and therefore we’re not just going to hurt the ranking ability of the low quality pages, we’re going to hurt the whole site.” Very problematic, really, really challenging and many folks who’ve encountered Panda issues over time have seen this.
There are also other probably non-directly Panda kinds of related things, like site-wide analysis of things like algorithmic looks at engagement and quality. So, for example ,there was a recent analysis of the Phantom II update that Google did, which hasn’t really been formalized very much and Google hasn’t said anything about it. But one of the things that they looked at in that Phantom update was the engagement of pages on the sites that got hurt versus the engagement of pages on the sites that benefited, and you saw a clear pattern. Engagement on sites that benefited tended to be higher. On those that were hurt, tended to be lower. So again, it could be not just Panda but other things that will hurt you here.
It can waste crawl bandwidth, which sucks. Especially if you have a large site or complex site, if the engine has to go crawl a bunch of pages that are cruft, that is potentially less crawl bandwidth and less frequent updates for crawling to your good pages.
It can also hurt from a user perspective. User happiness may be lowered, and that could mean a hit to your brand perception. It could also drive down better converting pages. It’s not always the case that Google is perfect about this. They could see some of these duplicate content, some of these thin content pages, poorly performing pages and still rank them ahead of the page you wish ranked there, the high quality one that has good conversion, good engagement, and that sucks just for your conversion funnel.
So all sorts of problems here, which is why we want to try and proactively clean out the cruft. This is part of the SEO auditing process. If you look at a site audit document, if you look at site auditing software, or step-by-step how-to’s, like the one from Annie that we use here at Moz, you will see this problem addressed.
How do I identify what’s cruft on my site(s)?
So let’s talk about some ways to proactively identify cruft and then some tips for what we should do afterwards.
Filter that cruft away!
One of those ways for sure that a lot of folks use is Google Analytics or Omniture or Webtrends, whatever your analytics system is. What you’re trying to design there is a cruft filter. So I got my little filter. I keep all my good pages inside, and I filter out the low quality ones.
What I can use is one of two things. First, a threshold for bounce or bounce rate or time on site, or pages per visit, any kind of engagement metric that I like I can use that as a potential filter. I could also do some sort of a percentage, meaning in scenario one I basically say, “Hey the threshold is anything with a bounce rate higher than 90%, I want my cruft filter to show me what’s going on there.” I’d create that filter inside GA or inside Omniture. I’d look at all the pages that match that criteria, and then I’d try and see what was wrong with them and fix those up.
The second one is basically I say, “Hey, here’s the average time on site, here’s the median time on site, here’s the average bounce rate, median bounce rate, average pages per visit, median, great. Now take me 50% below that or one standard deviation below that. Now show me all that stuff, filters that out.”
This process is going to capture thin and low quality pages, the ones I’ve been showing you in pink. It’s not going to catch the orange ones. Duplicate content pages are likely to perform very similarly to the thing that they are a duplicate of. So this process is helpful for one of those, not so helpful for other ones.
Sort that cruft!
For that process, you might want to use something like Screaming Frog or OnPage.org, which is a great tool, or Moz Analytics, comes from some company I’ve heard of.
Basically, in this case, you’ve got a cruft sorter that is essentially looking at filtration, items that you can identify in things like the URL string or in title elements that match or content that matches, those kinds of things, and so you might use a duplicate content filter. Most of these pieces of software already have a default setting. In some of them you can change that. I think OnPage.org and Screaming Frog both let you change the duplicate content filter. Moz Analytics not so much, same thing with Google Webmaster Tools, now Search Console, which I’ll talk about in a sec.
So I might say like, “Hey, identify anything that’s more than 80% duplicate content.” Or if I know that I have a site with a lot of pages that have only a few images and a little bit of text, but a lot of navigation and HTML on them, well, maybe I’d turn that up to 90% or even 95% depending.
I can also use some rules to identify known duplicate content violators. So for example, if I’ve identified that everything that has a question mark refer equals bounce or something or partner. Well, okay, now I just need to filter for that particular URL string, or I could look for titles. So if I know that, for example, one of my pages has been heavily duplicated throughout the site or a certain type, I can look for all the titles containing those and then filter out the dupes.
I can also do this for content length. Many folks will look at content length and say, “Hey, if there’s a page with fewer than 50 unique words on it in my blog, show that to me. I want to figure out why that is, and then I might want to do some work on those pages.”
Ask the SERP providers (cautiously)
Then the last one that we can do for this identification process is Google and Bing Webmaster Tools/Search Console. They have existing filters and features that aren’t very malleable. We can’t do a whole lot with them, but they will show you potential site crawl issues, broken pages, sometimes dupe content. They’re not going to catch everything though. Part of this process is to proactively find things before Google finds them and Bing finds them and start considering them a problem on our site. So we may want to do some of this work before we go, “Oh, let’s just shove an XML sitemap to Google and let them crawl everything, and then they’ll tell us what’s broken.” A little risky.
Additional tips, tricks, and robots
A couple additional tips, analytics stats, like the ones from GA or Omniture or Webtrends, they can totally mislead you, especially for pages with very few visits, where you just don’t have enough of a sample set to know how they’re performing or ones that the engines haven’t indexed yet. So if something hasn’t been indexed or it just isn’t getting search traffic, it might show you misleading metrics about how users are engaging with it that could bias you in ways that you don’t want to be biased. So be aware of that. You can control for it generally by looking at other stats or by using these other methods.
When you’re doing this, the first thing you should do is any time you identify cruft, remove it from your XML sitemaps. That’s just good hygiene, good practice. Oftentimes it is enough to at least have some of the preventative measures from getting hurt here.
However, there’s no one size fits all methodology after the don’t include it in your XML sitemap. If it’s a duplicate, you want to canonicalize it. I don’t want to delete all these pages maybe. Maybe I want to delete some of them, but I need to be considered about that. Maybe they’re printer friendly pages. Maybe they’re pages that have a specific format. It’s a PDF version instead of an HTML version. Whatever it is, you want to identify those and probably canonicalize.
Is it useful to no one? Like literally, absolutely no one. You don’t want engines visiting. You don’t want people visiting it. There’s no channel that you care about that page getting traffic to. Well you have two options — 301 it. If it’s already ranking for something or it’s on the topic of something, send it to the page that will perform well that you wish that traffic was going to, or you can completely 404 it. Of course, if you’re having serious trouble or you need to remove it entirely from engines ASAP, you can use the 410 permanently delete. Just be careful with that.
Is it useful to some visitors, but not search engines? Like you don’t want searchers to find it in the engines, but if somebody goes and is paging through a bunch of pages and that kind of thing, okay, great, I can use no index, follow for that in the meta robots tag of a page.
If there’s no reason bots should access it at all, like you don’t care about them following the links on it, this is a very rare use case, but there can be certain types of internal content that maybe you don’t want bots even trying to access, like a huge internal file system that particular kinds of your visitors might want to get access to but nobody else, you can use the robots.txt file to block crawlers from visiting it. Just be aware it can still get into the engines if it’s blocked in robots.txt. It just won’t show any description. They’ll say, “We are not showing a site description for this page because it’s blocked by robots.”
If the page is almost good, like it’s on the borderline between pink and green here, well just make it good. Fix it up. Make that page a winner, get it back in the engines, make sure it’s performing well, find all the pages like that have those problems, fix them up or consider recreating them and then 301’ing them over if you want to do that.
With this process, hopefully you can prevent yourself from getting hit by the potential penalties, or being algorithmically filtered, or just being identified as not that great a website. You want Google to consider your site as high quality as they possibly can. You want the same for your visitors, and this process can really help you do that.
Looking forward to the comments, and we’ll see you again next week for another edition of Whiteboard Friday. Take care.