Ever since I first saw Matt Cutts, Google’s head of search quality, “investigating” sites through his super-secret application (during an SES conference in NYC); calling out spammers and identifying crawl and ranking issues for curious site owners, I’ve wondered about the content of his tool collection. What secrets can Googlers pull up on command? How much do they really know about you, your sites and the web? In order to answer these questions, I’ve put together my own speculative guide about the features, operations and data points available to Mr. Cutts and his team, along with my guess about the relative liklihood that each element is actually available on demand (high, moderate or low probability).
All the things that could be in a Googler’s toolbox (in Rand’s opinion):
- General Site Data
- How many pages Google estimates you have plus how many they’ve crawled and indexed | HIGH
- Crawl rates for your site and how often they think you update your content in various sections | HIGH
- Registration date and first-crawled date | HIGH
- Server codes your pages return (or have returned in the past) – 200s, 404s, 301s, 302s, 500s, etc. and any server errors or crawl problems Google’s encountered | HIGH
- Anything you’ve submitted to Google’s Webmaster Central (aka sitemaps) | HIGH
- Any bans, penalties or “red flags” Google has given the site | HIGH
- Domains/pages that have been 301/302’d to your site | HIGH
- An estimate of visitor traffic to your site in total (I’d think it would be easy for the search giant to use analytics data and compare that against their own search data and toolbar information to come up with a good formula for virtually any site – they almost certainly beat the pants off Alexa) | MODERATE
- How well you rank for specific terms | MODERATE
- An estimate of how much traffic Google sends to your site | LOW
- Any advertising you buy/sell through Google (Adwords & AdSense accounts) | LOW
- The top search terms that bring your site traffic | LOW
- Site Owner Information
- How many sites you (or entities sharing your name, phone number, address or other registration data) own and what number of these are active | HIGH
- Length of domain registration | HIGH
- Where your domains are hosted | MODERATE
- Adsense/Adwords accounts you run or have access to (via the Google universal login protocol) | MODERATE
- Analytics, Toolbar, Sitemaps, Web Accelerator or other Google accounts you might have/use | LOW
- IP addresses from which you’ve logged into a Google service | LOW
- Link Information
- Complete list of sites and pages on the web link to your site | HIGH
- Pagerank data, probably with additional components that weren’t in the original formula | HIGH
- Percentage of your links that come from suspected manipulative sources (paid links, linkfarms, comment spam, ad networks, etc.) | HIGH
- Complete list of links to other sites/pages from your domain | HIGH
- Quality, trust, authority and relevance of sites you link out to | HIGH
- Where the majority of links point (home page, internal pages, a few particularly popular pages, etc.) | MODERATE
- Temporal data on links – the rate at which new links are coming to your site now and what those rates looked like in the past | MODERATE
- Variation of anchor text in links that point to your site | MODERATE
- Percentage of your links that come from blogs, wikis, forums, guestbooks or other potentially self-created sources | LOW
- Historical Data
- Historical PageRank data | HIGH
- Site ownership/registration changes in the past | HIGH
- If/when any of Google’s algorithmic updates affected your site’s rankings/traffic/crawl-rate (this would be a great signal to help identify why your site might have been penalized as the various updates all had particular foci) | HIGH
- How well you’ve ranked in Google over time | MODERATE
- Hosting/IP changes your site has made over time | MODERATE
- Historical traffic levels to your site | LOW
What do you think? Any other signals that Matt might be accessing on his laptop during site reviews?