One frustrating aspect of link building is not knowing the value of a link. Although experience, and some data, can make you better at link valuation, it is impossible to know to what degree a link may be helping you. It’s hard to know if a link is even helping at all. Search engines do not count all links, they reduce the value of many that they do count, and use factors related to your links to further suppress the value that’s left over. This is all done to improve relevancy and spam detection.
Understanding the basics of link-based spam detection can improve your understanding of link valuation and help you understand how search engines approach the problem of spam detection, which can lead to better link building practices.
I’d like to talk about a few interesting link spam analysis concepts that search engines may use to evaluate your backlink profile.
I don’t work at a search engine, so I can make no concrete claims about how search engines evaluate links. Engines may use some, or none, of the techniques in this post. They also certainly use more (and more sophisticated) techniques than I can cover in this post. However, I spend a lot of time reading through papers and patents, so I thought it’d be worth sharing some of the interesting techniques.
#1 Truncated PageRank
#2 Owned / Accessible Contributions
Links can be bucketed into three general buckets.
- Links from owned content – Links from pages that search engines have determined some level of ownership (well-connected co-citation, IP, whois, etc.)
- Links from accessible content – Links from non-owned content that is easily accessible to add links (blogs, forums, article directories, guest books, etc.)
- Links from inaccessible content – Links from independent sources.
A link from any one of these source is neither good nor bad. Links from owned content, via networks and relationships, are perfectly natural. However, a link from inaccessible content could be a paid link, so that bucket doesn’t mean it’s inherently good. However, knowing the bucket a link falls into can change the valuation.
This type of analysis on two sites can show a distinct difference in a link profile, all other factors being equal. The first site is primarily supported on links from content it directly controls or can gain access to. However, the second site has earned links from a substantially larger percentage of unique, independent sources. All things being equal, the second site is less likely to be spam.
#3 Relative Mass
Relative Mass accounts for the percent distribution of a profile for certain types of links. The example of the pie charts above demonstrates the concept of relative massive.
This type of analysis can be applied to tactics as well, such as distribution of links from comments, directories, articles, hijacked sources, owned pages, paid links, etc. The algorithm may provide a certain degree of “forgiveness” before its relative mass contribution exceeds an acceptable level.
#4 Counting Supporters / Speeds to Nodes
Another method of valuing links is by counting supporters and the speed of discovery of those nodes (and the point at which this discovery peaks).
A histogram distribution of supporting nodes by hops can demonstrate the differences between spam and high quality sites.
Well-connected sites will grow in supporters more rapidly than spam sites and spam sites are likely to peak earlier. Spam sites will grow rapidly and decay quickly as you move away from the target node. This distribution can help signify that a site is using spammy link building practices. Because spam networks have higher degrees of clustering, domains will repeat upon hops, which makes spam profiles bottleneck faster than non-spam profiles.
Protip: I think this is one reason that domain diversity and unique linking root domains is well correlated with rankings. I don’t think the relationship is as naïve as counting linking domains, but an analysis like supporter counting, as well as Truncated PageRank, would make receiving links from a larger set of diverse domains more well correlated with rankings.
#5 TrustRank, Anti-TrustRank, SpamRank, etc.
I won’t go into much more detail than that, because you can read about it in previous posts, but it comes down to four simple rules.
- Get links from trusted content.
- Don’t get links from spam content.
- Link to trusted content.
- Don’t link to spam content.
#6 Anchor Text vs. Time
Monitoring anchor text over time can give interesting insights that could detect potential manipulation. Let’s look at an example of how a preowned domain that was purchased for link value (and spam) might appear with this type of analysis.
This domain has a historical record of acquiring anchor text including both brand and non-branded targeted terms. Then suddenly that rate drops and after time a new sudden influx of anchor text, never seen before, starts to come in. This type of anchor text analysis, in combination with orthogonal spam detection approaches, can help detect the point in which ownership was changed. Links prior to this point can then be evaluated differently.
#7 Link Growth Thresholds
Sites with rapid link growth could have the impact dampened by applying a threshold of value that can be gained within a unit time. Corroborating signals can help determine if a spike is from a real event or viral content, as opposed to link manipulation.
#8 Robust PageRank
Robust PageRank works by calculating PageRank without the highest contributing nodes.
#9 PageRank Variance
The uniformity of PageRank contribution to a node can be used to evaluate spam. Natural link profiles are likely to have a stronger variance in PageRank contribution. Spam profiles tend to be more uniform.
So if you use a tool, marketplace, or service to order 15 PR 4 links for a specific anchor text, it will have a low variance in PR. This is an easy way to detect these sorts of practices.
#10 Diminishing Returns
One way to minimize the value of a tactic is to create diminishing marginal returns on specific types of links. This is easiest to see in sitewide links, such as blogroll links or footer paid links. At one time, link popularity, in volume, was a strong factor which lead to sitewides carrying a disproportionate amount of value.
The first link from a domain carries the first vote and getting additional links from one particular domain will continue to increase the total value from a domain, but only to a point. Eventually inbound links from the same domain will continue to experience diminishing returns. Going from 1 link to 3 links from a domain will have more of an effect than 101 links to 103 links.
Link Spam Algorithms
All spam analysis algorithms have some percentage of accuracy and some level of false positives. Through the combination of these detection methods, search engines can maximize the accuracy and minimize the false positives.
Web spam analysis allows for more false positives than email spam detection, because there are often multiple alternatives to replace a pushed down result. It is not like email spam detection, which is binary in nature (inbox or spam box). In addition to this, search engines don’t have to create binary labels of “spam” or “not spam” to effectively improve search results. By using analysis, such as some of those discussed in this post, search engines can simply dampen rankings and minimize effects.
These analysis techniques are also designed to decrease the ROI of specific tactics, which makes spamming harder and more expensive. The goal of this post is not to stress about what links work, and which don’t, because it’s hard to know. The goal is to demonstrate some of the problem solving tactics used by search engines and how this impacts your tactics.