Dear +John Mueller

Posted in: General by Richard Hearne on February 19, 2015

I've been digging around a little into the effects of internal links pointing to 404s. I found your rather useful post from 2013 about 404s (https://plus.google.com/+JohnMueller/posts/RMjFPCSs5fm) but suspect the reference for that was crawl errors derived from malformed URIs etc. I also found an old product forum post from +Gary Illyes where he mentioned "PageRank can only exist for live URLs that have a representation in Google's index and thus URLs that were dropped because they became an error, can't pass signals." (https://productforums.google.com/d/msg/webmasters/q-LqEpOfT1w/eLIeZXnmqooJ)

But I cannot find anything regarding the links that point to 404s.

So my question for you today, should you decide to answer: what happens to PageRank when a link points to a non-existent URL on the same domain? Does the PR that link passes sink, or does Google ignore that link and remove it from the PageRank equation?

Thanks in advance if you can help with my research John :)

Embedded Link

John Mueller – Google+

Google+: Reshared 3 times
Google+: View post on Google+

This post was first made on the Richard Hearne Google+ profile.

You should subscribe to the RSS Feed here for updates.

Comments (27)

27 Comments »

Without being John, here's my take on how PageRank is passed around for different status codes: https://docs.google.com/drawings/d/1aL8JLqLGrkFpDO9UYdAnOjiOTi2kF2hHQPfsDssTI2Y/edit?usp=sharing

Hope this helps with your research

Comment by Gary Illyes — February 19, 2015 @ 6:06 pm

Thanks Gary. That's interesting, but I still don't know if the PR that would normally pass along a given link is sunk because that link terminates at a 404s or not. The diagram (useful, thanks) only shows that C can neither receive or pass PR, but doesn't say how the link on A to the non-existent URI is handled.

BTW one other scenario that might be nice for add there is links to URIs that are inaccessible due to robots blocks. I always found that to be an interesting issue.

FWIW I'm looking for justification to get a large site to clean up non-trivial volumes of 404s, but cannot play the uX card. Has to be SEO driven.

Comment by Richard Hearne — February 19, 2015 @ 6:14 pm

Without being Jhon… Or Gary, my guess is that PR flowing to 404 pages is treated the same way as PR flowing to nofollowed links. Google assigns an amount of PR to that reference and then either it is assigned or goes into the oblivion of unicorns and rainbow cats!

Comment by Pedro Dias — February 19, 2015 @ 6:39 pm

I added a couple more scenarios. Basically if indexable A links to B which is also indexable, and B links to C which is not indexable, B will get PageRank and C will not.

As for the roboted (disallowed by robots.txt) pages, you have two cases:
1. the page became roboted after it was indexed, in which case we may decide to keep the page in our index with all its accumulated signals, including PageRank. Since we can't access the page any more, it will not pass PageRank
2. if we find a roboted page and it's not already indexed, we will not index it and thus can't get or pass PageRank.
Not quite sure how to put this on the diagram though.

Does this help?

Comment by Gary Illyes — February 19, 2015 @ 6:44 pm

That helps. I suspect a direct answer on the sink might not be easy to give.

On the robots.txt – put a nice big iron gate across the path

I've learnt something new also – I hadn't realised case 2, and always felt that if enough links pointed at something behind robots.txt it could get indexed. I had always assumed the login pages of some of the bigger sites I've worked on got indexed that way, but obviously they must have been indexed previously.

Thanks for input both +Gary Illyes and +Pedro Dias (I'll dream of unicorms shortly sigh)

Comment by Richard Hearne — February 19, 2015 @ 6:53 pm

Nice graph :).

Comment by John Mueller — February 19, 2015 @ 7:33 pm

Yeah, I think I found a new career opportunity.

Comment by Gary Illyes — February 19, 2015 @ 7:53 pm

I remember +John Mueller also liked to do graphs for mobile redirections
http://en.pedrodias.net/ux/how-to-handle-desktop-and-mobile-users

Comment by Pedro Dias — February 19, 2015 @ 7:58 pm

If A is indexed and has 2 dofollow links one to B and one to C, B resolves 200 (exists) and C resolves to 404 (does not exist), then the link to B is passed half the PR of A, and the link to C is just wasted as no PR is passed. Removing the link to C would result in more PR from A being passed to B. Isn't that enough of an "SEO"
reason besides the UX card?

Comment by Trey Collier — February 20, 2015 @ 1:05 am

+Trey Collier have you seen Google saying this anywhere?

Comment by Richard Hearne — February 20, 2015 @ 2:23 am

I think you have the answers you need on that one in pieces.

Google have confirmed in the past that they changed how they calculated the PageRank for outlinks with regard to PR sculpting. Last word on this is they calculate how much PR should flow through all outlinks, any links determined to not flow PR (rel=nofollow as an example) – the PR goes the way of the unicorn.

Google have previously stated and Gary reconfirmed above that error documents can't maintain PR in Google's systems.

In your example above, my expectation is that by removing the links pointing to 404 pages – with fewer outlinks per page, each link that does receive and flow PR will get slightly more.

Comment by Alistair Lattimore — February 20, 2015 @ 4:00 am

I commented on you post Alistair think it was a reply here, but my expectation has always been analogous to yours. But when I started digging it quickly became apparent that this theory was more a myth than a fact, and I don't think that Google has ever officially stated how they handle PR for internal (or external) 404 links.

The 2 obvious choices are to sink the PR passed over a 404 or remove that link from the PR distribution, and therefore redistribute the PR over the remaining links from the document.

I've been unable to find anything official on this, and perhaps it's coming close to the secret sauce I mentioned on your post.

Comment by Richard Hearne — February 20, 2015 @ 6:44 am

I think you are missing a simple point +Richard Hearne A page that doesn't exist, cannot receive any Link Juice! (aka PR)

+Alistair Lattimore and I have alluded to PR from a page is divided equally to all the links that are followable……and that getting rid of 404 links will increase the left over links with a boost of PR link Juice. Google's patent on this (PageRank) explains it quite well.

Google will not deny or confirm how they treat links to 404 pages. But their Patent does….and a page that doesn't exist can NOT hold or receive any PR. Am I missing something here?

Comment by Trey Collier — February 20, 2015 @ 9:39 am

Nope, I'm not missing that at all. My question has never been about the 404 receiving PR. It is about what happens to the link on a page which points to the non-existent page.

Apart from revealing changes to how internal NOFOLLOW links are handled (which wasn't part of original patent), Google has never publicly explained how they handle links to dead pages in terms of PR distribution AFAIK.

If you can point me to any references from original patent covering this I'd like to give it a gander again.

Comment by Richard Hearne — February 20, 2015 @ 9:56 am

+Gary Illyes, +John Mueller, if you don't mind me asking:

Can we please have a direct answer on the fate of PR associated to the link from A the 404ed-D in Gary's pretty graph?

Does it goes to the unicorns (as a nofollow-ed link juice does in my understanding, so far I've always assumed this way), or is the link discarded before redistributing PR, or what?

Of course, a "classified" answer is understandable, but if no secret-sauce is involved, the direct answer to the point would help us taking better decisions in some scenarios.

Comment by Federico Sasso — February 21, 2015 @ 10:20 am

To analyze the situation, I would strictly separate indexing from WhateverRank¹ distribution. Let's just talk about how things should be distributed, putting aside other different topics like indexing or crawling or resource HTTP statuses.

First, we should all remember that the first version of PageRank was a (very) rough way to estimate the probability of each page on the web to be reached by a web surfer. The PageRank formula was designed taking into account how many links each page contained because knowing the quantity of those links was necessary to estimate the probability of each link to be clicked by the user.

Now, reality tells us that as long as a user can see and click a link, that link existence in the page will decrease the other same-page links' probability of being clicked.

When Google introduced the rel=nofollow attribute, WhateverRank was calculated so that the total amount of WhateverRank sent by a page was divided between all the do-follow links in the page. This was a big mathematical mistake because doing things in this way created an artificial WhateverRank boost for the resources linked by the do-follow links. It was artificial because the presence of a rel=nofollow attribute in a link doesn't change at all the actual probability of the remaining links in the page to be clicked by users.

To fix their mistake, Google changed the formula so that the WhateverRank sent by a page was divided between all the links it contained, regardless of their follow/nofollow status. This update made the WhateverRank calculations able to provide values that matched in a better way what users can actually do while navigating web pages.

The lesson that Google learnt from this mistake was that a formula like that should always take into consideration what happens in real life.

Now, how should a formula like that behave when one of the two links in a page points to a resource that sends a 404/410/403/4xx status code? In my opinion +Trey Collier has the correct answer: if the Google engineers have learnt a good general lesson from the nofollow mistake, we should expect that the link that points to the reachable resource would get about half of the WhateverRank owned by the page. The user sees two links, so the probability of each of them to be clicked is roughly 50%.

(By the way: that 50% figure assumes that WhateverRank is divided evenly among all the links of a page, which probably it's not what the search engine actually does, but that's a different topic)

Now we have a good answer to the question "what happens to the WhateverRank transferred by the links that point to the reachable resources". It's a good answer because:

1) This kind of approach/formula models in a better way what actually happens in reality: as long as a user can see and click a link, that link contributes to the probability of the other links to be clicked

2) It's the correct mathematical way to approach the situation

3) Google already made in the past the mistake of artificially boosting PageRank. From that specific mistake, they should have learnt a general lesson about WhateverRank distribution

So, we don't have an official answer by Google, but all the signals point to the same direction: all the links in a page should be taken into consideration in order to better calculate a distribution of click/visit probability.

If you are interested in WhateverRank distribution, I would suggest to focus more on how much each link is visible to the user, because we should assume that dividing WhateverRank _evenly_ between all the links in a page, regardless of their actual probability of being seen and clicked by users, is not a very good way to design a probability-estimation formula.

Cheers!

¹ I've called it "WhateverRank" just because I think it's highly unlikely that Google hasn't developed something more sophisticated of that old goat in the last 17 years.

Comment by Enrico Altavilla — February 21, 2015 @ 5:20 pm

Plot thickens.

Comment by Richard Hearne — February 21, 2015 @ 5:32 pm

I think you guys are overthinking it. If a page doesn't exist, it can't get PageRank. If another page links to such a page, it will not pass PageRank as there's nowhere to pass that PageRank to. So, if A links to B, C, D, then B,C,D would, in most cases, get a fraction of A's PageRank. If D doesn't exist, well, it won't get any PageRank; it doesn't go to unicorns, it doesn't go to Wonderland, it doesn't even attempt to go anywhere.

Comment by Gary Illyes — February 21, 2015 @ 7:54 pm

w00t!

Comment by Pedro Dias — February 21, 2015 @ 8:13 pm

+Gary Illyes : actually, there is a very good reason why people want to know what happens to the PageRank that is not assigned to D.

While the answer seems obvious to you, it is not so obvious to people who, just a few years ago, observed Google re-distributing the non-assigned PageRank between the do-follow links in a page.

If the distributing algorithm is:

A:
1) Ignore all the links that point to 4xx resources
2) Divide the page PageRank between the (remaining) links

…you get a very different result from:

B:
1) Divide the page PageRank between the links
2) Don't assign any PageRank to the 404 resources

We know that the correct way to do it is "B" but you should't pretend that there is only one way of managing the distribution of PageRank and, more importantly, since Google has changed their minds about how to handle the follow/dofollow links, asking the same question about 404 resources is perfectly justified.

So, yeah, it's "B", but don't forget that in the past Google temporarily applied the "A" approach to nofollowed links.

P.S.
A small suggestion about your chart: remove the "because it can't be indexed" phrase, because the fact that a resource can't be indexed has nothing to to with the assignment or passing of PageRank. For example, you can create a NOINDEX resource and assign PageRank to it. That resource, by default, will also pass PageRank to other linked resources. So in my opinion you shouldn't create this connection between PageRank flow and indexing, because the two things are not really related.

Comment by Enrico Altavilla — February 21, 2015 @ 9:25 pm

+Enrico Altavilla a page that is NOINDEX is still in the Google index, it just won't be displayed to users.

The use of NO_INDEX_ as the value to indicate to search engines that they should stop a page from being returned in search results always seemed strange to me, something like NODISPLAY might have been clearer.

Comment by Alistair Lattimore — February 22, 2015 @ 12:16 am

+Alistair Lattimore : it's not really so strange because, actually, Google is the only search engine that interprets the NOINDEX directive as a NODISPLAY directive. Other search engines apply the real and original meaning of that directive, that is: "do not create in the inverted index any association between words/phrases and this resource".

Even Google, in the past, complied with the real meaning of that directive and when they unilaterally changed the way the directive was interpreted, they kept its original name in their documentation (probably because the NOINDEX directive is a well known name that the webmasters were already accustomed to).

This change was a bit confusing to webmasters, but it's not entirely Google's fault. Technology has changed a lot since the first web search engines were created, and those old directives were named after the internal structures used at the time. There was an inverted index (a list of associations between words and documents) and an archive (a repository of the raw contents of a resource), so the instructions "NOINDEX" and "NOARCHIVE" made sense. Maybe, since then, the Google technology has changed so much that an inverted index and an archive do not exist anymore and it's no more possible to follow the original meaning of those directives…

…or maybe the internal Google definition of "index" is still "a list of associations between words and documents" and the change of interpretation of the NOINDEX directive was just a convenient way to keep more document in the inverted index. The following video by Matt Cutts seems to confirm that when they talk about "the index" they are still referring to the industry-wide common meaning, an inverted index:
https://www.youtube.com/watch?v=KyCYyoGusqs

So, every time I read about "the index" I assume that they are talking about an inverted index.

My reply to Gary was about a different aspect, though: I was saying that that anything index-related (the index itself, the indexing process and the "differently interpreted" NOINDEX directive) has nothing to do with how the PageRank flows. So, it's not correct to state that the PageRank will not flow because a resource is not indexed. It doesn't flow because a resource is not in the link graph.

Comment by Enrico Altavilla — February 22, 2015 @ 1:44 am

I'm still none the wiser whether removing the broken link to D has any impact on the overall outcome +Gary Illyes. The question is whether the the group of links to B and C is treated identically whether the broken link to D appears on Page A or not?

I need to provide an SEO justification for cleaning up broken internal links on a site, and TBH I cant find anything I can refer to.

Comment by Richard Hearne — February 22, 2015 @ 5:33 am

+Richard Hearne : I'm sure that everyone who has commented on this thread will agree that fixing or removing broken links that users could click, will produce a positive effect, either directly or indirectly. So yes, you should always try to fix broken things.

Also, if a website owner needs an SEO justification to fix a serious usability issue, then in my opinion they are not doing a good job at making the site usable and at optimizing it for search engines. Their approach to SEO could hugely improve if they'll start to observe things from the point of view of the user.

Comment by Enrico Altavilla — February 22, 2015 @ 6:18 am

Let's keep the focus on my original question Enrico. UX is off-topic entirely in this discussion.

[FWIW I agree with you wholeheartedly however]

Comment by Richard Hearne — February 22, 2015 @ 6:28 am

I applaud your efforts +Richard Hearne on getting a definitive answer here from a Google source. JM was best chance….and you got a "nice Graphs" quip.

You also got some great feedback from some the the best SEO studs in the business IMHO.
+Enrico Altavilla did an awesome, near overboard, but certainly very insightful job of explaining PR distribution then and now.

+Alistair Lattimore ,too, has more SEO knowledge in is left pinky than I have in entire body….and many others that I follow and trust.

Bottom line I see here is that you aren't going to get that answer you want from big "G". So what do you plan to do now?

You did get a pretty definitive consensus from a great group of superb international SEO's opinions. Opinions that have been gathered from their own trials and tribulations, testings and such. Just curious if "G" won't give you your definitive answer you crave, what you next step will be?

Comment by Trey Collier — February 22, 2015 @ 9:27 am

+Trey Collier just FYI +Gary Illyes is a colleague of +John Mueller so Gary's views are coming from an insider.

I suspect that Google prefers to remain ambiguous on how this works in case that information can be abused (although I cant figure out how).

Perhaps the truth will slip out some time, probably by accident.

Thanks to everyone who contributed here.

Comment by Richard Hearne — February 23, 2015 @ 1:54 am

Search Engine Optimisation

Dear +John Mueller

27 Comments »

Leave a comment

Search Engine Optimisation Services »

Popular Posts

Useful Posts