User:GreenC/software/linkrot

How to fix link rot - on Wikipedia and elsewhere.

The 3 results
Given a URL, there are only 3 basic results:


 * 1) Convert to an archive URL: http://example.com --> https://web.archive.org/web/20240601/http://example.com
 * 2) Move to a new URL: http://example.com --> http://example-new.com
 * 3) Do nothing leave it alone: http://example.com

The 3 factors
When deciding which of the 3 results, there are 3 factors:


 * Redirects - A redirect is when a URL redirects to a different URL.
 * Soft-404s - A soft-404 is any page that contains content different from the desired content. Typically redirects to a home page.
 * Soft-redirects - A soft-redirect is when the page is live at a different URL, but there is no active redirect to the new URL.

To properly determine which of the 3 results to choose, the 3 factors need to be known ahead of time. This foreknowledge may come from other editors who inform you that a URL has moved. Or it may come through discovery, by looking at logs to see where URLs redirect to, and interpreting that information. It's a process to learn the information, codify it, and upload the results.

Process
Process to decide the 3 results


 * 1) Codify any pre-known soft-redirects. These would be hard coded rules, based on foreknowledge. Thus, transform http://example.com to http://example-new.com - We'll call this the "newurl"
 * 2) Check newurl for redirects -- we'll call this the newloc URL ie. the "new location" URL
 * 3) Make a two-column table: newurl newloc
 * 4) Analyze the table looking for repeating instances of the same newloc in the second column. These indicate probable soft-404s.
 * 5) Add new rules (code) to account for the soft-404s,
 * 6) Re-process the links with the soft-404 rules in place.
 * 7) Check every URL and redirect URL for status 200 or 404.
 * If 404, then add an archive URL result #1
 * If 200, return the newurl ie. result #2 or result #3 .. depending on the value newurl

Example code
The following pseudo-code demonstrates the steps:

origurl = "http://example.com" newurl = sub("example.com", "example-new.com", origurl)  # Step 1 - codify known soft-redirects

(status, newloc) = networkcheck(newurl) # status = 200, 404, etc.. .. this is Step 2 - check newurl for redirects # newloc = redirect URL

if newloc then print newurl "\t" location > table.txt  # Step 3 = make a two column table endif

At this point we follow Step #4 and look at the table which might look something like this:

http://example.com/page1.htm https://example.com http://example.com/page2.htm https://example.com http://example.com/page3.htm https://example.com/page3.htm http://example.com/page4.htm https://example.com http://example.com/page5.htm https://example.com/page5.htm

Here we see page1, page2, and page4 redirect to the home page. The others redirect to "https". So we have learned two new rules:
 * All URLS in this domain have a soft-redirect to https
 * Any URL that redirects to http://example.com is a soft-404.

So we modify the code as follows:

origurl = "http://example.com" newurl = sub("example.com", "example-new.com", origurl)  # Step 1 - known soft-redirect newurl = sub("http:", "https:", newurl)                  # Step 1 - known soft-redirect

(status, newloc) = networkcheck(newurl)   # status = 200, 404, etc.. .. this is Step 2 - check newurl for redirects # newloc = redirect URL

if newloc then if newloc == "https://example.com" then # Step 5 - soft-404 return "404" newurl = newloc (status, newloc) = networkcheck(newurl)

endif

if status == 200 then return newurl else return "404" endif

Thus the above code will return:

http://example.com/page1.htm --> https://web.archive.org/web/20240601/http://example.com/page1.htm http://example.com/page2.htm --> https://web.archive.org/web/20240601/http://example.com/page2.htm http://example.com/page3.htm --> https://example-new.com/page3.htm http://example.com/page4.htm --> https://web.archive.org/web/20240601/http://example.com/page4.htm http://example.com/page5.htm --> https://example-new.com/page5.htm