Wikipedia:Reference desk/Archives/Computing/2018 October 2

= October 2 =

Spider traps and DOS
Spider trap says that these traps may "cause a web crawler or search bot to make an infinite number of requests". Why would you do such a thing intentionally? Sure, it stops the problematic crawler/bot/etc., but wouldn't the effect be a denial-of-service attack, since you're being given an infinite number of requests? I see that there are other options (e.g. serving massive text files or refusing entry to something lacking the right cookie) that might simply crash the spider, but causing it to make infinite responses seems counterproductive unless I'm misunderstanding the whole concept. Nyttend backup (talk) 18:52, 2 October 2018 (UTC)
 * These pages suggest that they are normally unintentional, https://www.portent.com/blog/seo/field-guide-to-spider-traps-an-seo-companion.htm https://yoast.com/spider-trap/ . This pages has a comment that one purpose is to stop email address harvesting. Ways to deliberately DOS the spider, would be to pause before returning content, and then dynamically generate page text without using disk storage. Graeme Bartlett (talk) 02:19, 3 October 2018 (UTC)


 * A request-loop trap does not seem very intelligent to me, either, as these things go. However, your server will escape DOS if its ability to answer requests is higher than the spider's ability to make them; that does not necessarily mean you need more processing power or network capability. Denial-of-service_attack contains some inspiration, but for instance, the spider might have fallen in many traps in different sites (so the spider is DDoS'd), if the cost of making a request is higher than that of answering it (some DoS attacks work because a small query packet can request a large, bandwidth-guzzling reply), if you intentionally answer the spider's query very slowly (forcing them to maintain a low-throughput connection), etc.
 * In case you also wonder why you would set out traps for spiders... Well, I have one on my website, but you might as well read the original from whom I copied the idea rather than my summary. My own spambot trap is a page generating many mailto: links to non-existent domains; there are links to that page by zero-width elements all over the website but is excluded in robots.txt. So the only ones who should read it, really, are non-robots.txt-compliant spiders, 99+% of which are malicious and 100% of which deserve at least what they get here. Tigraan Click here to contact me 15:35, 3 October 2018 (UTC)