User:Tagishsimon/Junk4

Scraping Legislation.gov.uk
I'm wanting to improve my scraping of legislation.gov.uk.

Example: https://www.legislation.gov.uk/uksi/2020 - contains a couple of tables.

Right now I'm using a bit of nokogiri I found on sourceforge, to parse values from the table into a CSV. However it's only parsing the anchor text, and not the hyperlinks.

Sections below show
 * the code I'm using
 * sample of the pertinent table from https://www.legislation.gov.uk/uksi/2020
 * current code output
 * desired code output

The ask Might you be able either to amend the current ruby to something that does what I'm after, or, supply some other perl/python/??? code which supplies the desired output?

The documents I parse will have 10s to 100s of tables - I'm WGETting lots of pages from legislation.gov.uk, concatenating them into a single file, and running the current ruby across that. Downstream workflow - I put the CSV into a larger spreadsheet which generates the quickstatements required to append new items.