Wikipedia:Reference desk/Archives/Computing/2019 October 13

= October 13 =

set difference in SQL
Again this is part of my "learn SQL" mission so I'm looking for the "best" (scaleable etc.) answer rather than just a usable one. I'm trying to get news updates by reading a news site once an hour or so, extracting the links to news articles, and checking which links weren't there before (news stories tend to stay on the front page for many days). In sqlite, "N+1 queries are not a problem", so it's ok to just store the old urls in an indexed column, and then when doing an hourly scan, do a separate lookup for each link in the page. But I'm wondering how to avoid this. There can be up to a few hundred links on the page, e.g. on www.reuters.com or www.sfgate.com. My idea is:
 * Have a permanent table A of old urls, indexed on url
 * on scanning a page:
 * create an unindexed temporary table B in memory and insert all the just-read urls into it
 * create an index on the temporary table
 * CREATE TEMPORARY TABLE C AS (SELECT url FROM B WHERE url NOT IN A);
 * INSERT INTO A FROM C;
 * The idea above is that the query planner should be able to do the right thing with the above queries to create table C by using the indices on A and B to scan both in order, making a single linear pass, then do the table merge efficiently as well.

Is the above approach ok? What is the best way to create a temporary table with 100s of entries supplied by the application? Does it vary much by db? I'm using Python/sqlite3 atm but want to become reasonably clueful about postgresql and mysql as well. (Don't care about Oracle or MS SQL). Thanks! 67.164.113.165 (talk) 17:20, 13 October 2019 (UTC)


 * Assuming that these tables are just for your use, then that sounds like a good way to do it. On the other hand, if others would need to access those tables, then it might be better to break that down to smaller operations to avoid table locking. That is, you could check each new URL link against the table of old URL links, and only insert it if it's not found. Probably less efficient overall, but might handle the table locking better. Now this wouldn't matter for hundreds of URLs, but you asked about scaleability, and if you went to millions or URLs, then it might well matter. Also note that the index should probably be recomputed each hour, after inserts are done.


 * Also, unrelated to DBs, your method of finding if an update has occurred seems likely to miss some updates, where content was changed, but the URLs in the links were not. SinisterLefty (talk) 02:11, 14 October 2019 (UTC)


 * Thanks, yeah, sometimes a story might get updated but it's ok, I can always look on the live site. Yes the actual program is just for my own use, but I'm still looking for how a real db dev would do it.  I thought looking up urls one by one was an N+1 query which is what I'm trying to avoid.  Like if I were doing this in a Python program instead of sql, I'd use two sets (basically hash tables) so I could quickly do in-memory lookups without any db traffic.  The corresponding sql thing would seem to be push all the new urls to a temporary table (maybe not all at once, ok) and then do some bulk operation on the tables. Do the high end db's like postgres lock entire tables when doing updates?  I had hoped not, but I guess it's something to consider.  I bought a book about db optimization so I'll see if I can find any suggestions about this.  So far I've only skimmed it to get the general ideas, which mostly make sense.  67.164.113.165 (talk) 04:28, 14 October 2019 (UTC)


 * If we consider that there were two users, and one was doing a search for a URL, while the other was adding that URL to the table, then yes, it should properly lock the table while doing the update so that the search gets the correct results. However, there might be a way to specify that an occasional "wrong answer" is OK and that therefore it shouldn't lock the table when doing updates. There could also be issues with updating the index. If the index isn't updated, it would then take far longer to search for a URL. So, again it might make sense to just not allow searches until the index has been updated, although this would only matter if thousands of URLs had just been added.


 * I believe the way some massive DBs handle this is that they hold all updates until the nightly maintenance cycle, and update the index then. This allows for searches without delay during the day, with the understanding that they won't reflect the latest info, which "isn't in the system yet". SinisterLefty (talk) 04:59, 14 October 2019 (UTC)


 * Oh, and one other comment on your original plan, I wouldn't create the index for the temporary table. As you said, the DBMS can figure out the best way to do it, and that includes creating an index, if that would help (but I don't think it would). The permanent table, on the other hand, needs the index, as you don't want it recreating a temporary index every time. SinisterLefty (talk) 05:08, 14 October 2019 (UTC)


 * Ah, good point, with btree indexes it's enough that only the permanent table has it since the temp table entries will be looked up one by one at the db end and it will still be O(N log N) operations. I had imagined a sequential scan of both indexes side by side (like looking for common elements in two sorted files) but don't know if it would really work that way.  If I get really bored I might try doing some timings.  Thanks. 67.164.113.165 (talk) 05:24, 14 October 2019 (UTC)


 * Yes, bench-marking is certainly the way to find the best option for that particular DB, table implementation, data size, etc. However, if anything changes, even a minor update of the DB system, the results could change dramatically. Also note that you would want to compare the time without an index against the time with an index plus the time to create that index. SinisterLefty (talk) 05:32, 14 October 2019 (UTC)


 * Another general comment: This isn't a good way to determine if a web page has changed, unless you want the list of URLs for some other purpose, like a search-engine spider. If not, something like a diff would make more sense. SinisterLefty (talk) 05:56, 14 October 2019 (UTC)
 * They move the links around on the page but the links generally point to articles so I think the set difference approach is ok. It's not a general purpose scraper, it's just for a few specific sites, currently reuters.com and sfgate.com.  Both of them currently stick to a fairly rigid template where there are sections that each contain a bunch of links with headlines.  There are some other sites like nytimes.com where the main page contains story text and where diffs might work better.  I hadn't really thought about that, hmm. Actually what am I doing looking at the HTML at all?  I should pull the RSS feeds instead.  I wrote an HTML de-junker anticipating the California power outages (so I could check the news with limited mobile bandwidth) and it got a bit out of control.  But I should rethink it.  Thanks for the ideas. 67.164.113.165 (talk) 17:24, 14 October 2019 (UTC)  (Added: Actually I can use RSS to list articles instead of dejunking the front page, but I want have to dejunk the articles themselves, since they are full of videos and crap that drive my browser berserk even during normal times.  So I'll keep using the dejunking code, which strips out a lot of scripts and images etc. 67.164.113.165 (talk) 20:39, 14 October 2019 (UTC))


 * You're welcome. And can I get a copy of that dejunking code ? Sounds useful ! SinisterLefty (talk) 17:28, 14 October 2019 (UTC)


 * Sure, lemme clean it up and figure out how to upload it. It basically just loads the page into BeautifulSoup (html parser, "pip install bs4") and then deletes the JS scripts, images, and some other crap found by hand-examining the page. You get a very fast loading page that has the same general shape as the original.  I'll be away for the rest of this week but will be back after that so will try to upload it then.  I also like lite.cnn.io which is pretty junk-free to begin with.  Also: text.npr.org .  I did reuters because lite.cnn.io was offline for a while, and then sfgate because it had the most current updates about the CA power outages.  It's fairly simple to do other sites though as long as they don't rely too much on Javascript or AJAX. 67.164.113.165 (talk) 04:41, 15 October 2019 (UTC)


 * Thanks, those sites are nice, although I don't mind pics so much, just the ads, and animation or video that plays on it's own. SinisterLefty (talk) 05:49, 15 October 2019 (UTC)