User:The Anome/SQL

This page exists to save useful SQL queries, just in case the Quarry project ever goes away. Where this differs from the Quarry query, the Quarry version should be regarded as authoritative.

See also https://sql-optimizer.toolforge.org/ for a useful SQL analysis tool.

Articles not assigned to WikiProjects
https://quarry.wmcloud.org/query/72412

SELECT article.page_title, article.page_id, talk.page_id AS talk_id FROM page AS article LEFT JOIN page_props ON pp_page = article.page_id AND pp_propname = 'disambiguation' LEFT JOIN page AS talk ON talk.page_title = article.page_title AND talk.page_namespace = 1 AND talk.page_is_redirect = 0 WHERE article.page_namespace = 0 AND pp_page IS NULL AND article.page_is_redirect = 0 AND NOT (article.page_title LIKE "List_of_%") AND NOT (article.page_title LIKE "Lists_of_%") AND NOT EXISTS (   SELECT      1    FROM      categorylinks    WHERE      article.page_id = cl_from      AND ( cl_to IN (         "Temporary_maintenance_holdings",          "Candidates_for_speedy_deletion",          "Requested_RD1_redactions"        ) OR cl_to LIKE "%_names" OR cl_to LIKE "%_surnames" OR cl_to LIKE "Lists_of_%" ) )  AND NOT EXISTS (    SELECT      1    FROM      templatelinks      JOIN linktarget ON lt_id = tl_target_id    WHERE      tl_from = article.page_id      AND lt_namespace = 10      AND ( lt_title IN (         "Wi",          "Surname",          "Given_name",          "Animal_common_name",          "Plant_common_name",          "Nickname",          "Dmbox"        ) OR lt_title LIKE "%index%" OR lt_title LIKE "%redirect%" ) )  AND (    (talk.page_id IS NULL)    OR ( (talk.page_id IS NOT NULL) AND NOT EXISTS (       SELECT          1        FROM          categorylinks        WHERE          talk.page_id = cl_from          AND ( cl_to LIKE "%WikiProject_%" OR cl_to LIKE "%-Class_%_articles" OR cl_to LIKE "%-importance_%_articles" OR cl_to LIKE "%-priority_%_articles" OR cl_to LIKE "Unassessed_%_articles" )     )      AND NOT EXISTS (        SELECT          1        FROM          templatelinks          JOIN linktarget ON lt_id = tl_target_id        WHERE          tl_from = talk.page_id          AND lt_namespace = 10          AND (lt_title LIKE "%WikiProject_%")      ) ) ) ORDER BY  article.page_title

Biographical articles without WP Biography tag
https://quarry.wmcloud.org/query/72401

SELECT CONCAT("* ",article.page_title, "") --, talk.page_id AS talk_page_id FROM page AS article LEFT JOIN page AS talk ON talk.page_title = article.page_title AND talk.page_namespace = 1 AND talk.page_is_redirect = 0 WHERE -- article.page_id % 10 = 2 article.page_namespace = 0 AND article.page_is_redirect = 0 AND NOT article.page_title RLIKE "^(Lists?_of|[0-9]{4})_.*$" -- no dated events or lists AND EXISTS (SELECT 1 FROM categorylinks WHERE cl_from = article.page_id and              (cl_to = "Living_people" OR cl_to RLIKE "^(Date|Year)_of_(birth|death)_(missing|unknown).*$" OR cl_to RLIKE "^[0-9][^_]+_(births|deaths)$")) -- make sure we don't catch things like 'protest-related deaths' AND (   (talk.page_id IS NULL)     OR ((talk.page_id IS NOT NULL) AND NOT EXISTS (SELECT 1 FROM categorylinks WHERE cl_from = talk.page_id                  and (cl_to LIKE "%Biography_articles" OR cl_to LIKE "%biography_articles" OR cl_to LIKE "%WikiProject_Biography%"))))

ORDER BY article.page_title

WikiProject to page category cross-reference
https://quarry.wmcloud.org/query/72090

Part of a plan to assign articles to WikiProjects using Naive Bayes.

Still being worked on. Limited to just a few pages, for testing purposes. Sorting by all three result columns to make output more compressible later.

Now using modulo arithmetic to sub-sample the page table more uniformly. 1/1000 of the table ~= 6700 articles

page_id % 1000 = 77: Executed in 251.52 seconds as of Tue, 07 Mar 2023 08:21:01 UTC. Resultset (296041 rows) page_id % 1000 = 78: Executed in 265.67 seconds as of Tue, 07 Mar 2023 08:28:22 UTC. Resultset (316077 rows) page_id % 1000 = 392: Executed in 256.58 seconds as of Tue, 07 Mar 2023 09:22:56 UTC. Resultset (310016 rows) page_id % 1000 = 816: Executed in 269.80 seconds as of Tue, 07 Mar 2023 09:31:37 UTC. Resultset (295323 rows)

With 1/1000 scan, jsonl file is about 30 Mbytes in size, but bz2's to about 1.3 Mbytes.

Running this query repeatedly to cover all articles will take ~3 days., but is necessary to generate ground truth for later categorization.

Next step: Toolforge.

SELECT pagecats.cl_to, talkcats.cl_to, -- REGEXP_EXTRACT(talkcats.cl_to, '(?WikiProject_|.*_importance_|.*_priority_|Unassessed)(.*)'), -- does not work COUNT(*) AS my_count FROM page AS article INNER JOIN page AS talk ON talk.page_title = article.page_title INNER JOIN categorylinks AS pagecats ON pagecats.cl_from = article.page_id INNER JOIN categorylinks AS talkcats ON talkcats.cl_from = talk.page_id WHERE article.page_id % 1000 = 816 AND article.page_namespace = 0 AND talk.page_namespace = 1 AND article.page_is_redirect = 0 AND talk.page_is_redirect = 0 AND (       talkcats.cl_to LIKE "%WikiProject_%"        OR talkcats.cl_to LIKE "%-Class_%_articles"        OR talkcats.cl_to LIKE "%-importance_%_articles"        OR talkcats.cl_to LIKE "%-priority_%_articles"        OR talkcats.cl_to LIKE "Unassessed_%_articles"      ) AND NOT (           talkcats.cl_to LIKE "%vital%"            OR talkcats.cl_to LIKE "%Version%"  ) AND NOT (pagecats.cl_to LIKE "%Disambig%" OR pagecats.cl_to LIKE "%disambig%" OR pagecats.cl_to LIKE "%set_index%" OR pagecats.cl_to LIKE "Set_index%") AND NOT (talkcats.cl_to LIKE "%Disambig%" OR talkcats.cl_to LIKE "%disambig%") AND NOT (pagecats.cl_to LIKE "Short_description%"           OR pagecats.cl_to LIKE "%_errors%"           OR pagecats.cl_to LIKE "CS1_%"                      OR pagecats.cl_to LIKE "%short_description%"           OR pagecats.cl_to LIKE "%articles%"           OR pagecats.cl_to LIKE "Articles%"           OR pagecats.cl_to LIKE "%pages%"           OR pagecats.cl_to LIKE "%disputes%"           OR pagecats.cl_to LIKE "Pages%"           OR pagecats.cl_to LIKE "Use_dmy_date%"           OR pagecats.cl_to LIKE "%Wikipedia%"           OR pagecats.cl_to LIKE "%articles%"           OR pagecats.cl_to LIKE "%Articles%"           OR pagecats.cl_to LIKE "%Wikidata%"           OR pagecats.cl_to LIKE "Webarchive%") GROUP BY pagecats.cl_to, talkcats.cl_to ORDER BY my_count DESC, pagecats.cl_to, talkcats.cl_to -- LIMIT 10000