User:Enterprisey/AIV analysis/Appendix

This page contains random details about the AIV analysis so that you can more thoroughly check my work.

Trimming the overlap
The September 2023 analysis generated two files, https://apersonbot.toolforge.org/aiv-analysis/2022-09-01T00:00:00Z--2023-09-01T00:00:00Z--cases.0.json and https://apersonbot.toolforge.org/aiv-analysis/2023-02-01T00:00:00Z--2023-09-01T00:00:00Z--cases.0.json. These had an overlap of about a month or so because I started the second job at February 1 to catch the change made to IPvandal. I removed the overlapping cases and uploaded the resulting file to TODO TODO. Here's the Python session where I did the filtering:

aiv-analysis $ python Python 3.10.10 (main, Mar 5 2023, 22:26:53) [GCC 12.2.1 20230201] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import json >>> a=json.load(open('2022-09-01T00:00:00Z--2023-09-01T00:00:00Z--cases.0.json')) >>> len(a) 17826 >>> b=json.load(open('2023-02-01T00:00:00Z--2023-09-01T00:00:00Z--cases.0.json')) >>> len(b) 22559 >>> next(case['report']['aiv_removal_revid'] for case in a) 1107803846 >>> next(case['report']['aiv_removal_revid'] for case in b) 1136759073 >>> a[-1]['report']['aiv_removal_revid'] 1142517023 >>> b_revids = set(case['report']['aiv_removal_revid'] for case in b) >>> a2=[case for case in a if case['report']['aiv_removal_revid'] not in b_revids] >>> len(a2) 14728 >>> a2[-1]['report']['aiv_removal_revid'] 1136750374 >>> json.dump(a2, open('2022-09-01T00:00:00Z--2023-02-01T00:00:00Z--cases.0.json', 'w'))

As you can see, the task was straightforward: I generated a list of AIV removal revids for b, and filtered out the cases with those revids in a to make a2, which I wrote into the new file.

Note that the resulting two files have no gaps in between them. This can be verified by starting at the last diff that I printed for a2, which is Special:Diff/1136750374, and stepping forward to the next instance of removed text, which is Special:Diff/1136759073, which is, as expected, the first revid that I printed for b.

The resulting file is https://apersonbot.toolforge.org/aiv-analysis/2022-09-01T00:00:00Z--2023-02-01T00:00:00Z--cases.0.json.