Talk:Branching order of bacterial phyla (Genome Taxonomy Database, 2018)

GTDB version
Maybe we should update this page to use a newer GTDB release whenever one is available? Their releases aren't hard to parse: just go to https://data.gtdb.ecogenomic.org/releases/latest/ and get the two  files for bacteria and archaea. Trim it down to the phylum  level by discarding other nodes and that's the tree you want. Maybe add some names to the beyond-phylum-level clades we have here. Artoria2e5 🌉 02:43, 11 May 2022 (UTC)

Alright, here's a script to do that.

But! Before you think  is enough, it isn't.
 * The GTDB files has a lot of semicolons that the original library doesn't like. Replace all occurrences of  with   before you start. With sed, maybe, but I was experimenting so I used a graphical text editor. Try.
 * The GTDB files use colons in a way that the library cannot understand at all. To overcome that I "dumbified" the newick.py library by turning off colon handling in the following places:
 * Remove the colon from  at the top.
 * In, remove the   part completely.

Yeah. I was too tired to find another library. Anyways I have a tree now with, so that's a start... --Artoria2e5 🌉 03:23, 11 May 2022 (UTC)

Alright, the trees may be still a bit big. What to do?
 * Kill em numbers: remove  (keep the braces!) and , sed is unhappy with me sorry
 * Kill extra monotypic names: remove, although this can cause some confusion with  GTDB auto-generated names based on lower taxa

The end result is something like this for v207:

Okay start I'd say. --Artoria2e5 🌉 03:42, 11 May 2022 (UTC)

Further processing
HUGE thanks to for getting the job done! It's beautiful. I can't even imagine how much work that takes.

Still, we should maybe... get a way to automate the updating and specifically the relabelling from all the p__Blah names to proper stuff with quotation marks, links, and explanation about what it includes if the grouping is novel. Ideally we get: --Artoria2e5 🌉 15:07, 14 September 2022 (UTC)
 * a script that takes an old article cladogram and the corresponding "crude" newick (as above) and generates a look-up table for how to rename all the nodes
 * another script that takes a "crude" newick from a newer version and applies the table to it


 * Hi Artoria2e5,
 * I do not posses the skill to do create a bot that updates the GTDB tree whenever there is a newer version. If you can do this it is highly appreciated and commendable. If and when ne is created I can manually curate the new tree to remove errors if you would like.
 * Sincerely, Videsh Ramsahai (talk) 16:50, 14 September 2022 (UTC)
 * I guess I will do that sometime then. I think the first script can be skipped if we just kept comments describing what each node is in the article source. That would provide traceability even for human editors, although the source code size is... gonna get bigger and arguably the parenthesized stuff (e.g. "(Bacillota C)") will be duplicated unnecessarily. Artoria2e5 🌉 11:48, 15 September 2022 (UTC)

Oh no, Spirochaetota and Lindowbacteria are clearly not in the right place. And Undinarchaeota isn't branching early enough either. I don't want to check everything... --Artoria2e5 🌉 04:01, 27 January 2023 (UTC)