Wikipedia talk:WikiProject Molecular Biology/Molecular and Cell Biology/Human protein-coding genes

Centralized talk page
This page serves to centralize discussions about the human protein-coding genes list articles. Please add new posts or edit requests for those pages below.  Seppi  333  (Insert 2¢) 02:58, 24 November 2019 (UTC)
 * AARD has been disambiguated. Please correct the link to AARD (gene). Thanks. BD2412  T 05:32, 27 November 2019 (UTC)
 * I'm going to retarget all of the pages in that list except 1 or 2 of the enzyme links with to SYMBOL (gene), so you won't need to worry about dablinks in the tables when creating disambiguation pages.  I expect to add those piped links within the net 24 hours. Thanks for creating AARD by the way.  I really appreciate any assistance I can get with the work I'm doing since it takes a lot of both programming and editing time to find and fix problematic links when working on a wikilinked list of this size. So !  Seppi  333  (Insert 2¢) 06:08, 27 November 2019 (UTC)
 * All 4 pages have been updated with piped links to the relevant parenthetically disambiguated gene symbols; i.e., everything in the list at WT:MCB, minus the 10 enzyme gene symbols that didn't need disambiguation.  Seppi  333  (Insert 2¢) 14:53, 27 November 2019 (UTC)
 * Good work, I'll look into creating the missing disambiguation pages shortly. BD2412  T 22:58, 27 November 2019 (UTC)
 * Thanks.  Seppi  333  (Insert 2¢) 05:27, 28 November 2019 (UTC)

Sync page links with Template:Infobox gene
Create redirects from the redlinked gene symbols and parenthetically disambiguated variants to the corresponding proteins in the WikiProject Molecular Biology/Genetics/Gene Wiki list:

There's ~11500 blue links in the tables and ~12400 articles with. I can find the protein pages that aren't linked to in these tables simply by converting the pages listed in the templatetransclusioncheck tool results for infobox gene into a python list, then generating another python list of the current link targets in the tables the next time I run User:Seppi333/GeneListNLP, and finally returning a list of pages that are in but not the table list. I can then identify the corresponding gene symbol for a given protein using pyUniProt and pyGtoP: pyUniprot to obtain the corresponding HGNC ID for a matched protein name/alias and pyHGNC to find the corresponding HGNC-approved gene symbol with the HGNC ID; pyGtoP to obtain the HGNC-approved gene symbol for a matched receptor, transport, or enzyme name/alias. Would have to manually identify the corresponding HGNC-approved gene symbol for any unmatched links that remain.

Would then need to get a bot approved to add the missing redirects from any redlinked gene symbol and all redlinked parenthetically disambiguated gene symbols to the corresponding protein articles.  Seppi  333  (Insert 2¢) 05:26, 28 November 2019 (UTC)

Still problems with these articles
Most of the tables are simply unnecessary or don't belong on Wikipedia. Per WP:ELLIST, we aren't here to be a directory of links to other websites. Then there are these incredibly redundant values added to the table as well. What's the point in saying that each individual entry is "approved", when they all are? This is hardly a list that needs to be spread across three pages and more than 1 million bytes. Onetwothreeip (talk) 23:46, 30 November 2019 (UTC)
 * I asked about that at WT:MCB but unfortunately no one answered. =/ The status isn’t the redundant column though; see the end of the last table. The redundant one is the locus group.  In any event, this isn’t a directory of links; it’s a list of genes.  Seppi  333  (Insert 2¢) 00:40, 1 December 2019 (UTC)
 * If others think it’s worth cutting the locus group per your argument, I’m open to the idea, but as I mentioned at the bot policy page, it’s probably worth replacing that with the gene location anyway.  Seppi  333  (Insert 2¢) 00:42, 1 December 2019 (UTC)
 * serious question: why do you think this list is any different than the example table in WP:ELLIST? Trying to understand your perspective.  Seppi  333  (Insert 2¢) 00:58, 1 December 2019 (UTC)
 * If every value in a column is the same, it's certainly a redundant column. These articles are unfortunately a directory of external links and they most certainly shouldn't be, whether or not that was intentional.
 * I'm not sure what you mean by me thinking that this list is different than any example. It is similar to the negative examples and dissimilar to the positive example. In the positive example, there is an external link that appears at the end of the list without individual references for each entry. In the negative example, each entry links to an external website, which is what these articles do. Onetwothreeip (talk) 03:42, 1 December 2019 (UTC)
 * Oh. I think I see what the issue is. I'm guessing it's not readily apparent why HGNC and UniProt links are listed for the entries.  These aren't arbitrary gene/protein databases in the event you thought this.
 * HGNC is the Human Genome Organization Gene Nomenclature Committee; it assigns the official name and official symbol of a human gene. That's evident from any NCBI gene page, like, which literally lists those as "Official name"/"Official symbol", listing HGNC as the data provider. Worth pointing out that the Human Genome Organization is an international entity, whereas NCBI gene is run by the US government, so there's no direct organizational/governmental/regulatory relationship.  HGNC is the simply the sole authority on gene nomenclature.  The analogous case is true for UniProt and the encoded protein; genes and proteins often have similar names but they're very seldom the same. There's also no single authoritative database that I can cite for both the gene and the encoded protein. Hence the need to provide a link on the gene and a link on the protein (these are protein-coding genes after all).
 * So, these do actually constitute official links for the gene and protein corresponding to the list in the entry. If it still bothers you, I could put them both in ref tags, but that will do two things you might like even less. For one, it's going to increase the page size by circa 50k due to the number of characters that 5000 pairs of ref tags use. Doing that would also greatly increase the overall vertical scroll size of the page since a reflist section will appear with 5000 references per page instead of these citations appearing in the same row as the list entry.
 * On a different note, the YES/NO table illustrates that embedded lists should not contain externally linked list entries. I.e., the first 5 entries in the first list would appear as shown below if that were applicable to this page. In my previous comment, I was referring to the example table beneath the YES/NO table which has columns for Candidate, Political party, Official website, and Votes.  Seppi  333  (Insert 2¢) 10:49, 2 December 2019 (UTC)
 * The individual entries aren't notable, although political parties are. There is clearly no reason for these external links to exist in this table. Onetwothreeip (talk) 09:19, 9 December 2019 (UTC)
 * Would you suggest putting them in ref tags then?  Seppi  333  (Insert 2¢) 07:18, 10 December 2019 (UTC)


 * 1) A1BG – other stuff – more stuff – etc.
 * 2) A1CF – other stuff – more stuff – etc.
 * 3) A2M – other stuff – more stuff – etc.
 * 4) A2ML1 – other stuff – more stuff – etc.
 * 5) A3GALT2 – other stuff – more stuff – etc.


 * They should be removed because they serve no real purpose here. Onetwothreeip (talk) 09:21, 11 December 2019 (UTC)
 * I disagree. The red links obviously need to be cited.  Seppi  333  (Insert 2¢) 10:53, 11 December 2019 (UTC)

Page size
List of human protein-coding genes 3 ‎currently has 443,692 bytes of markup; with pages 1 & 2 not far behind. They are far too big. What's the best way to divide them up? Andy Mabbett ( Pigsonthewing ); Talk to Andy; Andy's edits 09:44, 8 December 2019 (UTC)
 * Well, the most direct solutions IMO would be to either split the full set of entries up into 5 lists of 4000 each or just cut out the locus group column from the existing lists. Which sounds like a better approach?  Seppi  333  (Insert 2¢) 07:10, 10 December 2019 (UTC)
 * I figure it’d be simpler to just cut the column, so I’ll go ahead and do that in the next day or so. Will follow up once it’s done. I think it should reduce the page size by around 100k.  Seppi  333  (Insert 2¢) 10:21, 10 December 2019 (UTC)
 * It's done. The first 3 pages have all dropped in size by 105,058 bytes.  That address your concern?  Seppi  333  (Insert 2¢) 11:15, 11 December 2019 (UTC)
 * Thank you. It's a good start, but we should be reducing the size of pages to under 100k, and preferably under 50k. Andy Mabbett ( Pigsonthewing ); Talk to Andy; Andy's edits 15:06, 11 December 2019 (UTC)
 * No problem. I'm not really opposed to doing that, but it would require splitting this list into about 10 pages in order to reduce the size of each page down to 100k-ish. I can manually update the current 4 without it taking up too much time, but 10+ would be a bit difficult for me to do without a bot (I'd have to manually copy and paste 10+ text files into the respective sources and then manually delete the files each time I perform an update); right now, it's not looking very likely that I'm going to get approval for that. =/  Seppi  333  (Insert 2¢) 15:45, 11 December 2019 (UTC)

Requested edit
Posting here in light of the large flashing sign at the top of the article - PIN4 is a DAB page; the link should be piped to Peptidyl-prolyl cis-trans isomerase NIMA-interacting 4. Narky Blert (talk) 09:02, 4 November 2020 (UTC)
 * Even better to pipe it to PIN4 (gene); this currently simply redirects to Peptidyl-prolyl cis-trans isomerase NIMA-interacting 4, but maybe someday there'll be an article about the gene itself.
 * In addition to PIN4, RS1 is also a DAB page; should be piped to RS1 (gene). Lennart97 (talk) 11:52, 9 November 2020 (UTC)


 * Page updated

LTB is now a dab page - the link on this page should be changed to LTB (gene). Tevildo (talk) 10:51, 31 May 2021 (UTC)
 * Thanks for the feedback. I'll update the algorithm sometime in the near future and run it again to update these pages. Just very busy off-wiki lately.  Seppi  333  (Insert 2¢) 16:55, 12 September 2021 (UTC)

✅ - I updated the algorithm per your requests and ran the newer version on PAWS. Let me know if you see any other dablinks or issues. Also, thanks for notifying me about these. I unfortunately don't have any spare time to check for DABlinks in these lists myself - or even edit Wikipedia for that matter - anymore because I'm perpetually slammed with work.
 * Algorithm revised per above

These are dablinks that weren't reported here, but were added to the lists at some point in the interim and reported in the dablinks tool following my bot's first set of page revisions (i.e., the initial run after I updated the algorithm per above):  Seppi  333  (Insert 2¢) 07:44, 25 November 2021 (UTC)
 * CALR
 * CA9
 * EZR
 * NPY
 * RETN
 * RPS5
 * RRAD

Requested edit 13 Dec 2021
Similar to the previous section: on line 1155, ASPN is a disambiguation page; the link should be changed to Asporin (or, in full, ASPN . --R'n'B (call me Russ) 01:30, 14 December 2021 (UTC)
 * ✅ - thanks for the notification.  Seppi  333  (Insert 2¢) 09:58, 20 March 2022 (UTC)

Requested edit: CTSL → CTSL1
Seppi333, CTSL is now a disambiguation page, your bot should probably pipe it to CTSL1. Also please pipe ASPN to ASPN (gene) as R'n'B requested above. Streded (talk) 01:10, 1 February 2022 (UTC)
 * ✅ - thanks for the notification.  Seppi  333  (Insert 2¢) 09:59, 20 March 2022 (UTC)

BCAM
I have turned BCAM into a disambiguation page, ideally the bot could pipe it to basal cell adhesion molecule. Thanks! CapitalSasha ~ talk 10:38, 30 June 2022 (UTC)
 * How is a bot supposed to know which BCAMs mean basal cell adhesion molecule, which ones mean Basque Center for Applied Mathematics, and which ones mean Breast Cancer Awareness Month? --R'n'B (call me Russ) 12:03, 30 June 2022 (UTC)
 * My understanding is that this bot is in charge of updating the page List of human protein-coding genes 1 which currently links to BCAM which is the link that should be piped. I'm not asking for the link to be changed on any other page. CapitalSasha ~ talk 12:41, 1 July 2022 (UTC)
 * ✅ Added to DAB link list in the algorithm.  Seppi  333  (Insert 2¢) 19:44, 17 September 2022 (UTC)

Dablinks fixed since last run
Since I last ran the bot script ~3 months ago, 2 new dablinks were fixed in the bot's source code:
 * GH2 (gene)
 * COQ5 (gene)

GH2 wasn't reported and COQ5 was an existing dablink, so please be sure to report new dablinks; otherwise, I usually have to go back and rerun the bot code again after fixing the dablinks it adds back to these pages.  Seppi  333  (Insert 2¢) 19:13, 12 December 2022 (UTC)

No new dablinks since last run
Updated today.  Seppi  333  (Insert 2¢) 19:38, 9 January 2023 (UTC)
 * Re above.  Seppi  333  (Insert 2¢) 00:36, 7 February 2023 (UTC)

Requested edit: FPRG2 March 5, 2023
Please Wikilink FBP2 to Fructose-bisphosphatase 2. Note: FBP2 acronym appears to be used for "far upstream element binding protein 2" as well.

Also, should the HGNC ID and UniProt ID be added to Fructose-bisphosphatase 2 WikiData? I am unsure how to do that.

Should EPHX3 be linked to Epoxide hydrolase 3? Thank you! Adakiko (talk) 22:04, 5 March 2023 (UTC)


 * User:Seppi333 has already added the requested FBP2 and EPHX3 redirects. Thanks Seppi. The infobox in Fructose-bisphosphatase 2 also lists the HGNC (see ) and UniProt (see ) IDs.  So it appears that these do not need to be added this they are already included in Wikidata.  Finally "far upstream element binding protein 2" is linked to KHSRP. Generally only the official HUGO gene symbol is linked. Boghog (talk) 04:16, 21 June 2023 (UTC)

Adding new columns
Is the plan to add new columns, with the actual proteins that each piace of genetic code corrispond to? as well as where to find them in the body. Did anything come from the discussion on the list of human proteins (Draft:List of proteins in the human body)? or was it just deleted and forgotten? is there anywhere, where one can see how the deletion discussion on the article ended? and if non of the idears will be implented here, why they won't be? Claes Lindhardt (talk) 17:49, 14 August 2023 (UTC)


 * Here is the link for Articles for deletion/List of proteins in the human body. The conclusion of that discussion was to Move to Draft‎.
 * For context, the merge proposal was to add protein columns to the tables in List of human genes. Seppi333 did a fantastic job in creating a Wikidata query (click the blue arrow at the bottom left hand side of the page to execute) that returns all verified humans genes and selected information about the proteins encoded by these genes. (24,161 results returned in 11,098 ms).  This table contains:


 * {| class="wikitable"

! Column name !! Explanation
 * + Caption text
 * HGNCsymbol || approved HUGO gene symbol
 * proteinLabel || recommended UniProt name
 * wd_gene_item_article_link || Wikipedia article name if linked to gene
 * wd_protein_item_article_link || Wikipedia article name if linked to protein
 * }
 * wd_gene_item_article_link || Wikipedia article name if linked to gene
 * wd_protein_item_article_link || Wikipedia article name if linked to protein
 * }
 * }


 * The reason that we have two sets of links, one for genes and one for proteins stored in Wikidata is historical. Where there were separate gene and protein article, almost all of these have been merged into a single gene/protein article. Wikidata has not been updated to reflect these mergers. Please note that this dichotomy only exists in Wikidata, not in Wikipedia. The easiest way to handle this is to merge wd_gene_item_article_link and wd_protein_item_article_link into a single column.
 * At the risk of beating a dead horse into the ground, please note that with very few exceptions, we have a single set of articles describing both the gene and the protein encoded by that gene. Therefore, for all practical purposes, the Gene Wiki is identical to the Protein Wiki.
 * Finally, please note that the infobox gene already has a field for "RNA expression pattern" which lists the tissues where the mRNA for a particular protein is expressed. It would be very messy to include such a column in an enormous list. It gets even messier to include mRNA expression patterns for all splice variants. What is the use case for including this in a list of gene/proteins? Boghog (talk) 19:07, 14 August 2023 (UTC)
 * I apologize for the delay in my response. Felt a bit ill last week. I'm more or less ready to start tackling this by reopening my bot's first RFBA. First thing, though, is to plan out how I'm going to program this (specifically, where I'd be getting data for new columns from), and I think it may be useful to get feedback from the relevant Wikiprojects (WP:MCB, MOLBIO, Genetics, Gene wiki, etc.). They weren't particularly active when I first proposed the list, but it's worth getting feedback nonetheless.  Seppi  333  (Insert 2¢) 01:46, 22 August 2023 (UTC)
 * In term of what databases are useful, is the list of relevant databases in the draft any help? (if not could it be if I updated it to have all columns and rows filled out). Is it better if I try to update: List of biological databases and / or Protein structure database or do you have a different way of finding/using new databases anyway? Claes Lindhardt (talk) 06:33, 24 August 2023 (UTC)
 * Is the relationship between protein and gene always 1:1 or is there somtimes a gene that codes for multiple different proteins or a protein that can arise from multiple different genes? As I understood it, 'A single gene can produce mRNA (one molecule) that may be “edited” to produce many different proteins, via a process called RNA splicing'. The "RNA expression pattern" referes to the RefSeq RNA ID? and it seems there are often multiple are we sure that all the variations have been merged into the relevant articles? or are we at are point where there is at least 1 or two per article. Is there a way to find out how many possible proteins there could be for each gene? Claes Lindhardt (talk) 06:41, 24 August 2023 (UTC)
 * All the relevant data is already included in Wikidata. Wikidata in turn draws it data from GenBank, HUGO Gene Nomenclature Committee, UniProt, and a few other databases. It would be useful to include links to these databases.  One gene can generate more than one protein through alternative splicing and these are listed in the RefSeq links in Infobox gene. Two genes can produce an identical protein (see for example HBA1 and HBA2).  These arise in relatively recent gene duplication events.  But these are relatively rare and in my opinion, not worth worring about. Boghog (talk) 18:20, 24 August 2023 (UTC)
 * Allright thank you! :) a bite of a side-quest question(but I cannot find a better place to ask) Do you know if there is already a wikidata template for articles about Human cells or Human Tissue on wikidata, that one could link these two as well or use for: https://en.wikipedia.org/wiki/List_of_distinct_cell_types_in_the_adult_human_body, to also make a query?(please just delete this if this is not the right place to ask) Claes Lindhardt (talk) 21:18, 24 August 2023 (UTC)
 * Yes, see Infobox cell. It looks like there is more than one system for naming tissue and cell types, so what you call any one cell type could be complicated. Human Cell Atlas which in turn is based on Human Cell Atlas Ontology looks promising but apparently not in Wikidata. Boghog (talk) 04:28, 25 August 2023 (UTC)
 * Thank you kindely <3 Claes Lindhardt (talk) 07:32, 26 August 2023 (UTC)

Proteins in Humans that humans cannot produce
Are we sure that there is no proteins used in the body that we cannot make ourself? Or any proteins which can be very effective in the body(maybe in a pathogentic way) that we cannot make ourself? - As these would also be relevant to the proteins of the human body but make thier way around this list as they do not have a gene.

It seems drugs like insulin, Sargamostim (leukine) or (rGM-CSF), β-glucocerebrosidase for Gaucher's disease, Dornase alfa for Cystic Fibrosis, Interferons for autoimmune disorders and viral infections, Granulocyte macrophage colony-stimulating factor for immunostimulation, Granulocyte colony-stimulating factor also for immunostimulation, Factor VIII for hemophilia, Tissue plasminogen activator for strokes, GM-CSF are examples of proteins beeing introduced without beeing produced in the human body. (However both of theses are examples of protein being introduced as a drug when the body is incapable of producing the protein naturally. So ussally these can also be created by the body and thus are in the genetic code).

Is there mechanism or human biological phenomenon which assures that we only respond to/use proteins that we also can produce ourselves and that we have in our genetic code? Claes Lindhardt (talk) 06:48, 24 August 2023 (UTC)


 * Exogenous proteins that are used as drugs or produced by pathogenic organisms or viruses are clearly outside the scope of human proteins and should not be included in any list of human proteins. There are many non-human proteins used as drugs and by definition they must have effects on the patient or they would not be used as drugs. Pathogenic proteins have been evolutionarily selected to have effects on the host organism. I am not aware of any mechanism by which an organism would respond only to proteins that are encoded by its own genome other than evolutionary optimization of protein–protein interactions.  All proteins, whether native or foreign, are eventually degraded by the proteasome or in lysosomes. Foreign proteins tend to be degraded faster. Finally foreign proteins may be recognized and removed by the immune system. Boghog (talk) 03:54, 25 August 2023 (UTC)
 * Is there a seperate list for these somewhere? or a template for these? Claes Lindhardt (talk) 07:31, 26 August 2023 (UTC)
 * I am not aware of any comprehensive list. Concerning exogenously produced protein drugs, there is a table in Biopharmaceutical which is far from complete. There is also a List of therapeutic monoclonal antibodies, but this list only contains monoclonal antibody drugs (which are the most common class of biopharmaceuticals) but not other classes of protein drugs. There is also Category:Recombinant proteins, most of which are drugs. As far as templates, there is Infobox drug with mab, mab_type, source, type parameters, but this is not linked to wiki data. As far as exogenous proteins produced for example by human pathogens, there is infobox nonhuman protein (see for example Diphtheria toxin), also not linked to wiki data. Viruses that infect humans are constantly mutating producing enormous number of new protein variants every day while older variants disappear.  It is simply not possible to tract all of these. Boghog (talk) 09:34, 26 August 2023 (UTC)

Requested edit 24 March 2024
Please disambiguate VTN to Vitronectin. — Shelf Skewed  Talk  03:28, 24 March 2024 (UTC)
 * Done. Disambiguated both MPZ and VTN in this run. The dablinks tool seems permanently broken, which is what I've been using to check for dablinks on these pages. I don't know of any useful alternative for scanning 20000 links on a small number of pages like this, but if anyone knows of a suitable replacement tool, please link it here so that I see it next time. Thanks!  Seppi  333  (Insert 2¢) 22:07, 12 April 2024 (UTC)

PIM3 wrong redirect
There is seemingly no PIM3 page in English (only Ukrainian) and PIM3 in this list redirects to the Modula-2 programming langauge. I've removed the hyperlink for now. 188.2.81.96 (talk) 07:53, 20 June 2024 (UTC)