Wikipedia talk:Biographical metadata

Inspiration
After following discussions about biographical articles for several years, and noting the inefficiency with which biographical metadata was dealt with, I finally started this summary following a discussion about "death by age". See here for one summary of my views. I'm now going to start a subpage in my userspace to list the 126 links for "death by age" using "what links here". Carcharoth (talk) 00:09, 9 October 2008 (UTC)
 * Just a suggestion, but you could make that list a subpage of this one, unless you're frowning on others editing it? - jc37 00:42, 9 October 2008 (UTC)
 * Others can edit it. It is at User:Carcharoth/People who died aged XX. But I want it to be tidy first. Well, And I also want to keep a record in my userspace. Maybe I'll copy it over. Carcharoth (talk) 00:52, 9 October 2008 (UTC)

Gender
Please see comment I made here about obtaining and recording gender (male/female) metadata. Opinions would be welcomed. Thanks. Carcharoth (talk) 02:05, 11 October 2008 (UTC)
 * Are you interested in ... the question I posed about gender metadata here. I was surprised to see that Persondata doesn't include gender. Do the microformats include it? [...] Carcharoth (talk) 02:19, 11 October 2008 (UTC)
 * [The above copied from my talk page, in order that conversation my be centralised here]
 * The hCard microformat does not include a gender attribute; though I proposed one some time ago. I have also proposed the addition of gender to the vCard specification on which hCard is based, and it seems likely that that will be adopted, though such big wheels grind with terminal slowness. If a gender property is added to vCard, it is almost certain that hCard will do so. In the interim, one microformat parser, Cognition, has adopted a gender property on a trial basis, and there is no technical reason why that could not be included as such in infoboxes (or elsewhere on Wikipedia) where it would be parsed by Cognition, (and other parsers which might follow suit) and simply ignored by other microformat parsers, ensuring backwards compatibility. Whether editors would want to see "gender" in infoboxes is another matter.
 * Bear in mind that there is a statistically-small edge-case, for people who are trans-gender, hermaphrodite, or some such. A third, "other", classification may suffice for these. Andy Mabbett (User:Pigsonthewing); Andy's talk; Andy's edits 10:12, 11 October 2008 (UTC)
 * Update: Gender is now in the vCard specification: http://tools.ietf.org/html/rfc6350#section-6.2.7 Andy Mabbett ( Pigsonthewing ); Talk to Andy; Andy's edits 20:46, 5 February 2012 (UTC)
 * How would you add gender attributes to Wikipedia articles so they are picked up by microformat outputs? Carcharoth (talk) 21:15, 5 February 2012 (UTC)

The same way we emit other microformat properties; for example, in an infobox:

with the individual instance of the infobox in an article having Male or whatever. and assuming consensus to display that property, of course. Andy Mabbett ( Pigsonthewing ); Talk to Andy; Andy's edits 21:45, 5 February 2012 (UTC)

The value of numbers
You seem to be wanting a quantity, not a category?

(Statistics, without the actual need for grouping those which are part of the statistics.)

If there a way to do this besides using the whatlinkshere workaround, or categories? - jc37 02:46, 11 October 2008 (UTC)


 * Sort of. I also want to tidy things up where they have got messy. Not separating the music group articles and "double" or "group" biographies from the "single" biographies was a big mistake. Would be good if there was a way to go back and tidy up 500,000 articles and label them as "male" or "female" and "single biography" and "group biography" (even separating out the "double biography" as that is relatively common). I think some of this needs the semantic what's-its-name thingy. In general, it seems that when people want large stats, they either go to a database dump, an API query, run a bot (or request one), or scrabble around with categories and whatlinks here (I think I listed those in decreasing order of technical know-how). I may have missed a few options. There are instructions somewhere on how to extract Persondata at Persondata, but I've never worked that out. Instructions here on how to extract biographical data would be good. There are also ways to count numbers of articles in various places using AWB as well, I think. See the stats on my talk page (here): "there are currently 561,596 pages categorized by birth or death; 522,416 categorized by birth; 236,557 categorized by death date, excluding those categorized as living people; 307,892 categorized as living people". You will note the numbers don't quite add up (236,557 plus 307,892 is 544,449 (living and dead); 522,416 (birth categories); and 561,596 (birth and death categories). This does depend on exactly how that analysis was done and how the "unknown" and "missing" categories were treated. Compare also with the 543,994 total here. There are various old discussions as well that have some data. It is almost certain that there are thousands of inconsistencies here that could be cleaned up. See the lists that User:Dsp13 produced at User:Dsp13/Living people needing categorization by year of birth, User:Dsp13/People needing categorization by year of birth, and User:Dsp13/People needing categorization as living or by year of death. Carcharoth (talk) 03:26, 11 October 2008 (UTC)
 * In the stats above, I'm pretty sure unknown and missing births (as subcats of people by birth) were included in people by birth, and the same for people by death. The quantitative discrepancy doesn't arise from this, but from (1) people whose birth year is somehow categorized (including as unknown/missing) but there is absolutely no category relating to their death year or status, and (2) contrariwise, people whose date of death or status as living is categorized, but no category relating to their birth year.
 * I agree that it would help to know automatically whether a bio page is of an individual or of multiple people. WP:Biography mixes the two. There is a category Category:multiple people but it (taken together with its subcats) doesn't include every page which should be there. Also, at the moment some pages in (subcats of) the multiple people category are categorized by birth or death, and often then more than birth or death date is given. I think there should be guidelines as to whether this is desirable. Dsp13 (talk) 13:27, 11 October 2008 (UTC)
 * Thanks for explaining the stats. Would you also be able to explain methods of obtaining such data? I mentioned above: "database dump, an API query, run a bot (or request one), or scrabble around with categories and whatlinks here" and AWB. Would you be able to expand on this at all (noting that categories say how many articles they contain, but whatlinks here doesn't - though I don't doubt there are tools to do that with). Thanks also for pointing out Category:Multiple people - that is very useful. I see that the "double saints" biographies are not included there. Excuse me a moment! :-) Carcharoth (talk) 13:46, 11 October 2008 (UTC)
 * I wrote a Python script to query Special:Export. I'd thought of writing a script to find all the pages which were categorized with more than one birth year, or more than one death year. Dsp13 (talk) 19:50, 11 October 2008 (UTC)

That Polbot stuff
I think this definitely qualified as biographical metadata stuff. In no particular order: I think that's most of it. Carcharoth (talk) 03:31, 11 October 2008 (UTC)
 * User:Carcharoth/Polbot3 trial run
 * User talk:Carcharoth/Polbot3 trial run
 * Bots/Requests for approval/Polbot 3
 * User:Polbot/ideas/defaultsort

Random other discussions
Did I ever say there was a lot of this sort of stuff out there? ;-) Please add more as found or needed. Carcharoth (talk) 03:31, 11 October 2008 (UTC)
 * Template talk:Persondata/Removing data
 * Archives of Template talk:WPBiography

Freebase
I also came across Freebase (eg. here) and the tables of information presented there look quite well done. Do you recognise the fields they are using there? Carcharoth (talk) 02:19, 11 October 2008 (UTC)


 * [The above copied from my talk page, in order that conversation my be centralised here]


 * I don't recognise those fields; they may be a set unique to Freebase, though several schema exist; some specific to particular use cases. For example, FOAF, OpenID Attribute Exchange and other schema listed here. Andy Mabbett (User:Pigsonthewing); Andy's talk; Andy's edits 10:19, 11 October 2008 (UTC)


 * Thanks. Don't have anything more to add here for the moment (maybe someone else will run with this), but that is useful. Carcharoth (talk) 13:06, 11 October 2008 (UTC)

WorldCat / Library of Congress name authority file
I've used the template Template:Worldcat id to add links to WorldCat Identities: one reason for doing this is to add links to library holdings by/about an individual; another reason for doing so is that this is built on the Library of Congress name authority file & provides a link to it. So, for example, the WorldCat Identities page for Jane Austen has a link to the LC name authority file record for Jane Austen. This for many individuals has dates of birth/death, and also has variant name forms. If anyone is interested, I've matched about 70,000 wikibios to their LC name authority ID. Dsp13 (talk) 13:46, 11 October 2008 (UTC)
 * Goodness! Any way to run a check to see whether there are discrepancies between the data? Also, should the LoC data be used as a reference or not? I can see cases where people would want to refer to other references instead, and of course sources may disagree, but this is a great resource! Carcharoth (talk) 14:59, 11 October 2008 (UTC)

The matches I've made are (mostly) automatic, and in these cases I've only matched where name & dates agree. The LoC dates are mostly right, so are useful where wikipedia lacks any birthyear. (See Template:cite LAF) But where Wikipedia and LcC disagree, wikipedia's quite as likely to be right. In the LoC file, not all dates are given - they're often only provided where needed to disambiguate a name. My feeling is that in the longer term people will want to integrate wikipedia with library resources more systematically (as has happened with German wikipedia): Open Library may prove a wikipedia-friendly way to do this.Dsp13 (talk) 19:45, 11 October 2008 (UTC)
 * Update: See Authority control. Andy Mabbett ( Pigsonthewing ); Talk to Andy; Andy's edits 17:19, 5 February 2012 (UTC)

IMDb
From what I've heard, IMDb should be used with care. Do we really want to list it here? Carcharoth (talk) 14:59, 11 October 2008 (UTC)

Pages with inconsistent metadata
Another external database which might be worth mentioning is the company I work for, the internet search company True Knowledge. As a byproduct of their data extraction from wikipedia, they've compiled a list of over 7,500 biographical pages with more than one incompatible date of birth. In these pages different elements of the page - persondata, a category, an infobox or the first sentence of the bio - differ from each other. Any help sorting these out welcome! It also seems an interesting example of how a structured database working with metadata can potentially provide useful feedback to wikipedia. Dsp13 (talk) 01:12, 3 December 2008 (UTC)

DefaultsortBot
Hi everyone,

As requested by Carcharoth, DefaultsortBot has been written, approved, and gone into operation. This bot will traverse Category:Biography articles with listas parameter, and where the corresponding article does not have a DEFAULTSORT tag (or something equivelant to it, such as Lifetime), it will take the listas parameter from the talk page and turn it into a DEFAULTSORT tag for the article. If you have any feedback, please leave it on my talk page. Thanks and enjoy! Matt (talk) 20:48, 24 May 2009 (UTC)

Oh, I also have ListasBot (which you may have noticed, as it's been running for some time now) going through talk pages in Category:Biography articles without listas parameter and assigning listas values to the WPBiography templates whenever possible. Matt (talk) 21:30, 24 May 2009 (UTC)

WikiProject Unique Identifiers
I have created WikiProject Unique Identifiers for discussion and coordination of all UID related matters. Please join! Andy Mabbett ( Pigsonthewing ); Talk to Andy; Andy's edits 13:03, 14 February 2012 (UTC)