Wikipedia talk:Authority control integration proposal/Archive 1

Discussion at the village pump
I and lots of other people provided comments about this proposal at the Village Pump. Those comments are here - Village_pump_(proposals).  Blue Rasberry   (talk)   15:14, 19 June 2012 (UTC)

Interested
I have only scan read the proposal. However I believe the benefits of closer integration with the Library community are immense. At the moment I am just supportive. I have had discussions with Cilip, OCLC here and in the US. I would need time to have this explained to me or to trust others who were working in an agreed way. I'm not sure if this is possible but I think a demonstration of the functionality - even if it only involved .0001 % of the potential would be very useful. Is this partially possible? Victuallers (talk) 18:49, 18 June 2012 (UTC)


 * Hi Roger. Feel free to give me a call if you'd like me to run through it offline! Andrew Gray (talk) 12:28, 20 June 2012 (UTC)

Authority control on Commons
Maximiliankleinoclc, I was quite active in last year adding Authority control templates to commons:category:Creator templates and subcategories of commons:category:People by name. Last year I copied over 30k Authority control templates from German Wikipedia (were they have ~180k of them). Lately we transplanted Commons:Help:Gadget-VIAFDataImporter from wikisource and added it to the list of available gadgets, and now there is a lot of activity of adding Authority control templates to a lot of pages. With the help of the gadget it can be done with a few clicks. It would be great, if it was be possible to add some more by a bot. You can use such run as a test before running it on Wikipedia. What info do you use to match records? Name and dates of death/birth, or other info? --Jarekt (talk) 19:12, 18 June 2012 (UTC)
 * There is Authority control on Commons (45k templates) as well as English (4k), and Deutsche (180k). This just goes to show the extent of the spaghetti. This proposal has Wikidata - the untangler of the spaghetti - in mind from the start. I think when we import to Wikidata having the three separate data sources and checking the will provide a good level of accuracy. That is when we go to put authority control on Wikidata, if three, or two out of three of en, de, and commons, agree on authority control then we can be confident. Maximiliankleinoclc (talk) 17:47, 26 June 2012 (UTC)

Useful
I've always supported anything in the direction of linking to outside authoritative databases, and also supported content organization that would be useful in converting WP into a fully semantic wiki. My impression is that people at some of the other WPs are far ahead of the enWP in this respect. Certainly the VIAF system is very unfamiliar in the US--in fact, even the LC authority file, whether from LC or via OCLC is still unfamiliar to every non-librarian here, though I have been trying to link to it at relevant AfDs and to use if for sourcing years of birth.. Raising awareness is good, but the initial steps should come slowly and not as a surprise. The suggestion at the village Pump that the bot is not yet ready & that this should be added manually is relevant.  DGG ( talk ) 19:15, 18 June 2012 (UTC)
 * The way the bot would work is that it takes a list of preexisting articles that are positive matches to VIAF at the moment. That could help automate saving AfD, by automation, and also awareness. Maximiliankleinoclc (talk) 17:52, 26 June 2012 (UTC)

Support for visible links
I strongly support the use of authority control in articles but it must be visible - hidden metadata falls out of step with visible article content, and errors go uncorrected. Furthermore, we should look to embedding these links, where possible, in infoboxes, so that they can be used as UIDs in the metadata emitted by them. I look forward to meeting you at Wikimania! Andy Mabbett ( Pigsonthewing ); Talk to Andy; Andy's edits 20:26, 18 June 2012 (UTC)

Using de.wikipedia's handpicked data
I think the very first step to integrate more authority data to en.wikipedia should be to take the data available via "de:" interwiki links wherever de:Template:Normdaten exists in the German article; and copy that to en.wikipedia. At de.wikipedia we not only have a lot of German GND data, but 116.000 LCCN entries and 142.815 VIAF entries. Since these are (or at least should be) manually checked, they are probably much better than any of VIAF's alogrithm-driven entries in matching persons. --AndreasPraefcke (talk) 10:21, 19 June 2012 (UTC)
 * I was having trouble understanding this and I talked to User:Maximiliankleinoclc and user:Andrew Gray about this. They explained to me what you are saying and possibilities for what can be done with German data. For others who are also confused, what this means is that there are two databases, a German and United States one, each of which would be good enough to use. Each would, though have a certain number of errors. If each of these databases where cross-checked for differences, then when they have different data for the same person that probably indicates a problem. If the two databases are cross-checked first, then have the problems identified and fixed, then the resulting fixed data list is the one used for this project, then this project will be using a higher quality data set of higher quality than any individual set. Right?  Blue Rasberry    (talk)   21:07, 2 July 2012 (UTC)
 * Well, since the results of the cross-check also might directly go back into the individual databases the difference might be not as big, since the improvement would be everywhere. Actually, we have two mappings: One performed intellectually (or semi-automated) in de:WP which can be extracted from the Normdaten-templates (~ 140.000 VIAF numbers as noted above inserted in the last couple of years) and a the more recent from VIAF to en:WP which was performed automatically by OCLC and can be extracted from their data dumps at, namely the viaf-yymmdddd-links.txt files (~304.000 as of 2012-05-24). These two can be set into correspondence either by exploiting interwiki links to en:WP contained in de:WP articles or vice versa (there will be differences, since the interwiki links are fairly good but not perfect). As a first exploit we identified several thousand articles in de:WP without a Normdaten-template but with an interwiki link to an article in en:WP which also is target of OCLC's mapping (actually restricted to those VIAF clusters which also bear a GND identifier since these are our primary focus at de:WP).
 * As a second exploit I just processed the articles with Normdaten-template in de:WP (and interwiki to en:WP) and compared their VIAF number (if any) with the one provided by OCLC's mapping (if any): Of 702.000 interwiki links from de:WP to en:WP were 84.000 backed by OCLC's mapping, 51.100 had a VIAF number in de:WP's Normdaten-template, and 9.400 (=18%) of these differed from the one provided by OCLC's mapping!
 * This could be explained as follows: Some editors in de:WP note the VIAF number corresponding to the GND record. Due to some peculiarities of the GND it is quite typical to have several GND records: One suitable for wikipedia usage (a personalized one) and one (an "undifferentiated" one) which unfortunately tends to enter the main VIAF cluster for that person. Other editors even deliberately prefer VIAF numbers not already implicit because of the already noted GND or LCCN number. And one may speculate that OCLC's mapping mostly succeeded for those clusters containing the LoC-NA record. Thus the huge discrepancy is not a sign of faults but can be interpreted as a successful "external" effort yielding &gt; 9.000 intellectual identifications within VIAF in cases where the automatic match & merge had not succeeded! Of course there are also several minor sources of real errors: interwiki issues (obviously sometimes disambiguation pages are involved which never should) and the fact that some VIAF numbers in de:WP are outdated, i.e. VIAF would redirect the number given to a current one). -- Gymel (talk) 00:02, 3 July 2012 (UTC)
 * Thanks for explaining how each database can identify information not present in other databases and thanks also for explaining how this project could result in improvement of the original databases on their next updates. All of this is difficult for me to understand because I have never thought of such things before and I do not yet know enough to be aware of what background for understanding that I might be lacking.
 * May I ask you to identify yourself on your userpage? You know a lot about this for a new user and I am curious about how you are involved in this project or similar projects. I really appreciate your input on this; although I support the project in principle because I recognize the benefits, I feel less qualified to endorse technical aspects of it simply because I do not understand such things. I am glad to see you here joining the discussion of this.  Blue Rasberry    (talk)   13:01, 3 July 2012 (UTC)

VIAF crowdsourcing interface
What is really missing in VIAF, in my opionion, is an easy mechanism to let users merge and split data clusters. It could be restricted to experienced users on application (let's say people working on authority files professionally, or the handful of people doing a lot of authority data work in de.wikipedia and other projects), or, if it doesn't change data "live", at least it could be used as a tool to become aware of clusters that require further looking. --AndreasPraefcke (talk) 10:21, 19 June 2012 (UTC)
 * You are right, it shouldn't be as liberal as anonymous. And it isn't you can submit correction to VIAF through the VIAF website suggestions box. Although this topic is a bit outside the scope of the proposal. Maximiliankleinoclc (talk) 17:57, 26 June 2012 (UTC)
 * I know. I just jumped at the opportunity... (you don't get to talk to someone concerned with VIAF every day :-) --AndreasPraefcke (talk) 17:59, 28 June 2012 (UTC)

More ways that VIAF could profit from Wikipedia

 * Using the de:Wikipedia:BEACON files to automatically link to more good relevant websites (now they are available only for GND, but they could easily be produced for VIAF numbers, LCCN, etc.) --AndreasPraefcke (talk) 10:21, 19 June 2012 (UTC)


 * Using IMDB links on en.wikipedia and de.wikipedia to help inlcude IMDb as another authority file. It is sorely missing in VIAF. --AndreasPraefcke (talk) 10:21, 19 June 2012 (UTC)


 * Does IMDB licensing allow for this? Maximiliankleinoclc (talk) 17:58, 26 June 2012 (UTC)


 * Probably - we do it already! Note that most people/films with IMDB pages already link to them, so his would mainly be shuffling the location of the data around a bit. Andrew Gray (talk) 22:11, 27 June 2012 (UTC)

Well, maybe OCLC would have to ask the powers behind IMDb; I don't know how the law is in this respect. I cannot imagine IMDb would be adverse to being publicly recognized as professional (which IMHO this database is: it is much better than any of the librarian auth. files as far as actors and directors go), or to get even more traffic from professional environments luike libraries and universities. --AndreasPraefcke (talk) 15:15, 28 June 2012 (UTC)

VIAF feature request: use Wikipedia autority control as stabilizer

 * When combining data, VIAF should prefer the identifier already in use at (de./en) wikipedia. Alternatively, VIAF could create some bot to to automatically change VIAF entries in Wikipedia whenever a VIAF set is combined and redirected, or at least provide a list of such redirects for Wikipedians to do the bot work. --AndreasPraefcke (talk) 10:44, 19 June 2012 (UTC)
 * This is a really good point, and I will include it in the maintenance section of the proposal, to update in time with the VIAF updates. Maximiliankleinoclc (talk) 17:59, 26 June 2012 (UTC)

Substantive feedback
I'm repling here rather than at Village_pump_(proposals), because this is getting long and reasonably technical. As a librarian, wikipedian and maintainer of a authority control system not involved in the current proposal I strongly support this work, but not in the current technical implementation. In particular: I hope this helps Stuartyeates (talk) 09:40, 24 June 2012 (UTC)
 * 1) The documentation on Authority control is entirely inadacquite, both because there are many points it doesn't cover (most of which appear to have been raised in this discussion so far) and because it's librarian-centric rather than wikipedian-centric. We also need much more detail on things like romanisation / character mapping differences between the wikipedia approach and that in use for each scheme.
 * 2) There are a large number of caveats in the documentation for Authority control which could be checked for automatically. They're not. I'm thinking (deprecated, please use GND) and As of 2012-06-20 for LCCN's starting with "n" and followed by 8 digits this template appears to require the syntax n/99/999999 where 9 is any digit from 0 to 9. and so forth. Data validation is your friend, early data validation is your best friend. See Citation for good examples of this in practise.
 * 3) Currently there is a single Authority control template which does all the work. This makes it impossible for people to add non-trivial new schemes (i.e. schemes with data validation and proper documentation), because non-trivial new schemes take development work and development can't be done on templates with 3 million uses except by gurus. At the most, Authority control should be an interface to a collection of templates, each of which know everything about a single scheme. That way new schemes could be tested, debugged and added without distrubing 3 million pages. See Citation for good examples of this in practise.
 * 4) The current approach is to use VIAF for people and only people. This is completely wrong. The approach should be to use VIAF initially for people and expand it as is feasible. The difference between the appraches in the near term is about planning and coding the templates / bots extensibly.


 * Thanks Stuart. I'm drawing up a general RFC on the proposal at the moment, and I think you're right that we need to step back and look at how the whole AC system on Wikipedia is organised as part of this. Bringing it all into one place and updating the documentation would be a great help!
 * Regarding your specific points, I agree with you entirely on #4 - I think that biographies are the most efficient use of the template, but there's no specific reason to restrict it to them.
 * 3 is interesting; are you thinking of, say, authority control as a wrapper which pulls in VIAF and LCCN and so on, similar to the way that JSTOR and DOI and so on are handled in the citation templates? Andrew Gray (talk) 22:22, 24 June 2012 (UTC)
 * Yep, that's pretty much what I was thinking.Stuartyeates (talk) 20:18, 25 June 2012 (UTC)
 * Great. I've explicitly revised the proposal to include reworking the template and the documentation as part of the plan - the template is just about within the bounds of what I'm comfortable doing, technically, so I might enlist some additional support for this part! Let me know what you think of the revised proposal. Andrew Gray (talk) 21:34, 25 June 2012 (UTC)
 * The good thing about breaking the templates down is that it makes it much easier to experiment with them and push our collective limits. Also, sane questions at Help_talk:Template seem to get answered. I'll make some (explanatory) tweaks to the proposal. Stuartyeates (talk) 23:18, 25 June 2012 (UTC)

Very nice idea about abstraction layers. Will implement. Maximiliankleinoclc (talk) 18:08, 26 June 2012 (UTC)


 * Authority data may not be trivial, but it's not rocket science. We have a long experience with that template (de:Template:Normdaten) on de.wikipedia, and it's pretty easy to use and didn't make much problems at any time. That said, I don't get your point #1: of course, it's librarian-centric, because that's what authority files have been so far. We may imagine (or start to develop) our own, better authority files, but to link with the _existing_ ones (invented and maintained by librarians), we need to adopt their thinking. What do you mean by romanisation approaches? We don't include names or any difficult characters in the template, but only identifiers. Well, the LCCN is kind of complicated (that is beacuse the ******s (fill in foul language of your liking) at LOC can't make up their minds which scheme of their LCCN to use themselves. But even the LCCN only has one or two English letters (n or no) and a couple of numbers. GND has numbers and an occasional X or "-". VIAF is even numbers only. Even the Japanese NDL is a simple number with some leading zeros. --AndreasPraefcke (talk) 18:34, 25 June 2012 (UTC)
 * it's librarian-centric, because that's what authority files have been so far — Yes, but the documentation here is for wikipedians not librarians and so long as it uses librarianship terms of art without explanation, it will remain obscure to them. Stuartyeates (talk) 20:18, 25 June 2012 (UTC)
 * What do you mean by romanisation approaches? We don't include names or any difficult characters in the template, but only identifiers. — Yes, we include only the authority control link, but wikipedians expect to be able to check that what's at the end of the link matches what they expect and any discrepensies need to be explained. There are english language wikipedia processes such as Good articles which involve every link being manually checked by an uninvolved editor and the documentation needs to support them in that. Stuartyeates (talk) 20:18, 25 June 2012 (UTC)
 * The examples you give of different schemes and their range of identifers reassures me that error-checking in the tamplate is possible. Stuartyeates (talk) 20:18, 25 June 2012 (UTC)


 * Thank you for the clarification. Whether or not to actually show the authority data within Wikipedia articles will always be in debate, I guess. Since they are metadata, I'd prefer some visualisation like our interwiki links, at a place that makes it clear that this is "professional" stuff and not really necessary for enjoying a good article. --AndreasPraefcke (talk) 08:18, 26 June 2012 (UTC)
 * In my experience you are likely to experience considerable editor resistence on the English-language wikipedia to any effort framed in terms of machine-readable data. It needs to be framed either in terms of improving the English-language wikipedia pages as experienced by new editors (i.e. those not logged in) or alternatively improving inter-wiki links to minority languages (i.e. outside the top 10 or 20 wikis). A number of efforts to improve machine-readability have floundered because editors of the English-language wikipedia overwhelming favour improving human editability over machine-readability. Stuartyeates (talk) 08:27, 26 June 2012 (UTC)
 * That's valuable input. If this turns out to be a deal-break we can resort to hidden metadata, and then it can be viewed in a more visually pleasing way once the Wikidata integration has occurred. Maximiliankleinoclc (talk) 18:11, 26 June 2012 (UTC)
 * Alas hidden metadata isn't a solution to this problem, since it's still there when a user hits the edit button, contributing to the complexity encountered by a newbie editor (assuming that you mean hidden metadata in the persondata sense). The solution is to base the RFC on arguements like disambiguation and assistance in interwiki links, while downplaying the linked data aspects. Stuartyeates (talk) 23:49, 26 June 2012 (UTC)


 * I understand your concern, but this battle is long lost, I think. The article on the U. S. begins with  {{Infobox country |conventional_long_name = United States of America . Not really "wiki wiki" as it used to be in 2003. But even then, the source code was already a mess as far as usability goes (. --AndreasPraefcke (talk) 11:12, 28 June 2012 (UTC)

Use of Albert Einstein as an example
In a number of places Albert Einstein is used as an authority control example. While he is an example of an author who is well known and everywhere, I suggest picking harder examples, to prove the versitility of authority control. Yasunari Kawabata, someone from Category:Burmese writers, Category:Thai writers or similar. These are the hard examples, involving radically different cultures and scripts, proving that our schemes have what it takes. Alternatively if some of the schemes can't handle these people and their real names, we need to document that. Stuartyeates (talk) 20:25, 26 June 2012 (UTC)


 * Mohandas Karamchand Gandhi seems a good example. VIAF 71391324 has primary entries in three of the four scripts VIAF has records in (Latin, Arabic & Hebrew - the missing one is Cyrillic, contributed by Israel and Moscow), and the 400s alternate section includes a wide variety of romanisations, plus alternates using Japanese, Devanagari and Gujarati scripts.
 * It's a bit hit-and-miss what does and doesn't turn up in the 400s; some seem to have a lot of scripts, some very few. Fyodor Dostoyevsky is an interesting example - VIAF 104023256 - as it has all four languages in the 100s, and as alternate headings, Japanese, Greek and Chinese plus an enormous variety of different romanisations. Andrew Gray (talk) 22:49, 26 June 2012 (UTC)
 * I've changed to Dostoyevsky. Feel free to switch it for someone with even more variety... Andrew Gray (talk) 22:54, 26 June 2012 (UTC)
 * Yes, a much better example. Stuartyeates (talk) 23:05, 26 June 2012 (UTC)

Convincing use-cases for authority control
I've been looking at some of the previous attempts to do automatic interlinking of wikiepdia with external things and I've come to the conclusion that we're going to need some clear, detailed use cases of how the authority conrtrol can benifit wikiepdia in and of itself. I suspect we can come up with a couple of "with this information we can write a toolserver tool to do X" scenarios. Stuartyeates (talk) 04:56, 27 June 2012 (UTC)
 * We do not need to demonstrate benefit to Wikipedia "in and of itself". Benefiting Wikipedia's readers and/or data re-users will be more than adequate justification. Andy Mabbett ( Pigsonthewing ); Talk to Andy; Andy's edits 10:17, 27 June 2012 (UTC)
 * Various things come from the embedded identifiers:
 * Reliable linking from external services - we can build lookup services. de.wp has an elegant PND tool - http://toolserver.org/~apper/pd/person/pnd-redirect/de/118768581 - which takes you to the article represented by that PND (and, optionally, allows you to select a preferred language - http://toolserver.org/~apper/pd/person/pnd-redirect/en/118768581 does the same thing pointing to en.wp, using interwikis). Being able to have this work for a large number of records rather than a few thousand makes it a practical tool, and allows people to automatically generate these links to Wikipedia, use the API to pull out leads from articles for reuse in other sites, etc.
 * Extend the scope for checking metadata - we already have methods, such as the Death anomalies project, for comparing the metadata between Wikipedia language editions and spotting inconsistencies. Including identifiers which tie into external services, with reliable APIs, mean that we have a lot of reliable data to compare articles to.
 * Return metadata to the outside world - working backwards from this, once we have embedded identifiers, the curators of this metadata will find it a lot easier to incorporate information from Wikipedia, taking advantage of our fairly fast update cycle for things like death dates.
 * Identifying alternate names - particularly for non-standard transliterations, the alternate headings in authority files give us an extensive and curated collection of variants of names. The linkage will help the creation of redirects.
 * Content creation support - the presence of the identifiers makes it possible for people to, eg, develop scripts to generate author's bibliographies for articles and so forth. I'm not recommending we do this now, but it's the sort of practical innovation we're allowing for.
 * ...plus the various benefits from the links in the template itself, such as being able to jump directly to WorldCat and thus to relevant material. Andrew Gray (talk) 12:44, 27 June 2012 (UTC)


 * Let me add one example for 1: The German National Library uses our data (a list of GND and Wikipedia articles, provided daily at ) to link back from its catalog to Wikipedia. See http://d-nb.info/gnd/118527053 (to the right: Zugehöriger Artikel in Wikipedia).
 * However, the best system for in- and outward links to/from Wikipedia using authority identifiers is our own BEACON format. See de:Wikipedia:BEACON or meta:BEACON. It allows any publisher of an online database to provide its own simple file with identifiers that will provide a search result in its database (library catalogue, biographical dictionary and so on). Even if such concent providers don't use authority data, their websites can be tagged by third parties (as long as the records have some persistent URL that we can use). Look at the Dostoevsky results of such a search using all available BEACON files: http://beacon.findbuch.de/seealso/pnd-aks?format=sources&id=118527053 Our own "Wikipedia Personensuche" that is linked in all authority control templates of biographies in de.wikipedia also uses these BEACON files (not all of them yet, though), see http://toolserver.org/~apper/pd/person/Fjodor_Michailowitsch_Dostojewski –-AndreasPraefcke (talk) 11:04, 28 June 2012 (UTC)

Multiple authority control templates in a single article
I'm struggling to understand how this is going to work in terms of semantics. I think it's best to start with 1:1 relations, because anything else risks making the Authority control framework incompatible with some semantic frameworks / approaches. The restriction would only apply to article space, naturally, and not to pages such as: Project_Gutenberg_author_list/Page_01, which is list of authors for whom pages are yet to be created (there are probably more such pages under WikiProject Missing encyclopedic articles). Stuartyeates (talk) 20:06, 28 June 2012 (UTC)


 * I spent today mulling over this, and on the whole I think I agree. Multiple cases should be logged in a separate report (I expect it will mostly cover sibling and co-author pairs), pending a clear decision on how we handle these cases - it's something which a lot of our metadata has trouble with. If we have corporate identifiers for the multiple-individual entity, on the other hand, no problem including those directly. Andrew Gray (talk) 20:45, 28 June 2012 (UTC)

I don't know en.wikipedia well enough, but I can say what we do at de.wikipedia. We have not really solved this problem, but we use a workaround. For co-author pairs that have a single article on de.wikipedia but are two people, we have redirects for each name that get categories like a normal biography, and have the authority template (which apparently can then be machine-read from the dump like the others). The data is not really useful for Wikipedia readers there, but at least it makes automatic linking from other sites possible. If the co-author pair (like de:Clegg & Guttmann) or a group pseudonym (like de:Nicci French) or a band or comedy act consisting of two people (whose biographies are covered in the band article only) have a separate authority record, this one is additionally put in the article itself. --AndreasPraefcke (talk) 21:13, 28 June 2012 (UTC)


 * Clever! How does your persondata work for this - do you add persondata to the redirects? I know we do something similar with categorised redirects sometimes, but it's not a systematic thing. (yet?) Andrew Gray (talk) 21:25, 28 June 2012 (UTC)


 * Yes, "Personendaten" is also added to the redirect. That said, with artists' duos, co-authors and the likes, I'd personally prefer not to use redirects at all, but have stubs that have all the necessary data and a short definition, plus some prominently placed link to where the real information is. Usually, people won't really search for those single people, but giving their whole biographies in the "band article" is often enough just a bit too much anyway, and these stubs could be linked from the "band article" anyway. There are other instances of double authority data for a single article, though, where redirects may indeed be the best choice. For example, churches are part of the GND as "geographic names" (being a building), but the corresponding parish is usually some sort of "organization" with a separate GND number. For complicated things like that (about which the wikipedia reader will never really care for), we'd either need to get away from the 1:1 relation of article and authority record, or we may just as well use such redirects (e. g. "Christ Church (Xtown)", redirected from "Parish of Christ Church (Xtown)"). --AndreasPraefcke (talk) 23:08, 28 June 2012 (UTC)


 * It's almost like our project doesn't have a very clear structured taxonomy, for some reason ;-). Andrew Gray (talk) 10:31, 29 June 2012 (UTC)
 * Thanks for explaining this, AndreasPraefcke.  Blue Rasberry    (talk)   21:01, 2 July 2012 (UTC)

French project
Just noticed this discussion on fr.wp: fr:Wikipédia:Le Bistro/3 juillet 2012. Per fr:Discussion modèle:Autorité they seem to be primarily pulling from de.wp data; there might be some opportunity for pooling effort here. Andrew Gray (talk) 12:46, 3 July 2012 (UTC)

How to use it
There are occasions when I have found two articles about the same person, when the persons are known by more than one name. This might be the case for persons from history, for persons from countries with non-Latin alphabets and varied transliterations into English, and for authors versus pen-names. One example is Prince Henry, Duke of Cornwall which was merged to Henry, Duke of Cornwall. How, exactly, would one look up the "Authority control" number for the subject of a new biography, to see if there is an existing bio for that person? Could a bot create a list of articles about the same entity under different names or spellings, so that a merger could be considered, just as the two aforementioned articles were merged? (Some conspiracy theorists will have a field day with the notion that Wikipedia is coming under "Deep Authority Control Integration.") Edison (talk) 15:41, 4 July 2012 (UTC)


 * Generating such lists by bot will be no problem at all, I think. A database of all "persondata" and "authority control" data can be maintained on the toolsever. We do that for de.wikipedia data for years now (it is the database behind http://toolserver.org/~apper/pd/ ), and the whole thing is being re-programmed to serve even more purposes right now; so I think there might be a chance that the same thing can easily be adapted to en.wikipedia needs. Well, not "we" do it, acatually User:APPER did the whole work. But I think it shouldn't be any problem to get lists of double entries of authority data from such a database on a regular basis. --AndreasPraefcke (talk) 21:18, 4 July 2012 (UTC)

Generating biographical databases
Not strictly related to this proposal, but I was wondering if this will help make it possible to generate a database of all biographical articles on Wikipedia? What I have in mind is the fact that there will be some (many) biographical articles on Wikipedia for which there are no VIAF identifiers. These may be obscure living or recent people who are borderline notable, or very obscure historical figures. Is it possible to estimate how many of the nearly 1,000,000 biographical articles on Wikipedia are likely to be matched up with a VIAF identifier and how many may not be, or is that something that we will only know once the data-crunching begins? I presume that once much of the matching up has been done, it will be possible to generate an alphabetical listing of all biography articles with VIAF identifiers? If that is so, can I ask if the following will be possible: May have some other questions later, but those are probably enough for now. Carcharoth (talk) 20:49, 15 July 2012 (UTC)
 * (1) If VIAF identifiers for other objects get added, will it still be possible to filter out the biography ones and just generate that database (i.e. exclude the non-people objects and retain that level of filtering)?
 * (2) The issue of biographical articles on two (or more) people is one that Wikipedia failed (at the start) to get a proper handle on. It would have been best to include something identifying such articles (something in metadata, other than the categories that may apply). Is it too late to include something at this stage identifying such articles as they are found during the roll-out of VIAF (if the proposal goes ahead, as it seems it will)?
 * (3) Does VIAF use any gender identifier for people? I've not yet found a system (including Wikipedia) that identifies gender (even a simple male, female, unknown, other choice). So I've never yet been able to answer the question of how many articles on women Wikipedia has, compared to number of articles on men.
 * (1) Yes http:toolserver.org/~enwp10/bin/list2.fcgi?run=yes&projecta=Biography (2) I strongly suggest tieing VIAF to Persondata which should already be present in most single-person biographies and absent from similar things that aren't (i.e. Bonnie and Clyde, Trial of Xiao Zhen, etc) (3) If you have (or are looking for) a gender notation, I suggest you start by talking to the folks at WikiProject LGBT studies, without whom any such scheme is unlikely to fly. Stuartyeates (talk) 01:21, 16 July 2012 (UTC)
 * (1) Thanks for that link. It is not really what I was looking for, though. I'm aware of WikiProject Biography, but what I want is a way to search and sort through the single-person biographies on Wikipedia (having band articles and multiple person biographies mixed up with the others means the dataset provided by the WikiProject Biography talk page tags is not clean enough). Currently this depends on being able to detect articles with Persondata in them, or being able to bundle together all the birth year categories (there is no single category containing all the articles in the birth/death years category tree). This requires a level of technical expertise beyond most people (an example is here). Have a look at Biographical metadata (a page I started a while ago) and look at the Personensuche tool mentioned in the 'Tools' section. Having something like that available for en-Wikipedia is sort of what I'm looking for. (2) I agree about making persondata more useful, but it was discouraging when it was nominated for deletion at the start of this year - is the implementation of VIAF and the (unnecessary) mention of infoboxes part of a wider push to see Persondata converted or phased out as part of a longer-term agenda? I hope not. There is a sizeable chunk of the metadata 'community' (for want of a better word) that want to force infoboxes onto all biographical articles because that will make data handling easier (it will cause conflict in the long run with those who write biographies, rather than collate biographical data, and recognise that for some biographies and articles, infoboxes are not needed - this needs to be resolved sooner rather than later, preferably by having a (rare) 'non-visible infobox' option to reduce conflict in such cases). (3) WikiProject LGBT studies will almost certainly have useful input on how to handle articles on transgendered people, but for the vast majority of articles it will be a simple male or female notation that has nothing to do with LGBT issues. You will note that the German Wikipedia 'Personensuche' tool has the option to search by gender. I suspect that was done without consulting the German Wikipedia equivalent of WikiProject LGBT studies. So to sum up, will this VIAF system help us end up with a tool for en-Wikipedia like the Personensuche one for de-wikipedia? Is VIAF going to reinforce Persondata or be part of a push to replace it? And how would we even start trying to roll out gender notation across nearly 1,000,000 articles - can VIAF help by pulling in data to Wikiepdia from other databases? It could be done quickly, if crowdsourced and partially automated, but is there any desire or motivation to do that and can any VIAF roll-out help with that? Carcharoth (talk) 20:40, 16 July 2012 (UTC)
 * It seems to me unlikely that gender notation would be rolled out across 1,000,000 articles. It's a tar baby I'd be completely unwilling to touch and I would advise against it's use as a motivating example for the use of VIAF, for fear of contamination. I also believe that you seriously over-estimate the power of the "collate biographical data" camp in relation to the "write biographies" camp (and the "make articles very simple to edit by removing everything but paragraph text" camp). Maybe the relative influences are different on de-wikipedia. Stuartyeates (talk) 21:16, 16 July 2012 (UTC)


 * 2) - There is absolutely no intention of replacing or supplanting persondata with identifier links, and I hope that the identifiers (as in de.wp) will coexist happily alongside persondata and infoboxes, as three seperate approaches to structured data with their own strengths and weaknesses. All three may, in the long run, change as a result of the Wikidata system - but that's not something we're aiming to address here.
 * 3) - Note that the German persondata supports gender alongside a pre-existing standard practice of categorising biographies by gender - de:Kategorie:Frau, etc. The English Wikipedia doesn't seem keen on systematically recording gender as metadata, and I suspect this is why en.wp's persondata doesn't currently support it.
 * In the long run, the use of authority identifiers could be used to help pull in information such as gender from other databases (though I can't immediately think of a major one which supports it and systematically records gender), but a major expansion of our metadata like this should probably go through a separate RFC. Andrew Gray (talk) 21:42, 16 July 2012 (UTC)
 * Thank-you both for those answers. I particularly liked the reference to the "make articles very simple to edit by removing everything but paragraph text" tranche of opinion! :-) I had forgotten that de-Wikipedia categorise by gender. But I'm not sure why attempts to start adding gender metadata on en-Wikipedia would be a tar baby. Have there been bad experiences in the past? And can anyone think of a way to estimate the number of articles about women on Wikipedia? And to get back on topic, when I clicked on one of the VIAF examples, I got a bit lost. Is there a page explaining what the different parts of the VIAF landing page mean? Carcharoth (talk) 22:04, 16 July 2012 (UTC)
 * My recollection is that this was argued over way back in the mists of time, though I'm having some trouble finding where. It might be that an RFC on whether or not to have a set of gender-notation "tag" categories is worthwhile to settle it clearly one way or the other! (I'm not wedded to the English vs. the German approach; I can see benefits either way.)
 * German is about 85% male; Czech 85%; Swedish 81%. These are the only three large or medium sized projects I could find which have systematised the categorisation; based on this, I would gamble English is likely to be more or less the same, perhaps somewhere in the 80-85% range.
 * I will have a look for a "reading VIAF" FAQ; it's all relatively technical, though. The data presented is mainly to validate to the user that they have the correct individual, and what the standardised form of the name is. Clicking one of the names takes you to that authority record's entry for the name, which may then link to catalogues or other resources. Andrew Gray (talk) 22:29, 16 July 2012 (UTC)


 * (edit conflict) The trite answer to why gender is a tar-baby is that it's a battle-ground in the american culture wars. More interestingly, post-colonialism delegates gender identification of non-Europeans to their respective non-European cultures, leading to things like Kathoey and Fa'afafine being recognised as first class genders. How does the de-wikipedia deal with this? I have no idea how to measure how many women are in en-wikipedia, except maybe by cross matching by with de wikipedia and then statistically adjusting the counts based on WikiProjects for people who aren't in both. Stuartyeates (talk) 22:33, 16 July 2012 (UTC)

-- Wondering off-topic but to address some of the issues raised above...

The problem with persondata (and the suggested "hidden" infobox"), is that it is hidden. Infoboxes make the same data available to machines, by marking up the visible data using microformats - if the visible data is changed, the the emitted metadata changes at the same time.  As for "recognise that for some biographies and articles, infoboxes are not needed", that's an expression of a debatable opinion, not a recognisable fact.  Andy Mabbett ( Pigsonthewing ); Talk to Andy; Andy's edits 14:56, 17 July 2012 (UTC)


 * that's an expression of a debatable opinion, not a recognisable fact actually according to Manual_of_Style/Infoboxes it's is determined through discussion and consensus among the editors at each individual article. Note that bots and scripts to not participate in discussion and consensus and their operators must determine the consensus before using them. Stuartyeates (talk) 21:22, 17 July 2012 (UTC)

Time to close
Would it be a good idea to hatnote this project and talk page as "closed", and direct future comments to, say, Template talk:Authority control or Wikipedia talk:Authority control? Andy Mabbett ( Pigsonthewing ); Talk to Andy; Andy's edits 20:06, 16 November 2012 (UTC)

Add no VIAF template if based on unreliable sources

 * No VIAF template should be added if its sources include IMDB, NNDB, or other user-generated database. Just skip it, or add "VIAF pending" or some such.
 * We should not be surrendering any "authority" of identity verification to an external entity which
 * Does not apply at least our requirements for source reliability
 * Is unaccountable. There's no change history to see who added a flawed source.
 * Has no issue creation/tracking
 * Does not reply to correction queries


 * Further, all such VIAF database entries should be purged at VIAF, with the sole exception of "Verified" IMdB entries, which are signed by legal entities assuming responsibility for the data. --Lexein (talk) 11:28, 21 November 2012 (UTC)


 * Wikipedia is not "surrendering any authority"; the VIAF template is simply a link between our page for an entity and an external service's identity for that entity. No editorial control or input is given to data from VIAF! In the general case, the way cataloguing rules work mean that a lot of linked identities will have unusual spellings or orthographies compared to Wikipedia's "common English" name, particularly with Cyrillic transliterations; there's absolutely no intention for us to switch to these forms, and indeed it's more likely that they'll pick up ours.
 * This seems to stem from your comments about Norm MacDonald; I suspect in this case that a cataloguer has seem two variants and picked one. I don't think it's being used to advance a particular POV; it just hasn't caught up with an error which (as the article talkpage notes) was fairly widespread. The identity of the subject is fairly clear in both cases, so from an identity perspective it seems a good match.
 * The use of IMDB in the LoC entry is a good example of how records come into existence. The general system is that an authority entry is created the first time a cataloguer needs to cover that person. They draw information from the item in hand - in this case, Billy Madison - and often it simply gets left there. However, it's encouraged to include some disambiguation information, which is usually a birth year... hence why IMDB was consulted. (I suspect Macdonald was not in many print reference works in 1998, when the record was formed).
 * What is particularly interesting is that they've weighted the two sources here and chosen to go with the spelling from the "real" one - the IMDB cite lists "found: Internet movie database, Nov. 24, 1998 (Norm Macdonald, b. Oct. 17, 1963; actor)". Unfortunately, in this case IMDB was right, and the misspelling was present in the original, but still - it's illustrative that IMDB is being used as supplementary information and not as sole authority. Andrew Gray (talk) 12:35, 21 November 2012 (UTC)
 * Interesting. Thanks for the extended response. I hope the VIAF folks pay attention to "verified" vs "non-verified" IMDb content in future, and deprecate unverified content. I also hope this non-proliferation of orthography (from VIAF to WP) is asserted as the plan, even in template docs, to address concerns such as mine. The discussion over the capitalization of Macdonald's last name was particularly long and arduous, with many long-term, but inconsistent, sources such as TV Guide and even the New York Times making things worse. --Lexein (talk) 13:40, 21 November 2012 (UTC)
 * Orthography is a real issue in LoC data with non-English material; they love tie characters, which no-one else ever uses. I used to tear my hear out when dealing with, for example, "Solzhenit︠s︡yn, Aleksandr Isaevich". There's also no single VIAF standard; each cataloguing authority uses its own forms, so the record will often show three or four variants (or more!). The project FAQ addresses this and notes that the content is intended to be distinct. Andrew Gray (talk) 13:48, 21 November 2012 (UTC)