Wikipedia:GLAM/Museum of New Zealand Te Papa Tongarewa/What We've Done/Myosotis Pilot



Te Papa's Myosotis pilot project wanted to find out how we could effectively and sustainably contribute to Wiki projects using our collection images, metadata, and curatorial knowledge. We used OpenRefine to load 355 images of Myosotis specimens native to Aotearoa New Zealand, creating a reusable process that involves adding well-described content, improving and creating articles, and connecting with structured metadata.

The project:
 * Loaded 355 images of Myosotis specimens
 * Added the images to articles created by one of our Botany Curators, Stitchbird2
 * Added and updated Wikidata items for many species and people related to the set
 * Created a new Commons template, Template:TePapaColl
 * Created processes to support the selection, export, transformation, and upload for our images and data

If you've got any questions, suggestions, or just want to talk about the project, get in touch with Avocadobabygirl.

This page describes the project goals, how to publish a set like this to Wikimedia Commons, and specifics of how we made it happen.

What’s Wikimedia Commons? How do we use it?
Te Papa wants the collections and knowledge we hold to be accessible and impactful for anyone who wants them. We’re building up an ongoing programme of digital outreach work that works out the best (most effective, sustainable, enriching…) platforms we can push out onto and makes it happen.

By loading images and metadata to Wikimedia Commons, as well as including the pictures in Wikipedia articles and connecting into Wikidata, we put valuable and up to date material right where people go looking for it. Contributing to a scientifically-sound article on native forget me nots makes Wikipedia itself more complete, but also helps people who see that information when it goes elsewhere on the internet, like iNaturalist or Google search results.

We load not just high-quality and high-resolution images, but also detailed metadata and a link back to the record on our Collections Online site. This helps the image travel with its context: detailed and useful information that makes the images easier to find, use, and interpret in all sorts of ways.

On the front end, we display descriptive metadata that makes it really clear what you’re looking at, as well as extra info that’s useful to wikipedians and researchers. Behind the scenes we also hook in several structured data statements using Wikidata properties and items, making it easier to computationally interpret the image.

Loading all this material to Wikimedia Commons is done with OpenRefine. We use it to prepare our data, hook into Wikidata, and upload the images in bulk.

Read on to see how we select material, process images and data, and load it to the platform.

Selection criteria
Setting selection criteria and making your actual selections helps keep the size of the following work down.

Establish your criteria, using the following as a basis.

Preferably, images will also have a use case ready to go, like inclusion on specific Wikipedia articles.

It will also be easier to prepare and upload material is the records are all the same type (eg Specimen vs Object), but this isn’t required.

Image selection
Because we wanted to restrict our set to a small number of relevant and high quality images, we did a review of all images attached to the records we’d chosen.

Preparing the data for OpenRefine
Create a general list of the kinds of images you want to include. It’s good to do this as a spreadsheet including columns like:
 * record numbers
 * titles
 * species
 * locations.

Make sure that there is one row in your spreadsheet for each image.

You can now open it in OpenRefine as a new project.

Filtering and faceting in OpenRefine
Use OpenRefine’s faceting and filtering tools to remove records (each relating to a single image) you don’t want to include. Each record should relate to a single image. Some useful methods are:
 * Facet by species
 * Facet by specimen or catalogue record. Only keep those with multiple images.
 * Facet on empty fields
 * Facet on image metadata. For example: minimum longest edge, file type (tif, jpg), file size, creation date, filename (the filename may point to the type of image it is – specimen sheet, field image etc)
 * Facet on image creator

When you have filtered the records you don’t want to include, you can flag them using the All dropdown menu on the first column, then Edit rows, then Flag rows. When you’re done, you can then remove all flagged records from your project by selecting Remove all matching rows – it’s better to do this at the end, in case you change your mind.

Review your data
After narrowing down to a subset of records, it’s a good time to review your data.

Look out for things like:
 * Values showing in the correct fields
 * Consistency – dates, spelling of names, formatting
 * Missing or additional data that should be added, for example Wikidata QIDs for associated people and taxa
 * Sensitive information – cultural, personal, location and financial data that shouldn’t be published

Ensure that data supporting image use is correct. For example:
 * Individual rights statements are consistently applied and meet the requirements of the external platform. For example, Wikimedia Commons requires images to be freely licensed or in the public domain.
 * Images are already (or queued to be) published on your own platform. This ensures users can verify that an image has in fact been officially published and is reusable.
 * Images are published at their highest resolution

Wikidata prep
OpenRefine lets you reconcile columns of values against Wikidata items, thereby connecting each upload to structured data in all sorts of useful ways.

Reconciliation using OpenRefine

Linking up things like creators, species, what’s depicted in the image, and significant locations covers most of the things people want to know. You might also consider:
 * type status (both whether the specimen is a type, and what kind of type)
 * collection/institution it's held in
 * people involved in collecting or identifying it.

The easiest way to get a definite match is to include Wikidata identifiers – QIDs – in your source data.

Wikidata:Identifiers

Finding a QID on Wikidata
A lot of things are already on Wikidata, so there’s a good chance of finding a QID for the entity you’re working with. Sometimes, the difficult part is finding the right one.

Wikidata items are supposed to be one-to-one with a specific thing, so finding something that’s close isn’t going to be helpful. Alexander von Humboldt (Prussian naturalist) is not Alexander von Humboldt (boat), and a specimen of ''Myosotis antarctica subsp. traillii isn’t a specimen of Myosotis antarctica subsp. antarctica''.



Start by searching from the box in the top right of Wikidata’s homepage. If the item you want doesn’t show up in the dropdown, hit enter to get a full search results page.

When looking for the right item, think about how you would be sure you’re looking at the right one:
 * Is the name at the right level of specificity?
 * Do birth/death dates, locations, associated institutions line up?
 * Has the name of the entity changed over time, with different ones being used in your data and on Wikidata?

You may find you need to do more research. If available information is scant and you can’t make a confirmed match, it may be safest to leave it out, and just use the entity’s name string instead.

Adding a new item to Wikidata
If there isn’t an item you can match, you can add your own one.

Help:Items tells you how to do that.

Create statements for the item to help make it clear what it is.



For example, a person’s record should include:
 * Instance of: human
 * Given name
 * Family name
 * Occupation
 * If you don’t have more definite information, add a contextually appropriate role here, like ‘botanical collector’
 * If it’s available the identifier from your system. For us, this is Te Papa agent ID

See Heidi Meudt’s Wikidata page for a more filled-in example.

Wikimedia Commons prep
Categories in Wikimedia Commons (and Wikipedia) group content together and help make it findable.

When applied to uploads, it’s best to use the most specific applicable category. For example, this specimen upload is a Myosotis, but only has the Myosotis pansa category.

Commons:How to create new categories or subcategories

Data mapping and transformation
The data actually required to load images to Wikimedia Commons is very simple – a filename and a license statement. But it’s possible to provide a lot more data.

If including more complex data, you’ll want to use a template. Templates for some object types are much more mature than others.
 * Template:Artwork
 * Template:Specimen

Naturalis have created a more comprehensive specimen template called Biohist.

Harvesting data
With your selections and data mapping in place, you can now re-export your data in a format that’s easy to process and upload in OpenRefine.

Processing in OpenRefine
Load the fresh export of data to OpenRefine as a new project, and do a final review of your data.
 * Ensure the filenames and filepath are correct
 * Remember that some things may appear to be doubled up, as they’re covering both descriptive and structured metadata

Wikitext
Generate Wikitext for each item by transforming the Wikitext column with the following value (adjust as needed, of course):

"== ==\n" + "\n" + "====\n" + "\n" + "\n" + "\n" + "\n" + if(isBlank(cells.CategoryScientificName.value), "", "\n") + if(isBlank(cells.TypeStatus.value), "", "\n")

Reporting and analytics
There are several tools that help gather analytics data about use of Wikipedia articles, Commons images, and more. They tend to provide a qualitative overview, so it’s good to supplement that with qualitative measures as well.

Using Wikimedia’s API to get pageviews
Wikimedia REST API documentation

This API gives you access to pretty much whatever you want to pull from Wikimedia, but what’s useful here is the pageviews data endpoint. This lets you send queries about how much use a given page is getting, customised with several parameters.

We run the following python script monthly, creating a simple report from a couple of text files that have lists of urls for the images and articles we want to keep track of.

from requests import get import json import html import csv

headers = {"Accept": "application/json", "User-Agent": "[PUT YOUR LOGIN EMAIL HERE]"}

class WikiAPI: def __init__(self): self.pageviews_base_url = "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article"
 * 1) Queries the API for each url, called by Report.get_views

def pageviews(self, project, access, agent, article, granularity, start, end): article = html.escape(article) slugs = [self.pageviews_base_url, project, access, agent, article, granularity, start, end] query_url = "/".join(slugs)

response = json.loads(get(query_url, headers=headers).text)

return response

class Report: def __init__(self, mode=None, articles=None, start=None, end=None, granularity=None, project=None, access=None, agent=None): self.mode = mode self.articles = articles self.start = start self.end = end self.granularity = granularity self.project = project self.access = access self.agent = agent
 * 1) Takes a list of urls and query parameters, creates API queries, and writes the results to a csv

self.API = WikiAPI

if self.mode == "articles": self.report_file = "{start} - {end} wikipedia article views.csv".format(start=self.start, end=self.end) elif self.mode == "images": self.report_file = "{start} - {end} wikimedia image views.csv".format(start=self.start, end=self.end)

self.open_file = open(self.report_file, "w", newline="", encoding="utf-8")

self.write_report

def write_report(self): self.reportwriter = csv.writer(self.open_file, delimiter=",") self.reportwriter.writerow(["wikiUrl", "pageViews"])

with open(self.articles, 'r', encoding="utf-8") as f:			lines = f.readlines for line in lines: wiki_url = line.split("/")[-1].strip view_count = self.get_views(wiki_url) self.reportwriter.writerow([wiki_url, view_count])

self.open_file.close

def get_views(self, article): view_count = 0 response = self.API.pageviews(project=self.project, access=self.access, agent=self.agent, article=article, granularity=self.granularity, start=self.start, end=self.end)

if "items" in response: for day in response["items"]: view_count += day["views"]

return view_count

def run_report(mode=None): # Can be daily or monthly granularity = "daily" # YYYYMMDD or YYYYMMDDHH start = "20221001" # YYYYMMDD or YYYYMMDDHH end = "20221031"
 * 1) Use to set parameters for the report

# Can be all-access, desktop, mobile-app, or mobile-web access = "all-access" # Can be all-agents, user, automated, or spider agent = "user"

if mode == "articles": project = "en.wikipedia.org" articles = "tracked_articles.txt"

elif mode == "images": project = "commons.wikimedia.org" articles = "tracked_uploads.txt"

Report(mode=mode, articles=articles, start=start, end=end, granularity=granularity, access=access, agent=agent, project=project)

run_report(mode="images")
 * 1) mode can be "articles" or "images"

Use of images on Wiki project pages
Other tools let you see how categories of Commons images are used across the Wiki ecosystem, giving you a broad scale of how a set of images are being used and also letting you drill down.

We use Glamorous to check the usage of all images under Category:Collections of Te Papa.

Filtering to a date span shows a chart of views by project (such as English-language Wikipedia, Spanish-language Wikipedia, Wikidata) on the Daily views tab.



Usage is also charted on the Global file usage tab.



And the File usage details tab provides a complete breakdown of every image in the category, showing for each one:
 * Number of uses
 * Page views across projects
 * Which pages it’s linked on



Tracking contributions
It can be useful to see how interest by contributors is building, based on how active they are after significant releases or other work.

The Programs and Events Dashboard provides a combined view of multiple users' contributions. Users can be added to the overall campaign or individual events.

We’re using ours to see how staff interest is (hopefully) building as we release more material and publicise the work internally. Staff who are interested in contributing as part of their work are added to the board, and we then look at our collective impact.

Another tool we may use is Herding Sheep - the idea is to ask participants at public edit-a-thons we hold to share their usernames, so we can get an idea of what kind of session or topic inspires the most ongoing activity as an editor.

Qualitative data
Although the available tools mainly focus on raw numbers, the wider Wiki ecosystem does provide good ways to collate qualitative data, which may tell you things like:
 * What questions people are trying to answer when they go to Wikipedia
 * What sort of problems you’ve helped them solve
 * What they think is still missing

We’re keeping an eye on our user Talk pages, as well as those for articles we’ve edited and images we’ve uploaded.

Other existing channels, including our website pop-up survey and high-resolution image download questionnaire, are also being watched for relevant comments. We are currently receiving feedback through emails to individual staff, and may set up a digital outreach address to publicise as an easy point of contact.

The main trick is to actually record these comments as they’re received. Even adding them to our simple monthly reporting spreadsheet is enough to get that information aggregated, analysed, and shared with the right people.

In the future, we’re considering running observational user testing to get qualitative feedback on the specifics of how we’re using these platforms, particularly regarding user experience and content decisions.