User:Phil wink/Quantitative scansion code

Introduction
Scansion — the analysis and (usually) graphic notation of the metrical structure of verse — is a quagmire. It promises simplicity and clarity, but often delivers … lunacy. My first goal is to point out as many problems as I can right at the beginning, so that as this property is implemented there will be as few surprises as possible. Secondarily, I try to suggest paths that I feel overcome these problems. Disagreements about scansion are often the result of people talking past each other because of unstated assumptions about what it really is or should do; so I have tried to be explicit about both the facts (as I understand them) and my opinions. But I'm happy to try to clarify more.

I have a variety of ambitions for this property. They cannot all be met perfectly, so tradeoffs are inevitable. One ambition is that it be simple: The fewer characters, and the fewer exotic characters, the better. It should also be powerful: Our work should be applicable to as many germane languages, periods, traditions, and wiki uses as possible. It should be optimized both for humans and computers: Our scansion should be so tightly defined that it can truly function as data, yet it must communicate to humans who have typographical traditions and visual needs of their own.

The scope of the proposed property is quantitative meter, so
 * 1. We are not scanning individual lines, but mapping the pattern that all lines in a given meter follow.
 * 2. We are completely ignoring syllabic, accentual, accentual-syllabic, tonal, parallel, free, and all other types of verse which aren’t quantitative verse.
 * 3. Conversely, we will attempt to accommodate all types of quantitative verse by addressing the known challenges of the 3 central traditions: Greek/Latin, Sanskrit, and Arabic.

But first, let us "dot our i's and cross our t's — and mind our p's and q's"...

Preliminary criteria

 * 1. Symbols likely to display correctly under most conditions (font/browser/operating system)
 * 2. Monospaces correctly (really, a special instance of #1)
 * 3. One prosodic element = one character: avoid formatted characters, combining characters, repeated characters
 * 4. Easy to type, edit
 * 5. Visually meaningful; as far as possible, reflect scansion tradition

Brill symbols
Starting with the symbols listed in Brill's notes (Rietbroek 2008), I tested Criterion #1 — whether they'd actually display — over several relatively standard and capacious fonts I happened to have on my computer (Arial Unicode, Calibri, Consolas, Courier New, DejaVu Sans, Liberation Sans, Linux Libertine G, Lucida Sans Unicode, TeXGyreScholia, Times New Roman). Results were broadly similar. If the symbol usually appeared correctly across fonts, I gave it a Y or N in the 1:1 column, based on whether it passed Criterion #3; an asterisk (*) indicates that the symbol was generally supplied by a math font, not by the specified font (if no Y/N/* appears, this means the symbol failed to appear correctly across most tested fonts). I also downloaded and tested New Athena Unicode in which, as expected, all symbols worked. Note that the description is from the PDF and does not always correspond to the Unicode tag. Using this table, you can see for yourself which symbols do or do not render on your system(s).

Taking "Y" symbols as our first cut of "usable symbols", we note that many of these will likely not be very useful in a reasonably simple quantitative scansion, while some quite foundational symbols (like the longs & shorts over each other) are absent or rare.


 * "Y" symbols — first cut

×–|‖∫~¨

Additional symbols
At this point we may already begin to wonder if the solution must involve a distinct underlying code + transformed human-readable display. But let us first catalogue a few other symbols that might be of use.

Monospacing
Symbols which consistently render correctly in monospaced fonts are extremely valuable when demonstrating scansion online, as this allows WYSIWYG editing:

– ˘  ˘ –  ˘   ˘ – |  – –   –    – ˘  ˘  – – Arma virumque cano, Troiae qui primus ab oris

To the extent that the symbols used in Wikidata's claims should match those used elsewhere in the Wikimedia Empire, then symbols likely to monospace well should also be used in Wikidata. Consider the "union" symbol:

1234567890 ∪∪∪∪∪∪∪∪∪∪

On my browsers, anyway, this supposedly monospaced character renders at a width of 1.2! This is because it's actually being replaced with a character from a (non-monospaced) math font. So (1) "union" is bad at monospacing, (2) the proper Unicode metrical breve is largely unavailable, and (3) the crappy little breve above is too high and shrimpy. This pretty much leaves us with (gasp) "u":

– u  u –  u   u – |  – –   –    – u  u  – – Arma virumque cano, Troiae qui primus ab oris

A truly ugly solution, but universally displayable, and easily entered.

Latin and Greek
Halporn, Ostwald, and Rosenmeyer (1980, pp 3-4, 61-62. Hereafter "HOR".) use the same symbols whether scanning Greek or Latin verse:

Nasty units
However, these symbols are combined ad lib to form a wide variety of compound symbols. The scansions discussed below are not just theoretically possible, they actually occur in HOR (and not as crazy exceptional cases).




 * Left (happy little emoji): This symbol (which, by analogy to anceps and biceps, I will call triceps) indicates a position which may hold 1 long or 1 short or 2 shorts. It does not occur in Unicode, but I believe it is necessary that Wikidata somehow come to terms with this unit, as it is a standard component of verse used by Plautus and Terence (and surely others). For display purposes, the best (non-LaTeX) I can come up with is  which renders as ∪&#x035D;∪ or ... probably a bit better ...   which renders as u&#x035D;u . (A similar trick I find less appealing is   which renders u&#x035C;&#x035E;u.) But even if these are workable display solutions, they destroy any sense of the value as code.
 * Center (no skateboarding): This indicates a biceps (the 2 shorts in a dactyl which may be replaced by 1 long) which may optionally have a break within it (if, of course, it's the 2 shorts). I assume that an effect like this would require LaTeX. I will argue later that Wikidata may be best off ignoring all optional breaks.
 * Right (park bench with double rainbow): This is another biceps which is bridged with the following long … but of course depending upon which way the biceps is realized, it can be a long-long bridge or a short-long bridge. I assume this too would require LaTeX. I will also argue against coding for bridges in Wikidata.

Principled limitations in Greek and Latin
Wikidata can decide to code absolutely all structure in a verse line type, but the cost seems great, both for humans trying to edit the code, and browsers trying to render it.

Several principled limitations will greatly reduce these difficulties, without (I think) much compromising the verse structure:
 * 1. Code no bridges. The Greeks and Romans didn't know they had them (even though they did); why should we spend a combining character on them? Furthermore, at least some bridges are genre-specific, so if we were to code them, we'd either have to create distinct items for (say) tragic iambic trimeter versus non-tragic iambic trimeter — or we'd have to take the further step of distinguishing optional bridges... a bridge too far!
 * 2. Code no optional breaks. While mandatory breaks (like that in the middle of the 2nd line of an elegiac couplet) should be retained, optional breaks can be viewed as slightly less a feature of verse structure and slightly more a feature of a poet's style. And (as we saw above) when they occur in the midst of a biceps, they may be difficult to render.
 * 3. Do not distinguish the more likely syllable type. Anceps (×) and biceps positions are sometimes distinguished as being primarily filled with short, secondarily with long (or vice versa) by displaying the secondary symbol above the primary one. Let us simplify our character set by not caring which is more common.
 * 4. Do not code the 1%. Since we are coding line types, not individual lines, we aim to code all possibilities. In exceptional cases, established rules are broken, either by incompetence, miscopying, or by inspired willfulness. Let us not let outliers stand in the way of coding the norm.
 * 5. Code no foot/metron divisions. Many texts (especially introductory texts) graphically divide a line into its component feet or metra (Arma vi|rumque ca|no, Troi|ae qui |primus ab | oris). My sense is that more academic publications move away from this. For our purposes, ignoring foot/metron breaks eliminate a lot of clutter and free up that break symbol (which would be either | or /) for use elsewhere. (This is probably the least important of my "principled limitations" and may well be reversed, especially considering the probable utility of foot/pāda markers in Sanskrit and Arabic verse).

We'll summarize just what elements are really needed, and what symbols to put to them, after addressing quantitative scansion in some additional languages.

Sanskrit
Closely analogous to Greek and Latin (G/L), Sanskrit measures heavy and light syllables. Unfortunately, the typical symbols are reversed:

Sanskrit also has its own traditional symbolic scansion, using Devanagari letters to indicate series (gaṇas) of 3 syllables. A section with syllable count not divisible by 3 results in a series of letters + 1 or 2 guru or laghu symbols tacked on at the end.

I do not know to what extent gaṇa (Sanskrit letter) and symbolic (light=| heavy=∪) scansions are still used; but these and standard G/L scansion should in principle be able to be losslessly transformed into one another. However, as far as I've seen, Sanskrit notation has no letter or symbol for anceps or any other type of position which allows multiple possibilities. This suggests that Sanskrit notation is in fact optimized for scanning realized verses, not the underlying metrical form. If this is the case, then while Sanskrit notation can be losslessly transformed to G/L, the reverse is not true, and our project — to scan the underlying meters — might require a G/L foundation.

Many types of Sanskrit verse (e.g. the shlokas of Mahabharata and Ramayana) typically mark the end of the first half-verse with | (#13) and the end of the second half verse with ‖ (#14, or 2 × #13). Sanskrit verses tend to be quite long, and in European versification would usually be conceptualized as couplets or short stanzas; even in Sanskrit they may be printed as 2, 3, or 4 lines.

The Indian elephant in the room
There is a big challenge with coding some Sanskrit verse, which is not seen (or seen rarely) in Greek or Latin. Consider one of the most important Sanskrit meters, the Anuṣṭubh. It comprises 4 pādas of 8 syllables each. Here are the scansions of 2 instances of (I think!) well-formed pādas:

u – u – u u u u - u - - - - - -

We logically deduce that the underlying metrical structure must be:

x x x - x x x x

This would give us 27 = 128 variants per pāda. But there aren't that many. This is because the acceptable values of an anceps are dependent upon the values of the other anceps within the pāda. Now different genres and periods can have slightly different dependency rules, but I will tabulate the list given by Brown (1869, p 5-6.). Assume that the 1st and 8th positions are independent: they can be heavy or light in any case. We are left with 2 sets of 3 positions each (that is, 2 gaṇas), which we can crosstabulate. "1" indicates the combination is valid for the 1st pāda, "2" for the 2nd pāda. For the 1st pāda only 18 out of the expected 64 possibilities are well-formed — for the 2nd pāda, only 5.

That's the first half-line; the second half-line exhibits exactly the same structure. How on earth to scan this efficiently will be addressed later. For the mean time, we merely observe that a simple formulation like

x x x x x x x x ¦ x x x x u – u x |

is true (in a sense) but does a very poor job of describing the actual metrical structure. Whereas displaying 18×5=90 distinct scansions per half-line is probably not a great idea either!

Mātrika (moric) meters
An entire category of Sanskrit verse is structured rather differently, though still quantitatively. Moric verse — rather than specifying valid strings of light and heavy syllables as we have encountered above — simply specifies a total length value for a given segment of verse, where light=1 mora, and heavy=2 morae. Evidently moric verse (mātrika in Sanskrit) is especially prevalent among Prakrit (related vernaculars) prosodies. Let us take the Āryā meter. It comprises 4 pādas with lengths 12-18-12-15. This might be symbolically represented:

12 morae ¦ 18 morae | 12 morae ¦ 15 morae ||

or, more granularly:

m m m m m m m m m m m m ¦ m m m m m m m m m m m m m m m m m m | m m m m m m m m m m m m ¦ m m m m m m m m m m m m m m m ||

where m=1 mora. The number of permutations of light and heavy syllables which would fill these pādas … well, I don’t want to think about it. However, it appears that in reality, these large pādas are composed of smaller measures of 4 morae, which by definition can only be realized in 5 ways:

m m m m = u u u u u u – –  u u u –   u –   –

This suggests that coding for dependent anceps (discussed below), or something very similar, could be used to represent this meter, at least in the data. However, how this is optimally displayed is still an open question. Gasparov's clever symbol can be approximated with  which renders uu&#x035E;uu ; or it can be quite closely duplicated with the exotic   which renders as u u  u u. Other less fussy options include:

uuuu uuuu  uuuu ¦ uuuu  uuuu  uuuu  uuuu  uu | uuuu  uuuu  uuuu ¦ uuuu  uuuu u uuuu  uu ||

or, more efficiently (and I think more clearly):

4m 4m 4m ¦ 4m 4m 4m 4m uu | 4m 4m 4m ¦ 4m 4m u 4m uu ||

I have several open questions about mātrika, and further research is needed. So my proposed solutions below only provisionally account for this metrical category.

Arabic
This section will only look at Classical Arabic verse, predominant from around the time of Muhammad through the mid-20th century. Pre-classical and modern Arabic verse may present other issues. Arabic prosody immediately presents 3 incompatibilities with Greek/Latin/Sanskrit: direction, and prosodic units and scansion (which will be treated together).

Direction of text
Arabic is written right-to-left (RTL). Of course romanized Arabic runs LTR. What order should a symbol string representing the meter run in? All text and scansions in this essay will run LTR, and it is my tacit assumption that Wikidata's values will also run LTR — though this is open to discussion. Assuming WD's values do run LTR, if it is desirable to be able to display these scansions RLT under some circumstances, then what are our options? A second property? Automated transformations (e.g. via template)? Manual transformations?

I will not address Persian verse in this essay, but it is (to my knowledge) the only other major language which features both RTL writing and quantitative prosody (there may be others). Persian verse is closely modeled on Arabic and, for the present, it is assumed that a system that supports Arabic scansion will provide a firm basis for Persian.

Prosodic units and scansion
Contrary to the examples of Greek, Latin, and Sanskrit, Arabic prosody does not conceptualize its verse to be composed of syllables, but of hierarchical groups ultimately defined by phoneme types. To briefly describe the prosodic system, moving from small to big, where C=consonant, v=short vowel, and V=long vowel…

The ḥarf is the minimal linguistic unit. It is either moving (Cv) or quiescent (C or V). The ḥarf is the basic unit of time, so is essentially equivalent to "mora". However, whereas in other prosodies (e.g. Latin, Japanese) a single mora can function as a prosodic unit, in Arabic the ḥarf is too brief, so its Elementary Prosodic Units (EPUs) are composed of multiple ḥarfs. The minimal EPU is CvC = 2 ḥarfs = "long syllable" (though, as stated, this concept is foreign).

The prosody admits 8 metrical feet, each composed of 1 peg + 1 or 2 cords. Arabic verse has no traditional abstract system of scansion. Metrical form is traditionally represented, not symbolically, but verbally by means of 8 mnemonic words which exemplify the feet.

*Theoretically defective, but necessary to explain some verse forms.

Like Sanskrit verses, Arabic verses tend to be quite long, with major divisions within them. Here are 3 different scansion views of the Ṭawīl verse:

ḥarfs  3  2    3  2 2    3  2    3  2 2    3  2    3  2 2    3  2    3  2 2 G/L   u - - | u - - - | u - - | u - - - ! u - - | u - - - | u - - | u - - - # Mnem faʿūlun mafāʿīlun faʿūlun mafāʿīlun faʿūlun mafāʿīlun faʿūlun mafāʿīlun

In European versification, this structure would tend to be thought of as a couplet.

Representation
Though Classical Arabic prosody does not use the concept of the syllable, all metrical structure can be reduced to short and long syllables. So using Greek/Latin symbols should render the structure losslessly. I assume that this will be our primary method. However, if it is also desirable to display Arabic meters with their traditional scansion, it may be necessary to keep the foot divisions in the G/L-style scansion (contra my G/L notes above), as this will help identify the correct mnemonic words. It is my understanding that ḥarfs, pegs, and cords, while essential furniture in Arabic prosody, are not considered part of the scansion.

This raises the possibility of 4 distinct displays for Arabic scansion:
 * 1) LRT Greek/Latin symbols
 * 2) RTL Greek/Latin symbols
 * 3) LRT Arabic mnemonics (romanized)
 * 4) RTL Arabic mnemonics (Arabic script)

Moving forward

 * Because the Sanskrit and Arabic systems are to some degree deficient in symbols descriptive of the underlying meter (as opposed to realized verse lines), I propose that Wikidata's values be founded upon the Greek/Latin system, though some modifications may be necessary.
 * Because of the difficulties of displaying certain units, I am advocating for a system in which the Wikidata values are strings of Code, which can be called up by templates which transform them into strings of the Display values. If later it is determined that better Display values are available, or if the Unicode options become sufficiently common, then these transforming templates can be updated and all displayed instances of the scansions will be automatically updated, without having to alter the underlying values. This also leaves the door open to the creation of additional transforming templates to display the coded scansions in formats native to Sanskrit, Arabic, or other prosodies (though as discussed above, the native display of Sanskrit underlying meters is questionable).

Minimal prosodic units
In the table below, I try to answer these questions:
 * 1. Regardless of details of coding or display, what are the minimal prosodic units which must be included if we wish a single system to accommodate Greek, Latin, Sanskrit, and Arabic quantitative scansions?
 * 2. In a perfect Unicode world, what symbol is optimally mapped to the unit?
 * 3. In the existing world of unpredictable browsers and fonts, how do we optimally display the unit?
 * 4. Assuming that the Wikidata value will need to be transformed into different characters for display anyway, what are the optimal code values?

Dependent anceps
To my knowledge, the dependent anceps problem applies chiefly to Sanskrit verse, but this discussion may be germane in other situations.


 * The easy path

We may choose not to solve the problem, and simply display:

x x x x x x x x ¦ x x x x u – u x |

even though we know this is a poor description of the actual metrical structure. After all, this complex situation can never be fully explained with a mere scansion; this is what articles are for. However, even if we take this path of avoidance, I recommend that we still keep a discrete code for "Dependent anceps" (e.g. capital X), and just transform this code to the same display value as "Independent anceps". This is because if we later decide to make the system more robust, at least some of the required structure will already be there.


 * The hard path

The other option is to attempt to characterize the dependencies in the scansion code. Whether and how these complex dependencies are displayed is another question. Certainly our initial transforming templates would simply ignore this information, so that to the viewer there would initially be little or no difference between the easy and hard paths … the difference would be hidden within the coded value. The display might either be identical to the above example, or just subtly different, say:

x x x x x x x x ¦ x x x x u – u x |

or

x X X X X X X x ¦ x X X X u – u x |

However, the code would contain robust information. (For now, I am using the gaṇas for shorthand, since they are germane for Sanskrit; if we pursue this hard path, we'll have a fuller discussion of how this will be managed.) The slightly easier method is to specify each half-pāda's valid possibilities:

This correctly limits which gaṇas could appear in a given position, but fails to account for the interaction between the 2 gaṇas within the 1st pāda. To do this, more extensive coding would be required:

This perfectly reflects all dependencies (at least according to Brown). A more hierarchical approach is probably superior; in that case, we'd get something like this:

All of these examples scan only the 1st half-verse (though the second is a verbatim repeat of the first). Just for reference, the above formulation is real, but I suspect it to be one of the worst-case scenarios of complexity. But no promises. Naturally, all hard paths require additional reserved code characters.


 * Display

I think it would be valuable to house this complex dependent anceps information in our code, even if we find no adequate way to display it. At the moment, the only display solution that makes sense to me is a modification of Quinn's (see Phalaeceans, below). I propose the following rules:
 * 1. Begin the scansion with the entire verse on 1 line (no line breaks, even at [!]. This scansion displays the simple “X” (or whatever we decide to use) for each dependent anceps.
 * 2. Report out each variant segment (but only the variant bits) on its own line.
 * 3. When multiple sets of variants need to be reported, and they are not mutually dependent, extend broken bars as far down as they both run together.
 * 4. When one defined set of variants is repeated, do not repeat the list, but label the initial list, and repeat the label.

Given this system, Brown's Anuṣṭubh (again, this is probably a near-worst-case scenario) would be displayed:

x X X X X X X x ¦ x X X X u – u x | x X X X X X X x ¦ x X X X u – u x || (A)           ¦  (B)               (A)               (B) – – – u – –  ¦   – – – – – – – u –  ¦   u – – – – – – u u  ¦   – – u   – – – u u u   ¦   u – u   u – – u – –   ¦   – u u   u – – – u – u – – – u u  u – – u u u   – u – – – – – u – u – – – u – – u – – u – – u u  – u – u u u   – – u u – – – – u – u u  – – u u u u   u – u u – – – u u u – –

This system may work equally well for moric verse. The Āryā might be displayed:

4m     4m 4m ¦ 4m 4m 4m 4m uu | 4m 4m 4m ¦ 4m 4m u 4m uu || u u u u u u – –  u u u –   u –   –

I do not imagine these complex displays replacing the simple displays, but being an alternative we can offer, once the "simple" display is worked out.

Additional code units
To implement dependent anceps at the Code level, something like these additional Units will be necessary. (These are not used consistently in the examples above.)

Additionally, for moric verse to be displayed, we will probably have to add:

Both of these will probably be displayed as coded, e.g.  is displayed: 4M. However, implications of using dependent anceps structure with moric verse have not yet been fully vetted.

Phalaeceans and their implications
A closer look at Latin hendecasyllabics (phalaeceans here, for better clarity) highlights some issues. Consider these scansions:

x x – u u – u – u – u (PEPP3&4) x x – u u – u – u – x (Cole in Wimsatt; also PEPP1&2 since you asked) x x – u u – u – u - - (HOR) - - - u u – u – u – x (Quinn) - u u –

First, this underlines the importance of including references in our statements. This is a well-studied and (I think) basically uncontroversial meter, but the first 4 sources I checked all gave subtly different scansions. I suspect the final syllable in PEPP3&4 is simply an error sadly perpetuated through editions. If the final syllable in HOR includes an "understood" brevis in longo, then it is functionally equivalent to Cole. But I don't know.
 * References

Second, I wonder if Quinn's minority opinion on the first 2 positions is the result of his writing specifically about Catullus’s phalaeceans, not phalaeceans in general. This encourages us to include an "as used by/in" property as qualifier for our statements. The evolution of verse practice is such that — although conforming to the same overall pattern — often one poet allows certain configurations that another does not. (As a second example, the dependent anceps formulation of Kālidāsa's Anuṣṭubh is radically simpler than that of the Epics — probably each should appear as distinct values of Anuṣṭubh: quantitative metrical pattern, one with the qualifier "as used by Kālidāsa", the other with the qualifier "as used in the Mahābhārata and Rāmāyaṇa".)
 * Qualifiers

Third, the most interesting difference (to me, anyway) is that Quinn specifies that the first 2 syllables are dependent anceps. (That is, if they were independent, Quinn would have to list  as a fourth alternative — all 4 PEPPs explicitly list all 4 possible combinations in the first 2 positions; Quinn lists 3, as shown.) Thus we have a non-Sanskrit instance of a scansion that ideally would be coded with a dependent anceps scansion:. I have not yet been able to confirm this with a reliable source, but the English Wikipedia's article on meter suggests that even the highly regulated Classical Arabic has an instance too.
 * Dependent anceps