Talk:Plain text

Merge with text file
See discussion at Talk:Text_file --NealMcB 17:32, 25 April 2006 (UTC)
 * Discussion declined proposal - most opinions made a distinction between Plain text and non-plain text. Said: Rursus ☺ ★ 17:27, 25 June 2007 (UTC)

Bad definition
The definition is bad – the article starts nicely explaining that plain text is text that lacks structurals and typographic markers, then it levitates into the blue, and avoids explaining the details about why plain text can be a good idea, and for what purpose. I must think about how to write this better... Said: Rursus ☺ ★ 17:27, 25 June 2007 (UTC)


 * I made one try to improve para 1, but para 3 and forth shoots wildly and imprecisely with their smoking guns!! (I believe para 4 just shoot para 1 to death)! Said: Rursus ☺ ★ 17:50, 25 June 2007 (UTC)

Citations?
The file was tagged for not citing sources? Why? There's no explanation here. Why? Please add a comment on what is to be improved on the talk page, the one who required citations for something. Said: Rursus ☺ ★ 13:43, 28 June 2007 (UTC)


 * Hi there, I'm more than happy to explain. For starters, consider the following language:

In computing, plain text is textual material in a computer file which is unformatted and without very much processing readable by simple computer tools such as line printing text commands, in Windows'es DOS window type, and in Unix terminal window cat.


 * May I ask where this definition comes from? It is not only unsubstantiated by a citation to a reliable source, it seems to be a circular definition that conveys little or no new information to a General Audience. "plain text is textual material" ... "unformatted and without very much processing readable" ... ?


 * Respectfully, this sounds very close to an ad-hoc definition for an already tenuous concept. It does not appear to be supported by any academic, professional, or journalistic documentation. If someone wanted to look up this definition and research it for themselves, where would they go? How would this definition survive scrutiny if challenged under WP:OR? dr.ef.tymac 13:56, 28 June 2007 (UTC)


 * Update: I've added a merge recommendation. The previous rationale for opposing the merge didn't make much sense, and there still is zero substantiation for a stand-alone article on this subject, especially the primary definition in the lead of this article. dr.ef.tymac 16:14, 30 June 2007 (UTC)

Text editors used on Unices
I am would add Emacs to the editors used in Unixy environments (and also on Mac OS X, which in fact is derived from (BSD) Unix). Nobody with functional minds uses "ed" for real world editing other than filtering, emacs would be much, much more appropriate in this list, its use is more widespread and it is much closer to what you call a text editor on a modern computer system. —Preceding unsigned comment added by Sebastian42 (talk • contribs) 17:56, 26 November 2009 (UTC)

This makes no sense
[6 bits means] 64 characters -- which leaves no room for lower-case letters after you assign codes for A-Z, 0-9, and even one other code for space, punctuation, etc. I make 26 uppercase characters (A-Z), 10 numerals and "one character for punctuation" a total of 37; a long way from the 64 possible characters assignable in 6 bits. 64's enough for a complete lowercase set, or without lowercase 64 bits allows a whole festival of 28 punctuation characters. With lowercase there's space for 26 uppercase, 26 lowercase, 10 digits and 2 for punctuation, includingtheratherusefulspace. Tonywalton Talk 23:29, 29 April 2012 (UTC)


 * The subject statement also ignores the use and history of alternate code pages to stuff more characters into 64 bits. An example is the TeleTypeSetter ("TTS") telegraphy code introduced in 1928, which was broadly implemented in the newspaper publishing industry. Because it was inherently digital (6-bit) and the publishing industry was willing to pay for development, most fundamental word processing technology was developed using TTS, beginning in the early 1950s. By having two code pages --- one for the the shift keyboard and one for the unshifted keyboard (which requires only two characters to encode) --- the number of characters available in 6-bits was nearly doubled to 126 characters plus the two shift switch characters. See e.g., [TTS character set image]. Marbux (talk) 03:02, 22 November 2012 (UTC)

plain language vs plain text
As télécommunication industry defined both plain text and plain language. As télécommunication industry recommands both international alphabet n°5 (and ASCII) and international alphabet n°2.

Shouldn't be a link be traced between both?

Reference:

http://www.itu.int/dms_pub/itu-s/oth/02/01/S020100001B4002PDFE.pdf

editor: PUBLISHED BY THE GENERAL SECRETARIAT OF THE INTERNATIONAL TELECOMMUNICATION UNION

place=GENEVA

year:1959

pages=23 and 24 — Preceding unsigned comment added by 86.75.160.114 (talk) 21:36, 26 July 2012 (UTC)


 * Thank you for that reference. The "plain language" defined on p. 23-24 of that document is an important influence on what is now known as "plain text" today, and I agree this article should mention it. --DavidCary (talk) 14:21, 15 August 2015 (UTC)

Plain text, the Unicode definition
What's with the formatting for this section? 57.73.135.225 (talk) 03:46, 12 June 2013 (UTC)

Removing formatting
I'm not sure this qualifies for inclusion in the article, but a useful function of plain text is preventing code and commands from being inherited when information is copied from a web page or document and pasted into a new document. For example, if highlighting and copying something from a Wikipedia article, hyperlinks and font commands may also be included by the web browser. Likewise, when copying from a Microsoft Word document, a person might only want the text for another Word file, but end up with the font selection, color and other formatting that they don't need. Convert the next into plain text by pasting it into Notepad, then copy it right back out. —RRabbit42 (talk) 16:17, 26 February 2017 (UTC)

Flagging sentence that invokes a term, but then seems in direct contradiction with large sections of the full Wikipedia article that covers that term.
From current Introduction to this article, par.6, last sentence:

For example, all the problems of Endianness can be avoided (with encodings such as UCS-2 rather than UTF-8, endianness matters, but uniformly for every character, rather than for potentially-unknown subsets of it).

When I first read the article, there was no link for UCS-2, but I Google, discovered Wikipedia article as top hit, and added link. .... Now several hours and many, many readings of Wiki pages (and edits) later, trying to wind up this session (but it keeps pulling me in ;'P ). When I finally got back to this article to pickup where I left off, which infact had been deep intrigue with the main claim of the above sentence.

Before I go further, let me first state that when implementing any standard there are often subtle differences that all too often you only come to understand far to late to avoid significant rework, while at worse simultaneously having created (and therefore now managing major confusion) on the part of stakeholders.

Therefore, I really do appreciate that someone has at least expressed the intention to raise what could very well be a meaningful distinction. Personally, while I had extensively looked into Unicode and the UTF standards, I had never even heard of UCS and ISO 10646. Hopefully, this comment can lead to a clear correction or at least some useful discussion, which others might find useful for future direction.

So, having said that and having read through this extensively: The above sentense doesn't seem to make sense, at least not as it's presently worded. As explined in Universal Coded Character Set:

History:

par.1: The International Organization for Standardization (ISO) set out to compose the universal character set in 1989, and published the draft of ISO 10646 in 1990....

par.2: One could code the characters of this primordial ISO 10646 standard in one of three ways: .... 2. UCS-2, two bytes for every character, enabling the encoding of the first plane, 0x20, the Basic Multilingual Plane, containing the first 36,864 codepoints, straightforwardly, and other planes and groups by switching to them with ISO 2022 escape sequences;

par.3: In 1990, therefore, two initiatives for a universal character set existed: Unicode, with 16 bits for every character (65,536 possible characters), and ISO 10646. The software companies refused to accept the complexity and size requirement of the ISO standard and were able to convince a number of ISO National Bodies to vote against it.[citation needed] The ISO standardizers realized they could not continue to support the standard in its current state and negotiated the unification of their standard with Unicode....

par.4: ....ISO 10646 was limited to contain as many characters as could be encoded by UTF-16 and no more....

par.5: Rob Pike and Ken Thompson, the designers of the Plan 9 operating system, devised a new, fast and well-designed mixed-width encoding, which came to be called UTF-8,[2] currently the most popular UCS encoding.

....

Relationship with Unicode:

par.1: Since 1991, the Unicode Consortium and the ISO have developed The Unicode Standard ("Unicode") and ISO/IEC 10646 in tandem. The repertoire, character names, and code points of Unicode Version 2.0 exactly match those of ISO/IEC 10646-1:1993 with its first seven published amendments. After Unicode 3.0 was published in February 2000, corresponding new and updated characters entered the UCS via ISO/IEC 10646-1:2000. In 2003, parts 1 and 2 of ISO/IEC 10646 were combined into a single part, which has since had a number of amendments adding characters to the standard in approximate synchrony with the Unicode standard.

So, if I'm reading that correctly, ISO 10646 and Unicode are identical character sets, but Unicode is actually more complete.

However, perhaps the original intent of the sentence in question might have been to express some other differences between UTF-8 versus UTF-16?

Encoding forms:

par.1: ISO 10646 defines several character encoding forms for the Universal Coded Character Set. The simplest, UCS-2,[Note 1] uses a single code value (defined as a number, of which one or more represents a code point in general, but for UCS-2 it is strictly one code value that represents a code point) between 0 and 65,535 for each character, and allows exactly two bytes (one 16-bit word) to represent that value. UCS-2 thereby permits a binary representation of every code point in the BMP that represents a character. UCS-2 cannot represent code points outside the BMP. (Occasionally, articles about Unicode will mistakenly refer to UCS-2 as "UCS-16". UCS-16 does not exist; the authors who make this error usually intend to refer to UCS-2 or to UTF-16.)

par.2: The first amendment to the original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP. A range of code points in the S (Special) Zone of the BMP remains unassigned to characters. UCS-2 disallows use of code values for these code points, but UTF-16 allows their use in pairs. Unicode also adopted UTF-16, but in Unicode terminology, the high-half zone elements become "high surrogates" and the low-half zone elements become "low surrogates".

par.3: .... in UCS-2, every encoded character has a fixed length in bytes, which makes it simple to manipulate...

Then again, it's possible that what was actually meant is that because ISO 10646 only covers character mapping, but doesn't cover implementation, that it is somehow more compatiable. But if I understand correctly, that is just not the case. While in some sense using ISO 10646 may be less ambigious, that is only because it is more abstract. From what I'm gathering from going through this article (and others that add detail to componets and/or history as well as general coding experience), does UCS-2 avoid issues with "all the problems of Endianness"? Perhaps, albeit only by sacrificing other features from Unicode as in the example below.

Differences from Unicode:

par.1: ISO 10646 and Unicode have an identical repertoire and numbers—the same characters with the same numbers exist on both standards, although Unicode releases new versions and adds new characters more often. Unicode has rules and specifications outside the scope of ISO 10646. ISO 10646 is a simple character map, an extension of previous standards like ISO 8859. In contrast, Unicode adds rules for collation, normalization of forms, and the bidirectional algorithm for right-to-left scripts such as Arabic and Hebrew. For interoperability between platforms, especially if bidirectional scripts are used, it is not enough to support ISO 10646; Unicode must be implemented.

par.2: To support these rules and algorithms, Unicode adds many properties to each character in the set such as properties determining a character’s default bidirectional class and properties to determine how the character combines with other characters. If the character represents a numeric value such as the European number ‘8’, or the vulgar fraction ‘¼’, that numeric value is also added as a property of the character. Unicode intends these properties to support interoperable text handling with a mixture of languages.

par.3: Some applications support ISO 10646 characters but do not fully support Unicode. One such application, Xterm, can properly display all ISO 10646 characters that have a one-to-one character-to-glyph mapping[clarification needed] and a single directionality. It can handle some combining marks by simple overstriking methods, but cannot display Hebrew (bidirectional), Devanagari (one character to many glyphs) or Arabic (both features). Most GUI applications use standard OS text drawing routines which handle such scripts, although the applications themselves still do not always handle them correctly.

But wait, hold the phone! Then there is this note as well:

Note 1: See UTF-16 for a more detailed discussion of UCS-2.

https://en.wikipedia.org/wiki/UTF-16

History:

par.1: .... The early 2-byte encoding was usually called "Unicode", but is now called "UCS-2". UCS-2 differs from UTF-16 by being a constant length encoding[4] and only capable of encoding characters of BMP.

....

par.4: UTF-16 is specified in the latest versions of both the international standard ISO/IEC 10646 and the Unicode Standard. "UCS-2 should now be considered obsolete. It no longer refers to an encoding form in either 10646 or the Unicode Standard." There are no plans to extend UTF-16 to support a higher number of code points, or the codes replaced by surrogates, as allocating code points for this would violate the Unicode Stability Policy with respect to general category or surrogate code points.[9] An example idea would be to allocate another BMP value to prefix a triple of low,low,high surrogates (the order swapped so that it cannot match a surrogate pair in searches), allowing 230 more code points to be encoded, but changing the purpose of a code point is disallowed (using no prefix is also not allowed as two of these characters next to each other would match a surrogate pair).

Description, U+0000 to U+D7FF and U+E000 to U+FFFF

par.1: Both UTF-16 and UCS-2 encode code points in this range as single 16-bit code units that are numerically equal to the corresponding code points. These code points in the Basic Multilingual Plane (BMP) are the only code points that can be represented in UCS-2. As of Unicode 9.0, some modern non-Latin Asian, Middle-Eastern, and African scripts fall outside this range, as do most emoji characters.

Okay, so as far as the official standard is concerned, that does seem rather definitive. At this point, one might think that, at least in theory, an argument could me made that the above comment might have been referring to UTF-16 and not UCS-2.

But then, 'just when you thought it was over', like a cheesy sequil to a horror movie, if we keep reading that article, we learn that, at least in the Microsoft Windows world, the implementation for UTF-16 is a whole other story, and apparently even worse for UTF-8 (unlike macOS {and *nux?}, where UTF-8 has been the basis for core text encoding at the kernel level for over a decade.)

Usage:

par.1: UTF-16 is used for text in the OS API of all currently supported versions of Microsoft Windows..., it has improved UTF-8 support in addition to UTF-16; see Unicode in Microsoft Windows#UTF-8). Older Windows NT systems (prior to Windows 2000) only support UCS-2.[13] In Windows XP, no code point above U+FFFF is included in any font delivered with Windows for European languages.[14][15] Files and network data tend to be a mix of UTF-16, UTF-8, and legacy byte encodings.

....

And then in the same sub-section there are these fun facts. Ah, what! But hey, it is what it is.

par.10: String implementations based on UTF-16 typically return lengths and allow indexing in terms of code units, not code points. Neither code points nor code units correspond to anything an end user might recognise as a “character”; the things users identify as characters may in general consist of a base code point and a sequence of combining characters (or be a sequence of code points of other kind, for example Hangul conjoining jamos) – Unicode refers to this as a grapheme cluster[23] – and as such, applications dealing with Unicode strings, whatever the encoding, have to cope with the fact that they cannot arbitrarily split and combine strings.

It also mentions several opperating systems and coding languages that still only have support for UCS-2, which in this case would seem to be referring to the pre-1991 version explicitly.

Introduction:

UTF-16 arose from an earlier fixed-width 16-bit encoding known as UCS-2 (for 2-byte Universal Character Set) once it became clear that more than 216 code points were needed.[1]

UTF-16 is used internally by systems such as Windows and Java and by JavaScript, and often for plain text and for word-processing data files on Windows. It is rarely used for files on Unix/Linux or macOS. It never gained popularity on the web, where UTF-8 is dominant (and considered "the mandatory encoding for all [text]" by WHATWG[2]). UTF-16 is used by under 0.01% of web pages themselves.[3] WHATWG recommends that for security reasons browser apps should not use UTF-16.[2]

Okay, that seems like something that could be important, but back to the original issue:

Byte order encoding schemes:

par.1: UTF-16 and UCS-2 produce a sequence of 16-bit code units. Since most communication and storage protocols are defined for bytes, and each unit thus takes two 8-bit bytes, the order of the bytes may depend on the endianness (byte order) of the computer architecture.

par.2: To assist in recognizing the byte order of code units, UTF-16 allows a Byte Order Mark (BOM), a code point with the value U+FEFF, to precede the first actual coded value....

par.3: If the BOM is missing, '''RFC 2781 recommends[nb 3] that big-endian encoding be assumed. In practice, due to Windows using little-endian order by default, many applications assume little-endian encoding.''' It is also reliable to detect endianess by looking for null bytes, on the assumption that characters less than U+0100 are very common. If more even bytes (starting at 0) are null, then it is big-endian.

par.4: The standard also allows the byte order to be stated explicitly by specifying UTF-16BE or UTF-16LE as the encoding type. When the byte order is specified explicitly this way, a BOM is specifically not supposed to be prepended to the text, and a U+FEFF at the beginning should be handled as a ZWNBSP character. Most applications ignore a BOM in all cases despite this rule.

Okay, by this point I'm having a bit of a tough time understanding how, as the original sentence in question puts it,

"all the problems of Endianness can be avoided (with encodings such as UCS-2 rather than UTF-8,"

But then there's that second part, which states,

"endianness matters, but uniformly for every character, rather than for potentially-unknown subsets of it)."

Now, alright since USC-2, as well as UTF-16 for that matter, has basically been declared forzen, i.e. feature complete, a.k.a. there just ain't no more room, and as the UFT-16 articles get's into, no they aren't going to be doing any more funny stuff to create more room because dang it they said they wouldn't and so far they're sticking to it.

Well, good for them, but I'm still not clear on how UCS-2 (UTF-16) would have more "uniform... endianness" or how/where that would matter, even generally, not to mention specifically. I mean sure, maybe if UTF-8 is still evolving that could bring up new issues, but that's not the same as having some kind of underlying structural difficulties.

Maybe the article on UTF-8 might be point out some issues:

From: https://en.wikipedia.org/wiki/UTF-8

Derivatives, CESU-8 (Main article: CESU-8)

par.2: Many programs added UTF-8 conversions for UCS-2 data and did not alter this UTF-8 conversion when UCS-2 was replaced with the surrogate-pair using UTF-16. In such programs each half of a UTF-16 surrogate pair is encoded as its own three-byte UTF-8 encoding, resulting in six-byte sequences rather than four bytes for characters outside the Basic Multilingual Plane. Oracle and MySQL databases use this, as well as Java and Tcl as described below, and probably many Windows programs where the programmers were unaware of the complexities of UTF-16.

Again, I don't have a dog in this race. I'm just trying to get a better picture of where potential issues might lie. I'm hoping we can all agree that being able to clearly describe potential issues in a neutral manner is constructive, and if not the best approach, at least generally a good place to start.

Having said that, so far the first part of that last paragraph sounds like there may be some complexity within issues converting between UTF-8 and UTF-16 for software that originally handled UCS-2. Then it goes on to finish with this last sentence:

Although this non-optimal encoding is generally not deliberate, a supposed benefit is that it preserves UTF-16 binary sorting order when CESU-8 is binary sorted.

From: https://en.wikipedia.org/wiki/CESU-8

par.1: The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26.[1] A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range U+10000 to U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four.

par.3: CESU-8 is not an official part of the Unicode Standard, because Unicode Technical Reports are informative documents only.[2] It should be used exclusively for internal processing and never for external data exchange.

par.4: Supporting CESU-8 in HTML documents is prohibited by the W3C[3][4] and WHATWG[5] HTML standards, as it would present a cross-site scripting vulnerability.[6]

Okay, good to know, but I'm not sure if that or any other derivative implementation has anything to do with endianness being inconsistent. In any event, while internal coding issues are certainly relevant to these other pages, I imagine the relevance for a wiki page on Plain Text would tend to center around any noteworthy, documented, and hopefully verified fall-out for end users.

If anyone might have some clue as to what the intended meaning of that means or where it shows up, I invite them to please list it here, or better yet update the article page of this wiki, ideally with external, preferably 3rd party references (e.g. discussions?) if you have them.

Tree4rest (talk) 12:50, 26 August 2019 (UTC)

No underscores in wikilinks
Please don't include underscores in wikilinks. I removed the wikilink in this edition, which includes an unnecessary underscores (the  character). For further examples of properly formatted wikilinks, see MOS:LINK. Thanks,--Jimmy Olano (talk) 16:51, 28 August 2019 (UTC)

Duplicate text section
The following text section appears at two places: One of the duplicates should be removed.
 * before the Table of Contents box
 * below the heading Plain Text and Rich Text

Here’s the text: Files that contain markup or other meta-data are generally considered plain-text, as long as the entirety remains in directly human-readable form (as in HTML, XML, and so on (as Coombs, Renear, and DeRose argue,[1] punctuation is itself markup). The use of plain text rather than bit-streams to express markup, enables files to survive much better "in the wild", in part by making them largely immune to computer architecture incompatibilities.

According to The Unicode Standard:

"Plain text is a pure sequence of character codes; plain Unicode-encoded text is therefore a sequence of Unicode character codes." styled text, also known as rich text, is any text representation containing plain text completed by information such as a language identifier, font size, color, hypertext links.[2]

For instance, rich text such as SGML, RTF, HTML, XML, wiki markup, and TeX rely on plain text.

According to The Unicode Standard, plain text has two main properties in regard to rich text:

"plain text is the underlying content stream to which formatting can be applied." "Plain text is public, standardized, and universally readable."[2]

Oops, I forgot the signature: 176.199.144.109 (talk) 13:27, 1 June 2020 (UTC)


 * I have removed the duplicated text. Please check. – Tea2min (talk) 13:14, 2 June 2020 (UTC)

Updated Unicode Standard information
It's possible that the previous version of the Unicode Standard said what the article claimed that it said, but the current version says practically the opposite. It rather looks like someone took what the Unicode Standard said and made small changes to it, so that it ended up saying something different. Anyway, I updated it. The article used to say that e.g. HTML is plain text, but the current version of the Unicode Standard says precisely that HTML is **not** plain text. -- leuce (talk) 18:50, 12 October 2021 (UTC)