User:Ak1538/sandbox

There are five broad aspects to author identification in writeprint:


 * Lexical features - the analysis of the lexicon, the author's choice of vocabulary, using characters and words to identify preferences of an individual;
 * use of uppercase and lowercase letters, frequency of certain letters, average length of word, mean length of the utterance itself
 * Syntactic features - the analysis of the author's writing style and sentence structure, such as punctuation and hyphenation, use of passive voice, and sentence complexity;
 * Structural features - the analysis of the author's organization and structural arrangement of the work, including paragraph length, spacing, and indentation.
 * encompassing arrangement of sentences within paragraphs, use of farewells, greetings and signatures in an email setting, for example;
 * Content-specific features - the analysis of the language that is contextually significant to subject of the written work, including the use of slang or acronyms. To be more specific, these features determine the interests of the subject by pinpointing keywords they use;
 * Idiosyncratic features - the analysis of errors and other ungrammatical elements that may be unique to the author, such as incorrect spelling, misuse of words and inaccurate verb forms. Because this can be hard to control, it has achieved high accuracy in author identification when combined with other features.

Machine learning
~

I think these two sections would be very valuable to this article. The Application of writeprint in forensic linguistics should be more detailed. It should discuss specific cases and the process. Although maybe this does not need a specific heading as its information could be divided into other or preexisting sections. But the section that I want to focus on more is writeprint's application in computational linguistics. There are so many topics in computer science alone, and still so many in its subtopic computational linguistics, that I believe the most appropriate place to go into detail about how writeprint applies to computational linguistics is within the writeprint article itself. Here I will be doing a lot of research into the relationship between the two and how computational linguistics can help with author identification.

Briannacardaci77 (talk) 17:33, 3 April 2020 (UTC)briannacardaci77

~

I noticed that a few of the sentences do not give off the typical style of Wikipedia because they are ungrammatical or choppy. I reworded the first two paragraphs:

The "Dear Boss" letter was a message allegedly written by the notorious unidentified Victorian serial killer known as Jack the Ripper. It was written on September 25th, 1888 and was addressed to the Central News Agency of London. It was postmarked and received by the Central News Agency on September 27th, 1888. The letter itself was forwarded to Scotland Yard September 29, 1888.

Although unlikely to have been written by the actual murderer, the "Dear Boss" letter was the first piece of correspondence received in which the author signed his name. From that day forward, the unidentified killer was known as Jack the Ripper.

(Marija Landeka)

~

It might be best to split the bulleted list, at risk of losing readability:

There are five broad aspects to author identification in writeprint: lexical, syntactic, structural, content-specific and idiosyncratic features. Lexical features refer to the features of the lexicon, or the author's choice of vocabulary and their preferences in terms of various metrics such as word length distributions or frequency of certain letters. Syntactic features refer to the author's writing style and sentence structure, such as their patterns of punctuation, use of passive voice, and typical sentence complexity. Structural features refer to the author's organization and structural arrangement of the work, including the arrangement of sentences within paragraphs, overall paragraph length, use of signatures or sign-offs, and spacing or indentation. Content-specific features refer to the use of words or phrases related to a particular subject or domain, such as the inclusion of technical terminology. Idiosyncratic features refer to errors and other ungrammatical elements that may be unique to the author, such as incorrect spelling, misuse of words and inaccurate verb forms.

However, this might make the section far too close to the source material. Also, the same source in this citation notes that this list is not really a sample of all features, or necessarily the most important features, but simply the set of features (excluding content-specific) that have "been shown to be more effective for online identification than feature sets containing only a subset of these feature groups", which is probably a point we should take into account when it comes to reworking the list!

~