Author profiling



Author profiling is the analysis of a given set of texts in an attempt to uncover various characteristics of the author based on stylistic- and content-based features, or to identify the author. Characteristics analysed commonly include age and gender, though more recent studies have looked at other characteristics like personality traits and occupation

Author profiling is one of the three major fields in automatic authorship identification (AAI), the other two being authorship attribution and authorship identification. The process of AAI emerged at the end of the 19th century. Thomas Corwin Mendenhall, an American autodidact physicist and meteorologist, was the first to apply this process to the works of Francis Bacon, William Shakespeare, and Christopher Marlowe. From these three historic figures, Mendenhall sought to uncover their quantitative stylistic differences by inspecting word lengths. Although much progress has been made in the 21st century, the task of author profiling remains an unsolved problem due to its difficulty.

Techniques
Through the analysis of texts, various author profiling techniques can be applied to predict information about the author. For example, function words, as well as part-of-speech analysis, can be referenced to determine the author's gender and truth of a text.

The process of author profiling usually involves the following steps:
 * 1) Identifying specific features to be extracted from the text
 * 2) Building an adopted, standard representation (e.g. Bag-of-words model) for the target profile
 * 3) Building a classification model using a standard classifier (e.g. Support Vector Machines) for the target profile

Machine learning algorithms for author profiling have become increasingly complex over time. Algorithms used in author profiling include:
 * Support Vector Machines
 * Naive Bayes classifiers
 * Deep averaging networks, many layers in a cycle of machine learning that uses the mean of word embeddings within a text
 * Long Short-Term memory

In the past, author profiling was limited to physical documents, often in the form of books and newspaper articles. Different combinations of textual attributes belonging to the authors were identified and analyzed using author profiling, including lexical and syntactical features. Pioneering research in author profiling focused mostly on a single genre until the shift towards author profiling on social media and the Internet. While attributes, such as content words and POS tags, are effective in author profile predictions on physical documents, their effectiveness in author profile predictions on digital texts is subjective and dependent on the type of online content being analyzed.

With the advances in technology, author profiling on the Internet has become increasingly common. Digital texts, such as social media posts, blog posts and emails, are now being used. This has sparked greater research efforts because of the advantages analysing digital texts can bring to sectors like marketing and business. Author profiling on digital texts has also enabled predictions of a wider range of author characteristics such as personality, income and occupation.

The most effective attributes for author profiling on digital texts involve a combinations of stylistic and content features. Author profiling on digital texts focuses on cross-genre author profiling, whereby one genre is used for training data and another genre is used for testing data, though both need to be relatively similar for good results.

There are some problems when performing author profiling techniques on online texts. These problems include:
 * Wide variation in lengths of texts used
 * Class imbalance in data

Author profiling and the Internet
The rise of the internet in the 20th to 21st century catalysed an increase in author profiling research, since data could be mined from the web, including social media platforms, emails and blogs. Content from the web have been analysed in tasks of author profiling to identify the age, gender, geographic origins, nationality and psychometric traits of web users. The information obtained has been used to serve various applications, including marketing and forensics.

Social media
The increased integration of social media in people's daily lives have made them a rich source of textual data for author profiling. This is mainly because users frequently upload and share content for various purposes including self-expression, socialisation, and personal businesses. The Social bot is also a frequent feature of social media platforms, especially Twitter, generating content that may be analysed for author profiling. While different platforms contain similar data they may also contain different features depending on the format and structure of the particular platform.

There are still limitations in using social media as data sources for author profiling, because data obtained may not always be reliable or accurate. Users sometimes provide false information about themselves or withhold information. As a result, the training of algorithms for author profiling may be impeded by data that is less accurate. Another limitation is the irregularity of text in social media. Features of irregularity include deviation from normal linguistic standards such as spelling errors, unstandardised transliteration as with the substitution of letters with numbers, shorthands, user-created abbreviations for phrases and et cetera, which may pose a challenge to author profiling. Researchers have adopted methods to overcome these limitations in training their algorithms for author profiling.

Facebook
Facebook is useful for author profiling studies as a social networking service. This is because of how a social network may be built, expanded, and used for social action in the site. In such processes, users share personal content that may be used for author profiling studies. Textual data is obtained from Facebook for author profiling from user's personal posts such as 'status updates'. These are acquired to produce a corpus in the selected language(s) for author profiling, to create either a bilingual or multilingual database of content words, which may then be used for author profiling.

In the context of Facebook, author profiling mainly involves English textual data, but also uses non-english languages that include: Roman Urdu, Arabic, Brazilian Portuguese, Spanish. While author profiling studies on Facebook have been predominantly for gender and age-group identification, there have been attempts to derive attributes to predict religiosity, the IT background of users, and even basic emotions (as defined by Paul Ekman) among others.

Weibo
Sina Weibo is one of the few Asian social media platforms that contain texts in Asian languages to have been analysed for author profiling. Primary content of focus for author profiling on Weibo content include classical Chinese characters, hashtags, emoticons, kaomoji, homogenous punctuation, Latin sequences (due to the multilingualism of text) and even poetic formats. Particularly popular Chinese expressions, POS tags and word types are also tracked for author profiling.

Author profiling for Weibo content requires algorithms different from those used for other social media platforms, mainly due to the linguistic differences between Mandarin Chinese and Western languages. For example, Chinese emotions involve Chinese characters describing the gesture or facial expression in brackets, such as: e.g. [哈哈] 'laughter', [泪] 'tears', [偷笑] 'giggle', [爱你] 'love', [心] 'heart'. This differs from the use of punctuation symbols for emoticons in Western languages, or the common use of the Unicode emojis in other platforms such as Facebook, Instagram, et cetera. Further, while there are around 161 western emoticons, there are around 2900 emoticons regularly used in mainland China for web content as in Weibo. To tackle these differences, author profiling algorithms have been trained on Chinese emoticons and linguistic features. For example, author profiling algorithms have been designed to detect Chinese stylistic expressions expressing formality and sentiment, in place of algorithms detecting English linguistic features such as capital letters.

As compared to other more popular, globalised platforms, texts on Weibo are not as commonly used in the task of author profiling. This is likely due to the centralisation of Weibo in the Chinese population of mainland China, limiting its usage to predominantly China Nationals. Studies done for this platform have used bots, machine learning algorithms to identify authors' age and gender. Data is acquired from Weibo microblog posts of willing participants to be analysed, and used to train algorithms that build concept-based profiles of users to a certain accuracy.

Chat logs
Chat logs have been studied for author profiling as they include much textual discourse, the analysis of which have contributed to applicational studies including social trends and forensic science. Sources of data for author profiling from chat logs include platforms such as Yahoo!, AIM (software) and WhatsApp. Computational systems have been devised to produce concept-based profiles listing chat topics discussed in a single chat room or by independent users.

Blogs
Author profiling can be used to identify characteristics of blog writers, such as their age, gender and geographical location, based on their different writing styles, This is especially useful when it comes to anonymous blogs. The choice of content words, style-based features and topic-based features are analyzed to discover characteristics of the author.

In general, features that are frequently occur in blogs include a high distribution of verbs per writing and a relatively high use of pronouns. The frequency of verbs, pronouns and other word classes are used to profile and classify emotions in the writings of authors, as well as their gender and age. Author profiling using classification models that were used on physical documents in the past, such as Support Vector Machines, have also been tested on blogs. However, it has been proven to be unsuitable for the latter due to its low performance.

The machine learning algorithms that work well for author profiling on blogs include:
 * Instance-based learning
 * Random Decision Forests

Email
Email has been a consistent focus for author profiling due to rich textual data that can be found in various sections of a typical emailing platform. These sections include the sent, inbox, spam, trash, and archived folders. Multilingual approaches to author profiling for emails have included English, Spanish, and Arabic emails as data sources, among others. Through author profiling, details of email users may be identified, such as their age, gender, geographical origin, level of education, nationality and even psychometrics traits of personality, which includes neuroticism, agreeableness, conscientiousness and extraversion and introversion from the Big Five personality traits.

In author profiling for email, content is processed for important textual data, while unimportant features such as metadata and other hyper-text markup language (HTML) redundancies are excluded. Important parts of the Multi-purpose Internet Mail Extensions (MIME) that contain content of the emails are also included in the analysis. Obtained data is often parsed into various sections of content, including author text, signature text, advertisement, quoted text, and reply lines. Further analysis of email textual content in author profiling tasks involves the extraction of tone of voice, sentiment, semantics and other linguistic features to be processed.

Applications
Author profiling has applications in various fields where there is a need to identify specific characteristics of an author of a text, with a growing importance in fields like forensics and marketing. Depending on its application, the task of author profiling can vary in terms of the characteristics to be identified, number of authors studied and number of texts available for analysis.

Although its applications have traditionally been limited to written texts, such as literary works, this has extended to online texts with the advancement of the computer and the Internet.

Forensic linguistics
In the context of forensic linguistics, author profiling is used to identify characteristics of the author of anonymous, pseudonymous or forged text, based on the author's use of the language. Through linguistic analysis, forensic linguists seek to identify the suspect's motivation and ideology, along with other class features, such as the suspect's ethnicity or profession. While this does not always lead to decisive author identification, such information can help law enforcement narrow the pool of suspects.

In most cases, author profiling in the context of forensic linguistics involves a single text problem, in which there is either no or few comparison texts available and no external evidence that points to the author. Examples of text analysed by forensic linguists include blackmailing letters, confessions, testaments, suicide letters and plagiarised writing. This has also extended to online texts as well, such as sexually explicit online chat logs between middle-aged men and underaged girls, with the increasing number of cybercrimes committed on the Internet.

One of the earliest and best-known examples of the use of author profiling is by Roger Shuy, who was asked to examine a ransom note linked to a notorious kidnapping case in 1979. Based on his analysis of the kidnapper's idiolect, Shuy was able to identify crucial elements of the kidnappers identity from his misspellings and a dialect item, that is, the kidnapper was well-educated and from Akron, Ohio. This eventually led to a successful arrest and confession by the suspect.

However, there are criticisms that author profiling methods lack objectivity, since these methods are reliant on a forensic linguist's subjective identification of crucial sociolinguistic markers. These methods, such as those adopted by literary critic Donald Wayne Foster, are said to be speculative and based entirely on one's subjective experience, and therefore cannot be tested empirically.

Bot detection
Author profiling is adopted in the identification of social bots, the most common being Twitter bots. Social bots have been deemed as a threat given their commercial, political and ideological influence, such as the 2016 United States presidential election, during which they polarised political conversations, and spread misinformation and unverified information. In the context of marketing, social bots can artificially inflate the popularity of a product by posting positive reviews, and undermine the reputation of competitive products with unfavourable reviews. Therefore, bot detection from an author profiling perspective is a task of high importance.

Made to appear as human accounts, bots can mostly be identified by information on their profiles, like their username, profile photo and time of posting. However, the task of identifying bots solely from textual data (i.e. without meta-data) is significantly more challenging, requiring author profiling techniques. This usually involves a classification task based on semantic and syntactic features.

The task of bot and gender profiling was one of four shared tasks organised by PAN, which organises a series of scientific events and shared tasks of digital text forensics and stylometry, in its 2019 edition. Participating teams had achieved much success, with the best results for bot detection for English and Spanish tweets at 95.95% and 93.33% respectively.

Marketing
Author profiling is also useful from a marketing viewpoint, as it allows businesses to identify the demographics of people that like or dislike their products based on an analysis of blogs, online product reviews and social media content. This is important since most individuals post their reviews on products anonymously. Author profiling techniques are helpful to business experts in making better informed strategic decisions based on the demographics of their target group. In addition, businesses can target their marketing campaigns at groups of consumers who match the demographics and profile of current customers.

Author identification and influence tracing
Author profiling techniques are used to study traditional media and literature to identify the writing style of various authors as well as their written topics of content. Author profiling for literature is also been done to deduce the social networks of authors and their literary influence based on their bibliographic records of co-authorship. In cases of anonymous or pseudepigraphic works, sometimes the technique has been used to attempt to identify the author or authors, or determine which works were written by the same person.

Some examples of author profiling studies on literature and traditional media include studies on the following:


 * The Bible (see Authorship of the Bible)
 * Gospels of the New Testament
 * Shakespeare's works
 * The Federalist Papers in the 1990s and 1960s
 * Author profiling studies for Lithuanian Literary Texts
 * Primary Colors, 1996 novel whose author was for a time anonymous
 * A Warning, a 2019 political book whose author was for a time anonymous

Library cataloguing
Another application of author profiling is in devising strategies for cataloguing library resources based on standard attributes. In this approach, author profiling techniques may improve the efficiency of library cataloguing in which library resources are automatically classified based on the authors' bibliographic records. This was a significant issue in the early 21st century when much of library cataloguing was still done manually.

In using author profiling for library cataloguing, researchers have used machine learning for automatic processes in the library, such as Support Vector Machine algorithms (SVMs). With the use of SVMs for author profiling, bibliographic records of authors within existing databases may be identified, tracked, and updated to identify an author based on her topics of literary content and expertise as indicated in his or her bibliographic records. In this case, author profiling uses the social structures of authors that may be derived from physical copies of published media to catalogue library resources.

In popular culture
Author profiling has been featured in popular culture. The 2017 Discovery Channel mini-series Manhunt: Unabomber is a fictionalised account of the FBI investigation surrounding the Unabomber. It features a criminal profiler who identifies defining characteristics of the Unabomber's identity based on his analysis of the Unabomber's idiolect in his published manifesto and letters. The show highlighted the importance of author profiling in criminal forensics, as it was critical in the capture of the real Unabomber culprit in 1996.