Sequence analysis in social sciences

In social sciences, sequence analysis (SA) is concerned with the analysis of sets of categorical sequences that typically describe longitudinal data. Analyzed sequences are encoded representations of, for example, individual life trajectories such as family formation, school to work transitions, working careers, but they may also describe daily or weekly time use or represent the evolution of observed or self-reported health, of political behaviors, or the development stages of organizations. Such sequences are chronologically ordered unlike words or DNA sequences for example.

SA is a longitudinal analysis approach that is holistic in the sense that it considers each sequence as a whole. SA is essentially exploratory. Broadly, SA provides a comprehensible overall picture of sets of sequences with the objective of characterizing the structure of the set of sequences, finding the salient characteristics of groups, identifying typical paths, comparing groups, and more generally studying how the sequences are related to covariates such as sex, birth cohort, or social origin.

Introduced in the social sciences in the 80s by Andrew Abbott, SA has gained much popularity after the release of dedicated software such as the SQ and SADI addons for Stata and the TraMineR R package with its companions TraMineRextras and WeightedCluster.

Despite some connections, the aims and methods of SA in social sciences strongly differ from those of sequence analysis in bioinformatics.

History
Sequence analysis methods were first imported into the social sciences from the information and biological sciences (see Sequence alignment) by the University of Chicago sociologist Andrew Abbott in the 1980s, and they have since developed in ways that are unique to the social sciences. Scholars in psychology, economics, anthropology, demography, communication, political science, organizational studies, and especially sociology have been using sequence methods ever since.

In sociology, sequence techniques are most commonly employed in studies of patterns of life-course development, cycles, and life histories. There has been a great deal of work on the sequential development of careers, and there is increasing interest in how career trajectories intertwine with life-course sequences. Many scholars have used sequence techniques to model how work and family activities are linked in household divisions of labor and the problem of schedule synchronization within families. The study of interaction patterns is increasingly centered on sequential concepts, such as turn-taking, the predominance of reciprocal utterances, and the strategic solicitation of preferred types of responses (see Conversation Analysis). Social network analysts (see Social network analysis) have begun to turn to sequence methods and concepts to understand how social contacts and activities are enacted in real time, and to model and depict how whole networks evolve. Social network epidemiologists have begun to examine social contact sequencing to better understand the spread of disease. Psychologists have used those methods to study how the order of information affects learning, and to identify structure in interactions between individuals (see Sequence learning).

Many of the methodological developments in sequence analysis came on the heels of a special section devoted to the topic in a 2000 issue of Sociological Methods & Research, which hosted a debate over the use of the optimal matching (OM) edit distance for comparing sequences. In particular, sociologists objected to the descriptive and data-reducing orientation of optimal matching, as well as to a lack of fit between bioinformatic sequence methods and uniquely social phenomena. The debate has given rise to several methodological innovations (see Pairwise dissimilarities below) that address limitations of early sequence comparison methods developed in the 20th century. In 2006, David Stark and Balazs Vedres proposed the term "social sequence analysis" to distinguish the approach from bioinformatic sequence analysis. However, if we except the nice book by Benjamin Cornwell, the term was seldom used, probably because the context prevents any confusion in the SA literature. Sociological Methods & Research organized a special issue on sequence analysis in 2010, leading to what Aisenbrey and Fasang referred to as the "second wave of sequence analysis", which mainly extended optimal matching and introduced other techniques to compare sequences. Alongside sequence comparison, recent advances in SA concerned among others the visualization of sets of sequence data, the measure and analysis of the discrepancy of sequences, the identification of representative sequences, and the development of summary indicators of individual sequences. Raab and Struffolino have conceived more recent advances as the third wave of sequence analysis. This wave is largely characterized by the effort of bringing together the stochastic and the algorithmic modeling culture by jointly applying SA with more established methods such as analysis of variance, event history analysis, Markovian modeling, social network analysis, or causal analysis and statistical modeling in general.

Sociology
The analysis of sequence patterns has foundations in sociological theories that emerged in the middle of the 20th century. Structural theorists argued that society is a system that is characterized by regular patterns. Even seemingly trivial social phenomena are ordered in highly predictable ways. This idea serves as an implicit motivation behind social sequence analysts' use of optimal matching, clustering, and related methods to identify common "classes" of sequences at all levels of social organization, a form of pattern search. This focus on regularized patterns of social action has become an increasingly influential framework for understanding microsocial interaction and contact sequences, or "microsequences." This is closely related to Anthony Giddens's theory of structuration, which holds that social actors' behaviors are predominantly structured by routines, and which in turn provides predictability and a sense of stability in an otherwise chaotic and rapidly moving social world. This idea is also echoed in Pierre Bourdieu's concept of habitus, which emphasizes the emergence and influence of stable worldviews in guiding everyday action and thus produce predictable, orderly sequences of behavior. The resulting influence of routine as a structuring influence on social phenomena was first illustrated empirically by Pitirim Sorokin, who led a 1939 study that found that daily life is so routinized that a given person is able to predict with about 75% accuracy how much time they will spend doing certain things the following day. Talcott Parsons's argument that all social actors are mutually oriented to their larger social systems (for example, their family and larger community) through social roles also underlies social sequence analysts' interest in the linkages that exist between different social actors' schedules and ordered experiences, which has given rise to a considerable body of work on synchronization between social actors and their social contacts and larger communities. All of these theoretical orientations together warrant critiques of the general linear model of social reality, which as applied in most work implies that society is either static or that it is highly stochastic in a manner that conforms to Markov processes This concern inspired the initial framing of social sequence analysis as an antidote to general linear models. It has also motivated recent attempts to model sequences of activities or events in terms as elements that link social actors in non-linear network structures This work, in turn, is rooted in Georg Simmel's theory that experiencing similar activities, experiences, and statuses serves as a link between social actors.

Demography and historical demography
In demography and historical demography, from the 1980s the rapid appropriation of the life course perspective and methods was part of a substantive paradigmatic change that implied a stronger embedment of demographic processes into social sciences dynamics. After a first phase with a focus on the occurrence and timing of demographic events studied separately from each other with a hypothetico-deductive approach, from the early 2000s the need to consider the structure of the life courses and to make justice to its complexity led to a growing use of sequence analysis with the aim of pursuing a holistic approach. At an inter-individual level, pairwise dissimilarities and clustering appeared as the appropriate tools for revealing the heterogeneity in human development. For example, the meta-narrations contrasting individualized Western societies with collectivist societies in the South (especially in Asia) were challenged by comparative studies revealing the diversity of pathways to legitimate reproduction. At an intra-individual level, sequence analysis integrates the basic life course principle that individuals interpret and make decision about their life according to their past experiences and their perception of contingencies. The interest for this perspective was also promoted by the changes in individuals' life courses for cohorts born between the beginning and the end of the 20th century. These changes have been described as de-standardization, de-synchronization, de-institutionalization. Among the drivers of these dynamics, the transition to adulthood is key: for more recent birth cohorts this crucial phase along individual life courses implied a larger number of events and lengths of the state spells experienced. For example, many postponed leaving parental home and the transition to parenthood, in some context cohabitation replaced marriage as long-lasting living arrangement, and the birth of the first child occurs more frequently while parents cohabit instead of within a wedlock. Such complexity required to be measured to be able to compare quantitative indicators across birth cohorts (see for an extension of this questioning to populations from low- and medium income countries). The demography's old ambition to develop a 'family demography' has found in the sequence analysis a powerful tool to address research questions at the cross-road with other disciplines: for example, multichannel techniques represent precious opportunities to deal with the issue of compatibility between working and family lives. Similarly, more recent combinations of sequence analysis and event history analysis have been developed (see for a review) and can be applied, for instance, for understanding of the link between demographic transitions and health.

Political sciences
The analysis of temporal processes in the domain of political sciences regards how institutions, that is, systems and organizations (regimes, governments, parties, courts, etc.) that crystallize political interactions, formalize legal constraints and impose a degree of stability or inertia. Special importance is given to, first, the role of contexts, which confer meaning to trends and events, while shared contexts offer shared meanings; second, to changes over time in power relationships, and, subsequently, asymmetries, hierarchies, contention, or conflict; and, finally, to historical events that are able to shape trajectories, such as elections, accidents, inaugural speeches, treaties, revolutions, or ceasefires. Empirically, political sequences' unit of analysis can be individuals, organizations, movements, or institutional processes. Depending on the unit of analysis, the sample sizes may be limited few cases (e.g., regions in a country when considering the turnover of local political parties over time) or include a few hundreds (e.g., individuals' voting patterns). Three broad kinds of political sequences may be distinguished. The first and most common is careers, that is, formal, mostly hierarchical positions along which individuals progress in institutional environments, such as parliaments, cabinets, administrations, parties, unions or business organizations. We may name trajectories political sequences that develop in more informal and fluid contexts, such as activists evolving across various causes and social movements, or voters navigating a political and ideological landscape across successive polls. Finally, processes relate to non-individual entities, such as: public policies developing through successive policy stages across distinct arenas; sequences of symbolic or concrete interactions between national and international actors in diplomatic and military contexts; and development of organizations or institutions, such as pathways of countries towards democracy (Wilson 2014).

Concepts
A sequence s is an ordered list of elements (s1,s2,...,sl) taken from a finite alphabet A. For a set S of sequences, three sizes matter: the number n of sequences, the size a = |A| of the alphabet, and the length l of the sequences (that could be different for each sequence). In social sciences, n is generally something between a few hundreds and a few thousands, the alphabet size remains limited (most often less than 20), while sequence length rarely exceeds 100.

We may distinguish between state sequences and event sequences, where states last while events occur at one time point and do not last but contribute possibly together with other events to state changes. For instance, the joint occurrence of the two events leaving home and starting a union provoke a state change from 'living at home with parents' to 'living with a partner'.

When a state sequence is represented as the list of states observed at the successive time points, the position of each element in the sequence conveys this time information and the distance between positions reflects duration. An alternative more compact representation of a sequence, is the list of the successive spells stamped with their duration, where a spell (also called episode) is a substring in a same state. For example, in aabbbc, bbb is a spell of length 3 in state b, and the whole sequence can be represented as (a,2)-(b,3)-(c,1).

A crucial point when looking at state sequences is the timing scheme used to time align the sequences. This could be the historical calendar time, or a process time such as age, i.e. time since birth.

In event sequences, positions do not convey any time information. Therefore event occurrence time must be explicitly provided (as a timestamp) when it matters.

SA is essentially concerned with state sequences.

Methods
Conventional SA consists essentially in building a typology of the observed trajectories. Abbott and Tsay (2000) describe this typical SA as a three-step program: 1. Coding individual narratives as sequences of states; 2. Measuring pairwise dissimilarities between sequences; and 3. Clustering the sequences from the pairwise dissimilarities. However, SA is much more (see e.g. ) and encompasses also among others the description and visual rendering of sets of sequences, ANOVA-like analysis and regression trees for sequences, the identification of representative sequences, the study of the relationship between linked sequences (e.g. dyadic, linked-lives, or various life dimensions such as occupation, family, health), and sequence-network.

Describing and rendering state sequences
Given an alignment rule, a set of sequences can be represented in tabular form with sequences in rows and columns corresponding to the positions in the sequences.

Sequences of cross-sectional distributions
To describe such data, we may look at the columns and consider the cross-sectional state distributions at the successive positions.

The chronogram or density plot of a set of sequences renders these successive cross-sectional distributions. For each (column) distribution we can compute characteristics such as entropy or modal state and look at how these values evolve over the positions (see pp 18–21).

Characteristics of individual sequences
Alternatively, we can look at the rows. The index plot where each sequence is represented as a horizontal stacked bar or line is the basic plot for rendering individual sequences. We can compute characteristics of the individual sequences and examine the cross-sectional distribution of these characteristics.

Main indicators of individual sequences
 * Basic measures
 * Length
 * Number of states visited
 * Number of transitions (length of sequence of distinct successive states, DSS)
 * Number of subsequences
 * Recurrence
 * Diversity
 * Within sequence entropy
 * Variance of spell duration
 * Complexity of the sequence structure
 * Volatility
 * Complexity index
 * Turbulence
 * Measures that take account of the nature of the states
 * Normative volatility i.e. proportion of positive spells.
 * Integration index also known as Quality index
 * Degradation
 * Badness
 * Precarity index
 * Insecurity

Other overall descriptive measures

 * Mean time in the different states (overall state distribution) and their standard errors
 * Transition probabilities between states.

Visualization
State sequences can nicely be rendered graphically and such plots prove useful for interpretation purposes. As shown above, the two basic plots are the index plot that renders individual sequences and the chronogram that renders the evolution of the cross-sectional state distribution along the timeframe. Chronograms (also known as status proportion plot or state distribution plot) completely overlook the diversity of the sequences, while index plots are often too scattered to be readable. Relative frequency plots and plots of representative sequences attempt to increase the readability of index plots without falling in the oversimplification of a chronogram. In addition, there are many plots that focus on specific characteristics of the sequences. Below is a list of plots that have been proposed in the literature for rendering large sets of sequences. For each plot, we give examples of software (details in section Software) that produce it.


 * Index plot: renders the set of individual sequences (SADI, SQ, TraMineR)
 * Chronogram (status proportion plot, state distribution plot): renders the sequence of cross-sectional distributions (SADI, SQ, TraMineR)
 * Plot of multidomain/multichannel sequences grouped by channels (TraMineR, seqHMM) or by individuals
 * Plot of time series of cross-sectional indicators (entropy, modal state, ...) (SQ, TraMineR)
 * Frequency plot (SQ, TraMineR)
 * Relative frequency plot (TraMineR)
 * Representative sequences (TraMineR)
 * Mean time in the different states and their standard errors (TraMineR)
 * State survival plot (TraMineRextras)
 * Position-wise group typical states, i.e., with highest implication strength (TraMineRextras)
 * Transition patterns (SADI)
 * Transition plot (SQ; Gmisc ) and plot of transition probabilities (seqHMM)
 * Parallel coordinate plot (TraMineR, SQ)
 * Probabilistic suffix trees (PST)
 * Sequence networks (see social network analysis, Social network analysis software)
 * Narrative networks (Software?)

Pairwise dissimilarities
Pairwise dissimilarities between sequences serve to compare sequences and many advanced SA methods are based on these dissimilarities. The most popular dissimilarity measure is optimal matching (OM), i.e. the minimal cost of transforming one sequence into the other by means of indel (insert or delete) and substitution operations with possibly costs of these elementary operations depending on the states involved. SA is so intimately linked with OM that it is sometimes named optimal matching analysis (OMA).

There are roughly three categories of dissimilarity measures:


 * Optimal matching and other edit distances
 * Examples: OM, OMloc (localized OM), OMslen (spell-length sensitive OM), OMspell (OM of spell sequences), OMstran (OM of sequences of transitions), TWED (time-warp edit distance),  HAM (Hamming and generalized Hamming), DHD (Dynamic Hamming).
 * Strategies for setting the substitution and indel costs
 * Constant costs (all substitution costs identical and single indel cost)
 * Theory-based costs
 * Feature-based costs
 * Data-driven costs: based on transition probabilities or state frequencies
 * Measures based on the count of common attributes
 * Examples: LCS (derived from length of longest common subsequence), LCP (from length of longest common prefix), NMS (number of matching subsequences), and NMSMST and SVRspell two variants of NMS.
 * Distances between within-sequence state distributions
 * Examples: CHI2 and EUCLID defined as the average of respectively the Chi-squared and Euclidean distance between state distributions in successive sliding windows.

Dissimilarity-based analysis
Pairwise dissimilarities between sequences give access to a series of techniques to discover holistic structuring characteristics of the sequence data. In particular, dissimilarities between sequences can serve as input to cluster algorithms and multidimensional scaling, but also allow to identify medoids or other representative sequences, define neighborhoods, measure the discrepancy of a set of sequences, proceed to ANOVA-like analyses, and grow regression trees.


 * Cluster analysis
 * Descriptive: identification of main sequence patterns.
 * Clusters as dependent or independent variables in regression analysis: study of relationships with other variables of interest.
 * Multidimensional scaling (principal coordinates): numerical representation of sequences.
 * Discrepancy (ANOVA-like) analysis
 * Sequence of ANOVA-like analyses
 * Regression trees
 * Representative sequences
 * Multiple domains (multichannel analysis)
 * Dyadic and polyadic sequence data

Other methods of analysis
Although dissimilarity-based methods play a central role in social SA, essentially because of their ability to preserve the holistic perspective, several other approaches also prove useful for analyzing sequence data.


 * Non dissimilarity-based clustering
 * Latent class analysis (LCA),
 * Markov model mixture and hidden Markov model mixture
 * Mixtures of exponential-distance models
 * Sequence networks
 * Representing a single sequence as a network
 * Meta network of sequences
 * Sequence network measures
 * Life history graph
 * Probabilistic approaches
 * Markovian and other transition distribution models. See also Markov model.
 * Probabilistic Suffix Tree (PST) also known as variable-order Markov model or variable-length Markov model.
 * Event sequences
 * Event structure models
 * Rendering of event sequences (parallel coordinate plots, ...)
 * Frequent subsequences
 * Discriminant subsequences
 * Dissimilarity-based analysis of event sequences

Advances: the third wave of sequence analysis
Some recent advances can be conceived as the third wave of SA. This wave is largely characterized by the effort of bringing together the stochastic and the algorithmic modeling culture by jointly applying SA with more established methods such as analysis of variance, event history, network analysis, or causal analysis and statistical modeling in general. Some examples are given below; see also "Other methods of analysis".


 * Effect of past trajectories on the hazard of an event: Sequence History Analysis, SHA
 * Effect of time varying covariates on trajectories: Competing Trajectories Analysis (CTA), and Sequence Analysis Multistate Model (SAMM)
 * Validation of cluster typologies
 * Discrepancy analysis to bring time back to qualitative comparative analysis (QCA)

Open issues and limitations
Although SA witnesses a steady inflow of methodological contributions that address the issues raised two decades ago, some pressing open issues remain. Among the most challenging, we can mention:


 * Sequences of different lengths, truncated sequences, and missing values.
 * Validation of cluster results
 * Sequence length vs importance of recency: for example, when analyzing biographic sequences 40 year-long from age 1 to 40, one can only consider individuals born 40 years earlier and therefore the behavior of younger birth cohorts is disregarded.

Up-to-date information on advances, methodological discussions, and recent relevant publications can be found on the Sequence Analysis Association webpage.

Fields of application
These techniques have proved valuable in a variety of contexts. In life-course research, for example, research has shown that retirement plans are affected not just by the last year or two of one's life, but instead how one's work and family careers unfolded over a period of several decades. People who followed an "orderly" career path (characterized by consistent employment and gradual ladder-climbing within a single organization) retired earlier than others, including people who had intermittent careers, those who entered the labor force late, as well as those who enjoyed regular employment but who made numerous lateral moves across organizations throughout their careers. In the field of economic sociology, research has shown that firm performance depends not just on a firm's current or recent social network connectedness, but also the durability or stability of their connections to other firms. Firms that have more "durably cohesive" ownership network structures attract more foreign investment than less stable or poorly connected structures. Research has also used data on everyday work activity sequences to identify classes of work schedules, finding that the timing of work during the day significantly affects workers' abilities to maintain connections with the broader community, such as through community events. More recently, social sequence analysis has been proposed as a meaningful approach to study trajectories in the domain of creative enterprise, allowing the comparison among the idiosyncrasies of unique creative careers. While other methods for constructing and analyzing whole sequence structure have been developed during the past three decades, including event structure analysis, OM and other sequence comparison methods form the backbone of research on whole sequence structures.

Some examples of application include:

Sociology Demography and historical demography Political sciences Education and learning sciences
 * Labor market entry sequences
 * De-standardization of the life course
 * Residential trajectories
 * Time use
 * Actual and idealized relationship scripts
 * Basic types of figures in ritual dances
 * Pathways of alcohol consumption
 * Transition to adulthood
 * Partnership biographies
 * Family formation life course
 * Childbirth histories
 * Pathways towards democratization
 * Pathways of legislative processes
 * Bargaining between actors during national crises
 * Study trajectories
 * Learning strategies

Psychology Medical research Survey methodology Geography
 * Sequences of adolescences' social interactions
 * Care trajectory in chronic disease
 * Response in survey collection
 * Mobility studies
 * Regional development
 * Land use

Software
Two main statistical computing environment offer tools to conduct a sequence analysis in the form of user-written packages: Stata and R.


 * Stata: SQ and SADI are general SA toolkits. MICT is dedicated to imputation of missing elements in sequences.
 * R: TraMineR with its extension TraMineRextras is probably the most comprehensive SA toolkit; ggseqplot, provides ggplot versions of most TraMineR plots; seqhandbook provides several specific tools such as heat maps of sequence data and the GIMSA method for measuring dissimilarities between multidomain sequences; seqimpute provides tools for imputing missing elements in sequences; seqHMM, although specialized in fitting Markov models, this package provides useful plotting facilities for rendering multichannel sequences and transition probabilities; WeightedCluster versatile clustering package with original tools for grouping identical sequences and rendering hierarchical trees of sequences; PST fits and renders probabilistic suffix trees of sequences.

Institutional development
The first international conference dedicated to social-scientific research that uses sequence analysis methods – the Lausanne Conference on Sequence Analysis, or LaCOSA – was held in Lausanne, Switzerland in June 2012. A second conference (LaCOSA II) was held in Lausanne in June 2016. The Sequence Analysis Association (SAA) was founded at the International Symposium on Sequence Analysis and Related Methods, in October 2018 at Monte Verità, TI, Switzerland. The SAA is an international organization whose goal is to organize events such as symposia and training courses and related events, and to facilitate scholars' access to sequence analysis resources.