Multiple object tracking

In psychology and neuroscience, multiple object tracking (MOT) refers to the ability of humans and other animals to simultaneously monitor multiple objects as they move. It is also the term for certain laboratory techniques used to study this ability.

In an MOT study, several identical moving objects are presented on a display. Some of the objects are designated as targets while the rest serve as 'distractors'. The study participants try to monitor the changing positions of the targets as they and the distractions move about. At the end of the trial, typically the participants are asked to indicate the final positions of the targets.

The results of MOT experiments have revealed limitations on humans' ability to simultaneously monitor multiple moving objects. For example, awareness of features such as color and shape is disrupted by the objects' movement.

History
In the 1970s, researcher Zenon Pylyshyn postulated the existence of a "primitive visual process" in the human brain capable of "indexing and tracking features or feature-clusters". Using this process, cognitive processes can continuously refer to, or "track", objects despite movement of the objects causing them to stimulate different visual neurons over time. Data collected with Pylyshyn's MOT protocol and published in 1988 provided the first formal demonstration that the mind can keep track of the changing positions of multiple moving objects.

As a specific theory of this ability, Pylyshyn proposed "fingers of instantiation" theory (FINST), which claims that tracking is mediated by a fixed set of discrete pointers. While FINST theory has been very influential, many studies have found evidence that seems inconsistent with the theory.

Procedure


A typical MOT study involves the presentation of between eight and twelve objects. The participant is told to monitor the positions of a subset of the objects, which are referred to as targets. Often the targets are indicated by being presented initially in a distinct color. The targets then become identical in appearance to the other, distractor objects. The targets and distractors move about the screen for several seconds in an unpredictable fashion. The participant is then asked to indicate which of the objects are the targets. The accuracy of the participant's judgments indicates whether the participant mentally updated the positions of the targets as they moved.

To ensure that the task requires participants to mentally update the targets' positions, displays are typically designed such that object paths cause the targets to swap positions with distractors, at least occasionally. With that constraint, MOT task variations have been designed to probe specific aspects of how the mind tracks moving objects. For example, to compare performance in the left to performance in the right visual fields, studies confine some or all the moving objects to one of the visual fields. To avoid any contribution from spatial interference among mental object representations, some studies maintain a minimum distance between objects. Other studies have combined MOT with a concurrent task to investigate whether the two tasks draw on the same mental resource, and have changed target features such as color to assess whether study participants update their representations of those features.

Capacity limits
MOT study results indicate that the number of targets that people can track is very limited. This reflects a bottleneck in the brain's processing architecture. While at the early, sensory stages of visual processing, dozens of objects may be fully processed, later processes such as those associated with cognition have much more limited capacity to process visual objects.

The specific number of visual objects that people can accurately track varies widely with display parameters, contrary to a common belief that people can track no more than four or five objects. Even for a fixed set of display parameters, rather than there being a clear limit, performance falls gradually with the number of targets. Such findings undermine Pylyshyn's FINST theory that tracking is mediated by a fixed set of discrete pointers.

The above limitations appear to stem from processes specific to the two cerebral hemispheres. The independence of the limits in the two hemifields is demonstrated by findings that when one is tracking the maximum number that can be tracked in the left hemifield (which is processed by the right cerebral hemisphere), one can add targets to the right hemifield (which is processed by the left cerebral hemisphere) at little to no cost to performance. For features other than position, capacity seems to be more limited—see § Updating of features other than position.

While the tracking capacity limit is largely set separately by the two cerebral hemispheres, a more unified and cognitive resource also can contribute to tracking. For example, if there is only one target, one can bring one's full cognitive abilities to bear, such as in predicting future positions, to facilitate tracking. When more targets are present, these resources may still play a role.

Spatiotemporal limits
If the objects of a display are not sufficiently widely spaced, the objects are difficult to identify and select with attention due to spatial crowding, which can prevent tracking. High object speeds have a similar effect—faster objects are harder to track, and humans are completely unable to track objects that move sufficiently fast. This "speed limit", however, is much slower than the maximum object speed at which humans can judge the object's movement direction. This dissociation between motion perception and object tracking is thought to reflect that direction judgments can be based on low-level and local motion detector responses that do not register the positions of objects.

As an object's speed is increased, temporal crowding can result and prevent tracking well before the tracking speed limit is reached. Temporal crowding refers to an impairment caused by distractors visiting a target's former location within a short interval. The phenomenon was revealed in a study with a display where distractors were evenly-spaced along a circular trajectory that was also shared by a target. Participants could not track three targets if the locations traversed were visited by objects more than three times per second, and this was true even if the objects were moving at a relatively slow speed. This temporal crowding limit on tracking becomes more severe as the number of targets increases.

As the spatial, temporal, and speed limits are approached, tracking performance decreases gradually and in typical MOT displays, it is unclear which of these limits, or what combination of them, determine the maximum number of targets that can be tracked. For the spatial limit, one study found little to no effect beyond the Bouma's law crowding zone. Many MOT studies do not enforce sufficient spacing between objects to avoid spatial crowding, making spatial crowding likely to be one factor in overall performance.

Role of prediction and trajectory information
Brains continuously predict some aspects of the future. In the case of multiple object tracking, however, several MOT studies have found evidence against extrapolation of future positions.

When future positions are predictable, human object tracking performance can be higher than when future positions are unpredictable. However, the benefit seems to disappear when there are more than one or two targets, suggesting that any prediction happening is more limited in processing capacity than other aspects of object tracking. One issue with those studies, however, it that predictability of objects' future positions appears to be confounded with the objects being distinguishable from each other (on the basis of maintaining particular and different motion directions). In such experiments, the difference in targets' and distractors' motion directions or accelerations may be the facilitator of tracking rather than prediction of future positions. Indeed, distinctiveness of motion directions alone facilitates tracking. Ability to detect a change in a target's trajectory is much worse with each increase in target number. This suggests motion direction is only utilized when there are few targets, and may explain why the predictability benefit is confined to when there are only a few targets.

Role of grouping and coordinate frames
The human brain represents the positions of objects with multiple reference frames or coordinate systems. Early stages of the visual system represent the locations of objects relative to the direction the eyes are pointing (retinotopic coordinates). Some later stages of human visual processing can represent object locations relative to each other or to the scene.

Regarding representation of relative locations, the relative positions of objects can be represented with an imaginary polygon, with each target a different vertex of that polygon. In studies of MOT, Steve Yantis drew participants' attention to the polygon formed by the targets and found that benefited performance, as did setting the targets' trajectories to avoid much disruption of the constantly-morphing polygon. This suggests that shape tracking contributes to accurate performance, at least in some participants. One study measured an electrical brain response (ERP) to a probe that was flashed while the objects were moving. The earliest-detectable part of the neural response to the probe was significantly greater if the probe lay on the polygon defined by the targets rather than inside or outside the polygon. This suggests that at least some of the participants continuously tracked the polygon defined by the targets.

Displays with more complicated statistical relationships among moving targets have been devised to show that regularities in hierarchical relationships are extracted and utilized in multiple object tracking, including nesting of groups of objects within moving reference frames.

Updating of features other than position
The classic MOT task requires updating of targets' positions but not their other features. People appear to be less able to update the other features of targets, and have difficulty even in maintaining their knowledge of such features as the associated objects move. In one study, Pylyshyn assigned distinct identities to four identical targets, either by giving them names or by giving them easily-identifiable starting positions: the four corners of the screen. In addition to the usual task at the end of the trial of identifying which objects were the targets, participants also were asked about the identity of the targets – which one each was. Contrary to what Pylyshyn expected from his FINST theory, accuracy at identifying which target was which was very low, even when accuracy reporting the targets' positions was high.

To assess maintenance of knowledge of object identities, one series of experiments used cartoon animals as targets and distractors that all moved about the screen. By the end of each trial, the animals came to rest behind cartoons of cacti, so that their identities were no longer visible. Participants were asked where a particular target (e.g., the cartoon rabbit) had gone—that is, which occluder it was hiding behind. In this multiple identity tracking (MIT) task, performance was much worse than in the standard MOT task of reporting target locations irrespective of which target a location belonged to.

The deficit in updating the locations of featural and identity information may reflect a more general deficit in updating the locations of objects in visual short-term memory. In a study using a shell game in which the shells hid brightly-colored balls of wool, pairs of shells were swapped at a slow rate of once a second, but accuracy judging which shell contained a particular color fell to 80% accuracy when there were four swaps in a simple three-shell display, compared to 95% accuracy for four swaps with a two-shell display.

The concept of an "object file" is that of a record in the brain that stores the features of a visual object, with the location record updated as the object moves. In the original studies that were motivated by this idea, one feature an object disappears and the object moves to a new location. The feature is then presented in the new location, and people respond faster to that feature than to features that were not previously presented as part of the object. This finding of priming indicates that an object file was created and updated by the brain. One might expect this to tap into the same processing as that assessed by the MIT task. The relationship between the two is unclear, however, as there is evidence that attentional tracking occurs can occur along a different trajectory than that which is the basis of updating the memory of an object's features.

In the studies mentioned so far, the objects involved did not change any of their features besides their positions, so the task was to maintain knowledge of (unchanging) features while updating their positions. Change blindness studies show that in many circumstances, people do poorly at noticing that features have changed. A famous demonstration involves placing a blank screen between the presentation of two versions of a screen to mask the flicker that would otherwise be associated with a change. Change blindness also occurs when the flicker evoked by the change is masked by the objects' motion. That, however, may only mean that nothing is comparing the features present before and after the change; it does not necessarily mean that object representations are not updated, so other studies are needed.

A related issue is whether tracking can occur on the basis not only of smooth changes in the position of an object, but also on the basis of smooth changes in an object's other features. In a tracking experiment in which two objects were always spatially superposed, the objects maintained their separate identities based on smooth continuity of their colors, orientations, and spatial frequencies. The participants could only track one such object, suggesting no ability to capitalize on spatiotemporal feature continuity for features other than position, although this has not yet been tested for cases in which the targets do not overlap (overlap may trigger figure-ground interference).

Difficulty tracking unusual objects and object parts
Many objects have clearly-visible parts. A dumbbell, for example, has a central bar part and has the weights at the bar's ends. Even when such parts are conspicuous, people can have difficulty tracking an individual part of multiple objects. When individual ends of multiple dumbbell-shaped drawings are designated as targets, tracking performance is poor. Performance was even worse when participants attempted to track one end of multiple moving lines, where the lines were uniform without distinct parts. Evidently, the mental processes that underlie tracking of multiple objects operate on a particular type of object representation that differs from what we can consciously recognize. Possibly the representation used for tracking is shared by that used when searching for a particular colored shape that is hidden among many other shapes; visual search is hindered by connecting targets to distractors.

For some types of "objects" that are not segmented as such by early visual processing, not even a single instance can be tracked. Stuart Anstis has shown that people are unable to track the intersection of two lines sliding over each other, except possibly at very slow speeds.

Some things change shape as they move, such as liquids and slinkys. For slinky-like objects that moved by extending their leading edges to a point and then retracting their trailing edges, Kristy vanMarle and Brian Scholl found that tracking performance was poor. The underlying reason for this is unclear, but reporting the location of even a lone object is impaired by growth or contraction of the object, which may contribute to the tracking failure.

Interference with concurrent performance of other tasks
Overlap among the processes underlying mental abilities can be revealed by what types of concurrent tasks interfere with each other. Attempting to track multiple visual objects typically interferes with other tasks, even for tasks with stimuli in other modalities. Unfortunately, it can be difficult to determine whether this reflects processing somewhat specific to our ability to track or instead reflects the processing necessary to initiate and sustain a wide variety of tasks.

One exception to the usual finding of interference with other tasks is that an auditory pitch discrimination task was found to not interfere with visual multiple object tracking. With a task designed as an auditory analog of tracking rather than just requiring discrimination of a few pitches, however, Daryl Fougnie et al. found that the task interfered approximately as much with visual object tracking as did a visual feature-tracking task. This suggests that auditory and visual tracking are limited by a common processing resource.

Neural basis
Neuroimaging studies find that activation of areas of the parietal cortex increases with the number of objects tracked, which is consistent with the suggestion that the parietal cortex plays a role in humans' limited tracking capacity. Activation of other brain areas also seems to increase with target load, but the particular areas may be less consistent across studies than the parietal cortex finding. The size of participants' pupils also increases with the number of objects tracked. The pupil size increase, which also is caused by mental effort in other tasks, may reflect norepinephrine release from the locus coeruleus.

Objects presented to the left visual hemifield are processed initially by the right cerebral hemisphere, while stimuli presented to the right visual hemifield are processed initially by the left cerebral hemisphere. The independent capacity limits in the two hemifields are very similar, although there may be a small right-hemifield advantage. A right hemifield advantage would be consistent with a contribution by both parietal cortices to tracking that hemifield, which was suggested because both parietal cortices are thought to contribute to other attentional functions in the right hemifield.

The neural basis of MOT has also been studied using electroencephalography (EEG). One such study found a robust correlation between tracking performance and the effect of number of targets on the N2pc event-related potential and also on contralateral delay activity. Multiple brain areas contribute to these signals, so such studies have not yet allowed researchers to determine exactly which brain areas mediate tracking.

Human variation and development
If a person is tested multiple times, their scores are usually similar to each other. This suggests that the variation in the number of objects people seem able to track (for one version of the task, capacities ranged between one and six targets) reflects real variation in ability. A caveat is that studies have failed to assess how much of this could be due to variation in individuals' motivation, but one study tested only top military recruits, a sample that was likely to be highly motivated, and also found substantial variation between individuals.

Most research has been conducted on healthy undergraduates at universities in Western countries, so we don't know much about other populations. Comparing children of different ages, however, two studies in North America found a marked increase with age in the number of objects the children could track, from 6 or 7 years old to adulthood. People with autism spectrum disorders have been found to have poorer MOT performance than typically-developing people. This was attributed to a deficit in attentional selection in autism.

Adults with Williams Syndrome have profound deficits on certain spatial assembly tasks, such as copying a four-block checkerboard pattern. For multiple object tracking, their performance is similar to typically-developing four- or five-year-old children. In contrast, their ability to remember the locations of MOT targets if they don't move is more comparable to typically-developing 6-year-olds, which has led to the suggestion that maintaining attentional selection is a particular problem in Williams Syndrome.

Among older typically-developing adults, MOT performance falls steeply with age. Age-related increases in spatial crowding and temporal crowding likely contribute to this.

Several papers report that video game players perform substantially better in MOT tasks than those who do not play video games. However, it has been suggested that this could be an artifact of research practices such as selective publication of results.

Covariation of object tracking ability with other abilities
While some have used MOT in an attempt to ensure study participants sustain their attention over a long interval, a study with a large number of participants found little correlation with a continuous performance task specifically designed to measure lapses in attention. MOT may, then, be forgiving of lapses in attention, which is consistent with findings that for typical displays, participants can perform well in MOT even if they are occasionally briefly interrupted, with their tracking processes able to pick up where they left off.

One approach to investigating which tasks share underlying processing is to test participants on several different tasks to determine which tasks have the highest correlations across individuals. The results of studies that have done this with MOT have not been entirely consistent with each other, so which tasks yield the highest correlation with MOT performance is not yet clear. However, multiple studies find that visual working memory is one of the most highly-correlated tasks. That correlation is consistent with findings that working memory tasks are among the best predictors of performance in a range of tasks. This may reflect shared mechanisms such as maintaining goal-relevant information in memory (possibly including which objects are the targets) and disengaging from outdated or irrelevant information.

Use in ability testing and training
Some professional sports teams use laboratory-style MOT tests for ability assessment and for training. Associates of the company that makes the "NeuroTracker" MOT product claim that NeuroTracker is a "cognitive enhancer" that improves a variety of abilities relevant to performance on the sports field, but the evidence in the studies purporting to show this is weak. Another reason for skepticism of such claims is the poor track record of other commercial "brain training" products advertised for their cognitive-enhancing effects.

While it is unlikely that training on laboratory-style MOT tasks yields broad mental benefits, when more rigorous studies are done, it is possible that firm evidence may support the use of tasks related to MOT for screening or training purposes for specific purposes. Regarding screening, however, one study found that laboratory MOT performance did not predict driving test performance as well as the Montreal Cognitive Assessment, a trail-making task, or a useful field-of-view task. A multiple object avoidance (MOA) task, involving steering a ball with a computer mouse to prevent it from colliding with other moving balls on a computer screen, was found to correlate better with driving performance than MOT. In another study, strong positive correlations with MOA performance were found with driving simulator performance and years of driving experience. This may be because MOA includes control of movement, which is necessary for driving, but is not required for MOT.

Theories and models
Published computational models fit some aspects of tracking results, with most focusing on the pattern of performance decline with increasing number of targets, and some modeling the dissociation between position and non-position features. No published theory purports to explain all four of the following: the difficulty with tracking parts of objects, the role of temporal interference, the dissociation between position and non-positional features, and the pattern of performance decline with increasing number of targets.

Serial versus parallel processing
The independence of tracking in the left and right hemifields suggests that position updating in each hemifield occurs independently of and in parallel with position updating in the other hemifield (see § Capacity limits). Within a hemifield, it is not yet completely clear whether tracking of multiple objects happens in parallel or instead the target positions are updated one-by-one, but most recent theorists agree with Pylyshyn's original FINST theory that positions are updated in parallel. A finding that gives some support to the alternative of serial switching is the marked increase in temporal interference as the number of targets tracked increases. In particular, the amount of increase in time needed between when a target leaves a location and a distractor takes its place is approximately predicted by the theory that attention must visit each moving target one-by-one to update its location.

Some who theorize that position updating occurs simultaneously for multiple targets draw a contrast with features other than position, stating that they are updated by a process that must serially switch among the targets. A model by Lovett, Bridewell, & Bello published in 2019, for example, includes a parallel process to track changes in position and connect to visual pointers that are shared with visual short-term memory and other visual attention tasks. A serial selection process is also included, which operates on only one object at a time and enables access to a target's motion history and other features.

Slots versus resources
Central to Pylyshyn's FINST theory is that a small set of discrete pointers mediate multiple object tracking. Subsequent researchers have suggested that rather than discrete pointers, a mental resource that is more continuous is divided among the targets. This dispute is similar to the "slots versus resources" debate in the study of working memory. A continuous resource naturally explains the smooth decline in performance with number of targets, although there is no agreement about what precisely about tracking becomes worse when less resource is provided. Possibilities include spatial resolution, temporal resolution, the maximum speed of the tracker, or all three (see § Spatiotemporal limits).