Hancock (programming language)

Hancock is a C-based programming language, first developed by researchers at AT&T Labs in 1998, to analyze data streams. The language was intended by its creators to improve the efficiency and scale of data mining. Hancock works by creating profiles of individuals, utilizing data to provide behavioral and social network information.

The development of Hancock was part of the telecommunications industry's use of data mining processes to detect fraud and to improve marketing. However, following the September 11, 2001 attacks, and the increased government surveillance of individuals, Hancock and similar data mining technologies came into public scrutiny, especially regarding its perceived threat to individual privacy.

Background
Data mining research, including Hancock, grew during the 1990s, as scientific, business, and medical interest in massive data collection, storage, and management increased. During the early 1990s, transactional businesses became increasingly interested in data warehousing, which provided storage, query, and management capabilities for the entirety of recorded transactional data. Data mining research with a focus on databases became focused on creating efficient data structures and algorithms, particularly for data which was located off of main memory storage, on a disk, for example. Padharic Smyth believed that data mining researchers aimed to write algorithms which could scale the massive amounts of data in shorter amounts of time.

Researchers at AT&T Labs, including Corinna Cortes, pioneered the Hancock programming language from 1998 to 2004. Hancock, a C-based domain specific programming language, was intended to make program code for computing signatures from large transactional data streams easier to read and maintain, thus serving as an improvement over the complex data mining programs written in C. Hancock also managed issues of scale for data mining programs.

The data streams Hancock programs analyzed were intended to handle hundreds of millions of signatures daily, ideally suited for transactions like telephone calls, credit card purchases, or website requests. At the time Hancock was developed, this data were usually amassed for billing or security purposes, and increasingly, to analyze how transactors behaved. Data mining can also be useful for identifying atypical patterns in transactor data. In regards to anti-terrorist activities, data mining’s assistance in pattern-finding can help find links between terrorist suspects, through funding or arms transfers, for example.

Data stream applications also include network monitoring, financial monitoring, such as security derivative pricing, prescription drug effect monitoring, and e-commerce. Data mining can be used by firms to find their most profitable consumers or to conduct churn analysis. Data mining can also help firms make credit-lending decisions by designing models which determine a customer’s credit worthiness. These models are intended to minimize risky credit-lending while maximizing sales revenue.

Besides Hancock, other data stream systems in existence by 2003 included Aurora, Gigascope, Niagara, STREAM, Tangram, Tapestry, Telegraph, and Tribeca.

Databases
Hancock is a language for data stream mining programs. Data streams differ from traditional stored databases in that they experience very high volumes of data and allow analysts to act upon such data in near-real time. Stored databases, on the other hand, involve data being inputted for offline querying. Data warehouses, which store intersectional data from different systems, can be costly to build and lengthy to implement. Simplified data warehouses can take months to build.

The scale of massive data stream mining poses problems to data miners. For example, internet and telephone network data mining might be tasked with finding persistent items, which are items that regularly occur in the stream. However, these items might be buried in a large amount of the network’s transactional data; while the items can eventually be found, data miners aim for increased time efficiency in their search.

In database technology, users do not necessarily know where the data they are searching for is located. These users only have to issue queries for data, which the database management system returns. In a large data set, data can be contained in random-access memory (RAM), which is the primary storage, or disk storage, which is secondary storage. In 2000, Padharic Smyth estimated that, using the most recent technology, data located in RAM could be accessed relatively quickly, “on the order of 10−7-10−8 seconds,” while secondary storage data took significantly longer to access, “on the order of 104-105” seconds.

Data mining
Data mining can be broken down into the processes of input, analysis, and the reporting of results; it uses algorithms to find patterns and relationships among the subjects and has been used by commercial companies to find patterns in client behavior. Data analysts are needed to collect and organize data and train algorithms.

KianSing Ng and Huan Liu opine that even with straightforward data mining goals, the actual process is still complex. For example, they argue that real-world data mining can be challenged by data fluctuations, which would render prior patterns “partially invalid.” Another complication is that most databases in existence in 2000 were characterized by high dimensionality, which means that they contain data on many attributes. As Ng and Liu note, high dimensionality produces long computing times; this can be solved by data reduction in the pre-processing stage.

Hancock's process is as follows:


 * Hancock programs analyze data as it arrives, in real-time, into data warehouses.
 * Hancock programs computed the signatures, or behavioral profiles, of transactors in the stream.
 * Data stream transactors include telephone numbers or IP addresses.
 * Signatures enable analysts to discover patterns hidden in the data.
 * Telecommunications data streams consist of call-records, which include information on the locations of callers, time of calls, and sometimes include recordings of conversations.
 * Hancock was used to process signatures based on data like the length of phone calls and the amount of calls to a particular area over a specified interval of time.
 * Hancock programs used link analysis to find “communities of interest," which connected signatures based on similarities in behavior. Link analysis require that linkages between data are continually updated, and are used to detect fraud networks.
 * Link analysis, which can be considered a form of association data mining, aims to find connections between relationships. One such relationship is call patterns in telecommunications. Association data mining aims to find relationships between variables. For example, one research paper suggested that a market could use association analysis to find the probability that a customer who purchases coffee also purchases bread; the market could then use that information to influence store layout and promotions.

Because Hancock code performed efficiently, even with large amounts of data, the AT&T researchers claimed that it allowed analysts to create applications "previously thought to be infeasible."

Applications
The AT&T Labs researchers analyzed telecommunications data streams, including the company’s entire long distance stream, which included around 300 million records from 100 million customer accounts daily. By 2004, the entirety of AT&T's long-distance phone call record signatures were written in Hancock and the company used Hancock code to peruse through nine gigabytes of network traffic, nightly.

Telecommunications companies share information derived from data mining network traffic for research, security, and regulatory purposes.

Marketing
Hancock programs assisted in AT&T's marketing efforts. In the 1990s, large data stream mining and the increased automation of government public record systems allowed commercial corporations in the United States to personalize marketing. Signature profiles were developed from both transaction records and public record sources. Ng and Liu, for example, applied data mining to customer retention analysis, and found that mining of association rules allowed a firm to predict departures of influential customers and their associates. They argued that such knowledge subsequently empowers the company’s marketing team to target those customers, offering more attractive pitches.

Data mining assisted telecommunications companies in viral marketing, also known as buzz marketing or word-of-mouth marketing, which uses consumer social networks to improve brand awareness and profit. Viral marketing depends on connections between consumers to increase brand advocacy, which can either be explicit, such as friends recommending a product to other friends, or implicit, such as influential consumers purchasing a product. For firms, one of the goals of viral marketing is to find influential consumers who have larger networks. Another method of viral marketing is to target the neighbors of prior consumers, known as “network targeting.” Using Hancock programs, analysts at AT&T were able to find "communities of interest," or interconnected users who featured similar behavioral traits.

One of the issues viral marketing promoters encountered was the large size of marketing data sets, which, in the case of telecommunication companies, can include information on transactors and their descriptive attributes and transactions. Marketing data sets, when amounting in the hundreds of millions, can exceed the memory capacity of statistical analysis software. Hancock programs addressed data scaling issues and allowed analysts to make decisions as the data flowed into the data warehouses.

While the development of wireless communication devices allowed law enforcement to track the location of users, it also allowed companies to improve consumer marketing, such as by sending messages according to wireless user’s proximity to particular businesses. Through cell site location data, Hancock programs were capable of tracking wireless users' movements.

According to academic Alan Westin, the increase of telemarketing during this period also increased consumer annoyance. Statisticians Murray Mackinnon and Ned Glick hypothesized in 1999 that firms hid their use of commercial data mining because of potential consumer backlash for mining customer records. As an example, Mackinnon and Glick cited a June 1999 lawsuit in which the state of Minnesota sued US Bancorp for releasing customer information to a telemarketing firm; Bancorp promptly responded to the lawsuit by restricting its usage of customer data.

Fraud detection
AT&T researchers, including Cortes, showed that Hancock-related data mining programs could be used for finding telecommunications fraud.

Telecommunications fraud detection includes subscription fraud, unauthorized calling card usage, and PBX fraud. It is similar to mobile communications and credit card fraud: in all three, firms must process large amounts of data in order to obtain information; they must deal with the unpredictability of human behavior, which makes finding patterns in the data difficult; and their algorithms must be trained to spot the relatively rare cases of fraud among the many legitimate transactions. According to Daskalaki et al., in 1998, telecommunications fraud incurred billions of dollars in annual losses globally.

Because fraud cases were relatively few compared to the hundreds of millions of daily telephone transactions that occurred, algorithms for data mining of telecommunication records need to provide results quickly and efficiently. The researchers showed that communities of interest could identify fraudsters since data nodes from fraudulent accounts are typically located closer to each other than to a node from a legitimate account.

Through social network analyses and link analysis, they also found that the set of numbers that were targeted by fraudulent accounts, which were then disconnected, were often called on by fraudsters from different numbers; such connections could be used to identify fraudulent accounts. Link analysis methods are based on the assumption that fraudsters rarely deviate from their calling habits.

Relation to surveillance
In 2007, Wired magazine published an online article claiming that Hancock was created by AT&T researchers for "surveillance purposes." The article highlighted research papers written by Cortes et al., particularly the researchers' concept of "communities of interest." The article connected Hancock's concept with the recent public findings that the Federal Bureau of Investigation (FBI) had been making warrantless requests for records of "communities of interest" from telecommunication companies under the USA PATRIOT Act.

The article claims that AT&T "invented the concept and the technology" of creating "community of interest" records, citing the company's ownership of related data mining patents. Finally, the article noted how AT&T, along with Verizon, was, at the time, being sued in federal court for providing the National Security Agency (NSA) with access to billions of telephone records belonging to Americans. The NSA, the article claims, obtained such data with the intention of data mining it to find suspected terrorists and warrantless wiretapping targets.

FBI telecommunication records surveillance
Federal telecommunications surveillance is not a recent historical development in the United States. According to academic Colin Agur, telephone surveillance by law enforcement in the United States became more common in the 1920s. Particularly, telephone wiretapping became a prevalent form of evidence collection by law enforcement officials, especially federal agents, during Prohibition. Agur argues that the Communications Act of 1934, which established the Federal Communications Commission, reigned in law enforcement abuse of telephone surveillance. Under the act, telecommunications companies could keep records and report to the FCC illegal telecommunications interception requests. After the Federal Wiretap Act of 1968 and the Supreme Court's decision in Katz v. United States, both of which extended Fourth Amendment protections to telephone communications, federal telecommunications surveillance required warrants.

The FBI was first authorized to obtain national security letters (NSLs) for communication billings records, including those from telephone services, after Congress passed the Electronic Communications Privacy Act of 1986. The letters forced telephone companies to provide the FBI with customer information, such as names, addresses, and long-distance call records. Congress would eventually expand NSL authority to include warrants for local-distance call records as well.

After the September 11, 2001 attacks, Congress passed the USA PATRIOT Act, which made it easier for investigators at the FBI to be issued national security letters for terrorism investigations (NSLs). Academics William Bendix and Paul Quirk contend that the PATRIOT Act allowed the FBI to access and collect the private data of many citizens, without the approval of a judge. The FBI was allowed to keep a collection of records, with no time limit for possession. It could also force NSL recipients to remain silent through the use of gag orders.

The Wired article claimed that the FBI began making warrantless requests to telecommunication companies for "communities of interest" records of suspects under the USA PATRIOT Act. The article claimed that law enforcement discovered the existence of such records based on research by Hancock's creators.

In 2005, government leaks revealed the FBI’s abuse of NSLs. In 2006, when the PATRIOT Act was renewed, it included provisions that required the Justice Department’s inspector general to annually review NSL usage. The first inspector general report found that 140,000 NSL requests, on nearly 24,000 U.S. persons, were granted to FBI agents from 2003 to 2005. The data was then added to databanks available to thousands of agents.

NSA telecommunication records surveillance
The public-private relationship of telecommunication companies extends into the homeland security domain. Telecommunication companies, including AT&T, Verizon, and BellSouth, cooperated with NSA requests for access to transactional records. Telecommunications companies, including AT&T, have maintained partnerships with government agencies, like the Department of Homeland Security, to collaborate on sharing information and solving national cybersecurity issues. AT&T representatives sit on the board of the National Cyber Security Alliance (NCSA), which promotes cybersecurity awareness and computer user protection.

Analysts at the NSA, under authority of the secret Terrorist Surveillance Program, also used data mining to find terrorist suspects and sympathizers. In this search, the NSA intercepted communications, including telephone calls, leaving and entering the United States. Agents screened the information for possible links to terrorism, such as the desire to learn to fly planes or specific locations of the communication’s recipients, like Pakistan.

In 2005, the New York Times reported on the existence of the program, which the Bush administration defended as necessary in its counterterrorism efforts and limited to terrorist suspects and associates.

However, in 2007, the Wired article noted how AT&T and Verizon were being sued in federal court for providing the NSA with access to billions of telephone records belonging to Americans for anti-terrorism activities, such as using data mining to locate suspected terrorists and warrantless wiretapping targets.

In 2013, following the Snowden leaks, it was revealed that the program had also mined the communications of not just terrorist suspects, but also millions of American citizens. A 2014 independent audit by the Privacy and Civil Liberties Oversight Board found that the program had limited counterterrorism benefits.