Search engine privacy

Search engine privacy is a subset of internet privacy that deals with user data being collected by search engines. Both types of privacy fall under the umbrella of information privacy. Privacy concerns regarding search engines can take many forms, such as the ability for search engines to log individual search queries, browsing history, IP addresses, and cookies of users, and conducting user profiling in general. The collection of personally identifiable information (PII) of users by search engines is referred to as tracking.

This is controversial because search engines often claim to collect a user's data in order to better tailor results to that specific user and to provide the user with a better searching experience. However, search engines can also abuse and compromise its users' privacy by selling their data to advertisers for profit. In the absence of regulations, users must decide what is more important to their search engine experience: relevance and speed of results or their privacy, and choose a search engine accordingly.

The legal framework in the United States for protecting user privacy is not very solid. The most popular search engines collect personal information, but other search engines that are focused on privacy have cropped up recently. There have been several well publicized breaches of search engine user privacy that occurred with companies like AOL and Yahoo. For individuals interested in preserving their privacy, there are options available to them, such as using software like Tor which makes the user's location and personal information anonymous or using a privacy focused search engine.

Privacy policies
Search engines generally publish privacy policies to inform users about what data of theirs may be collected and what purposes it may be used for. While these policies may be an attempt at transparency by search engines, many people never read them and are therefore unaware of how much of their private information, like passwords and saved files, are collected from cookies and may be logged and kept by the search engine. This ties in with the phenomenon of notice and consent, which is how many privacy policies are structured.

Notice and consent policies essentially consist of a site showing the user a privacy policy and having them click to agree. This is intended to let the user freely decide whether or not to go ahead and use the website. This decision, however, may not actually be made so freely because the costs of opting out can be very high. Another big issue with putting the privacy policy in front of users and having them accept quickly is that they are often very hard to understand, even in the unlikely case that a user decides to read them. Privacy minded search engines, such as DuckDuckGo, state in their privacy policies that they collect much less data than search engines such as Google or Yahoo, and may not collect any. As of 2008, search engines were not in the business of selling user data to third parties, though they do note in their privacy policies that they comply with government subpoenas.

Google and Yahoo
Google, founded in 1998, is the most widely used search engine, receiving billions and billions of search queries every month. Google logs all search terms in a database along with the date and time of search, browser and operating system, IP address of user, the Google cookie, and the URL that shows the search engine and search query. The privacy policy of Google states that they pass user data on to various affiliates, subsidiaries, and "trusted" business partners.

Yahoo, founded in 1994, also collects user data. It is a well-known fact that users do not read privacy policies, even for services that they use daily, such as Yahoo! Mail and Gmail. This persistent failure of consumers to read these privacy policies can be disadvantageous to them because while they may not pick up on differences in the language of privacy policies, judges in court cases certainly do. This means that search engine and email companies like Google and Yahoo are technically able to keep up the practice of targeting advertisements based on email content since they declare that they do so in their privacy policies. A study was done to see how much consumers cared about privacy policies of Google, specifically Gmail, and their detail, and it determined that users often thought that Google's practices were somewhat intrusive but that users would not often be willing to counteract this by paying a premium for their privacy.

DuckDuckGo
DuckDuckGo, founded in 2008, claims to be privacy focused. DuckDuckGo does not collect or share any personal information of users, such as IP addresses or cookies, which other search engines usually do log and keep for some time. It also does not have spam, and protects user privacy further by anonymizing search queries from the website the user chooses and using encryption. Similarly privacy oriented search engines include Startpage, Ecosia, Qwant, MetaGer and Disconnect. Mojeek and Brave Search are privacy-focused search engines that build their own indexes.

Types of data collected by search engines
Most search engines can, and do, collect personal information about their users according to their own privacy policies. This user data could be anything from location information to cookies, IP addresses, search query histories, click-through history, and online fingerprints. This data is often stored in large databases, and users may be assigned numbers in an attempt to provide them with anonymity.

Data can be stored for an extended period of time. For example, the data collected by Google on its users is retained for up to 9 months. Some studies state that this number is actually 18 months. This data is used for various reasons such as optimizing and personalizing search results for users, targeting advertising, and trying to protect users from scams and phishing attacks. Such data can be collected even when a user is not logged in to their account or when using a different IP address by using cookies.

User profiling and personalization
What search engines often do once they have collected information about a user's habits is to create a profile of them, which helps the search engine decide which links to show for different search queries submitted by that user or which ads to target them with. An interesting development in this field is the invention of automated learning, also known as machine learning. Using this, search engines can refine their profiling models to more accurately predict what any given user may want to click on by doing A/B testing of results offered to users and measuring the reactions of users.

Companies like Google, Netflix, YouTube, and Amazon have all started personalizing results more and more. One notable example is how Google Scholar takes into account the publication history of a user in order to produce results it deems relevant. Personalization also occurs when Amazon recommends books or when IMDb suggests movies by using previously collected information about a user to predict their tastes. For personalization to occur, a user need not even be logged into their account.

Targeted advertising
The internet advertising company DoubleClick, which helps advertisers target users for specific ads, was bought by Google in 2008 and was a subsidiary until June 2018, when Google rebranded and merged DoubleClick into its Google Marketing Platform. DoubleClick worked by depositing cookies on user's computers that would track sites they visited with DoubleClick ads on them. There was a privacy concern when Google was in the process of acquiring DoubleClick that the acquisition would let Google create even more comprehensive profiles of its users since they would be collecting data about search queries and additionally tracking websites visited. This could lead to users being shown ads that are increasingly effective with the use of behavioral targeting. With more effective ads comes the possibility of more purchases from consumers that they may not have made otherwise. In 1994, a conflict between selling ads and relevance of results on search engines began. This was sparked by the development of the cost-per-click model, which challenged the methods of the already-created cost-per-mille model. The cost-per-click method was directly related to what users searched, whereas the cost-per-mille method was directly influenced by how much a company could pay for an ad, no matter how many times people interacted with it.

Improving search quality
Besides ad targeting and personalization, Google also uses data collected on users to improve the quality of searches. Search result click histories and query logs are crucial in helping search engines optimize search results for individual users. Search logs also help search engines in the development of the algorithms they use to return results, such as Google's well known PageRank. An example of this is how Google uses databases of information to refine Google Spell Checker.

Privacy organizations
There are many who believe that user profiling is a severe invasion of user privacy, and there are organizations such as the Electronic Privacy Information Center (EPIC) and Privacy International that are focused on advocating for user privacy rights. In fact, EPIC filed a complaint in 2007 with the Federal Trade Commission claiming that Google should not be able to acquire DoubleClick on the grounds that it would compromise user privacy. The Open Search Foundation specifically targets search engine privacy by investigating ways of making search a public, collaborative good where people can search freely without their personal data being collected and evaluated.

Users' perception of privacy
Experiments have been done to examine consumer behavior when given information on the privacy of retailers by integrating privacy ratings with search engines. Researchers used a search engine for the treatment group called Privacy Finder, which scans websites and automatically generates an icon to show the level of privacy the site will give the consumer as it compares to the privacy policies that consumer has specified that they prefer. The results of the experiment were that subjects in the treatment group, those who were using a search engine that indicated privacy levels of websites, purchased products from websites that gave them higher levels of privacy, whereas the participants in the control groups opted for the products that were simply the cheapest. The study participants also were given financial incentive because they would get to keep leftover money from purchases. This study suggests that since participants had to use their own credit cards, they had a significant aversion to purchasing products from sites that did not offer the level of privacy they wanted, indicating that consumers value their privacy monetarily.

Ethical debates
Many individuals and scholars have recognized the ethical concerns regarding search engine privacy.

Pro data collection
The collection of user data by search engines can be viewed as a positive practice because it allows the search engine to personalize results. This implies that users would receive more relevant results, and be shown more relevant advertisements, when their data, such as past search queries, location information, and clicks, is used to create a profile for them. Also, search engines are generally free of charge for users and can remain afloat because one of their main sources of revenue is advertising, which can be more effective when targeted.

Anti-data collection
This collection of user data can also be seen as an overreach by private companies for their own financial gain or as an intrusive surveillance tactic. Search engines can make money using targeted advertising because advertisers are willing to pay a premium to present their ads to the most receptive consumers. Also, when a search engine collects and catalogs large amounts of data about its users, there is the potential for it to be leaked accidentally or breached. The government can also subpoena user data from search engines when they have databases of it. Search query database information may also be subpoenaed by private litigants for use in civil cases, such as divorces or employment disputes.

AOL search data leak
One major controversy regarding search engine privacy was the AOL search data leak of 2006. For academic and research purposes, AOL made public a list of about 20 million search queries made by about 650,000 unique users. Although they assigned unique identification numbers to the users instead of attaching names to each query, it was still possible to ascertain the true identities of many users simply by analyzing what they had searched, including locations near them and names of friends and family members. A notable example of this was how the New York Times identified Thelma Arnold through "reverse searching". Users also sometimes do "ego searches" where they search themselves to see what information about them is on the internet, making it even easier to identify supposedly anonymous users. Many of the search queries released by AOL were incriminating or seemingly extremely private, such as "how to kill your wife" and "can you adopt after a suicide attempt". This data has since been used in several experiments that attempt to measure the effectiveness of user privacy solutions.

Google and Yahoo
Both Google and Yahoo were subjects of a Chinese hack in 2010. While Google responded to the situation seriously by hiring new cybersecurity engineers and investing heavily into securing user data, Yahoo took a much more lax approach. Google started paying hackers to find vulnerabilities in 2010 while it took Yahoo until 2013 to follow suit. Yahoo was also identified in the Snowden data leaks as a common hacking target for spies of various nations, and Yahoo still did not give its newly hired chief information security officer the resources to really effect change within the company. In 2012, Yahoo hired Marissa Mayer, previously a Google employee, to be the new CEO, but she chose not to invest much in the security infrastructure of Yahoo and went as far as to refuse the implementation of a basic and standard security measure to force the reset of all passwords after a breach.

Yahoo is known for being the subject of multiple breaches and hacks that have compromised large amounts of user data. As of late 2016, Yahoo had announced that at least 1.5 billion user accounts had been breached during 2013 and 2014. The breach of 2013 compromised over a billion accounts while the breach of 2014 included about 500 million accounts. The data compromised in the breaches included personally identifiable information such as phone numbers, email addresses, and birth dates as well as information like security questions (used to reset passwords) and encrypted passwords. Yahoo made a statement saying that their breaches were a result of state sponsored actors, and in 2017, two Russian intelligence officers were indicted by the United States Department of Justice as part of a conspiracy to hack Yahoo and steal user data. As of 2016, the Yahoo breaches of 2013 and 2014 were the largest of all time.

In October 2018, there was a Google+ data breach that potentially affected about 500,000 accounts which led to the shutdown of the Google+ platform.

Government subpoenas of data
The government may want to subpoena user data from search engines for any number of reasons, which is why it a big threat to user privacy. In 2006, they wanted it as part of their defense of COPA, and only Google refused to comply. While protecting the online privacy of children may be an honorable goal, there are concerns about whether the government should have access to such personal data to achieve it. At other times, they may want it for national security purposes; access to big databases of search queries in order to prevent terrorist attacks is a common example of this.

Whatever the reason, it is clear that the fact that search engines do create and maintain these databases of user data is what makes it possible for the government to access it. Another concern regarding government access to search engine user data is "function creep", a term that here refers to how data originally collected by the government for national security purposes may eventually be used for other purposes, such as debt collection. This would indicate to many a government overreach. While protections for search engine user privacy have started developing recently, the government has increasingly been on the side that wants to ensure search engines retain data, making users less protected and their data more available for anyone to subpoena.

Switching search engines
A different, although popular, route for a privacy centered user to take is to simply start using a privacy oriented search engine, such as DuckDuckGo. This search engine maintains the privacy of its users by not collecting data on or tracking its users. While this may sound simple, users must take into account the trade-off between privacy and relevant results when deciding to switch search engines. Results to search queries can be very different when the search engine has no search history to aid it in personalization.

Using privacy oriented browsers
Mozilla is known for its beliefs in protecting user privacy on Firefox. Mozilla Firefox users have the capability to delete the tracking cookie that Google places on their computer, making it much harder for Google to group data. Firefox also has a button called "Clear Private Data", which allows users to have more control over their settings. Internet Explorer users have this option as well. When using a browser like Google Chrome or Safari, users also have the option to browse in "incognito" or "private browsing" modes respectively. When in these modes, the user's browsing history and cookies are not collected.

Opting out
The Google, Yahoo!, AOL, and MSN search engines all allow users to opt out of the behavioral targeting they use. Users can also delete search and browsing history at any time. The Ask.com search engine also has AskEraser, which, when used, purges user data from their servers. Deleting a user's profile and history of data from search engine logs also helps protect user privacy in the event a government agency wants to subpoena it. If there are no records, there is nothing the government can access. It is important to note that simply deleting your browsing history does not delete all the information the search engine has on you, some companies do not delete the data associated with your account when you clear your browsing history. For companies that do delete user data, they usually do not delete all of it keeping records of how you used the search engine.

Social network solution
An innovative solution, proposed by researchers Viejo and Castellà-Roca, is a social network solution whereby user profiles are distorted. In their plan, each user would belong to a group, or network, of people who all use the search engine. Every time somebody wanted to submit a search query, it would be passed on to another member of the group to submit on their behalf until someone submitted it. This would ideally lead to all search queries being divvied up equally between all members of the network. This way, the search engine cannot make a useful profile of any individual user in the group since it has no way to discern which query actually belonged to each user.

Delisting and reordering
After the Google Spain v. AEPD case, it was established that people had the right to request that search engines delete personal information from their search results in compliance with other European data protection regulations. This process of simply removing certain search results is called de-listing. While effective in protecting the privacy of those who wish information about them to not be accessed by anyone using a search engine, it does not necessarily protect the contextual integrity of search results. For data that is not highly sensitive or compromising, reordering search results is another option where people would be able to rank how relevant certain data is at any given point in time, which would then alter results given when someone searched their name.

Anonymity networks
A sort of DIY option for privacy minded users is to use a software like Tor, which is an anonymity network. Tor functions by encrypting user data and routing queries through thousands of relays. While this process is effective at masking IP addresses, it can slow the speed of results. While Tor may work to mask IP addresses, there have also been studies that show that a simulated attacker software could still match search queries to users even when anonymized using Tor.

Unlinkability and indistinguishability
Unlinkability and indistinguishability are also well-known solutions to search engine privacy, although they have proven somewhat ineffective in actually providing users with anonymity from their search queries. Both unlinkability and indistinguishability solutions try to anonymize search queries from the user who made them, therefore making it impossible for the search engine to definitively link a specific query with a specific user and create a useful profile on them. This can be done in a couple of different ways.

Unlinkability
Another way for the user to hide information such as their IP address from the search engine is an unlinkability solution. This is perhaps more simple and easy for the user because any user can do this by using a VPN, although it still does not guarantee total privacy from the search engine.

Indistinguishability
One way is for the user to use a plugin or software that generates multiple different search queries for every real search query the user makes. This is an indistinguishability solution, and it functions by obscuring the real searches a user makes so that a search engine cannot tell which queries are the software's and which are the user's. Then, it is more difficult for the search engine to use the data it collects on a user to do things like target ads.

Legal rights and court cases
Being that the internet and search engines are relatively recent creations, no solid legal framework for privacy protections in terms of search engines has been put in place. However, scholars do write about the implications of existing laws on privacy in general to inform what right to privacy search engine users have. As this is a developing field of law, there have been several lawsuits with respect to the privacy search engines are expected to afford to their users.

The Fourth Amendment
The Fourth Amendment is well known for the protections it offers citizens from unreasonable searches and seizures, but in Katz v. United States (1967), these protections were extended to cover intrusions of privacy of individuals, in addition to simply intrusion of property and people. Privacy of individuals is a broad term, but it is not hard to imagine that it includes the online privacy of an individual.

The Sixth Amendment
The Confrontation Clause of the Sixth Amendment is applicable to the protection of big data from government surveillance. The Confrontation Clause essentially states that defendants in criminal cases have the right to confront witnesses who provide testimonial statements. If a search engine company like Google gives information to the government to prosecute a case, these witnesses are the Google employees involved in the process of selecting which data to hand over to the government. The specific employees who must be available to be confronted under the Confrontation Clause are the producer who decides what data is relevant and provides the government with what they've asked for, the Google analyst who certifies the proper collection and transmission of data, and the custodian who keeps records. The data these employees of Google curate for trial use is then thought of as testimonial statement. The overall effectiveness of the Confrontation Clause on search engine privacy is that it places a check on how the government can use big data and provides defendants with protection from human error.

Katz v. United States
This 1967 case is prominent because it established a new interpretation of privacy under the Fourth Amendment, specifically that people had a reasonable expectation of it. Katz v. United States was about whether or not it was constitutional for the government to listen to and record, electronically using a pen register, a conversation Katz had from a public phone booth. The court ruled that it did violate the Fourth Amendment because the actions of the government were considered a "search" and that the government needed a warrant. When thinking about search engine data collected about users, the way telephone communications were classified under Katz v. United States could be a precedent for how it should be handled. In Katz v. United States, public telephones were deemed to have a "vital role" in private communications. This case took place in 1967, but surely nowadays, the internet and search engines have this vital role in private communications, and people's search queries and IP addresses can be thought of as analogous to the private phone calls placed from public booths.

United States v. Miller
This 1976 Supreme Court case is relevant to search engine privacy because the court ruled that when third parties gathered or had information given to them, the Fourth Amendment was not applicable. Jayni Foley argues that the ruling of United States v. Miller implies that people cannot have an expectation of privacy when they provide information to third parties. When thinking about search engine privacy, this is important because people willingly provide search engines with information in the form of their search queries and various other data points that they may not realize are being collected.

Smith v. Maryland
In the Supreme Court case Smith v. Maryland of 1979, the Supreme Court went off the precedent set in the 1976 United States v. Miller case about assumption of risk. The court ruled that the Fourth Amendment did not prevent the government from monitoring who dialed which phone numbers by using a pen register because it did not qualify as a "search".

Both the United States v. Miller and the Smith v. Maryland cases have been used to prevent users from the privacy protections offered under the Fourth Amendment from the records that internet service providers (ISPs) keep. This is also articulated in the Sixth Circuit Guest v. Leis case as well as the United States v. Kennedy case where the courts ruled that Fourth Amendment protections did not apply to ISP customer data since they willingly provided ISPs with their information just by using the services of ISPs. Similarly, the current legal structure regarding privacy and assumption of risk can be interpreted to mean that users of search engines cannot expect privacy in regards to the data they communicate by using search engines.

Electronic Communication Privacy Act
The Electronic Communications Privacy Act (ECPA) of 1986 was passed by Congress in an effort to start creating a legal structure for privacy protections in the face of new forms of technologies, although it was by no means comprehensive because there are considerations for current technologies that Congress never imagined in 1986 and could account for. The EPCA does little to regulate ISPs and mainly prevents government agencies from gathering information stored by ISPs without a warrant. What the EPCA does not do, unsurprisingly because it was enacted before internet usage became a common occurrence, is say anything about search engine privacy and the protections users are afforded in terms of their search queries.

Gonzales v. Google Inc.
The background of this 2006 case is that the government was trying to bolster its defense for the Child Online Protection Act (COPA). It was doing a study to see how effective its filtering software was in regards to child pornography. To do this, the government subpoenaed search data from Google, AOL, Yahoo!, and Microsoft to use in its analysis and to show that people search information that is potentially compromising to children. This search data that the government wanted included both the URLs that appeared to users and the actual search queries of users. Of the search engines the government subpoenaed to produce search queries and URLs, only Google refused to comply with the government, even after the request was reduced in size. Google itself claimed that handing over these logs was to hand over personally identifiable information and user identities. The court ruled that Google had to hand over 50,000 randomly selected URLs to the government but not search queries because that could seed public distrust of the company and therefore compromise its business.

Law of Confidentiality
While not a strictly defined law enacted by Congress, the Law of Confidentiality is common law that protects information shared by a party who has trust and an expectation of privacy from the party they share the information with. If the content of search queries and the logs they are stored in is thought of in the same manner as information shared with a physician, as it is similarly confidential, then it ought to be afforded the same privacy protections.

Google Spain v. AEPD
The European Court of Justice ruled in 2014 that its citizens had the "Right to Be Forgotten" in the Google Spain SL v. Agencia Española de Protección de Datos case, which meant that they had the right to demand search engines wipe any data collected on them. While this single court decision did not directly establish the "right to be forgotten", the court interpreted existing law to mean that people had the right to request that some information about them be wiped from search results provided by search engine companies like Google. The background of this case is that one Spanish citizen, Mario Costeja Gonzalez, set out to erase himself from Google's search results because they revealed potentially compromising information about his past debts. In the ruling in favor of Mario Costeja Gonzalez, the court noted that search engines can significantly impact the privacy rights of many people and that Google controlled the dissemination of personal data. This court decision did not claim that all citizens should be able to request that information about them be completely wiped from Google at any time, but rather that there are specific types of information, particularly information that is obstructing one's right to be forgotten, that do not need to be so easily accessible on search engines.

General Data Protection Regulation (GDPR)
The GDPR is a European regulation that was put in place to protect data and provide privacy to European citizens, regardless of whether they are physically in the European Union. This means that countries around the globe have had to comply with their rules so that any European citizen residing in them is afforded the proper protections. The regulation became enforceable in May 2018.