DNA encryption

DNA encryption is the process of hiding or perplexing genetic information by a computational method in order to improve genetic privacy in DNA sequencing processes. The human genome is complex and long, but it is very possible to interpret important, and identifying, information from smaller variabilities, rather than reading the entire genome. A whole human genome is a string of 3.2 billion base paired nucleotides, the building blocks of life, but between individuals the genetic variation differs only by 0.5%, an important 0.5% that accounts for all of human diversity, the pathology of different diseases, and ancestral story. Emerging strategies incorporate different methods, such as randomization algorithms and cryptographic approaches, to de-identify the genetic sequence from the individual, and fundamentally, isolate only the necessary information while protecting the rest of the genome from unnecessary inquiry. The priority now is to ascertain which methods are robust, and how policy should ensure the ongoing protection of genetic privacy.

History
In 2003, the National Human Genome Research Institute and its affiliated partners successfully sequenced the first whole human genome, a project that took just under $3 billion to complete. Four years later, James Watson – one of the co-discoverers of the structure of DNA – was able to sequence his genome for less than $1.5 million. As genetic sequencing technologies have proliferated, streamlined and become adapted to clinical means, they can now provide incredible insight into individual genetic identities at a much lower cost, with biotech competitors vying for the title of the $1,000 genome. Genetic material can now be extracted from a person's saliva, hair, skin, blood, or other sources, sequenced, digitized, stored, and used for numerous purposes. Whenever data is digitized and stored, there is the possibility of privacy breaches. While modern whole genome sequencing technology has allowed for unprecedented access and understanding of the human genome, and excitement for the potentialities of personalized medicine, it has also generated serious conversation about the ethics and privacy risks that accompany this process of uncovering an individual's essential instructions of being: their DNA sequence.

Research
Genetic sequencing is a pivotal component of producing scientific knowledge about disease origins, disease prevention, and developing meaningful therapeutic interventions. Much of research utilizes large-group DNA samples or aggregate genome-wide datasets to compare and identify genes associated with particular diseases or phenotypes; therefore, there is much opposition to restricting genome database accessibility and much support for fortifying such wide-scale research. For example, if an informed consent clause were to be enforced for all genetics research, existing genetic databases could not be reused for new studies - all datasets would either need to be destroyed at the end of every study or all participants would need to re-authorize permissions with each new study. As genetic datasets can be extrapolated to closely related family members, this adds another dimension of required consent in the research process. This fundamentally raises the question of whether or not these restrictions are necessary privacy protections or a hindrance to scientific progress.

Clinical Use
In medicine, genetic sequencing is not only important for traditional uses, such as paternity tests, but also for facilitating ease in diagnosis and treatment. Personalized medicine has been heralded as the future of healthcare, as whole genome sequencing have provided the possibility personalizing treatment to individual expression and experience of disease. As pharmacology and drug development are based on population studies, current treatments are normalized to whole populations statistics, which might reduce treatment efficacy for individuals, as everyone's response to a disease and to drug therapy is uniquely bound to their genetic predispositions. Already, genetic sequencing has expedited prognostic counseling in monogenic diseases that requires rapid, differential diagnosis in neonatal care. However, the often blurred distinction between medical usage and research usage can complicate how privacy between these two realms are handled, as they often require different levels of consent and leverage different policy.

Commercial Use
Even in the consumer market, people have flocked to Ancestry.com and 23andMe to discover their heritage and elucidate their genotypes. As the nature of consumer transactions allows for these electronic click wrap models to bypass traditional forms of consent in research and healthcare, consumers may not completely comprehend the implications of having their genetic sequence digitized and stored. Furthermore, corporate privacy policies often operate outside the realm of federal jurisdiction, exposing consumers to informational risks, both in terms of their genetic privacy and their self-disclosed consumer profile, including self-disclosed family history, health status, race, ethnicity, social networks, and much more. Simply having databases invites potential privacy risks, as data storage inherently entails the possibility of data breaches and governmental solicitation of datasets. 23andMe have already received four requests from the Federal Bureau of Investigation (FBI) to access consumer datasets and although those requests were denied, this reveals a similar conundrum as the FBI–Apple encryption dispute.

Forensic Use
DNA-information can be used to solve criminal cases by establishing a match between a known suspect of a particular crime and an unknown suspect of an unsolved crime. However, DNA-information on its own can lead to expected errors of a certain probability and should not be used as entirely reliable evidence on its own.

Policy
As an individual's genomic sequence can reveal telling medical information about themselves, and their family members, privacy proponents believe that there should be certain protections in place to ultimately protect the privacy and identity of the user from possible discrimination by insurance companies or employers, the major concern voiced. There have been instances in which genetic discrimination has occurred, often revealing how science can be misinterpreted by non-experts. In 1970, African-Americans were denied insurance coverage or charged higher premiums because they were known carriers of sickle-cell anemia, but as carriers, they do not have any medical problems themselves, and this carrier advantage actually confers resistance against malaria. The legitimacy of these policies has been challenged by scientists who condemn this attitude of genetic determinism, that genotype wholly determines phenotype. Environmental factors, differential development patterns, and the field of epigenetics would argue gene expression is much more complex and genes are not a diagnosis, nor a reliable diagnosis, of an individual's medical future.

Incipient legislations have manifested in response to genetic exceptionalism, the heightened scrutiny expected of genomics research, such as the 2008 Genetic Information Nondiscrimination Act (GINA) in the United States; however, in many cases, the scope and accountability of formal legislation is rather uncertain, as the science seems to be proceeding at a much more rapid pace than the law, and specialized ethics committees have had to fill this necessary niche. Much of the criticism targets how policy fundamentally lacks an understanding of technical issues involved in genome sequencing and fails to address how in the event of a data breach, an individual's personal genome can not be replaced, complicating privacy protection even further. As computational genomics is such a technical field, the translation of expert language to policy is difficult - let alone translation to laymen language -, presenting a certain barrier to public perception about the capabilities of current genomic sequencing technologies which, ultimately, makes the discourse about protecting genetic privacy without impeding scientific advancement an even more difficult one to have.

Across the world, each country has unique healthcare and research frameworks that produce different policy needs – genetic privacy policy is further complicated when considering international collaborations on genetic research or international biobanks, databases that store biological samples and DNA information. Furthermore, research and healthcare are not the only fields that require formal jurisdiction; other areas of concern include the genetic privacy of those in the criminal justice system and those who engage with private consumer-based genomic sequencing.

England and Wales
91% of the largest forensic DNA database in the world, the National Criminal Intelligence DNA Database (NDNAD), contains DNA information from residents of England and Wales. The NDNAD stores genetic information of criminally convicted individuals, those who were charged but acquitted of a recordable offence, those who were arrested but never charged with a recordable offense, and those who are under counterterrorism control. Of the 5.5 million people in the database, which represents 10% of the total population, 1.2 million have never been convicted of a crime. The European Court of Human Rights decided, in the case of S and Marper v United Kingdom (2008), that the government must present sufficient justification for differential treatment of DNA profiles of those in the criminal justice system compared to that of non-convicted individuals; essentially, there must be no abuse of retained biological materials and DNA-information. The decision highlighted several existing issues with the current system that poses privacy risks for the individuals involved: the storage of personal information with genetic information, the storage of DNA profiles with the inherent capacity to determine genetic relationships, and fundamentally, the act of storing of cellular samples and DNA profiles produces opportunities for privacy risks. As a result, the Protection of Freedoms Act 2012 was created to ensure proper use of collected DNA materials and regulate their storage and destruction. However, many problems still persist, as samples can still be retained indefinitely in databases, regardless of whether or not the affected individual was convicted – and even the samples of juvenile delinquents. Critics have argued that this long-term retention could lead to stigmatization of affected individuals and inhibit their re-integration into society and also, are subject to misuse by discriminatory behavior innate to the criminal justice system.

Germany
In 1990, the Federal Supreme Court of Germany and the Federal Constitutional Court of Germany decided that sections of the German Code of Criminal Procedure provided justifiable legal basis for the use of genetic fingerprinting in identifying criminals and absolving innocents. The decisions, however, lacked specific details on how biological materials can be obtained and how genetic fingerprinting can be utilized; only regulations of blood tests and physical examinations were explicitly outlined. In 1998, the German Parliament authorized the establishment of a national DNA database, due to mounting pressure to prevent cases of sexual abuse and homicides involving children. This decision rendered as constitutional and supported by a compelling public interest by the Federal Constitutional Court in 2001, despite some criticism that the right of informational self-determination was violated. The court did mandate that DNA information and samples must be supported by evidence that the individual can commit a similar crime in the future. To address the legal uncertainty, the Act on Forensic DNA Analysis of 2005 introduced provisions that included exact and limited legal grounds for the use of DNA based information in criminal proceedings. Some sections order that DNA samples may only be used if they are necessary to accelerate the investigation, eliminate suspects, and a court must order genetic fingerprinting. Since its implementation, there has been a monthly addition of 8000 new sets to the database, bringing into question the necessity of such wide scale data collection and whether or not the wording of the provisions provided effective privacy protection. A recent controversial decision by the German government expanded the range of familial searching by DNA dragnet to identify genetic relatives of sexual and violent perpetrators – an action that was previously deemed as having no legal basis by the Federal Supreme Court of Germany in 2012.

South Korea
The National Forensic Service of South Korea and the Public Prosecution Authority of South Korea established separate DNA analysis departments in 1991, despite initial public criticism that the data collection was enacted without considering the informational privacy of subjects involved, a criticism that turned to support with a series of high-profile cases. In 2006, a proposed bill by the General Assembly on the collection and operationalization of DNA information outlined crime categories for the storage, the control, and the destruction of DNA samples and DNA information. However, the bill failed to pass as it could not translate into any significant change in actual practice. The incomprehensive crime categories included were only applicable in obtaining biological information without an individual's consent, and the protocol to destroy collected samples were unclear, exposing them to misuse.

The DNA Information Act of 2009 attempted to resolve these weaknesses, including provisions that stated biologically sensitive information may only be collected from convicted individuals, confined suspects, and crime scenes. Genetic fingerprinting was made permissible for specific crimes, including arson, murder, kidnapping, rape or sexual molestation, trespass upon residence at night for stealing, larceny, and burglary, and numerous other violent crimes. The act also required a written warrant for acquiring samples from convicted criminals or suspects if the concerned individuals do not give written consent. All samples must be destroyed in a timely manner if the concerned individual is proclaimed innocent, acquitted, their prosecution is dismissed, and upon their death. Importantly, if collected samples are used to ascertain individuals at the crime scene, the DNA information must be destroyed upon successful identification. However, there are still several flaws and criticisms to this legislation, in terms of clarifying the presumption of innocence, the rather trivial enforcement of sample destruction (only 2.03% of samples are deleted annually) and requisite of a written warrant (99.6% of samples are obtained without a warrant), and there is still much debate about whether or not this legislations violates the right of informational self-determination.

United States
In the United States, biobanks are primarily under the jurisdiction of the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule and the Federal Policy for Protection of Human Subjects (Common Rule). As neither of these rules was conceived with the intention of regulating biobanks and the decentralized levels of regulation, there have been many challenges in their application and enforcement, and federal law fails to directly tackle international policy and how data can be shared outside of the EU-US Safe Harbor Agreement. An area that needs clarification is how federal and state laws are differentially and specifically applied to different biobanks, researchers, or projects, a situation further complicated by the fact that most biobanks are part of larger entities, or in collaboration with other institutions, confusing the line public and private interests. About 80% of all biobanks have internal oversight boards that regulate data collection, usage, and distribution. There are three basic access models applied to the accessibility of biobank samples and data: open access (unrestricted to anyone), tiered access (some restrictions to access dependent on the nature of the project), or controlled access (tightly controlled access).

GINA provisions prohibit health insurers from requiring genetic testing or requesting genetic information for enrollment purposes and prohibit employers from requesting genetic testing or genetic information for any type of employment assessment,(hiring, promotion, termination). However, insurers can request genetic information to determine coverage of a specific procedure. Some groups are also excluded from following GINA's provisions, including insurers and employers of federal government employees, military, and employers with fewer than 15 employees.

China
China has a widespread network of hospitals and research institutes. It is currently undergoing a plan to create a more cohesive framework for data sharing among existing biobanks, which was previously under the jurisdiction of overlapping and confusing regulatory laws. Many biobanks operate under independently, or within a network of other networks, with the most prominent being the Shanghai Biobank network. Under this main network, guidelines detail specific de-identification policies and explicitly endorse broad consent. Recently, the Chinese Constitution has formally recognized individual privacy as a distinct and independent constitutional right, and therefore, legislators have begun developing a Draft Ordinance on Human Genetics Resources to organize national laws on biobanking management measures, legal liability, and punishment for violations. International data sharing will be even more strictly regulated under these federal laws.

Australia
Biobanks in Australia are mainly under the regulation of healthcare privacy guidelines and human research ethics committees – no formal biobank legislation exists but international data sharing is widely permitted. The National Health and Medical Research Council (NHMRC) develops guidelines for and funds many of these institutions. There is discussion towards broad consent for biobanking.

Consumer Genetic Testing
Electronic Frontier Foundation, a privacy advocate, found that existing legislation does not have formal jurisdiction in ensuring consumer privacy where DNA information is concerned. Genetic information stored by consumer businesses are not protected by the HIPAA; therefore, these companies can share genetic information with third parties, conditions contingent upon their own privacy statements. Most genetic testing companies only share anonymized, aggregated data with users’ consent. Ancestry.com and 23andMe do sell such data to research institutions and other organizations, and can ask for a case-by-case consent to release non-anonymized data to other parties, including employers or insurers. 23andMe even issues a warning that re-identification may take place and is possible. If a consumer explicitly refuses research use or requests for their data to be destroyed, 23andMe is still allowed to use their consumer identifying and behavioral information, such as browsing patterns and geographical location, for other marketing services.

Areas of Concern
Many computational experts have developed, and are developing, more secure systems of genomics sequencing to protect the future of this field from misguided jurisdiction, wrongful application of genetics data, and above all, the genetic privacy of individuals. There are currently four major areas of genetics research in which privacy-preserving technologies are being developed for:

String searching and comparison
Paternity tests, genetic compatibility tests, and ancestry testing are all types of medical tools that rely on string searching and comparison algorithms. Simply, this is a needle-in-a-haystack approach, in which a dataset is searched for a matching “string”, the sequence or pattern of interest. As these types of testing have become more common, and adapted to consumer genomic models, such as smartphone apps or trendy DNA tests, current privacy securing methods are focused on fortifying this process and protecting both healthcare and private usage.

Aggregate data release
The modern age of big data and large scale genomic testing necessitates processing systems that minimize privacy risks when releasing aggregate genomic data, which essentially means ensuring that individual data cannot be discerned within a genomic database. This differential privacy approach is a simple evaluation of the security of a genomic database and many researchers provide "checks" on the stringency of existing infrastructures.

Alignment of raw genomic data
One of the most important developments in the field of genomics is the capacity for read mapping, in which millions of short sequences can be aligned to a reference DNA sequence in order to process large datasets efficiently. As this high-capacity process is often divided up between public and private computing environments, there is a lot of associated risk and stages where genetic privacy is particularly vulnerable; therefore, current studies focus on how to provide secure operations within two different data domains without sacrificing efficiency and accuracy.

Clinical use
With the advent of high throughput genomic technology allowing unprecedented access to genetic information, personalized medicine is gaining momentum as the promised future of healthcare, rendering secure genomic testing models as imperative for the progress of medicine. Particularly, concerns voice how this process will involve multiparty engagement and access to data. The distinction between genetic sequencing for medicine and research purposes is a contentious one, and furthermore, anytime healthcare is involved in a discussion, the dimension of patient privacy must be considered, as it may conflict or complement genetic privacy.

Secure read mapping
Sensitive read mapping is essential to genomics research, as read mapping is not only important for DNA sequencing, but also for identifying target regulatory molecules in RNA-Seq. A solution proposes splitting read mapping into two tasks on a hybridized computing operation: the exact matching of reads using keyed hash values can be conducted on a public cloud and the alignment of reads can be conducted on a private cloud. As only keyed hash values are exposed to public scrutiny, the privacy of the original sequence is preserved. However, as alignment processes tends to be high volume and work intensive, most sequencing schemes still functionally require third party computing operations, which reintroduce privacy risks in the public cloud domain.

Secure string searching
Numerous genetic screening tests rely on string searching and have become commonplace in healthcare; therefore, the privacy of such methodologies have been an important area of development. One protocol hides the position and size of partial substrings, allowing one party (the researcher or physician) with the digitized genome and a second party (research subject or patient) with sole propriety of his or her DNA marker to conduct secure genetic tests. Only the researcher or the physician knows the conclusion of the string searching and comparison scheme and neither party can access other information, ensuring privacy preservation.

Secure genome query
The basis of personalized medicine and preventative healthcare is establishing genetic compatibility by comparing an individual's genome against known variations to estimate susceptibility to diseases, such as breast cancer or diabetes, to evaluate pharmacogenomics, and to query biological relationships among individuals. For disease risk tests, studies have proposed a privacy preserving technique that utilizes homomorphic encryption and secure integer comparison, and suggests storing and processing sensitive data in an encrypted form. To ensure privacy, the storage and processing unit (SPU) stores all the single-nucleotide polymorphism (SNPs) as real SNPs - the observed SNPs in the patient - with redundant content from set of potential SNPs. Another solution developed three protocols to secure calculating edit distance using intersections of Yao's Garbled Circuit and a banded alignment algorithm. The major drawback of this solution is its inability of performing large scale computations while retaining accuracy.

Secure genome-wide association studies
Genome-wide association studies (GWAS) are important in locating specific variations in genome sequences that lead to disease. Privacy preserving algorithms that identify SNPs significantly associated with diseases are based on introducing random noise to aggregate statistics to protect individual privacy. In another study, the nature of linkage disequilibrium is utilized in selecting the most useful datasets while maximizing protection of patient privacy with injected noise; however, it may lack effective disease association capabilities. Critics of these methods note that a substantial amount of noise is required to satisfy differential privacy for a small ratio of SNPs, an impracticality in conducting efficient research.

Authenticated encryption storage
The nature of genomic sequences requires a specific encryption tool to protect against low complexity (repetitive content) attacks and KPA (Known-plaintext attack), given several expected symbols. Cryfa uses packing (reducing the storage size), a shuffling mechanism (randomizing the symbol positions), and the AES cipher (Advanced Encryption Standard) to securely store FASTA, FASTQ, VCF, SAM and BAM files with authenticated encryption.