Ancient protein

Ancient proteins are complex mixtures and the term palaeoproteomics is used to characterise the study of proteomes in the past. Ancients proteins have been recovered from a wide range of archaeological materials, including bones, teeth, eggshells, leathers, parchments, ceramics, painting binders and well-preserved soft tissues like gut intestines. These preserved proteins have provided valuable information about taxonomic identification, evolution history (phylogeny), diet, health, disease, technology and social dynamics in the past.

Like modern proteomics, the study of ancient proteins has also been enabled by technological advances. Various analytical techniques, for example, amino acid profiling, racemisation dating, immunodetection, Edman sequencing, peptide mass fingerprinting, and tandem mass spectrometry have been used to analyse ancient proteins. The introduction of high-performance mass spectrometry (for example, Orbitrap) in 2000 has revolutionised the field, since the entire preserved sequences of complex proteomes can be characterised.

Over the past decade, the study of ancient proteins has evolved into a well-established field in archaeological science. However, like the research of aDNA (ancient DNA preserved in archaeological remains), it has been limited by several challenges such as the coverage of reference databases, identification, contamination and authentication. Researchers have been working on standardising sampling, extraction, data analysis and reporting for ancient proteins. Novel computational tools such as de novo sequencing and open research may also improve the identification of ancient proteomes.

Philip Abelson, Edgar Hare and Thomas Hoering
Abelson, Hare and Hoering were leading the studies of ancient proteins between the 1950s and the early 1970s. Abelson was directing the Geophysical Laboratory at the Carnegie Institute (Washington, DC) between 1953 and 1971, and he was the first to discover amino acids in fossils. Hare joined the team and specialised in amino acid racemisation (the conversion of L- to D-amino acids after the death of organisms). D/L ratios were used to date various ancient tissues such as bones, shells and marine sediments. Hoering was another prominent member, contributing to the advancement of isotopes and mass spectrometry. This golden trio drew many talented biologists, geologists, chemists and physicists to the field, including Marilyn Fogel, John Hedges and Noreen Tuross.

Ralph Wyckoff
Wyckoff was a pioneer in X-ray crystallography and electron microscopy. Using microscopic images, he demonstrated the variability and damage of collagen fibres in ancient bones and shells. His research contributed to the understanding of protein diagenesis (degradation) in the late 1960s, and highlighted that ancient amino acid profiles alone might not be sufficient for protein identification.

Margaret Jope and Peter Wesbroek
Jope and Wesbroek were leading experts in shell proteins and crystallisation. Wesbroek later established Geobiochemistry laboratory at the University of Leiden, focusing on biomineralisation and how this process facilitated protein survival. He also pioneered the use of antibodies for the study of ancient proteins in the 1970s and 1980s, utilising different immunological techniques such as Ouchterlony double immunodiffusion (interactions of antibodies and antigens in a gel).

Peggy Ostrom
Ostrom championed the use of mass spectrometry since the 1990s. She was the first to improve the sequence coverage of ancient proteins by combining different techniques such as peptide mass fingerprinting and liquid chromatography-tandem mass spectrometry (LC-MS/MS).

Formation & incorporation
Understanding how ancient proteins are formed and incorporated into archaeological materials are essential in sampling, evaluating contamination and planning analyses. Generally, for ancient proteins in proteinaceous tissues, notably, collagens in bones, keratins in wool, amelogenins in tooth enamel, and intracrystalline proteins in shells, they might be incorporated during the time of tissue formation. However, the formation of proteinaceous tissues is often complex, dynamic and affected by various factors such pH, metals, ion concentration, diet plus other biological, chemical and physical parameters. One of the most characterised phenomena is bone mineralisation, a process by which hydroxyapatite crystals are deposited within collagen fibres, forming a matrix. Despite extensive research, bone scaffolding is still a challenge, and the role of non-collagenous proteins (a wide range of proteoglycans and other proteins) remains poorly understood.

Another category is complex and potentially mineralised tissues, such as ancient human dental calculi and ceramic vessels. Dental calculi are defined as calcified biofilms, created and mediated by interactions between calcium phosphate ions and a wide range of oral microbial, human, and food proteins during episodic biomineralisation. Similarly, the minerals of a ceramic matrix might interact with food proteins during food processing and cooking. This is best explained by calcite deposits adhering to the inside of archaeological ceramic vessels. These protein-rich mineralised deposits might be formed during repeated cooking using hard water and subsequent scaling.

Preservation
Organic (containing carbon) biomolecules like proteins are prone to degradation. For example, experimental studies demonstrate that robust, fibrous and hydrophobic keratins such as feathers and woollen fabrics decay quickly at room temperature. Indeed ancient proteins are exceptional, and they are often recovered from extreme burial contexts, especially dry and cold environments. This is because the lack of water and low temperature may slow down hydrolysis, microbial attack and enzymatic activities.

There are also proteins whose chemical and physical properties may enable their preservation in the long term. The best example is Type 1 collagen; it is one of the most abundant proteins in skin (80-85%) and bone (80-90%) extracellular matrices. It is also mineralised, organised in a triple helix and stabilised by hydrogen bonding. Type 1 collagen has been routinely extracted from ancient bones, leathers, and parchments; these characteristics may contribute to its stability over time. Another common protein in the archaeological record is milk beta-lactoglobulin, often recovered from ancient dental calculi. Beta-lactoglobulin is a small whey protein with a molecular mass of around 18400 Da (dalton). It is resistant to heating and enzymatic degradation; structurally, it has a beta-barrel associated with binding to small hydrophobic molecules such as fatty acids, forming stable polymers.

Given that proteins vary in abundance, size, hydrophobicity (water insolubility), structure, conformation (shape), function and stability, understanding protein preservation is challenging. While there are common determinants of protein survival, including thermal history (temperature/time), burial conditions (pH/soil chemistry/water table) and protein properties (neighbouring amino acids/secondary structure/tertiary folding/proteome content), there is no clear answer and protein diagenesis is still an active research field.

Structure & damage patterns
Generally, proteins have four levels of structural complexity: quaternary (multiple polypeptides, or subunits), tertiary (the 3D folding of a polypeptide), secondary (alpha helices/beta sheets/random coils) and primary structure (linear amino acid sequences linked by peptide bonds). Ancient proteins are expected to lose their structural integrity over time, due to denaturation (protein unfolding) or other diagenetic processes.

Ancient proteins also tend to be fragmented, damaged and altered. Proteins can be cleaved into small fragments over time, since hydrolysis (the addition of water) breaks peptide bonds (covalent bonds between two neighbouring alpha-amino acids). In terms of post-translational modifications (changes occur after RNA translation), ancient proteins are often characterised by extensive damage such as oxidation (methionine), hydroxylation (proline), deamidation (glutamine/asparagine), citrullination (arginine), phosphorylation (serine/threonine/tyrosine), N-terminus glutamate to pyroglutamate and the addition of advanced glycation products to lysine or arginine. Among these modifications, glutamine deamidation is one of the most time-dependent processes. Glutamine deamidation is mostly a non-enzymatic process, by which glutamine is converted to glutamic acid (+0.98406 Da) via side-chain hydrolysis or the formation of a glutarimide ring. It is a slow conversion with a long half-time, depending on adjacent amino acids, secondary structures, 3D folding, pH, temperature and other factors. Bioinformatic tools are available to calculate bulk and site-specific deamidation rates of ancient proteins. The structural manifestation of these chemical changes within ancient proteins was first documented using scanning electron microscopy (SEM). Type-1 collagen protein fibrils of a permafrost-preserved woolly mammoth (Yukon, Canada) were directly imaged and shown to retain their characteristic banding pattern. These were compared against type-1 collagen fibrils from a temperate Columbian mammoth specimen (Montana, U.S.A.). The Columbian mammoth collagen fibrils, unlike those of the permafrost-frozen woolly mammoth, had lost their banding, indicating substantial chemical degradation of the constituent peptide sequences. This also constitutes the first time that collagen banding, or the molecular structure for any ancient protein, has been directly imaged with scanning electron microscopy.

Overview
Palaeoproteomics is a fast-developing field that combines archaeology, biology, chemistry and heritage studies. Comparable to its high-profile sister field, aDNA analysis, the extraction, identification and authentication of ancient proteins are challenging, since both ancient DNA and proteins tend to be ultrashort, highly fragmented, extensively damaged and chemically modified.

However, ancient proteins are still one of the most informative biomolecules. Proteins tend to degrade more slowly than DNA, especially biomineralised proteins. While ancient lipids can be used to differentiate between marine, plant and animal fats, ancient protein data is high-resolution with taxon- and tissue-specificities.

To date, ancient peptide sequences have been successfully extracted and securely characterised from various archaeological remains, including a 3.8 Ma (million year) ostrich eggshell, 1.77 Ma Homo erectus teeth, a 0.16 Ma Denisovan jawbone and several Neolithic (6000-5600 cal BC) pots. Hence, palaeoproteomics has provided valuable insight into past evolutionary relationships, extinct species and societies.

Extraction
Generally, there are two approaches: a digestion-free, top-down method and bottom-up proteomics. Top-down proteomics is seldom used to analyse ancient proteins due to analytical and computational difficulties. For bottom-up, or shotgun proteomics, ancient proteins are digested into peptides using enzymes, for example trypsin. Mineralised archaeological remains such as bones, teeth, shells, dental calculi and ceramics require an extra demineralisation step to release proteins from mineral matrices. This is often achieved by using a weak acid (ethylenediaminetetraacetic acid, EDTA) or cold (4 °C) hydrochloric acid (HCl) to minimise chemical modifications that may introduced during extraction.

To make ancient proteins soluble, heat, sonication, chaotropic agents (urea/guanidine hydrochloride, GnHCl), detergents or other buffers can be used. Alkylation and reduction are often included for cysteine to disrupt disulfide bonds and avoid crosslinking.

After demineralisation, protein solubilisation, alkylation and reduction, buffer exchange is needed to ensure that extracts are compatible with downstream analysis. Currently, there are three widely-used protocols for ancient proteins and gels (GASP), filters (FASP) and magnetic beads (SP3) can be used for this purpose. Once buffer exchange is completed, extracts are incubated with digestion enzymes, then concentrated, purified and desalted.

For non-mineralised archaeological materials such as parchments, leathers and paintings, demineralisation is not necessary, and protocols can be changed depending on sample preservation and sampling size.

Instrumentation and data analysis
Nowadays, palaeoproteomics is dominated by two mass spectrometry-based techniques: MALDI-ToF (matrix-assisted laser desorption/ionisation-time-of-flight) and LC-MS/MS. MALDI-ToF is used to determine the mass-to-charge (m/z) ratios of ions and their peak patterns. Digested peptides are spotted on a MALDI plate, co-crystallise with a matrix (mainly α-cyano-4-hydroxycinnamic acid, CHCA); a laser excites and ionises the matrix, then its time to travel a vacuum tube is measured and converted to a spectrum of m/z ratios and intensities.

Since only peak patterns, not entire amino acid sequences of digested peptides are characterised, peptide markers are needed for pattern matching and ancient protein identification. In archaeological contexts, MALDI-ToF has been routinely used for bones and collagens in a field known as ZooMS (zooarchaeolgy by mass spectrometry).

LC-MS/MS is another widely used approach. It is a powerful analytical technique to separate, sequence and quantify complex protein mixtures. The first step in LC-MS/MS is liquid chromatography. Protein mixtures are separated in a liquid mobile phase using a stationary column. How liquid analytes interact with a stationary phase depends on their size, charge, hydrophobicity and affinity. These differences lead to distinct elution and retention time (when a component of a mixture exit a column). After chromatographic separation, protein components are ionised and introduced into mass spectrometers. During a first mass scan (MS1), the m/z ratios of precursor ions are measured. Selected precursors are further fragmented and the m/z ratios of fragment ions are determined in a second mass scan (MS2). There are different fragmentation methods, for example, high-energy C-trap dissociation (HCD) and collision induced dissociation (CID), but b- and y-ions are frequently targeted.

Search engines and software tools are often used to process ancient MS/MS data, including MaxQuant, Mascot and PEAKS. Protein sequence data can be downloaded from public genebanks (UniProt/NCBI) and exported as FASTA files for sequencing algorithms. Recently, open search engines such as MetaMorpheus, pFind and Fragpipe have received attention, because they make it possible to identify all modifications associated with peptide spectral matches (PSMs).

De novo sequencing is also possible for the analysis of ancient MS/MS spectra. It is a sequencing technique that assembles amino acid sequences directly from spectra without reference databases. Advances in deep learning also lead to the development of multiple pipelines such as DeNovoGUI, DeepNovo2 and Casanovo. However, it may be challenging to evaluate the outputs of de novo sequences and optimisation may be required for ancient proteins to minimise false positives and overfitting.

Palaeoproteomes

 * Bones. Ancient bones are one of the most well-characterised and iconic palaeoproteomes. Ancient bone proteomes have been sequenced from hominins, humans, mammoths, moas and now extinct rhinoceros.  Fibrillar collagens are the most abundant proteins in modern bones; similarly, Type 1 and III collagens are also common in the archaeological record. While modern bones contain about 10% of non-collagenous proteins (NCPs), various NCPs have been recorded, including osteocalcin, biglycan and lumican. Generally, NCPs are excellent targets for studying evolution history, since they have higher turnover rates than bones. Given the abundance of ancient bone proteomes, a bottom-up proteomic workflow known as SPIN (Species by Proteome INvestigation) is available for the high-throughput analysis of 150 million mammalian bones.
 * Teeth. Tooth enamel is one of the hardest and most mineralised tissues in the human body, since it is mainly composed of hydroxyapatite crystals. While an enamel proteome is small, ancient amelogenins and other ameloblast-relevant proteins are often well-preserved in a mineralised, closed system. Ancient enamel proteins are useful when aDNA or other proteins do not survive, and they have been analysed to understand extinct species and evolution.
 * Shells. Archaeological shells also contain rich palaoproteomes. Like tooth enamel, They are more or less close systems that isolate proteins from water or other forces of degradation. Strathiocalcin-1 and -2 are securely identified in 3.8 Ma ostrich eggshell samples at the site of Laetoli in Tanzania. These C-type lectins are associated with biomineralisation, and they are also found in extinct big bird shells collected from Australia.  Given the age of the ostrich eggshell, it was verified by a combination of six methods: analytical replication (same samples analysed in different labs), amino acid racemisation (D/L ratios), carry-over analysis (pre- and after-injection washes to evaluation the extent of carry-over in mass spectrometers), damage patterns (deamidation/oxidation/phosphorylation/amidation/decomposition) and aDNA studies. These independent procedures ensure the authenticity of the oldest peptide sequences.

Other complex mixtures

 * Ceramics & food crusts. Various ancient dietary proteins have been characterised from ceramics and associated food crusts (charred and calcite deposits on ceramic vessels). Cow, sheep and goat milk beta-lactoglobulin proteins are predominant in this context, but there are also milk caseins (alpha-, beta- and kappa-casein), animal blood haemoglobins and a wide range of plant proteins (wheat glutenins, barley hordeins, legumins and other seed storage proteins).  The identification of these ancient foodstuffs may be used to understand how food was prepared, cooked and consumed in the past. It is also clear that archaeological ceramics and food crusts are complex mixtures that contain metaproteomes (multiple proteomes).

Analytical challenges
While palaeoproteomics is a useful tool for a wide array of research questions, there are some analytical challenges that prevent the field from reaching its potential. The first issue is preservation. Mineral-binding seems to stabilise proteins, but this is a complex, dynamic process that has not been systematically investigated in different archaeological and burial contexts.

Destructive sampling is another problem that can cause irreparable damage to archaeological materials. Although minimally-destructive or non-destructive sampling methods are being developed for parchments, bones, mummified tissues and leathers, it is unclear if they are suitable for other types of remains such as dental calculi, ceramics and food crusts.

It is equally difficult to extract mineral-bound proteins due to their low abundance, extensive degradation, and often strong intermolecular interactions (hydrogen bonding, dispersion, ion-dipole and dipole-dipole interactions) with mineral matrice. Ancient proteins also vary in preservation states, hydrophobicity, solubility and optimum pH values; methodological development is still required to maximise protein recovery.

Ancient protein identification is still a challenge, because database search algorithms are not optimised for low-intensity and damaged ancient proteins, increasing the probabilities of false positive and false negatives. There is also the issue of dark proteomes (unknown protein regions that cannot be sequenced); approximately 44-54% of proteins in eukaryotes such as animals and plants are dark. Reference databases are also biassed towards model organisms such as yeasts and mouses, and current sequence data may not cover all archaeological materials.

Lastly, while cytosine deamination (cytosine being converted to uracil over time that causes misreadings) has been widely used in the authentication of aDNA, there are no standardised procedures to authenticate ancient proteins. This authentication issue is highlighted by the claim identification of 78 Ma Brachylophosaurus canadensis (hadrosaur) and 68 Ma Tyrannosaurus rex collagen peptides. The lack of post-translational modifications and subsequent experimental studies demonstrate that these sequences may be derived from bacterial biofilms, the cross-contamination of control samples or modern laboratory procedures.

Future directions
Despite significant analytical challenges, palaeoproteomics is constantly evolving and adopting new technology. Latest high-performance mass spectrometry, for example, TimsToF (trapped ion mobility-time-of-flight) in a DIA mode (data independent acquisition) may help with the separation, selection and resolution of ancient MS/MS data. Novel extraction protocols such as DES (Deep Eutectic Solvent)-assisted procedures may increase the numbers and types of extracted palaeoproteomes. Identification tools are also improving thanks to progress of bioinformatics, machine learning and artificial intelligence.

Public depositories for raw data

 * PRIDE-PRoteomics IDEntification Database
 * MassIVE-Mass spectrometry Interactive Virtual Environment

Reference databases

 * UniProt
 * NCBI, National Center for Biotechnology Information

Database search

 * MaxQuant
 * Mascot
 * Alphapept

Open search

 * pFind
 * Fragpipe
 * MetaMorpheus

De novo programmes

 * PEAKS
 * DeNovoGUI
 * Casanovo

Wikipedia pages

 * Amino acid
 * Protein
 * Proteomics
 * Ancient DNA
 * Archaeogenetics
 * Paleogenetics
 * Molecular clock