Molecular Evolutionary Genetics Analysis

Molecular Evolutionary Genetics Analysis (MEGA) is computer software for conducting statistical analysis of molecular evolution and for constructing phylogenetic trees. It includes many sophisticated methods and tools for phylogenomics and phylomedicine. It is licensed as proprietary freeware. The project for developing this software was initiated by the leadership of Masatoshi Nei in his laboratory at the Pennsylvania State University in collaboration with his graduate student Sudhir Kumar and postdoctoral fellow Koichiro Tamura. Nei wrote a monograph (pp. 130) outlining the scope of the software and presenting new statistical methods that were included in MEGA. The entire set of computer programs was written by Kumar and Tamura. The personal computers then lacked the ability to send the monograph and software electronically, so they were delivered by postal mail. From the start, MEGA was intended to be easy-to-use and include solid statistical methods only.

MEGA version 2 (MEGA2), which was coauthored by an additional investigator Ingrid Jakobson, was released in 2001. All the computer programs and the readme files of this version could be sent electronically due to advances in computer technology. Around this time, the leadership of the MEGA project was taken over by Kumar (now at Temple University) and Tamura (now at Tokyo Metropolitan University). The monograph Molecular Evolutionary Genetics Analysis was often used as a textbook for new ways to study molecular evolution.

MEGA has been updated and expanded several times and currently all these versions are available from the MEGA website. The latest release, MEGA7, has been optimized for use on 64-bit computing systems. MEGA is in two version. A graphical user interface is available as a native Microsoft Windows program. A command line version, MEGA-Computing Core (MEGA-CC), is available for native cross-platform operation. The method is widely used and cited. With millions of downloads across the releases, MEGA is cited in more than 85,000 papers. The 5th version has been cited over 25,000 times in 4 years.

Sequence alignment construction
Alignment Editor ―  Within MEGA, the Alignment Editor is a tool that may be used for editing and building multiple sequence alignments. The Alignment Editor in MEGA includes an integrated tool for both ClustalW and MUSCLE programs. All actions take place in the Analysis Explorer, which can be found in the main menu of MEGA. When a new alignment is being created, the user is presented with three options: create a new alignment, open a saved alignment session, or retrieve sequences from a file (importing sequences from NCBI). Once an option is selected, the user can choose either ClustalW or MUSCLE from the Alignment tab located at the top of the page. Parameters for the selected alignment program can then be specified and a progress bar will appear while the tool is being computer. Aligned sequences will replace unaligned ones in the main section of the Alignment Editor. To perform further analysis in MEGA, it is advisable to save the alignment session in either MEGA or FASTA format.

Trace Data File Viewer/Editor ― The Trace Data File Viewer/Editor has many functionalities in the following three menus. All the commands are used to help specialize searches and alignments in MEGA.
 * Data Menu consists of Open File in New Window, Open File, Save File, Print, Add to Alignment Explore, Export FASTA File, and Exit.
 * Edit Menu consists of Undo, Copy, Mask Upstream, Mask Downstream, and Reverse Complement. The difference between Mask Upstream and Mask Downstream is upstream is used to mask/unmask a region to the left of a cursor while downstream does the opposite. Reverse Complement will be used in situations where the complements may need to be reversed in a sequence.
 * Search Menu consists of Find, Find Next, Find Previous, Next N, Find in File, and DO BLAST Search. The Find, Find Next, and Find Previous are used to find occurrences in certain sections of a query sequence. Next N is a command the will be able to go to the next indeterminate (N) nucleotide. Find in a File allows a user to search another file for selected sequences. Do BLAST Search command will perform a BLAST search in a separate web browser. The user may be able to either select certain date to BLAST or all sequences in the session will be used.

Integrated web browser, sequence fetching ―  MEGA comes with a built-in web browser that allows users to access GenBank sequence data from the NCBI website. The integrated web browser can be accessed when creating a new alignment in the Alignment Editor. To successfully use sequences from NCBI, it is advised to change the searches to FASTA format and use the “Add to Alignment” button. Once completed, all the sequences will be imported into the MEGA application.

Multiple sequence alignment

Data handling
One of the challenges associated with evolutionary genetic analysis is the presence of ambiguous states such as R, Y, and T. These states often arise from sequence errors or incomplete datasets. However, MEGA offers several resources to handle ambiguous states, including the deletion of sites that have an ambiguity score higher than the Site Coverage Cutoff parameter.

MEGA's extended format allows users to save all data attributes, such as sequence length, nucleotide positions, gaps, and ambiguous states. Additionally, MEGA supports data import from other formats, such as Clustal, which ensures a seamless transition between popular file types.

After importing a dataset, MEGA provides multiple different data viewer options. For example, users can view statistical attributes and select subsets in the Sequence Data Explorer or use the Distance Data Explorer to inspect pairwise distance data. Another feature of MEGA is the visual specification of domain groups. This allows users to group sequences by a specific characteristic and view subsequent phylogenetic trees.

Genetic code table
MEGA offers support for modifying the genetic code used for translating DNA sequences. By default, MEGA has 23 built-in genetic code variations including the standard code, vertebrate mitochondrial code, Drosophila mitochondrial code, and yeast mitochondrial code. Users may add, remove, or edit any genetic code table.

In addition, MEGA can also computes the degeneracy of each codon position in a genetic code table as well as the number of synonymous sites and non-synonymous sites using the Nei-Gojobori method.

Real-time caption expert engine
The Caption Expert is a part of MEGA which provides publication-like detailed captions based on the properties of analysis results. It is a tool that may be used for distance matrix, phylogeny, tests, etc. within MEGA (megasoftware).

Integrated text file editor
MEGA's integrated text file editor enables users to edit text files without the need for another program. Features like columnar block selection-editing aid in the performance of bulk operations, like changing letter case or font size. Additionally, the editor includes line numbers to assist with the navigation of large files and identifying areas of interest.

MEGA also provides several tools to format sequences. For example, the built-in reverse complement utility reverses the order of characters and replaces each with its complement.

The screenshots demonstrate the use of MEGA's reverse complement tool. The original sequence was reversed and each nucleotide was replaced with its complement to produce the reverse complement.

Sequence data viewer
MEGA provides a graphical interface for displaying and manipulating aligned nucleotide and protein sequences. The Sequence Data Explorer has multiple menu functionalities to help with exporting data, searching alignments, changing display features, highlighting sites, and computing statistics:
 * Data Menu consists of Write Data to File, Translate/Untranslate, Select Genetic Code Table, Setup/Select Genes and Domains, Setup/Select Taxa and Groups, Quit Data Viewer. Translate/Untranslate and Select Genetic Code Table is only available for nucleotide sequences.
 * Search Menu consists of Find Motif, Find Next, Find Previous, Find Marked Site, and Highlight Motif.
 * Display Menu consists of Show Only Selected Sequences, Use Identical Symbol, Color Cells, Sort Sequences, Restore Input Order, Show Sequence Name, Show Group Name, and Change Font.
 * Highlight Menu consists of Conserved Sites, Variable Sites, Parsimony-Informative Sites, Singleton Sites, 0-fold Degenerate Sites, 2-fold Degenerate Sites, and 4-fold Degenerate Sites.
 * Statistics Menu consists of Nucleotide Composition, Nucleotide Pair Frequencies, Codon Usage, Amino Acid Composition, Use All Selected Sites, Use only Highlighted Sites, Display Results in Spreadsheet (Excel or Libre/Open Office), Display Results in Comma-Delimited (CSV), Display Result in Text Editor.

MCL-based estimation of nucleotide substitution patterns
Substitution Models in MEGA allow various options with different attributes of substitution models for both DNA and protein sequences. You may choose different substitution types, model, etc. to fit best with chosen data. The three main substitution models are 4x4 Rate Matrix, Transition-Transversion Rate Ratio (k1,k2), and Transition-Transversion Rate Bias of R.

 Transition-Transversion Rate Ratios (k1, k2) – Transition-Transversion Rate Ratio calculates the ratio rate of Transition(a) to Transversion(b) using the formula k = a/b.

 Transition-Transversion Rate Bias (R) — Transition-Transversion Rate Bias of R in MEGA calculates the ratio of the number of transitions to the number of transversions between a pair of sequences. MEGA allows a user to conduct an analysis of the data with a specified value of R. A key takeaway is when R equals 0.5, it means there is no bias towards either a transition or transversion substitution.

Substitution pattern homogeneity test
MEGA offers several approaches for testing substitution pattern homogeneity, such as composition distance, disparity index, and Monte Carlo tests. These methods are used to determine if different genetic regions evolved under the same selective pressure.

Computation distance measures the variation in nucleotide composition between two sequences. MEGA computes this figure per site and excludes any gaps or missing data. A larger distance suggests that the regions evolved under different selective pressures.

The disparity index evaluates the difference in substitution patterns for a given pair of sequences. This value is calculated per site and is thought to be more dynamic than the chi-square test. A large difference implies that the pattern of substitution was not the same for the given pair of sequences.

The Monte Carlo test is another approach to test substitution pattern homogeneity that involves running a null distribution simulation. MEGA requires the user to specify the number of replicates and a starting seed. For a significant result, many simulations must be performed. Therefore, it is essential to consider the computational cost of the algorithm. The table above shows the computational complexity of different Monte Carlo methods as $$N$$ approaches infinity in relation to the parameter $$\alpha$$. While it's not clear which method MEGA employs, it is likely to be computationally intensive because all the methods listed in the table have a computational order greater than $$\Theta(N^2)$$.

Distance estimation methods
MEGA offers a wide variety of options for calculating evolutionary distance between a pair of nucleotide or amino acid sequences with or without standard errors. Distance methods are divide into three categories, nucleotide, syn-nonsynonymous, and amino acids:


 * Nucleotide - Sequences are compared nucleotide-by-nucleotide, available for both protein coding and non-coding sequences.
 * No. of differences
 * p-distance
 * Juke-Cantor Model
 * Tajima-Nei Model
 * Kimura 2-Parameter Model
 * Tamura 3-Parameter Model
 * Tamura-Nei Model
 * Log-Det Method
 * Maximum Composite Likelihood Model
 * Syn-Nonsynonymous - Sequences are compared codon-by-codon, only available for protein coding sequences.
 * Nei-Gojobori Method
 * Modified Nei-Gojobori Method
 * Li-Wu-Luo Method
 * Pamilo-Bianchi-Li Method
 * Kumar Method
 * Amino Acid - Sequences are compared residue-by-residue, available for both protein coding and non-coding sequences.
 * No. of differences
 * p-distance
 * Poisson Model
 * Equal Input Model
 * Dayhoff Model
 * Jones-Taylor-Thornton Model

After selecting a distance method, a subset of attributes will become visible when applicable. The attributes are Substitutions to Include, Transition/Transversion Ratio, Pattern among Lineages, and Rates among Sites. For example, if a model has a rate variation, the gamma parameter will become visible. In addition, every distance method provides options for handling gap and missing data, and codon position if applicable.

Every substitution matrix has it own use case. One of the simplest model is the Juke-Cantor, which assumes an equal mutation rates. The Kimura 2-Parameter model extends that model but with distinctions between transition rates ($$A \leftrightarrow G$$ and $$C \leftrightarrow T$$) and transversion rates ($$\phantom{ }^A_G \leftrightarrow \phantom{ }^C_T $$). Then the Kimura 3-Parameter model extends that model but with distinctions between transversions that conserve the nucleotide's weak/strong property ($$A \leftrightarrow T$$ and $$C \leftrightarrow G$$) and transversions that conserve the nucleotide's amino/keto property ($$A \leftrightarrow C$$ and $$G \leftrightarrow T$$). However, each extension adds more parameter and risk the issue of overfitting. The best substitution matrix depends on the data used. To help with selection, MEGA provides a Find Best-Fit Substitution Model in the Model tab that run each model and assigns a Bayesian information criterion evaluation.

Tests of selection
 Large sample Z-test  The Z-test is used to compare relative synonymous and nonsynonymous substitutions within a gene sequence, with the main objective of determining positive selection. To perform the Z-test formula, an estimation of the number of synonymous substitutions per synonymous site (dS) and nonsynonymous substitutions per nonsynonymous site (dN) must be account for, along with the variances of the synonymous and nonsynonymous substitutions Var(dS) and Var(dN). The formula used for the Z-test is:

 Z = (dN – dS_ / SQRT(Var(dS) + Var(dN)) 

If dN is greater than dS, it indicates positive selection, while if dN is less than dS, it indicates purifying selection. The output of Z from the formula above will determine if it is a positive or purifying selection. Key factors to determine which selection the output will be is the variances of the synonymous and nonsynonymous sites. These tests are commonly used for analytical formulas or bootstrapping resampling in MEGA.

 Fisher's exact test —  Fisher's Exact Test examines synonymous and nonsynonymous substitutions in sequences and is referred to as a one-tailed test when analyzing small samples for positive selection. Rejecting the null hypothesis of neutrality occurs when the P-Value is less than 0.05. If the differences per synonymous site exceed those per nonsynonymous site, MEGA assigns a P-Value of 1, indicating purifying selection rather than positive selection. Further research on Fisher's Exact Test, the algorithm is based on the probability distribution of n!. As a conclusion, it could be argued that the time complexity of the algorithm is O(n!). The name for the distribution method is Hypergeometric Distribution (Hoffman).

 Tajima's Neutrality Test —  The purpose of Tajima's Neutrality Test is to assess the relationship between the number of segregating sites per site and nucleotide diversity. When alleles are selectively neutral the product 4Nv can be estimated in two ways. N represents the effective population size and v is the mutation rate per site. By calculating the difference between these estimates, one can determine if there is evidence of non-neutral evolution.

Molecular clock test
The molecular clock hypothesis suggests that all sequences have evolved at a constant rate over time. Therefore, the molecular clock test evaluates this statement in conjunction with the data provided by the user. In MEGA, this test is performed by applying a maximum likelihood test to a given tree topology and sequence alignment. This produces two log-likelihood values, one with the clock hypothesis and one without. Another approach offered by MEGA is Tajima's relative rate test. This method compares the number of substitutions per site between different sequences. If the resulting numbers differ by a large factor, the molecular clock hypothesis may not be valid for the given data set.

Tree-making methods
MEGA offers five methods building a phylogenetic tree: Each method allows for a bootstrap phylogeny test with any number of replications. Neighbor joining and minimum evolution allows for an interior-branch test instead. Substitution model and parameters are the same as the distance estimation methods.
 * Neighbor joining
 * Minimum evolution
 * UPGMA
 * Maximum parsimony
 * Maximum likelihood

Tree explorers
MEGA provides a graphical interface for displaying a phylogenetic tree based on a variety of options. In the view menu, the tree can be displayed in three different styles: traditional, radiation, or circle. Traditional trees have three different branch styles: rectangular, straight, or curved. The view menu also offers toggling topology scaling, changing font type and size, arranging taxa, showing/hiding various details, and a general option for more control over the tree drawing aspects.

The subtree menu provides options for manipulating the tree, such as swapping branches, flipping lineages order, compressing/expanding subtrees, and moving the tree's root. Subtrees can also be displayed in its own tree explorer with all the same features and options. The compute menu provides options for computing a condensed tree, a consensus tree, or a timetree with or without a molecular clock. The file menu provides options for saving, exporting, printing, and exiting. The tree topology can be exported to a file in MEGA tree format, or for timetrees, exported in a tabular format with relevant information used when constructing the timetree. Other export options include the current timetree calibrations, analysis summary, partition list, and pairwise distances. The tree explorer also provide options to save the current tree display in an image format or to the clipboard under the image menu option. The image format supported are BMP, PNG, PDF, SVG, TIFF, and EMF. If the user chose to build the tree with bootstrap replication, then the tree explorer will have two tabs, one with the original tree and one with the bootstrap consensus tree.