User:Riskanal/sandbox

The metalog distribution is a highly flexible continuous probability distribution that has simple closed form equations, can be directly parameterized by data, and belongs to the class of quantile-parameterized distributions. Together with its transforms, the metalog family of distributions provides an alternative to the Pearson distributions for data-fitting applications. Like the Pearson distributions, the metalogs can represent a wide range of uncertainties, such as those commonly encountered in economics, engineering, business, and science. In contrast to the Pearson distributions, the metalog family is more shape and bounds flexible, has simpler closed-form equations, is easier to fit to data, and is easier to simulate. The metalog distributions were developed by Tom Keelin and first published in 2016. The metalog distributions are also known as the Keelin distributions.

History
In the Age of Enlightenment, the normal distribution was first published as was Bayes’ theorem. The normal distribution laid the foundation for much of the development of classical statistics. In contrast, Bayes' theorem laid the foundation for state-of-information, belief-based probability representations. Because state-of-information probabilities can take on any shape and may have natural bounds, probability distributions flexible enough to accommodate both were needed. Moreover, many empirical and other data sets exhibited shapes that could not be well matched by the normal or other early continuous distributions. So began the search for shape- and bounds-flexible continuous probability distributions.

In the early 20th Century, the Pearson family of distributions, which includes the normal, beta, uniform, gamma, student-t, chi-square, F, and others, emerged as a major advance in shape-flexibility. These were shortly followed by the Johnson distributions. Both families can represent the first four moments of data (mean, variance, skewness, and kurtosis) with smooth continuous curves. But they have no ability to match fifth or higher-order moments as well. For a given skewness and kurtosis, there is no choice of bounds. Their equations include intractable integrals and complex statistical functions. Fitting to data often requires iterative methods.

In the early 21st century, decision analysts seeking continuous probability distributions that would exactly go through three points on the CDF (e.g. expert-elicited quantiles corresponding CDF probabilities $$0.10, 0.50$$ and $$0.90$$) found the Pearson and Johnson distributions generally inadequate for this purpose. In addition, decision analysts sought probability distributions that would be easy to parameterize with data (e.g. by using linear least squares). Introduced in 2011, the class of quantile-parameterized distributions (QPDs) accomplished both. While being a significant advance for this reason, the QPD used to illustrate this class, the SimpleQNormal distribution, had less shape flexibility than the Pearson and Johnson families and lacked boundedness flexibility. Shortly thereafter, Keelin developed the family of metalog distributions, an instance of the QPD class, which is more shape-flexible than the Pearson and Johnson families, offers a choice of boundedness, has closed form equations that can be fit to data with linear least squares, and has closed-form quantile functions which facilitate simulation.

Definition and Quantile Function
The metalog distribution is a generalization of the logistic distribution. The term "metalog" is short for "metalogistic." Starting with the logistic quantile function, $$x=\mu+s\mbox{ } ln\Bigl({y\over{1-y}}\Bigr)$$, Keelin substituted power series expansions in cumulative probability $$y$$ for the $$\mu$$ and the $$s$$ parameters, which control location and scale respectively.


 * $$\mu =a_1 +a_4(y -0.5)+a_5(y -0.5)^2 +a_7(y -0.5)^3+a_9(y -0.5)^4 + \dots $$
 * $$s=a_2+a_3(y -0.5)+a_6(y -0.5)^2 +a_8(y -0.5)^3+a_{10}(y -0.5)^4 + \dots $$

Keelin's rationale for this substitution was fivefold. First, the resulting quantile function would have significant shape flexibility depending on the coefficients $$a_i$$. Second, it would have a simple closed form that is linear in these coefficients, implying that they could easily be determined from CDF data by linear least squares. Third, the resulting quantile function would be smooth and differentiable so that a closed form PDF would be available. Fourth, simulation would be facilitated by the resulting closed-form inverse CDF. Fifth, like a Taylor series, any number of terms $$k$$ could be used depending on the degree of shape flexibility desired and other application needs.

Rewriting the logistic quantile function to incorporate the above substitutions yields the metalog quantile function, where cumulative probability $$0<y<1$$.

M_k(y)= \left\{ \begin{array}{ll} a_1+a_2ln\Bigl({y\over{1-y}}\Bigr) & \mbox{for } k=2\\ a_1+a_2ln\Bigl({y\over{1-y}}\Bigr)+a_3(y-0.5)ln\Bigl({y\over{1-y}}\Bigr) & \mbox{for } k=3\\ a_1+a_2ln\Bigl({y\over{1-y}}\Bigr)+a_3(y-0.5)ln\Bigl({y\over{1-y}}\Bigr)+a_4(y-0.5) & \mbox{for } k=4\\ M_{k-1}(y)+a_k(y-0.5)^{k-1\over2} & \mbox{for odd } k\geq5\\ M_{k-1}(y)+a_k(y-0.5)^{{k\over2}-1}ln\Bigl({y\over{1-y}}\Bigr) & \mbox{for even } k\geq6\\ \end{array}\right. $$

Equivalently, the metalog quantile function can be expressed in terms of basis functions: $$ M_k(y) = \sum_{i=1}^k a_i g_i(y)$$, where the metalog basis functions are $$ g_1(y)=1,\mbox{ } g_2(y)=ln\Bigl({y\over{1 -y}}\Bigr),$$ and each subsequent $$g_i(y)$$ is defined as the expression that is multiplied by $$a_i$$ above.

Note that coefficient $$a_1$$ is the median since all other terms are zero when $$y=0.5$$. Special cases of the metalog quantile function are the logistic distribution ($$k=2$$) and the uniform distribution ($$k\geq4,a_1 = 0.5,a_4 =1,a_i =0$$ otherwise).

Probability Density Function
Differentiating $$x=M_k(y)$$ with respect to $$y$$ yields $$dx/dy$$. The reciprocal of this quantity, $$dy/dx$$, is the probability density function (PDF),



m_k(y)= \left\{ \begin{array}{ll} {y(1-y)\over{a_2}} & \mbox{for } k=2\\ \Bigl({a_2\over{y(1-y)}}+a_3\Bigl({1-y\over{y(1-y)}}+ln{y\over{1-y}}\Bigr)\Bigr)^{-1} & \mbox{for } k=3\\ \Bigl({a_2\over{y(1-y)}}+a_3\Bigl({1-y\over{y(1-y)}}+ln{y\over{1-y}}\Bigr)+a_4\Bigr)^{-1} & \mbox{for } k=4\\ \Bigl({1\over{m_{k-1}(y)}}+a_k{k-1\over2}(y-0.5)^{(k-3)/2}\Bigr)^{-1} & \mbox{for odd } k\geq5\\ \Bigl({1\over{m_{k-1}(y)}}+a_k\Bigl({(y-0.5)^{k/2-1}\over{y(1-y)}}+({k\over2}-1)(y-0.5)^{(k/2-2)}ln{y\over{1-y}}\Bigr)\Bigr)^{-1} & \mbox{for even } k\geq6,\\ \end{array}\right. $$ which may be equivalently expressed in terms of basis functions as


 * $$m_k(y) = \left( \sum_{i=1}^k

a_i {{d g_i(y)}\over{dy}} \right)^{-1}$$ where $$0<y<1$$. Note that this PDF is expressed as a function of cumulative probability $$y$$ rather than variable-of-interest $$x$$. To plot it, as shown in the figures, vary $$y\in(0,1)$$ parametrically. Plot $$x=M_k(y)$$ on the horizontal axis and $$m_k(y)$$ on the vertical axis.

Based on the above equations, the family of metalog distributions is comprised of unbounded, semibounded, and bounded metalogs along with their SPT special cases.

Unbounded, Semibounded, and Bounded Metalog Distributions
As defined above, the metalog distribution is unbounded, except in the unusual special case where $$a_i = 0$$ for all terms that contain $$ln\Bigl({y\over{1-y}}\Bigr)$$. Yet, many applications require flexible probability distributions that have a lower bound $$b_l$$, an upper bound $$b_u$$, or both. To meet this need, Keelin used transformations to derive semi-bounded and bounded metalog distributions. Such transformations are governed by a general property of quantile functions: for any quantile function $$x=Q(y)$$ and increasing function $$z(x), x=z^{-1} (Q(y))$$ is a quantile function. For example, the quantile function of the normal distribution is $$x=\mu+\sigma \Phi^{-1} (y)$$. The natural logarithm, $$z(x)=\ln(x-b_l)$$, is an increasing function, so $$x=b_l+e^{\mu+\sigma \Phi^{-1} (y)}$$ is the quantile function of the lognormal distribution. Analogously applying this property to the metalog quantile function $$M_k(y)$$ using the transformations below yields the semibounded and bounded members of the metalog family. By considering $$z(x)$$ to be metalog-distributed, all members of the metalog family meet Keelin and Powley's definition of a quantile-parameterized distribution and thus possess the properties thereof.

\begin{array}{l|c|c|c|c|c} & & & & & \mbox{shape}\\ & \mbox{transformation} & \mbox{Quantile Function} & \mbox{PDF} & \mbox{where} &\mbox{parameters}\\ \hline \mbox{Metalog (Unbounded)} & z(x)=x & M_k(y) & m_k(y) & 0<y<1 & k-2\\ \hline \mbox{Log Metalog} & z(x)=ln(x-b_l) & M_k^{log}(y)=b_l+e^{M_k(y)} & m_k^{log}(y)=m_k(y)e^{-M_k(y)} &  0<y<1 & k-1\\ \mbox{(Semibounded)} && =b_l & =0 &  y=0\\ \hline \mbox{Negative-Log Metalog} & z(x)=-ln(b_u-x) & M_k^{nlog}(y)=b_u-e^{-M_k(y)} & m_k^{nlog}(y)=m_k(y)e^{M_k(y)}& 0<y<1 & k-1\\ \mbox{(Semibounded)} && =b_u & =0 &  y=1\\ \hline \mbox{Logit Metalog} & z(x)={ln(x-b_l)\over{ln(b_u-x)}} & M_k^{logit}(y)={b_l+b_u e^{M_k(y)}\over{1+e^{M_k(y)}}} & m_k^{logit}(y)={\bigl(1+e^{M_k(y)}\bigr)^2\over{(b_u-b_l)e^{M_k(y)}}} &  0<y<1 & k\\ \mbox{(Bounded)}&&=b_l & =0 &  y=0\\ & & =b_u & =0 &  y=1 \end{array} $$ Note that the number of shape parameters in the metalog family increases linearly with the number of terms $$k$$. So, any metalog may have any number of shape parameters. In contrast, the Pearson and Johnson families of distributions are limited to two shape parameters.

SPT Metalog Distributions
The SPT (symmetric-percentile triplet) metalog distributions are a three-term $$(k=3)$$ special case of the unbounded, semibounded, and bounded metalog distributions. These are parameterized by three $$(x,y)$$ points on the CDF of the form $$(x_1,\alpha)$$, $$(x_2,0.5)$$, and $$(x_3,1-\alpha)$$, where $$0 <\alpha<0.5$$. SPT metalogs are useful when, for example, quantiles $$(x_1,x_2,x_3)$$ corresponding to CDF probabilities $$(0.1, 0.5, 0.9)$$ are assessed from an expert and used to parameterize three-term metalog distributions. As noted below, certain mathematical properties are simplified by SPT parameterization.

Properties
All members of the metalog family of distributions share the following properties.

Feasibility
A function of the form of $$M_k(y)$$ or any of its above transforms is a feasible probability distribution if and only if its PDF is greater than zero for all $$y \in (0,1)$$. This implies a feasibility constraint on the set of coefficients $$\boldsymbol a=(a_1,...,a_k) \in \R^k$$:


 * $$m_k(y)>0$$ for all $$y \in (0,1)$$.

In practical applications, feasibility must generally be checked rather than assumed. For $$k=2$$, $$a_2>0$$ ensures feasibility. For $$k=3$$ (including SPT metalogs), the feasibility condition is $$a_2>0$$ and $${|a_3|/a_2}<1.66711$$. For $$k=4$$, a similar closed form has been derived. For $$k\geq5$$, feasibility is typically checked graphically or numerically.

For a given $$k$$, the unbounded metalog and its above transforms share the same set of feasible coefficients. So, for a given set of coefficients, confirming that $$m_k(y)>0$$ for all $$y \in (0,1)$$ is sufficient regardless of the transform or number of coefficients in use.

Convexity
The set of feasible metalog coefficients $$S_\boldsymbol a=\{\boldsymbol a\in\R^k |m_k(y) > 0$$ for  all $$y\in (0,1)\}$$ is convex. Because convex optimization problems require convex feasible sets, this property can simplify optimization problems involving metalogs. Moreover, this property guarantees that any convex combination of the $$\boldsymbol a$$ vectors of feasible metalogs is feasible, which is useful, for example, when combining the opinion of multiple experts or interpolating among feasible metalogs.

Fitting to Data
The coefficients $$\boldsymbol a$$ can be determined from data by linear least squares. Given $$n$$ data points $$(x_i,y_i)$$ that are intended to characterize a metalog CDF, and $$n \times k$$ matrix $$\boldsymbol Y$$ whose elements consist of $$g_j (y_i)$$, then, so long as $$\boldsymbol Y^T \boldsymbol Y$$ is invertible, coefficients' column vector $$\boldsymbol a$$ can be determined as $$\boldsymbol a=(\boldsymbol Y^T \boldsymbol Y)^{-1} \boldsymbol Y^T \boldsymbol z$$, where $$n\geq k$$ and column vector $$\boldsymbol z=(z(x_1),...,z(x_n))$$. If $$n=k$$, this equation reduces to $$\boldsymbol a=\boldsymbol Y^{-1} \boldsymbol z$$, where the resulting CDF runs through all data points exactly. For SPT metalogs, it further reduces to expressions in terms of three $$(x_i,y_i)$$ points directly.

An alternate fitting method, implemented as a linear program, determines the coefficients by minimizing the sum of absolute distances between the CDF and the data subject to feasibility constraints.

Shape Flexibility
Metalogs are highly shape flexible. In the original paper, Keelin showed that ten-term metalog distributions parameterized by 105 CDF points from 30 traditional source distributions (including normal, student-t, lognormal, gamma, beta, and extreme value) approximate each such source distribution within a K-S distance of 0.001 or less.

The animated figure on the right illustrates this for the standard normal distribution, where metalogs with various numbers of terms are parameterized by the same set of 105 points from the standard normal CDF. The metalog PDF converges to the standard normal PDF as the number of terms increases. With two terms, the metalog approximates the normal with a logistic distribution. With each increment in number of terms, the fit gets closer. With 10 terms, the metalog PDF and standard normal PDF are visually indistinguishable.

Similarly, nine-term semi-bounded metalog PDFs with $$b_l=0$$ are visually indistinguishable from a range of Weibull distributions. The six cases shown correspond to Weibull shape parameters (0.5, 0.8, 1.0, 1.5, 2, 4). In each case, the metalog is parameterized by the nine $$x$$ points from the Weibull CDF that correspond to cumulative probabilities $$y = (0.001, 0.02, 0.10, 0.25, 0.5, 0.75, 0.9, 0.98, 0.999)$$.

Such convergence is not unique to the normal and Weibull distributions. Keelin originally showed analogous results for a wide range of distributions and has provided further illustrations.

Moments
The $$m^{th}$$ moment of the unbounded metalog distribution, $$E[x^m] =\int_{y=0}^1 {M_k(y)}^m dy$$, is a special case of the more general formula for QPDs. For the unbounded metalog, such integrals evaluate to closed-form moments that are $$m^{th}$$ order polynomials in the coefficients $$a_i$$. The first four central moments of the four-term unbounded metalog are:

\begin{array}{rcl} \mbox{mean} & = & a_1 +{a_3\over2}\\ \mbox{variance} & = &\pi^2{{a_2}^2\over3}+{{a_3}^2\over{12}}+\pi^2{{a_3}^2\over{36}}+a_2a_4 +{{a_4}^2\over{12}}\\ \mbox{skewness} & = &\pi^2 {a_2}^2+\pi^2 {{a_3}^3 \over{24}} + {{{a_2} {a_3} {a_4}} \over{2}}+ \pi^2{{{a_2} {a_3} {a_4}} \over{6}} + {{{a_3} {a_4}^2} \over{8}}\\ \mbox{kurtosis} & = &7\pi^4 {{a_2}^4\over{15}}+3\pi^2 {{{a_2}^2{a_3}^2} \over{24}}+7\pi^4 {{{a_2}^2{a_3}^2} \over{30}}+{{{a_3}^4}\over{80}} + \pi^2{{a_3}^4 \over{24}} + 7\pi^4 {{a_3}^4\over{1200}} + 2\pi^2{a_2}^3{a_4}\\ &+&{{{a_2} {a_3}^2 {a_4}} \over{2}}+2\pi^2{{{a_2} {a_3}^2 {a_4}} \over{3}}+2{{a_2}^2 {a_4}^2}+\pi^2{{{a_2}^2 {a_4}^2 } \over{6}}+{{{a_3}^2 {a_4}^2 } \over{8}}+\pi^2{{{a_3}^2 {a_4}^2 } \over{40}}+{{{a_2} {a_4}^3 } \over{3}} +{{a_4}^4 \over{80}} \end{array} $$ Moments for fewer terms are subsumed in these equations. For example, moments of the three-term metalog are revealed by setting $$a_4$$ to zero. Moments for more terms and higher order moments ($$m>4$$) are also available. Moments for semi-bounded and bounded metalogs are not available in closed form.

Applications
Metalogs are interdisciplinary. Due to their shape and bounds flexibility, they can be used to represent empirical or other data in virtually any field of human endeavor.
 * Astronomy . Metalogs were applied to assess the risks of asteroid impact.
 * Cybersecurity . Metalogs were used in cyber security risk assessment.
 * Eliciting and Combining of Expert Opinion . Statistics Canada elicited expert opinion on future Canadian fertility rates from 18 experts, which included use spreadsheet-based real-time PDF feedback based on five-term metalogs. The individual-expert opinions where weighted and combined into an overall metalog-based forecast.
 * Empirical Data Exploration and Visualization . In fish biology, a 10-term log metalog distribution was fit to the weights of 3,474 steelhead trout caught and released on the Babine River in British Columbia during 2010-2014. The bimodality is attributed to the presence of both first-time and second-time spawners in the river, the latter of which tend to weigh more.
 * Hydrology . A 10-term semibounded metalog was used to model the probability distribution over annual river gauge height.
 * Oil Field Production . Semibounded SPT metalogs were used to analyze biases in projections of oil-field production when compared to observed production after the fact.
 * Simulation Input Distributions . Since quantile functions in the metalog family are expressed in closed form, they facilitate Monte Carlo simulation. Substituting in uniformly distributed random samples of $$y$$ produces random samples of $$x$$ in closed form, thereby eliminating the need to invert a CDF expressed as $$y=F(x)$$. This approach was used to simulate the total value of a portfolio of 259 financial assets.
 * Simulation Output Distributions . Metalogs have also been used to fit output data from simulations in order to represent those outputs (both CDFs and PDFs) as closed-form continuous distributions. Used in this way, they are typically more stable and smoother than histograms.
 * Sums of Lognormals . Metalogs enable a closed-form representation of known distributions whose CDFs have no closed-form expression. Keelin et al. (2019) apply this to the sum of independent identically distributed lognormal distributions, where quantiles of the sum can be determined by a large number of simulations. Nine such quantiles are used to parameterize a semi-bounded metalog distribution that runs through each of these nine quantiles exactly. Quantile parameters are stored in a table which can be interpolated for in-between values, which are guaranteed feasible by the metalogs' convexity property.

For a given application and data set, choosing the number of metalog terms $$k$$ requires judgment. For expert elicitation, 3 to 5 terms is usually sufficient. For data exploration and matching other probability distributions such as the sum of lognormals, 8 to 12 terms is usually sufficient. A metalog panel, which arrays the metalog PDFs for a range of $$k$$ for a given data set, may aid this judgment. For example, in the steelhead weight metalog panel above ... .Other tools such as Akaike information criterion and Bayesian information criteria may also be useful.

Related distributions
The following distributions are subsumed within metalog family:
 * The logistic distribution is a special case of the unbounded metalog where $$a_i = 0$$ for all $$i>2$$.
 * The uniform distribution is a special case of: 1) the unbounded metalog where $$k\geq4$$, $$a_1 = 0.5$$, $$a_4=1$$ and $$a_i=0$$ otherwise; and 2) the bounded metalog where $$k\geq2$$, $$b_l = 0$$, $$b_u = 1$$, $$a_2 = 1$$, and $$a_i=0$$ otherwise.
 * The log-logistic distribution, also known as the Fisk distribution in economics, is a special case of the log metalog where $$b_l = 0$$, and $$a_i = 0$$ for all $$i>2$$.
 * The log-uniform distribution is a special case of the log metalog where $$k\geq4$$, $$a_1 = 0.5$$, $$a_4=1$$, and $$a_i=0$$ otherwise.
 * The logit-logistic distribution is a special case of the logit metalog where $$a_i = 0$$ for all $$i>2$$.

Software
Freely available Commercially available
 * Excel Workbooks.
 * R. rmetalog (CRAN-approved).
 * Python. Pymetalog closely mirrors the R package. Metalogistic takes advantage of the SciPy platform.
 * Web browser. SPT metalog calculator and MakeDistribution.com both allow easy experimentation with fitting metalogs to data. Metalog calculator and ELD (equally likely data) metalog calculator are Excel online versions of the Excel Workbooks
 * SIPmath Modeler Tools, an Excel Add-In simulation environment.
 * FrontLine Solvers: Analytic Solver, RASON, and Solver SDK, software for analytic solutions.
 * Lumina Analytica, software for modeling and aiding difficult decisions.
 * Lone Star Analysis: TruNavigator and AnalyticsOS, predictive and prescriptive analytics.
 * SmartOrg Portfolio Navigator, software to aid new product development and portfolio management.

Convex Hull for Feasible Coefficients of Three-Term Metalogs
Feasibility condition for metalogs with $$k=3$$ terms: $$a_1$$ is any real number, $$a_2>0$$ and $$|a_3|/a_2\leq 1.66711$$.

Convex Hull for Feasible Coefficients of Four-Term Metalogs
Convex Hull for Feasible Coefficients of Four-Term Metalogs

Feasibility for metalogs with $$k=4$$ terms is defined as follows:
 * $$a_1$$ is any real number, and
 * $$a_2\geq0$$, and
 * If $$a_2=0$$, then $$a_3=0$$ and $$a_4>0$$ (uniform distribution exactly)
 * If $$a_2>0$$, then feasibility conditions are specified numerically
 * For a given $$|a_3|/a_2$$, feasibility requires that $$a_4/a_2\geq$$ number shown.
 * For a given $$a_4/a_2$$, feasibility requires that $$|a_3|/a_2\leq$$ number shown.
 * At the top of this table, the four-term metalog is symmetric and peaked, similar to a student-t distribution with 3 degrees of freedom.
 * At the bottom of this table, the four-term metalog is a uniform distribution exactly.
 * In between, it has varying degrees of skewness depending on $$a_3$$. Positive $$a_3$$ yields right skew. Negative $$a_3$$ yields left skew. When $$a_3=0$$, the four-term metalog is symmetric.

Convex Hull Equations
The feasible area can be closely approximated by an ellipse (dashed, gray curve), defined by center $$b =4.5$$ and semi-axis lengths $$c=8.5$$ and $$d =1.95$$. Supplementing this with linear interpolation outside its applicable range, feasibility, given $$a_2>0$$, can be closely approximated:

\ \approx \left\{ \begin{array}{rlcrll} {|a_3|\over a_2}&\leq{d\over c}\sqrt{c^2-({a_4\over a_2}-b)^2} & \text{ for } & -4.0& \leq{a_4\over a_2}\leq 4.5,\\ {|a_3|\over a_2}&\leq 0.014 ({a_4\over a_2} -4.5)+ 1.95 & \mbox{ for } & 4.5&<{a_4\over a_2}\leq 7.0, \\ {|a_3|\over a_2}& \leq 0.004 ({a_4\over a_2}-7.0) +1.984& \mbox{ for} & 7.0 &<{a_4\over a_2}\leq 10.0,\\ {|a_3|\over a_2}& \leq 2.0& \mbox{ for} & 10.0 &<{a_4\over a_2}. \end{array}\right. $$