Talk:Exponential smoothing/Archive 1

The original version
18:03, 11 August 2006 (UTC)141.122.9.165This article is giving an example of using a general technique, even though it says it is a bad exmaple. RJFJR 17:18, 5 March 2006 (UTC)

The whole thing reads like a student essay. --A bit iffy 15:17, 27 April 2006 (UTC)

I do not have access to recent research paper etc., but I can give the following reference: Montgomery, Douglas C "Forecasting and time series analysis" 1976, McGraw-Hill Inc. martin.wagner@nestle.com

It's pretty messy. I've seen worse, though. One of the early formulas seems right, but I haven't looked closely at the text yet. Michael Hardy 20:59, 23 October 2006 (UTC)

For such a widely adopted forecasting method, this is extremely poor. At the very least, standard notation should be adopted, see Makridakis, S., Wheelwright, S. C., & Hyndman, R. J. (1998). Forecasting: Methods and applications (3rd ed.). New Jersey: John Wiley & Sons. State-space notation would also be a useful addition. Dr Peter Catt 02:29, 15 December 2006 (UTC)

I feel really sorry to see poor work like this on Wiki.

A complete rewrite
OK, nobody liked this article very much, and it even came up over on this talk page. So I've rewritten it in its entirety. I'll try to lay my hands on a few reference books in the next month or so, so that I can verify standard notation, etc. Additional information about double and triple exponential smoothing should also go in this article, but at least I've made a start. DavidCBryant 00:45, 10 February 2007 (UTC)

Unsatisfying Derivation of Alpha
The statement that there is no simple way to choose &alpha; is very unsatisfying.

If one considers the impulse-response of this method, then the time delay of the response (mean) is 1/&alpha; data points and the rms width of the response is also on the order of (but not exactly) 1/&alpha; data points. Thus the method smooths with a smoothing width of 1/&alpha; data points, and this is a perfectly good way to choose an &alpha;.

208.252.219.2 (talk) 16:01, 18 August 2010 (UTC)


 * WP:SOFIXIT --Muhandes (talk) 10:03, 20 August 2010 (UTC)

Text from Initial value problem
I have removed the following text from the page Initial value problem which is about ODE theory. If somebody with knowledge of the domain thinks it belongs here, please integrate it. --138.38.106.191 (talk) 14:25, 10 May 2013 (UTC)
 * Exponential smoothing is a general method for removing noise from a data series, or producing a short term forecast of time series data.
 * Single exponential smoothing is equivalent to computing an exponential moving average. The smoothing parameter is determined automatically, by minimizing the squared difference between the actual and the forecast values. Double exponential smoothing introduces a linear trend, and so has two parameters. For estimating initial value there are several methods. like we use these two formulas;
 * Single exponential smoothing is equivalent to computing an exponential moving average. The smoothing parameter is determined automatically, by minimizing the squared difference between the actual and the forecast values. Double exponential smoothing introduces a linear trend, and so has two parameters. For estimating initial value there are several methods. like we use these two formulas;


 * $$y'_0=\left(\frac{\alpha}{1-\alpha}\right)a_t+b_t$$
 * $$y''_0=\left(\frac{\alpha}{1-\alpha}\right)a_t+2b_t$$
 * $$y''_0=\left(\frac{\alpha}{1-\alpha}\right)a_t+2b_t$$

redundant "Exponential moving average"
Two articles present similar content:
 * Exponential smoothing
 * Moving average

Sorry, I do not have enougth motivation/time to check further and to manage the potential merge.

Oliver H (talk) 09:06, 6 March 2014 (UTC)

Summary is far from neutral and too specific.
The first paragraph of the summary starts by accusing statistics of having a trivial standard of proof and ends by claiming that the (unvalidated by citation) practice of triple filtering is both standard an numerology. The second paragraph uses the loaded and insulting word "parrots". Having seen the density of bias in the summary, I will not waste time with the body of the article. Further, it is atypical for mathematical articles to establish notation in the summary -- notation is typically established in the Definition section. -- 66.103.116.83 (talk) 01:48, 18 July 2015 (UTC)

For reference, in case someone wants to back it out, here is the (anonymous) edit where the inappropriate sarcasm in the first paragraph, at least, was introduced:
 * https://en.wikipedia.org/w/index.php?title=Exponential_smoothing&oldid=668235733
 * 04:00, 23 June 2015‎
 * Edited by IP address 99.181.61.1

Dr. Hyndman's comment on this article
Dr. Hyndman has reviewed this Wikipedia page, and provided us with the following comments to improve its quality:

"* Confused comment distinguishing "random process" from "an orderly, but noisy, process".
 * Notation used corresponding to engineering rather than statistics and econometrics. Hyndman, Koehler, Ord and Snyder (2008) has encouraged a standardization of exponential smoothing notation in econometrics and statistics. It would be better to use this if possible.
 * The background on simple and weighted moving averages is unnecessary, and should be in a separate article.
 * "FIR and IIR filters" used without explanation.
 * An EWMA is not equivalent to an ARIMA(0,1,1) model as stated. Rather, an ARIMA(0,1,1) model will give forecasts that are equivalent to an EWMA. There is a difference.
 * Derivation of formula from Newbold and Box out of place in this context. It is also derived backwards. This section should be removed as the derivation is given later in any case.
 * The time constant result seems to assume s_0=0.
 * The initial smoothed value is assumed to be set by the user. In modern implementations, this is estimated along with the smoothing parameter.
 * Optimization section refers to regression in a strange aside.
 * The section on "Comparison with moving average" is highly confused. It mixes up smoothing with forecasting, and makes false statements about the distribution of forecast errors. I'm not sure there is anything in this section worth keeping.
 * In double exponential smoothing section it is again assumed that s_0 and b_0 are to be determined by the user rather than estimated.
 * Triple exponential smoothing is discussed in Holt's 1957 paper. It is NOT due to Winters. Winter's 1960 contribution popularized the method.
 * There is a need for a new section (or perhaps a new article) on the link between exponential smoothing and innovations state space models, introduced by Ord, Koehler and Snyder (JASA 1997) and extended in Hyndman et al (2002,2008)."

We hope Wikipedians on this talk page can take advantage of these comments and improve the quality of the article accordingly.

Dr. Hyndman has published scholarly research which seems to be relevant to this Wikipedia article:


 * Reference : Christoph Bergmeir & Rob J Hyndman & Jose M Benitez, 2014. "Bagging Exponential Smoothing Methods using STL Decomposition and Box-Cox Transformation," Monash Econometrics and Business Statistics Working Papers 11/14, Monash University, Department of Econometrics and Business Statistics.

ExpertIdeasBot (talk) 11:13, 1 June 2016 (UTC)

Dr. Snyder's comment on this article
Dr. Snyder has reviewed this Wikipedia page, and provided us with the following comments to improve its quality:

{{quote|text=Article	Exponential Smoothing

Reviewer	Adjunct Associate Professor Ralph D Snyder Affiliation	Department of Econometrics and Statistics, Monash University, Clayton, Victoria, Australia 3800 Date	July, 2016

https://drive.google.com/file/d/0B8-wWhGFpGYCMEVwQ2k5VllXcHM/view?usp=sharing

Brief comments on current article
This article contains a traditional perspective of exponential smoothing, being very much captive to versions as they appeared historically in the literature and which have been overtaken by more modern integrated approaches. It places too much emphasis on technical details of methods which have been superseded or substantially simplified in more modern approaches. Some methods are misnamed and a general version of exponential smoothing, of which all the earlier methods are special cases, is ignored.

Too immersed in technical details
“Exponential smoothing is a rule of thumb technique for smoothing time series data, particularly for recursively applying as many as three low-pass filters with exponential window functions “.

The article begins with this sentence which contains the technical terms “low-pass filters” and “exponential window functions”. These terms are taken from an engineering oriented time series literature but would be unknown to most business forecasters and managers. Given that the application of exponential smoothing has been traditionally centred on short-term forecasting of inventory demands, any article written on the subject should recognise that many readers seeking help on this topic are likely to be immediately put off by this orientation. It is curious that such terms are used given that the topic is normally exposited [1] without recourse to them.

This orientation continues throughout the article.

“''It also introduces a phase shift into the data of half the window length. For example, if the data were all the same except for one high data point, the peak in the "smoothed" data would appear half a window length later than when it actually occurred. Where the phase of the result is important, this can be simply corrected by shifting the resulting series back by half the window length.''”

This level of detail is unnecessary and unenlightening in an introductory exposition.

Covers superseded methods
The last paragraph of the section Double exponential smoothing has a focus on Brown’s double exponential smoothing [2] without any explanation for the equations which define it. Brown was an important pioneer of the exponential smoothing methods and has an important place in any historical analysis of their evolution. His approaches all involve only one parameter α, however, and consequentially are less flexible than multi-parameter analogues, usually provide poorer forecasts, and have largely been superseded.

Mislabeled methods

 * 1) The section Double exponential smoothing would be better labelled Exponential smoothing with local trends. It describes two approaches: one due to Holt[3] and the other due to Brown[2]. The latter is correctly termed double exponential smoothing because it involves two applications of exponential smoothing: one to the original data and the second to the smoothed data-see the equations for s_t^' and s_t^. The first method, however, is traditionally referred to as trend-corrected exponential smoothing to distinguish it from the second method. The term trend-corrected'' is apt because the level term s_(t-1) is augmented with the trend term b_(t-1) in the formula for the revised level s_t. The application of the term double exponential smoothing to the first approach has been justified with a reference to note 12 in the Wikipedia article. However, it appears to me that this reference has not been published in an authoritative refereed journal and is likely to be an unreliable source for terminology.<
 * 2) The article has a section on Triple exponential smoothing which has a discussion on seasonal methods. Again, the section header involves a confused use of terminology. The term normally applies to a method devised by Brown{2] to augment double exponential smoothing with yet a third application of simple exponential smoothing, this time to the doubly smoothed series. It has nothing to do with seasonal effects.

General versions of exponential smoothing are not covered
The above methods are often described as linear versions of exponential smoothing. The terms linear and exponential side-by-side is necessary, even if a bit confusing. The methods are considered to be linear because they yield forecasts which are linear functions of the series values. They are considered to be exponential because the coefficients of the series values in these linear forecast functions decline exponentially with the age of the data. The most general linear version of exponential smoothing was introduced in a much overlooked section of the one of the most influential books on time series analysis: Box and Jenkins[4]. It does not rate a mention in what should be a authoritative overview of exponential smoothing. Moreover, it ignores the most general version of exponential smoothing, encompassing both linear and nonlinear versions (Ord, Koehler and Snyder[5]). The advantage of both general approaches is that they unify the whole area of exponential smoothing and eliminate much of the complexity associated with traditional special cases.

The need to consider statistical models underlying the exponential smoothing

 * 1) The methods proposed for seeding the associated recurrence relationships have been ad hoc.
 * 2) Many of the traditional methods proposed for determining the parameters like α and β have been ad hoc, with the exception of those that rely on minimising the sum of squared errors.
 * 3) Although exponential smoothing often yields good point predictions, it has little to say about the measurement of uncertainty surrounding them. What little that has been written on this issue is often based on assumptions which are inconsistent with those which underpin exponential smoothing.

These problems are largely resolved by augmenting exponential smoothing with appropriate statistical models. Then it is possible to bring standard statistical methods to bare on the estimation of seed values and parameters, together with the generation of prediction distributions. There are two main possibilities:
 * 1) Box and Jenkins demonstrated in their seminal book[4] on time series analysis, that any of their ARIMA models has an implied method of linear exponential smoothing and vice versa. Thus, the standard technology associated with the ARIMA framework can be invoked to obtain indirect estimates of exponential smoothing parameters and prediction distributions.
 * 2) Both the linear and nonlinear innovations state space frameworks in Ord, Koehler and Snyder[5] provide a more transparent link with both the linear and nonlinear forms of exponential smoothing. It can be used to directly obtain maximum likelihood estimates of the parameters and seed values associated with exponential smoothing and also derive prediction distributions. By encompassing nonlinear cases, it is more general than the Box and Jenkins framework.

Links to other modern forecasting methods are not covered

 * The Kalman filter[6], a pivotal time series method, is not mentioned. It has close links with exponential smoothing, the former converging to latter as successive series values are processed in conjunction with invariant linear state space models.
 * Bayesian forecasting (Harrison and Stevens[7]), which relies on the Kalman filter, has its conceptual roots in exponential smoothing.
 * The structural time series framework (Harvey[8]) also has strong links with exponential smoothing.

Croston's Method for intermittent demand forecasting
I've never commented on Wikipedia before, so please excuse my ignorance.

I've recently been reviewing methods of intermittent demand forecasting and have applied the Croston Method for applying exponential smoothing based on the ratio of demand/demand intervals - see https://www.researchgate.net/publication/254044245_A_Review_of_Croston's_method_for_intermittent_demand_forecasting.

This method cannot be found on wikipedia as far as I can tell, and I'm not sure if it belong's here, under the forecasting topic or a stand-alone topic. — Preceding unsigned comment added by S&opgeek (talk • contribs) 02:18, 23 September 2016 (UTC)

Problems with weighting
Now that the article has been re-written to be much clearer, I see that there are some problems which I had not previously noticed.
 * 1) Currently, the way the average is initialized gives much too much weight to the first observation. If &alpha; is 0.05 (corresponding to a moving average of about 20 values), when you initialize and then input a second observation, the average is (1*x1 + 19*x0)/20 which gives the first observation 19 times the weight of the second when it would be more appropriate to give the first observation 19/20 of the weight of the second observation.
 * 2) No provision is made for the practical problem of missing data. If an observation is not made at some time, or it is made but lost, then what do we do?

Perhaps these difficulties could be addressed, in part, by separately computing a normalization factor which could be done by forming a sum in the same way using always 1 as the data and then dividing that into the sum of the actual observations. JRSpriggs 03:45, 12 February 2007 (UTC)


 * Do you see this as a problem with the article itself, or a problem with the statistical technique described in the article? —David Eppstein 08:06, 12 February 2007 (UTC)


 * Well, that is the question, is it not? I do not know enough about statistics to know whether the technique has been described incorrectly or whether we should point out that the technique has these limitations. JRSpriggs 09:59, 12 February 2007 (UTC)


 * When I learned about this technique, I think I remember learning that either of the two methods could be used to initialize it (either copy the first data point enough times to fill in the array, of copy the most recent data point enough times). But I have no idea where I learned about this and I have no references on it, so I can't say what is done in practice. The article on moving average also has no discussion of how to initialize the array. It's hardly a limitation on the method because the method is intended for large data sets, not tiny ones. There seem to be some references at moving average like this one. CMummert · talk 13:27, 12 February 2007 (UTC)

Perhaps we should consider merging this article into Moving average. Otherwise, I think that the weighting should be more like this:
 * $$s_t = \frac {\sum_k \exp (-\alpha (t - t_k)) x_k}{\sum_k \exp (-\alpha (t - t_k))}$$

where the sum is over observations with $$t_k \leq t .\!$$ What do you-all think? JRSpriggs 04:35, 13 February 2007 (UTC)

I agree on both counts, particularly the merge. MisterSheik 18:47, 16 February 2007 (UTC)

The issue with the proposed formula is that it removes a major benefit of smoothing, namely that you don't need to store all data points, only the most recent ones. The initialization is addressed in "Choosing the initial smoothed value". — Preceding unsigned comment added by 2620:0:100E:904:CDE3:35A3:390:9155 (talk) 17:30, 27 January 2017 (UTC)

least square optimisation of alpha
I do not understand exactly why optimizing alpha using LS methods should work. Sum of squares of differences is minimized for alpha=1, and it equals 0. By continuity optimization problem I suppose there is no other non-trivial optimisation solutions. Please give some citation/reference. —Preceding unsigned comment added by 149.156.82.207 (talk) 18:25, 15 December 2010 (UTC)
 * I don't follow. For alpha=1 s_t=x_{t-1}, i.e. the estimate is always the last measure. This minimizes the sum of square only when x_{t-1}=x_t, i.e. only for a constant series. If you need a source on this, check section 6.4.3.1 of the NIST/SEMATECH e-Handbook of Statistical Methods, which is the source of most of the article. --Muhandes (talk) 19:15, 15 December 2010 (UTC)
 * For $$\alpha=1$$, $$s_t=x_t$$, i.e. smoothed value is equal to the last actual value, and the error as you put it - $$(s_t - x_t)^2$$ - is always 0. That's why many people were confused, including me. Optimizing mean square error only makes sense when exponential smoothing is used for predicting next value, but squared error should be calculated as $$(s_t - x_{t+1})^2$$ then, as $$s_t$$ in this case is an estimate for the future value of x, $$x_{t+1}$$. I will correct this and provide a reference. Shcha (talk) 13:41, 5 July 2017 (UTC)

The exponential moving average
The article says: For example, the method of least squares might be used to determine the value of α for which the sum of the quantities (sn − xn)^2 is minimized. Uh? I think such $$\alpha,$$ would be one! Albmont 11:14, 28 February 2007 (UTC)


 * Thanks. It should have been sn-1 instead of sn. I changed it. JRSpriggs 12:25, 28 February 2007 (UTC)
 * Still wrong for the above mentioned reason, and no citation provided. Removed WP:NOR Shcha (talk) 13:18, 5 July 2017 (UTC)
 * Restored this bit with correction, see my comment in under least square optimisation of alpha, also added a reference. It's also related to the discussion under Is s(t) right. Shcha (talk) 13:56, 5 July 2017 (UTC)

Negative values for smoothing factor &alpha;
All references I have looked at suggest that the value of &alpha; must be chosen between 0 and 1. However, none offer any reason for this. Although such a range may be "intuitive", I have worked with datasets for which the optimal value for &alpha; (in a least-squares sense, as described in the article) is negative. Why would this be wrong? koochak 10:30, 5 March 2008 (UTC)

Look at the meaning of the &alpha;. It is a percentage of the smoothed value that should be generated using the previous smoothed value. You cannot have a negative percentage. JLT 15:03, 16 Dec 2009 (CST)

I agree with UTC, and with the fact that α can take negative values. The explanation that α must be between 0 and 1 because it is a percentage is just circular logic. So long as you can build a linear combination of observed and predicted values, the exponential smoothing formula holds. 07:42, 28 September 2014 (PC)

If $$\alpha$$ is negative, then $$1-\alpha$$ is greater than one, which leads to an unstable system. The result, then, is a growing exponential, instead of a decaying exponential, which is not smoothing, but in some cases could be useful. Gah4 (talk) 08:49, 11 April 2018 (UTC)

Double exponential smoothing != double exponential smoothing
After a lot of confusion and searching, I noticed that there are at least three approaches to calculate a double exponential smoothing.
 * 1) possibly the Holt method
 * This one just calculates a single exponential smoothing, the results $$s_{t}$$ are used as the starting values for the estimation line (i. e. $$F_{t+0}$$). Additionally, the trend itself $$b_{t}$$ is calculated and is used as the gradient of the estimation line. As a result, the estimation difference to single exponential smoothing is just that a trend is assumed, calculated and used, using the result of the simple exp. smoothing as the starting point.
 * Sources:
 * this German PDF, called “Exponentielle Glättung 2. Ordnung” there
 * German wikibook Mathematik: Statistik: Glättungsverfahren, called “gleitender Durchschnitt zweiter Ordnung” there
 * 1) the Brown method
 * This one first calculates a single exponential smoothing $$S'_{t}$$ over the data and then calculates another exponential smoothing $$S_{t}$$ over that smoothed line, resulting in a double-smoothing. For both times, the same α is being used. The estimation line has the starting value $$2 \cdot S'_{t}-S_{t}$$, the line gradient is described as $$\frac{\alpha}{1-\alpha} \cdot (S'_{t}-S''_{t})$$
 * Sources:
 * this German PDF, called “Trendkorrekturversion (Modell von Brown)” there
 * Duke University’s Averaging and Exponential Smoothing Models, called “Brown’s linear exponential smoothing” there
 * 1) allegedly the linear exponential smoothing by Holt/Winters (the one talked about in the article)
 * This one works similarly to the Brown method but instead of just taking the previous result of the single smoothing it takes into account the previously forecasted trend $$S''_{t-1}$$ by adding it to the previously forecasted level $$S'_{t-1}$$. Also, the new variable β is used to adjust the influence of the trend on the forecast.
 * Sources:
 * this German PDF, called “Lineare exponentielle Glättung von Holt-Winters” there
 * Time series Forecasting using Holt-Winters Exponential Smoothing, pages 3–4, called “double exponential smoothing” there

My point is: this must be clarified and explained properly.

I’m sorry for the lack of English resources, I hope you can find better ones than me. --Uncle Pain (talk) 14:29, 23 September 2011 (UTC)

A small addition after some comparison: the methods 1 and 2 are indeed very similar, as the German PDF implies by combining them in a connected row. The only difference seems to be the $$\frac{\alpha}{1-\alpha}$$ factor in method 2 which should be $$\frac{1}{1-\alpha}$$ to make it match the results of method 1. Both calculate the same gradient of the estimation line. --Uncle Pain (talk) 15:59, 23 September 2011 (UTC)
 * Very good points. I'll try to work it all out into the article tomorrow, thanks for the resources. --Muhandes (talk) 20:35, 24 September 2011 (UTC)
 * I added the second method. I never met the first method before, and I'm still trying to figure out if it is different from the second. If you are confident in your analysis, I suggest you add it yourself. --Muhandes (talk) 14:58, 25 September 2011 (UTC)

Sorry but i have no idea on how to go about this, but: There are three components in a timeseries: level, trend, seasonality. Simple (single) exponential smoothing (SES) assumes no trend or seasonality. Holts (double) exponential smoothing incorporates trend. Care should be taken as there is also a double exponential smoothing that incorporates seasonality instead of trent, referred to as seasonal SES. Holt-Winters (triple) exponential smoothing incorporates both trend and seasonality. Which method is best is about which components are present (level you always have, so it is about trend and seasonality being present). The main article currently incorrectly links a paper on holt-winters when discussing double exponential smoothing (opening the reference and you see discussion on seasonality in the abstracts. It also erroneously labels holt-winters as a double exponential smoothing process. Here are two definitive references for exponential smoothing:

Gardner Jr, E.S., 1985. Exponential smoothing: The state of the art. Journal of forecasting, 4(1), pp.1-28. Gardner Jr, E.S., 2006. Exponential smoothing: The state of the art—Part II. International journal of forecasting, 22(4), pp.637-666. And here is a very good open source textbook that discusses my points: https://otexts.org/fpp2/ — Preceding unsigned comment added by Direxd (talk • contribs) 13:14, 15 August 2018 (UTC)

"convert FIR filters to IIR filters"
the article uses the phrase "convert FIR filters to IIR filters", but readers, like me, have no idea of what FIR, IIR means. Jackzhp (talk) 09:26, 28 July 2017 (UTC)
 * I added links to the appropriate pages for those phrases, hope this helps. Benbradley (talk) 15:45, 29 December 2018 (UTC)