“M5 Accuracy Competition: Results, Findings, and Conclusions”, 2022-01-11 (; backlinks; similar):
In this study, we present the results of the M5 “Accuracy” Kaggle competition, which was the first of 2 parallel challenges in the latest M competition with the aim of advancing the theory and practice of forecasting.
The main objective in the M5 “Accuracy” competition was to accurately predict 42,840 time series representing the hierarchical unit sales for the largest retail company in the world by revenue, Walmart. The competition required the submission of 30,490 point forecasts for the lowest cross-sectional aggregation level of the data, which could then be summed up accordingly to estimate forecasts for the remaining upward levels.
We provide details of the implementation of the M5 “Accuracy” challenge, as well as the results and best performing methods, and summarize the major findings and conclusions.
Finally, we discuss the implications of these findings and suggest directions for future research.
[Keywords: forecasting competitions, M competitions, accuracy, time series, machine learning, retail sales forecasting]
…The M5 “Accuracy” competition involved 7,092 participants in 5,507 teams from 101 countries. Among these teams, 4,373 (79.4%) entered the competition during the “validation” phase and 1,134 (20.6%) during the “test” phase. Moreover, 1,434 teams (26.0%) made submissions during both the “validation” and “test” phases of the competition, but 2,939 (53.4%) only during the “validation” phase. In total, the participating teams made 88,136 submissions, most of which (78.3%) were submitted during the “validation” phase. Most of the teams made a single submission, but some made 3–20 submissions. It is worth mentioning that 1,563 participants participated in a Kaggle competition for the first time, including 15 in the top 100…Among the participating teams, 2,666 (48.4%) managed to outperform the Naive benchmark, 1,972 (35.8%) outperformed the sNaive (a naive method that accounts for seasonality) benchmark, and 415 (7.5%) beat the top performing benchmark ([exponential smoothing] ES_bu; details of the benchmarks used in the M5 “Accuracy” competition and their performance are provided in the appendix of the supplementary material). It is important to note that these numbers refer to the forecasts selected by each team for the final evaluation of their performance and not to the “best” submission made by each case while the competition was still running. In the latter case, 3,510 (63.7%), 2,685 (48.8%), and 672 (12.2%) teams would have managed to outperform the Naive, sNaive, and ES_bu benchmarks, respectively. Thus, many teams failed to choose the best method that they developed, probably due to misleading validation scores…Therefore, as discussed later, the winning teams who used more sophisticated ML methods managed to outperform the benchmarks in the competition by a notable margin, but this does not mean that the success of ML methods in general can be taken for granted.
…Among the 415 teams that managed to outperform all of the benchmarks in the competition, 5 obtained improvements greater than 20%, 42 greater than 15%, 106 greater than 10%, and 249 greater than 5%. These improvements are substantial and they demonstrate the superior performance of the winning M5 methods compared with the standard forecasting benchmarks. Moreover, the 5 winners of the competition were the only teams to obtain accuracy improvements greater than 20%, thereby achieving a clear victory.
…4.2. Winning submissions: …Before presenting the 5 winning methods, we note that most of the methods used LightGBM, which is a ML algorithm for performing nonlinear regression using gradient boosted trees ( et al 2017). LightGBM has several advantages compared with other ML alternatives in forecasting tasks, such as those in the M5 “Accuracy” competition, because it allows the effective handling of multiple features (eg. past sales and exogenous/explanatory variables) of various types (numeric, binary, and categorical). In addition, it is fast to compute compared with other gradient boosting methods (GBMs), does not depend on data pre-processing and transformations, and only requires the optimization of a relatively small number of parameters (eg. learning rate, number of iterations, maximum number of bins, number of estimators, and loss functions). In this regard, LightGBM is highly convenient for experimenting and developing solutions that can be accurately generalized to a large number of series with cross-correlations. In fact, LightGBM can be considered the standard method of choice in Kaggle’s recent forecasting competitions because the winners of the “Corporación Favorita Grocery Sales Forecasting” and “Recruit Restaurant Visitor Forecasting” competitions built their approaches using this method (2021), and the discussions and notebooks posted on Kaggle for the M5 “Accuracy” competition focused on using LightGBM and its variants.
The forecasting methods used by the 5 winning teams can be summarized as follows.
YJ_STU; YeonJun In: The winner of the competition was a senior undergraduate student at Kyung Hee University, South Korea, who used an equal weighted combination (arithmetic mean) of various LightGBM models, which were trained to produce forecasts for the product-store series using data pooled per store (10 models), store-category (30 models), and store-department (70 models).2 variations were considered for each type of model, where the first applied a recursive approach and the second a non-recursive forecasting approach (Bontempi et al 201311ya). In total, 220 models were built and each series was forecast using the average of 6 models, where each exploited a different learning approach and training set. The models were optimized without considering early stopping and by maximizing the negative log-likelihood of the Tweedie distribution ( et al 2020), which is considered an effective approach when handling data with a probability mass of zero and non-negative, highly right-skewed distribution.
The method was fine tuned using the last 4 28 day-long windows of available data for CV and by measuring both the mean and the standard deviation of the errors produced by the individual models and their combinations. In this manner, the final solution was selected such that it provided both accurate and robust forecasts. Among the features used, the models considered various identifiers, calendar-related information, special days, promotions, prices, and unit sales data in both recursive and non-recursive formats.
Matthias; Matthias Anderer: This method was also based on an equally weighted combination of various LightGBM models, but it was externally adjusted through multipliers according to the forecasts produced by N-BEATS (deep-learning NN for time series forecasting; et al 2019) for the top 5 aggregation levels of the data set.Essentially, LightGBM models were first trained per store (10 models) and then 5 different multipliers were used to adjust their forecasts and to correctly capture the trend. In total, 50 models were built and each series of the product-store level in the data set was forecast using a combination of 5 different models. A custom, asymmetric loss function was used. The last 4 28 day-long windows of available data were used for CV and model building. The LightGBM models were trained using only some basic features of calendar effects and prices (past unit sales were not considered), and the N-BEATS model was based solely on historical unit sales.
mf; Yunho Jeon & Sihyeon Seong: This method used an equally weighted combination of 43 deep-learning NNs ( et al 2020), where each comprised multiple long short-term memory layers, which were employed to recursively predict the product-store series. Among the models trained, 24 considered dropout, whereas the other 19 did not. These models originated from only 12 models and they corresponded to the last, more accurate instances observed for these models during training, as specified by CV (last 14 28 day-long windows of available data). Similar to the winner, the method considered Tweedie regression, but it was modified to optimize weights based on the sampled predictions instead of actual values. The Adam optimizer and cosine annealing were used for learning rate scheduling. The NNs considered 100 features with similar characteristics to those used by the winning submission (sales data, calendar-related information, prices, promotions, special days, identifiers, and zero-sales periods).
monsaraida; Masanori Miyahara: This method produced forecasts for the product-store series in the data set using non-recursive LightGBM models trained per store (10 models). However, in contrast to the other methods, each week in the forecasting horizon was forecast separately using a different model (four models per store). Thus, 40 models were built to produce the forecasts. The features used as inputs were similar to those applied by the winning submission, except for the recursive features. Tweedie regression was applied for training the models with no early stopping, and the training parameters were not optimized. The last 5 28 day-long windows of available data were used for CV.
Alan Lahoud; Alan Lahoud: This method used recursive LightGBM models, which were trained per department (seven models). After producing the forecasts for the product-store series, they were externally adjusted such that the mean of each of the series at the store-department level was the same as that for the previous 28 days, which was achieved using appropriate multipliers. The models were trained using Poisson regression with early stopping and validated using a random sample of 500 days. The features used as inputs were similar to those employed by the winning submission.…4.3. Key findings: The main findings related to the performance of the top 5 methods are summarized as follows.
Superior performance of ML Method. Over many years, empirical studies have demonstrated that simple methods are as accurate as complex or statistically sophisticated methods ( et al 2020c). Limited data availability, inefficient algorithms, the need for preprocessing, and restricted computational power are just some of the factors that reduce the accuracy of ML methods compared with statistical methods ( et al 2018b). M4 was the first forecasting competition to show that 2 ML-based approaches were substantially more accurate than simple statistical methods, thereby highlighting the potential value of ML methods for obtaining more accurate forecasts ( et al 2020c). The first method that won the M4 competition was a hybrid approach based on mixed recurrent NNs and exponential smoothing (2020), and the second ranked method used XGBoost to optimally weight the forecasts produced by standard time series forecasts (Montero- et al 2020). Both of the winning M4 submissions were based on ML but they were built on statistical, series-specific functionalities, and their accuracy was also similar to a simple combination of the median of 4 statistical methods (2020).
Therefore, M5 was the first competition where all of the top-performing methods were both “pure” ML approaches and better than all statistical benchmarks and their combinations. It was shown that LightGBM can be used effectively to process numerous correlated series and exogenous/explanatory variables, and to reduce the forecast errors. Moreover, deep learning methods like DeepAR and N-BEATS, using advanced, state-of-the-art ML implementations, have shown forecasting potential, motivating further research in this direction.
Value of combining: The M5 “Accuracy” competition confirmed the findings of the previous 4 M competitions as well as those of numerous other studies by demonstrating that the accuracy can be improved by combining forecasts obtained with different methods, even relatively simple ones (2020).
The winner of the M5 “Accuracy” competition employed a very simple, equal-weighted combination, involving 6 models, where each exploited a different learning approach and training set. Similarly, the runner-up used an equal-weighted combination of 5 models, where each obtained a different estimate of the trend, and the third best-performing method used an equal-weighted combination of 43 NNs. Simple combinations of models were also used by the methods ranked 14th, 17th, 21st, 24th, 25th, and 44th. Among these combination approaches, only that ranked 25th considered unequal weighting of the individual methods. The value of combining was also supported by comparisons made between the benchmarks in the competition. As shown in the appendix of the supplementary material, the combination of exponential smoothing and AutoRegressive Integrated Moving Average (ARIMA) models performed better than the individual methods, while a combination of top-down and bottom-up reconciliation methods outperformed both the top-down and bottom-up methods.
Therefore, our results support the long-standing belief that combining forecasts obtained with different methods can improve the forecasting accuracy and they confirm that there is no guarantee that an “optimal” forecast combination will perform better than a simpler, equal-weighted one ( et al 2016).
Value of “cross-learning”: In the previous M competitions, most of the series were uncorrelated, with a different frequency and domain, and chronologically unaligned. Therefore, both of the top-performing M4 submissions used “cross-learning” from multiple series concurrently instead of one series at a time, but their approach was difficult to implement effectively in practice, and it did not demonstrate the full potential of “cross-learning”.
By contrast, the M5 comprised aligned, highly-correlated series structured in a hierarchical fashion, so “cross-learning” was much easier to apply and superior results were achieved compared with methods trained in a series-by-series manner. It should be note that in addition to producing more accurate forecasts, “cross-learning” implies the use of a single model instead of multiple models, where each is trained using data from a different series, thereby reducing the overall computational cost and mitigating difficulties related to limited historical observations ( et al 2021). Essentially, all top 50-performing methods in M5 used “cross-learning” by exploiting all of the information in the data set.
Notable differences between the winning methods and benchmarks used for sales forecasting: The M5 “Accuracy” competition considered 24 benchmarks of various types that are typically used in sales forecasting applications, including traditional and state-of-the-art statistical methods, ML methods, and combinations. As shown in Figure 3 & Table 2, the winning submissions provided more accurate forecasts in terms of ranks compared with these benchmarks and they were also more than 20% better in terms of the average WRMSSE. The differences were much smaller at lower aggregation levels and in some cases negative, but the results clearly demonstrated their overall superiority, thereby motivating additional research into the area of ML forecasting methods that can be used to predict complex, nonlinear relationships between series, as well as including exogenous/explanatory variables. However, it should be noted that this finding is based on the performance of the winning teams alone. When the whole sample of participating teams was considered, we found that the vast majority (about 92.5%) failed to outperform the top performing benchmark, despite the latter being considerably simpler.
This finding suggests that standard time series forecasting methods, such as exponential smoothing, may still be useful for supporting decisions related to the operation of retail companies and that the usage of ML methods does not necessarily guarantee better performance, at least if the methods employed are not built and trained correctly, which was the case for the M5 winning teams. Similarly, we found that it was possible to use more sophisticated methods to improve the forecasting accuracy at a particular cross-sectional level, but the impact was minor at other levels, especially the most granular levels. Therefore, the adoption of more sophisticated methods should be carefully assessed by investigating whether the value added by these approaches in terms of accuracy is meaningful compared with their costs (2020).
Beneficial effects of external adjustments: Forecast adjustments are typically used when forecasters exploit external information as well as inside knowledge and their expertise to improve forecasting accuracy (2013). Such adjustments were applied in the M2 competition and it was found that they did not improve the accuracy of pure statistical methods (Makridakis et al 199331ya).
In the M5 “Accuracy” competition, some of the top-performing methods including those ranked 2nd and 5th used adjustments in the form of multipliers to enhance the forecasts derived by the ML models. These adjustments were not completely based on judgment but instead on the analytical alignment of the forecasts produced at the lowest aggregation levels with those at the higher levels, and these adjustments proved to be beneficial, where they helped the models to reduce bias and better consider the longer-term trends that are easier to observe at higher aggregation levels (Kourentzes, et al 2014). The concept of reconciling the forecasts produced at different aggregation levels is not new in the field of forecasting, and numerous studies have empirically demonstrated its benefits, especially when forecasts and information from the complete hierarchy are exploited (Hyndman et al 201113ya, et al 2020b). Therefore, further investigation is required to evaluate the actual value of the external adjustments used in M5 to determine how they should be preferably selected in order to improve the accuracy in a more consistent and unbiased manner. Several studies have shown that judgmental adjustments are often unnecessary and they can degrade the forecasting accuracy (Lawrence et al 200618ya). et al 2009 analyzed the forecasts produced by 4 supply-chain companies, including a retailer, and found that small positive adjustments generally reduced the accuracy, thereby suggesting a general bias toward optimism, whereas larger negative adjustments were more likely to be beneficial.
Value added by effective CV strategies: When dealing with complex forecasting tasks, adopting effective CV strategies is critical for objectively capturing the post-sample accuracy, avoiding overfitting, and mitigating uncertainty (2000). The importance of adopting such strategies is demonstrated by the results of the M5 “Accuracy” competition, which indicate that a substantial number of teams failed to select the most accurate set of forecasts from those submitted while the competition was still running (see §3). However, various CV strategies can be adopted and different conclusions can be drawn based on their design ( et al 2018). Selecting the time period when the CV will be performed, the size of the validation windows, how these windows will be updated, and the criteria used to summarize the forecasting performance are just some of the factors that forecasters must consider.
In the M5 “Accuracy” competition, the top 4 best-performing methods and the vast majority of the top 50 submissions employed a CV strategy where at least the last 4 28-day-long windows of available data were used to assess the forecasting performance, thereby providing a reasonable approximation of the post-sample accuracy. In addition to this CV scheme, the winner measured both the mean and standard deviation of the models that he developed. According to his validations, the recursive models in his approach were more accurate on average than the non-recursive models but with greater instability. Thus, he decided to combine those 2 models to ensure that the forecasts produced were both accurate and stable. Spiliotis et al (2019a) stressed the necessity to consider the full distributions of forecasting errors and especially their tails when evaluating forecasting methods, thereby indicating that robustness is a prerequisite for achieving high accuracy. We hope that the M5 results will encourage more research in this area and contribute to the development of more powerful CV strategies.
Importance of exogenous/explanatory variables: Time series methods are usually sufficient for identifying and capturing historical data patterns (level, trend, and seasonality) and they can produce accurate forecasts by extrapolating these patterns. However, methods that rely solely on identifying and extrapolating historical data fail to effectively account for the effects of holidays, special days, promotions, prices, and possibly the weather. Moreover, these factors can affect historical data and distort the time series pattern unless they are removed before use for forecasting. In these settings, the information from exogenous/explanatory variables is of critical importance for improving the forecasting accuracy ( et al 2016).
In the M5 “Accuracy” competition, all of the winning submissions used external information to improve the forecasting performance of their models. For example, monsaraida and other top teams found that several price-related features were substantially important for improving the accuracy of their results. Furthermore, the importance of exogenous/explanatory variables was supported by comparisons made between the benchmarks in the competition, as shown in the appendix of the supplementary material. For instance, ESX used information about promotions and special days as exogenous variables within exponential smoothing models and performed 6% better than ES_td, which employed the same exponential smoothing models but without considering exogenous variables. The same was true in the case of the ARIMA models, where ARIMAX was found to be 13% more accurate than ARIMA_td.
The M5 “Accuracy” competition clearly showed that ML methods have entered the mainstream of forecasting applications, at least in the area of retail sales forecasting. The potential benefits of these methods are substantial and there is little doubt that retail firms will need to adopt them to improve the accuracy of their forecasts and support better decision making related to their operations and supply chain management.