Statistical Estimation, Inference, and Prediction

In this chapter, we introduce four key statistical libraries in Python statsmodels, pmdarima, fbprophet, and scikitlearn by outlining key examples. ese libraries are used to model time series and provide their forecast values, along with confidence intervals. In addition, we demonstrate how to use a classification model to predict percentage changes of a time series.

For this, we are going to cover the following use cases:

Introduction to statsmodels

Using a Seasonal Auto-Regressive Integrated Moving Average with eXogenous

factors (SARIMAX) time-series model with pmdarima

Time series forecasting with Facebook’s Prophet library Introduction to scikit-learn regression and classification

146 Statistical Estimation, Inference, and Prediction

Technical requirements

e Python code used in this chapter is available in the Chapter06 folder in the book’s code repository.

Introduction to statsmodels

statsmodels is a Python library that allows us to explore data, perform statistical tests, and estimate statistical models.

is chapter focuses on statsmodels’ modeling, analysis, and forecasting of time series.

Normal distribution test with Q-Q plots

An underlying assumption of many statistical learning techniques is that the observations/ fields are normally distributed.

While there are many robust statistical tests for normal distributions, an intuitive visual method is known as a quantile-quantile plot (Q-Q plot). If a sample is normally distributed, its Q-Q plot is a straight line.

In the following code block, the statsmodels.graphics.api.qqplot(...) method is used to check if a numpy.random.uniform(...) distribution is normally distributed:

from statsmodels.graphics.api import qqplot

import numpy as np

fig = qqplot(np.random.uniform(size=10000), line='s') fig.set_size_inches(12, 6)

e resulting plot depicted in the following screenshot shows a non-linear relationship between the two distributions, which was expected since we used a uniform distribution:

Introduction to statsmodels 147

Figure 6.1 Q-Q plot for a dataset generated from a uniform distribution

In the following code block, we repeat the test, but this time with a numpy.random. exponential(...) distribution as our sample distribution:

fig = qqplot(np.random.exponential(size=10000), line='s') fig.set_size_inches(12, 6)

e resulting Q-Q plot again confirms a non-normal relationship between the two distributions, as illustrated in the following screenshot:

Figure 6.2 Q-Q plot for a dataset generated from an exponential distribution

148 Statistical Estimation, Inference, and Prediction

Finally, we will pick out 10,000 samples from a normal distribution using the numpy. random.normal(...) method and use qqplot(...) to observe them, as illustrated in the following code snippet:

fig = qqplot(np.random.normal(size=10000), line='s') fig.set_size_inches(12, 6)

e result is a plot with a linear relationship as expected, as illustrated in the following screenshot:

Figure 6.3 Q-Q plot for 10,000 samples sampled from a standard normal distribution

Q-Q plots are used for comparison between two probability distributions with one of them most o en being a normal distribution by plotting their quantiles against one another. e preceding examples demonstrate how easy it is to test visually for normal distribution.

Time series modeling with statsmodels

A time series is a sequence of numerical data points in time order.

A crucial part of working with time series data involves working with dates and times.

e statsmodels.api.tsa.datetools model provides some basic methods for generating and parsing dates and date ranges, such as dates_from_range(...).

Introduction to statsmodels 149

In the following code snippet, we generate 12 datetime.datetime objects using a length=12 parameter and starting from 2010 with a yearly frequency:

import statsmodels.api as sm sm.tsa.datetools.dates_from_range('2010', length=12)

at yields the following list of datetime objects:

[datetime.datetime(2010, 12, 31, 0, 0), datetime.datetime(2011, 12, 31, 0, 0), ...

datetime.datetime(2020, 12, 31, 0, 0), datetime.datetime(2021, 12, 31, 0, 0)]

e frequency of dates in the dates_from_range(...) method can be specified by the start date and a special format, where the m1 suffix means first month and monthly frequency, and q1 means first quarter and quarterly frequency, as illustrated in the following code snippet:

sm.tsa.datetools.dates_from_range('2010m1', length=120) at yields the following list of datetime objects with monthly frequency:

[datetime.datetime(2010, 1, 31, 0, 0), datetime.datetime(2010, 2, 28, 0, 0), ...

datetime.datetime(2019, 11, 30, 0, 0), datetime.datetime(2019, 12, 31, 0, 0)]

Let’s now perform an Error, Trend, Seasonality (ETS) analysis of a time series.

ETS analysis of a time series

e ETS analysis of a time series breaks down the data into three different components, as follows:

e trend component captures the overall trend of the time series. e seasonality component captures cyclical/seasonal changes.

e error component captures noise in the data that could not be captured with the

other two components.

150 Statistical Estimation, Inference, and Prediction

Let’s generate 20 years of monthly dates as an index to the Pandas DataFrame dataset using the datetools.dates_from_range(...) method, as follows:

import pandas as pd

n_obs = 12 * 20

linear_trend = np.linspace(100, 200, num=n_obs)

cycle = np.sin(linear_trend) * 10

error_noise = np.random.randn(n_obs)

dataset = \

pd.DataFrame(

linear_trend + cycle + error_noise, index=sm.tsa.datetools.dates_from_range('2000m1', length=n_obs), columns=['Price'])

dataset

e result is the following DataFrame with a Price field that is composed of ETS components:

Price 2000-01-31 96.392059 2000-02-29 99.659426 ... ... 2019-11-30 190.067039 2019-12-31 190.676568 240 rows × 1 columns

Let’s visualize the time series dataset that we generated, as follows:

import matplotlib.pyplot as plt dataset.plot(figsize=(12, 6), color='black')

e resulting time series dataset has an apparent linearly increasing trend with seasonal components mixed in, as illustrated in the following screenshot:

Introduction to statsmodels 151

Figure 6.4 Plot displaying synthetic prices with ETS components

In the preceding screenshot, we do see the seasonality component very clearly the oscillation up and down from the median value. We also see the error noise since the oscillations are not perfect. Finally, we see that the values are increasing the trend component.

The Hodrick-Prescott filter

e Hodrick-Prescott (HP) filter is used to separate the trend and cyclical components from time series data by removing short-term fluctuations from the longer-term trend. In statsmodels, this is implemented as statsmodels.api.tsa.filters.

hpfilter(...).

Let’s use it with a lamb=129600 smoothing parameter to perform the decomposition (the value 129600 is the recommended value for monthly data). We use a pair of series values returned to generate a DataFrame with Price, hp_cycle, and hp_trend fields

to represent the price, the seasonal component, and the trend components, as illustrated in the following code snippet:

hp_cycle, hp_trend = !

sm.tsa.filters.hpfilter(dataset['Price'], lamb=129600) decomp = dataset[['Price']]

decomp['HP_Cycle'] = hp_cycle

decomp['HP_Trend'] = hp_trend

decomp

152 Statistical Estimation, Inference, and Prediction

e decomp DataFrame contains the following data:

Price HP_Cycle HP_Trend 2000-01-31 96.392059 -4.731153 101.123212 2000-02-29 99.659426 -1.839262 101.498688 ... ... ... ... 2019-11-30 190.067039 -8.350371 198.417410 2019-12-31 190.676568 -8.107701 198.784269 240 rows × 3 columns

In the next section, we will look at the UnobservedComponents model. UnobservedComponents model

Another way of breaking down a time series into ETS components is to use a statsmodels.api.tsa.UnobservedComponents object.

e UnobservedComponentsResults.summary(...) method generates statistics for the model, as follows:

uc = sm.tsa.UnobservedComponents(dataset['Price'], level='lltrend', cycle=True, stochastic_cycle=True) res_uc = uc.fit(method='powell', disp=True) res_uc.summary()

e output contains details about the model, as illustrated in the following code block:

Optimization terminated successfully.

Current function value: 2.014160

Iterations: 6

Function evaluations: 491

Unobserved Components Results

Dep. Variable: Price No. Observations: 240 Model: local linear trend Log Likelihood -483.399

stochastic cycle AIC 976.797

Date: Fri, 12 Jun 2020 BIC 994.116 Time: 08:09:46 HQIC 983.779 Sample: 01-31-2000

Introduction to statsmodels 153

12-31-2019

Covariance Type: opg

coef std err z P>|z| [0.025 0.975] sigma2.irregular 0.4962 0.214 2.315 0.021 0.076 0.916 sigma2.level 6.954e-17 0.123 5.63e-16 1.000 -0.242 0.242

sigma2.trend 2.009e-22 4.03e-05 4.98e-18 1.000 -7.91e- 05 7.91e-05

sigma2.cycle 1.5485 0.503 3.077 0.002 0.562 2.535 frequency.cycle 0.3491 0.013 27.768 0.000 0.324 0.374 Ljung-Box (Q): 347.56 Jarque-Bera (JB): 0.42 Prob(Q): 0.00 Prob(JB): 0.81 Heteroskedasticity (H): 0.93 Skew: -0.09 Prob(H) (two-sided): 0.73 Kurtosis: 2.91

We can access the ETS/cyclical components using the resid, cycle.smoothed, and level.smoothed attributes and add them to the decomp DataFrame, as follows:

decomp['UC_Cycle'] = res_uc.cycle.smoothed decomp['UC_Trend'] = res_uc.level.smoothed decomp['UC_Error'] = res_uc.resid

decomp

e decomp DataFrame has the following new columns containing the Cycle, Trend, and Error terms from the UnobservedComponents model:

... UC_Cycle UC_Trend UC_Error 2000-01-31 ... -3.358954 99.743814 96.392059 2000-02-29 ... -0.389834 100.163434 6.173967 ... ... ... ... ... 2019-11-30 ... -9.725420 199.613395 1.461497 2019-12-31 ... -9.403885 200.033015 0.306881 240 rows × 6 columns

Next, we will look at the statsmodel.tsa.seasonal.seasonal_decompose(…) method.

statsmodels.tsa.seasonal.seasonal_decompose(...) method Another way to perform ETS decomposition is by using the statsmodels.tsa.

seasonal.seasonal_decompose(...) method.

154 Statistical Estimation, Inference, and Prediction

e following code block uses an additive model by specifying a model='additive' parameter and adds SDC_Cycle, SDC_Trend, and SDC_Error columns to the decomp DataFrame by accessing the season, trend, and resid attributes in the DecomposeResult object:

from statsmodels.tsa.seasonal import seasonal_decompose s_dc = seasonal_decompose(dataset['Price'], model='additive') decomp['SDC_Cycle'] = s_dc.seasonal

decomp['SDC_Trend'] = s_dc.trend

decomp['SDC_Error'] = s_dc.resid

decomp[118:122]

e decomp DataFrame now has three additional fields with values, as shown in the following code block:

... SDC_Cycle SDC_Trend SDC_Error 2009-11-30 ... 0.438633 146.387392 -8.620342 2009-12-31 ... 0.315642 147.240112 -6.298764 2010-01-31 ... 0.228229 148.384061 -3.538544 2010-02-28 ... 0.005062 149.912202 -0.902362

Next, we will plot the various results we got from the preceding sections.

Plotting of the results of HP filter, the UnobservedComponents model, and the seasonal_decompose method

Let’s plot the trend components extracted from the HP filter, the UnobservedComponents model, and the seasonal_decompose method, as follows:

plt.title('Trend components')

decomp['Price'].plot(figsize=(12, 6), color='black',

linestyle='-', legend='Price') decomp['HP_Trend'].plot(figsize=(12, 6), color='darkgray', linestyle='--', lw=2, legend='HP_Trend') decomp['UC_Trend'].plot(figsize=(12, 6), color='black',

linestyle=':', lw=2, legend='UC_Trend')

Introduction to statsmodels 155

decomp['SDC_Trend'].plot(figsize=(12, 6), color='black', linestyle='-.', lw=2, legend='SDC_Trend')

at gives us the following plot, with the trend components plotted next to the original price. All three models did a good job in identifying the overall increasing trend, with the seasonal_decompose(...) method capturing some non-linear/cyclical trend components, in addition to the overall linearly increasing trend:

Figure 6.5 Plot showing trend components extracted from different ETS decomposition methods

e following code block plots the cycle/seasonal components obtained from the three models:

plt.title('Cycle/Seasonal components') decomp['HP_Cycle'].plot(figsize=(12, 6), color='darkgray', linestyle='--', lw=2, legend='HP_Cycle') decomp['UC_Cycle'].plot(figsize=(12, 6), color='black',

linestyle=':', lw=2, legend='UC_Cycle') decomp['SDC_Cycle'].plot(figsize=(12, 6), color='black',

linestyle='-.', lw=2, legend='SDC_Cycle')

156 Statistical Estimation, Inference, and Prediction

e following result shows that the seasonal_decompose(...) method generates seasonal components with very small fluctuations, and that is because some part of the seasonal components was built into the trend plot we saw before:

Figure 6.6 Plot showing cyclical/seasonal components extracted

by different ETS decomposition methods

Finally, we will visualize the error terms in the UnobservedComponents and seasonal_decompose methods, as follows:

plt.title('Error components')

plt.ylim((-20, 20))

decomp['UC_Error'].plot(figsize=(12, 6), color='black', linestyle=':', lw=2, legend='UC_Error') decomp['SDC_Error'].plot(figsize=(12, 6), color='black', linestyle='-.', lw=2, legend='SDC_Error')

Introduction to statsmodels 157

e output is shown in the following screenshot:

Figure 6.7 Plot displaying error terms from different ETS decomposition models

e plot shown in the preceding screenshot demonstrates that the error terms oscillate around 0 and that they have no clear trend.

Augmented Dickey-Fuller test for stationarity of a time series

Stationary time series are time series whose statistical properties such as mean, variance, and autocorrelation are constant over time. Many statistical forecasting models

assume that time series datasets can be transformed into stationary datasets by some mathematical operations, such as differencing.

An Augmented Dickey-Fuller (ADF) test is used to check if a dataset is stationary or not it computes the likelihood that a dataset is not stationary, and when that probability (p-value) is very low, we can conclude that the dataset is stationary. We will look at the detailed steps in the following sections.

158 Statistical Estimation, Inference, and Prediction

Step 1 – ADF test on the prices

Let’s check for stationarity, as well as converting our dataset into a stationary dataset by using a differencing method. We start with the statsmodels.tsa.stattools. adfuller(...) method, as illustrated in the following code snippet:

from statsmodels.tsa.stattools import adfuller

result = adfuller(dataset['Price'])

print('Test Stat: {}\np value: {}\nLags: {}\nNum \

observations: {}'.format(result[0], result[1], result[2], result[3]))

at outputs the following values when applied to the Price field. e Test statistic is a positive value and the p-value is 98%, meaning there is strong evidence that the Price field is not stationary. We knew this was expected, since the Price field has strong trend and seasonality components in it:

Test Stat: 0.47882793726850786 p value: 0.9842151821849324 Lags: 14

Num observations: 225

Step 2 – First differencing on prices

Next, we apply a first differencing transformation; this finds the first difference from one observation to the next one. If we difference the differenced dataset again, that yields a second difference, and so on.

We store the first-differenced pandas.Series dataset in the price_diff variable, as shown in the following code block:

price_diff = !

(dataset['Price'].shift(-1) - dataset['Price']).fillna(0) price_diff

Introduction to statsmodels 159

at dataset contains the following values:

2000-01-31 4.951062

2000-02-29 5.686832 ...

2019-11-30 3.350694

2019-12-31 0.000000

Name: Price, Length: 240, dtype: float64

Step 3 – ADF test on the differenced prices

Now, we rerun the ADF test on this transformed dataset to check for stationarity, as follows:

result = adfuller(price_diff)

print('Test Stat: {}\np value: {}\nLags: {}\nNum \

observations: {}'.format(result[0], result[1], result[2], result[3]))

e test statistic now has a large negative value (values under -4 have a very high likelihood of being stationary). e probability of not being stationary now reduces to an extremely low value, indicating that the transformed dataset is stationary, as illustrated in the following code snippet:

Test Stat: -7.295184662866956

p value: 1.3839111942229784e-10 Lags: 15

Num observations: 224

Autocorrelation and partial autocorrelation of a time series

Autocorrelation or serial correlation is the correlation of an observation a delayed copy of itself as a function of delay. It measures if the currently observed value has any relationship to the value in the future/past.

160 Statistical Estimation, Inference, and Prediction

In our dataset with a clear linear trend and some seasonal components, the autocorrelation slowly decreases as the number of lags increases, but for smaller lag values the dataset has high autocorrelation values due to the large overall linear trend. e statsmodels.graphics.tsaplots.plot_acf(...) method plots the

autocorrelation of the Price field with lag values ranging from 0 to 100, as illustrated in the following code snippet:

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf fig = plot_acf(dataset['Price'], lags=100)

fig.set_size_inches(12, 6)

e result indicates that autocorrelation remains relatively strong up to lag values of around 36, where it dips below 0.5. is is illustrated in the following screenshot:

Figure 6.8 Autocorrelation plot showing autocorrelation against different lag values

e statsmodels.graphics.tsaplots.plot_pacf(…) method lets us plot the partial autocorrelation values against different lag values. e difference between autocorrelation and partial autocorrelation is that with partial autocorrelation, only the correlation between that observation and the previous observation that lag periods is used, and correlation effects from lower lag-value terms are removed. is method is shown in the following code snippet:

fig = plot_pacf(dataset['Price'], lags=100) fig.set_size_inches(12, 6)

Introduction to statsmodels 161

e output can be seen in the following screenshot:

Figure 6.9 Partial autocorrelation plot showing partial autocorrelations against lag values

e plot shown in the preceding screenshot drops in autocorrelation sharply a er the

first two lag terms and then seasonally varies from positive to negative values every 10 lag terms.

ARIMA time series model

e Auto-Regressive Integrated Moving Average (ARIMA) model is one of the most well-known time series modeling and forecasting models available. It is used to predict time series data for time series with correlated data points.

e ARIMA model is composed of three components, outlined as follows:

Auto-regression (AR): is model uses the autocorrelation relationship we

explored in the previous section. It accepts a p parameter, specifying the number of lags to use. For our case based on the autocorrelation plots, we will specify p=36 when modeling the Price series with ARIMA.

Integrated (I): is is the differencing transformation for the model to use to

convert the time series into a stationary dataset. It accepts a d parameter, specifying the order of differencing to perform, which in our case will be d=1 . As we saw in the Augmented Dickey-Fuller test for stationarity of a time series section, the first- order differencing led to a stationary dataset.

162 Statistical Estimation, Inference, and Prediction

Moving Average (MA): is is the component that applies a MA model to lagged

observations. is accepts a single parameter, q, which is the size of the MA window. In our case, we will set this parameter based on the partial autocorrelation plots and use a value of q=2 because of the sharp drop-off in partial autocorrelation past the lag value of 1.

In statsmodels, the statsmodels.tsa.arima.model.ARIMA model builds a time series as an ARIMA model. Using an order=(36, 1, 2) parameter, we specify p=36, d=1, and q=2. en, we call the ARIMA.fit(...) method to fit the model to our Price series, and call the ARIMA.summary(...) method to output information about the fitted ARIMA model.

Some other packages for example, pmdarima offer auto_arima methods that find the ARIMA models by themselves, as illustrated in the following code snippet:

from statsmodels.tsa.arima.model import ARIMA arima = ARIMA(dataset['Price'], order=(36,1,2)) res_ar = arima.fit()

res_ar.summary()

e following output describes fitting parameters:

SARIMAX Results

Dep. Variable: Price No. Observations: 240 Model: ARIMA(36, 1, 2) Log Likelihood -360.195 Date: Sat, 13 Jun 2020 AIC 798.391 Time: 09:18:46 BIC 933.973 Sample: 01-31-2000 HQIC 853.027

12-31-2019

Covariance Type: opg

coef std err z P>|z| [0.025 0.975] ar.L1 -0.8184 0.821 -0.997 0.319 -2.428 0.791 ar.L2 -0.6716 0.495 -1.358 0.175 -1.641 0.298 ...

ar.L35 0.3125 0.206 1.514 0.130 -0.092 0.717 ar.L36 0.1370 0.161 0.851 0.395 -0.178 0.452 ma.L1 -0.0244 0.819 -0.030 0.976 -1.630 1.581 ma.L2 0.1694 0.454 0.373 0.709 -0.721 1.060 sigma2 1.0911 0.144 7.586 0.000 0.809 1.373

Introduction to statsmodels 163

Ljung-Box (Q): 13.99 Jarque-Bera (JB): 1.31 Prob(Q): 1.00 Prob(JB): 0.52 Heteroskedasticity (H): 1.15 Skew: 0.09 Prob(H) (two-sided): 0.54 Kurtosis: 2.69

Using the statsmodels.tsa.arima.ARIMAResults.predict(...) method, we can use the fitted model to predict values over the specified start and end datetime indices (in this case, the entire dataset). We will save the predicted prices in the PredPrice field for comparison later. e code can be seen in the following snippet:

dataset['PredPrice'] = res_ar.predict(dataset.index[0], dataset.index[-1]) dataset

e result adds the new column with the predicted prices, as follows:

Price PredPrice 2000-01-31 95.317833 0.000000 2000-02-29 100.268895 95.317901 ... ... ... 2019-11-30 188.524009 188.944216 2019-12-31 191.874704 190.614641 240 rows × 2 columns

Now, we will plot the original Price and the PredPrice fields in the following code block to visually compare the two:

plt.ylim(70, 250)

dataset['Price'].plot(figsize=(12, 6), color='darkgray',

linestyle='-', lw=4, legend='Price') dataset['PredPrice'].plot(figsize=(12, 6), color='black', linestyle='-.', legend='PredPrice')

164 Statistical Estimation, Inference, and Prediction

e predicted prices are quite accurate, and that is because the specified parameters (p, d, q) were precise. e result can be seen in the following screenshot:

Figure 6.10 Plot comparing the original price and the price predicted by an ARIMA (36, 1, 2) model

Let’s use this fitted model to forecast values for dates out in the future. First, we build an extended_dataset DataFrame with another 4 years’ worth of datetime indices and

no data (which will be filled in with NaN values) using the datetools.dates_from_ range(...) method and the pandas.DataFrame.append(...) method, as follows:

extended_dataset = pd.DataFrame(index=sm.tsa.datetools.dates_ from_range('2020m1', length=48))

extended_dataset = dataset.append(extended_dataset) extended_dataset

Price PredPrice

2000-01-31 95.317833 0.000000

2000-02-29 100.268895 95.317901

... ... ...

2023-11-30 NaN NaN

2023-12-31 NaN NaN

288 rows × 2 columns

Introduction to statsmodels 165

en, we can call the ARIMAResults.predict(...) method again to generate predicted prices for the entire time series and thus forecast onto the new dates we added, as follows:

extended_dataset['PredPrice'] = \ res_ar.predict(extended_dataset.index[0], extended_dataset.index[-1]) extended_dataset

Price PredPrice 2000-01-31 95.317833 0.000000 2000-02-29 100.268895 95.317901

... ... ... 2023-11-30 NaN 215.441777 2023-12-31 NaN 220.337355 288 rows × 2 columns

e following code block plots the last 100 observations from the extended_dataset DataFrame:

extended_dataset['Price'].iloc[-100:].plot(figsize=(12, 6), color='darkgray', linestyle='-', lw=4, legend='Price') extended_dataset['PredPrice'].iloc[-100:].plot(figsize=(12, 6), color='black', linestyle='-.', legend='PredPrice')

166 Statistical Estimation, Inference, and Prediction

And that yields a plot with the forecasted PredPrice values, as illustrated in the following screenshot:

Figure 6.11 Historical and predicted prices forecasted by the ARIMA model

In the plot shown in the preceding screenshot, the predicted prices visibly follow the trend of past prices.

Using a SARIMAX time series model with pmdarima

SARIMA is an extension of the ARIMA model for univariate time series with a seasonal component.

SARIMAX is, then, the name of the model, which also supports exogenous variables. ese are the three ARIMA parameters:

p = trend auto-regressive order d = trend difference order

q = trend MA order

Using a SARIMAX time series model with pmdarima 167

In addition to the preceding parameters, SARIMA introduces four more, as follows:

P = seasonal auto-regressive order

D = seasonal difference order

Q = seasonal MA order

m = the length of a single seasonal period in the number of time steps

To find these parameters manually can be time-consuming, and it may be advantageous to use an auto-ARIMA model.

In Python, auto-ARIMA modeling is provided by the pmdarima library. Its documentation is available at http://alkaline-ml.com/pmdarima/index.html.

e installation is straightforward, as can be seen here:

pip install pmdarima

e auto-ARIMA model attempts to automatically discover the SARIMAX parameters by conducting various statistical tests, as illustrated here:

Figure 6.12 Table of the various statistical tests

Once we find the optimal d value, the auto-ARIMA model searches for the best fitting model within the ranges defined by start_p, max_p, start_q, and max_q. If the seasonal parameter is enabled, once we determine the optimal D value we use a similar procedure to find P and Q.

e best model is determined by minimizing the value of the information criterion (Akaike information criterion (AIC), Corrected AIC, Bayesian information criterion (BIC), Hannan-Quinn information criterion (HQC), or out-of-bag (OOB) for validation scoring respectively).

168 Statistical Estimation, Inference, and Prediction

If no suitable model is found, auto-ARIMA returns a ValueError output.

Let’s use auto-ARIMA with the previous dataset. e time series has a clear seasonality component with a periodicity of 12.

Notice in the following code block that we generate 95% confidence intervals for the predicted values, which is very useful for trading rules for example, sell if the price is above the upper confidence interval value:

import pmdarima as pm

model = pm.auto_arima(dataset['Price'], seasonal=True, stepwise=True, m=12) print(model.summary())

extended_dataset = \

pd.DataFrame(

index=sm.tsa.datetools.dates_from_range('2020m1', length=48))

extended_dataset['PredPrice'], conf_int = \ model.predict(48, return_conf_int=True, alpha=0.05)

plt.plot(dataset['Price'], c='blue') plt.plot(extended_dataset['PredPrice'], c='green') plt.show()

print(extended_dataset)

print(conf_int)

Using a SARIMAX time series model with pmdarima 169

e output is shown here:

Figure 6.13 SARIMAX result statistics from auto-ARIMA

170 Statistical Estimation, Inference, and Prediction

e plot is shown in the following screenshot:

Figure 6.14 Historical and predicted price forecasted by the auto-ARIMA model

e output also includes the predicted prices, as follows:

PredPrice 2020-01-31 194.939195 ... ... 2023-12-31 222.660698

[48 rows x 1 columns]

In addition, the output provides the confidence intervals for each predicted price, as follows:

[[192.39868933 197.4797007 ] [196.80033117 202.32443987] [201.6275806 207.60042584] ...

[212.45091331 225.44676173] [216.11548707 229.20590827]]

We will now see time series forecasting with Facebook’s Prophet library.

Time series forecasting with Facebook’s Prophet library 171

Time series forecasting with Facebook's Prophet library

Facebook Prophet is a Python library used for forecasting univariate time series with strong support for seasonality and holiday effects. It is especially suitable for time series with frequent changes of trends and is robust enough to handle outliers.

More specifically, the Prophet model is an additive regression model with the following attributes:

Piecewise linear or logistic growth trend

Yearly seasonal component modeled with a Fourier series

Weekly seasonal component modeled with dummy variables A user-provided list of holidays

Installation of Prophet is more complicated, since it requires a compiler. e easiest way to install it is by using Anaconda, as follows:

conda install -c conda-forge fbprophet

e accompanying Git repository contains the conda environment set up with Prophet. e Prophet library requires the input DataFrame to include two columns ds for date,

and y for the value.

Let’s fit the Prophet model onto the previous dataset. Notice in the following code snippet that we explicitly tell Prophet we wish to receive monthly predictions (freq='M'):

from fbprophet import Prophet

prophet_dataset = \

dataset.rename(columns={'Price' : 'y'}).rename_axis('ds')\ .drop('PredPrice', 1).reset_index()

print(prophet_dataset)

model = Prophet()

model.fit(prophet_dataset)

df_forecast = model.make_future_dataframe(periods=48, freq='M') df_forecast = model.predict(df_forecast)

172 Statistical Estimation, Inference, and Prediction

print(df_forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()) model.plot(df_forecast, xlabel='Date', ylabel='Value') model.plot_components(df_forecast)

e predicted values are very similar to the SARIMAX model, as can be seen here:

Figure 6.15 e Prophet library’s output includes prediction values,

along with the model components’ values

e predicted values are stored in the yhat column with the yhat_lower and yhat_upper confidence intervals.

Prophet does produce charts of Prophet components, which is useful for understanding the model’s prediction powers. A trend component chart can be seen here:

Figure 6.16 e trend component chart of the Prophet model

Time series forecasting with Facebook’s Prophet library 173

e following screenshot shows the yearly seasonality output:

Figure 6.17 e yearly seasonality component chart of the Prophet model

Here is the output of the forecast chart:

Figure 6.18 e forecast chart of the Prophet model along with the confidence intervals

174 Statistical Estimation, Inference, and Prediction

Each time series model is slightly different and is best suited for different classes of time series. In general, however, the Prophet model is very robust and easiest to use in most scenarios.

Introduction to scikit-learn regression and classification

scikit-learn is a Python supervised and unsupervised machine learning library built on top of the numpy and scipy libraries.

Let’s demonstrate how to forecast price changes on a dataset with RidgeCV regression and classification using scikit-learn.

Generating the dataset

Let’s start by generating the dataset for the following examples a Pandas DataFrame containing daily data for 20 years with BookPressure, TradePressure, RelativeValue, and Microstructure fields to represent some synthetic trading signals built on this dataset (also known as features or predictors). e PriceChange field represents the daily change in prices that we are trying to predict (also known as response or target variable). For simplicity, we make the PriceChange field a linear function of

our predictors with random weights and some random noise. e Price field represents

the actual price of the instrument generated using the pandas.Series.cumsum(...) method. e code can be seen in the following snippet:

import numpy as np import pandas as pd

df = pd.DataFrame(index=pd.date_range('2000', '2020')) df['BookPressure'] = np.random.randn(len(df)) * 2 df['TradePressure'] = np.random.randn(len(df)) * 100 df['RelativeValue'] = np.random.randn(len(df)) * 50 df['Microstructure'] = np.random.randn(len(df)) * 10

true_coefficients = np.random.randint(low=-100, high=101, size=4) / 10

Introduction to scikit-learn regression and classification 175

df['PriceChange'] = ((df['BookPressure'] * true_ coefficients[0])

(df['TradePressure'] * true_coefficients[1])
(df['RelativeValue'] * true_coefficients[2])
(df['Microstructure'] * true_coefficients[3])
(np.random.randn(len(df)) * 200))

df['Price'] = df['PriceChange'].cumsum(0) + 100000 Let’s quickly inspect the true weights assigned to our four features, as follows:

true_coefficients

array([10. , 6.2, -0.9, 5. ])

Let’s also inspect the DataFrame containing all the data, as follows:

BookPressure TradePressure RelativeValue Microstructure PriceChange Price

2000-01-01 4.545869 -2.335894 5.953205 -15.025576 -263.749500 99736.250500

2000-01-02 -0.302344 -186.764283 9.150213 13.795346 -758.298833 98977.951667

... ... ... ... ... ... ...

2019-12-31 -1.890265 -113.704752 60.258456 12.229772 -295.295108 182827.332185

2020-01-01 1.657811 -77.354049 -39.090108 -3.294086 -204.576735 182622.755450

7306 rows × 6 columns

176 Statistical Estimation, Inference, and Prediction

Let’s visually inspect the Price field, as follows:

df['Price'].plot(figsize=(12, 6), color='black', legend='Price')

e plot shows the following realistic-looking price evolution over 20 years:

Figure 6.19 Price plot for the synthetically generated dataset

Let’s display a scatter matrix of all columns but the Price column, as follows:

pd.plotting.scatter_matrix(df.drop('Price', axis=1), color='black', alpha=0.2, grid=True, diagonal='kde', figsize=(10, 10))

Introduction to scikit-learn regression and classification 177

e output is shown here:

Figure 6.20 Scatter matrix for the synthetically generated dataset

e scatter matrix shows that there is a strong relationship between PriceChange and TradePressure.

178 Statistical Estimation, Inference, and Prediction

Running RidgeCV regression on the dataset

Let’s use a scikit-learn regression method to fit a linear regression model to our dataset. We will use the four features to try to fit to and predict the PriceChange field.

First, we collect the features and target into a DataFrame and a Series, as follows:

features = df[['BookPressure', 'TradePressure',

'RelativeValue', 'Microstructure']] target = df['PriceChange']

We will use sklearn.linear_model.RidgeCV, a linear regression model with L2 regularization (an L2 norm penalty factor to avoid overfitting) that uses cross-validation to learn the optimal coefficients. We will use the sklearn.linear_model.RidgeCV.

fit(...) method to fit the target values using the features. e code is shown in the following snippet:

from sklearn.linear_model import RidgeCV ridge = RidgeCV()

ridge.fit(features, target)

e result is a RidgeCV object, as can be seen here:

RidgeCV(alphas=array([ 0.1, 1. , 10. ]), cv=None,

fit_intercept=True, gcv_mode=None, normalize=False, scoring=None, store_cv_values=False)

We can access the weights/coefficients learned by the Ridge model using the RidgeCV.

coef_ attribute and compare it with the actual coefficients, as follows:

true_coefficients, ridge.coef_

It seems the coefficients learned by the model are very close to the true weights, with some

errors on each one of them, as can be seen in the following code snippet:

(array([10. , 6.2, -0.9, 5. ]),

array([11.21856334, 6.20641632, -0.93444009, 4.94581522]))

Introduction to scikit-learn regression and classification 179

e RidgeCV.score(...) method returns the R2 score, representing the accuracy of a fitted model, as follows:

ridge.score(features, target)

at returns the following R2 score with a maximum value of 1, so this model fits the data quite well:

0.9076861352499385

e RidgeCV.predict(...) method outputs the predicted price change values, which we combine with the pandas.Series.cumsum(...) method to generate the predicted price series, and then save it in the PredPrice field, as follows:

df['PredPrice'] = !

ridge.predict(features).cumsum(0) + 100000; df

at adds a new column to our DataFrame, as shown here:

... Price PredPrice 2000-01-01 ... 99736.250500 99961.011495 2000-01-02 ... 98977.951667 98862.549185 ... ... ... ... 2019-12-31 ... 182827.332185 183059.625653 2020-01-01 ... 182622.755450 182622.755450 7306 rows × 7 columns

In the following code block, the true Price field is plotted alongside the predicted PredPrice field:

df['Price'].plot(figsize=(12, 6), color='gray',

linestyle='--', legend='Price') df['PredPrice'].plot(figsize=(12, 6), color='black',

linestyle='-.', legend='PredPrice')

180 Statistical Estimation, Inference, and Prediction

e plot generated, as shown in the following screenshot, reveals that PredPrice mostly tracks Price, with some prediction errors during some time periods:

Figure 6.21 Plot comparing the original price and the predicted price from a Ridge regression model

We can zoom in to the first quarter of 2010 to inspect the prediction errors, as follows:

df['Price'].loc['2010-01-01':'2010-03-31']!

.plot(figsize=(12, 6), color='darkgray', linestyle='-', legend='Price') df['PredPrice'].loc['2010-01-01':'2010-03-31']\ .plot(figsize=(12, 6), color='black', linestyle='-.', legend='PredPrice')

Introduction to scikit-learn regression and classification 181

is yields the following plot, displaying the differences between Price and PredPrice for that period:

Figure 6.22 Plot comparing the original and predicted price from a Ridge regression model for 2010 Q1

We can compute the prediction errors and plot them using a density plot, as shown in the following code snippet:

df['Errors'] = df['Price'] - df['PredPrice'] df['Errors'].plot(figsize=(12, 6), kind='kde',

color='black', legend='Errors')

182 Statistical Estimation, Inference, and Prediction

is generates the plot shown in the following screenshot, displaying the distribution of errors:

Figure 6.23 Plot displaying the distribution of prediction errors for the Ridge regression model

e error plot displayed in the preceding screenshot shows that there is no strong bias in the errors.

Running a classification method on the dataset

Let’s demonstrate scikit-learn’s classification methods.

First, we need to create discrete categorical target labels for the classification model to predict. We assign -2, -1, 0, 1, and 2 numeric labels to these conditions respectively and save the discrete target labels in the target_discrete pandas.Series object, as follows:

target_discrete = pd.cut(target, bins=5,

labels = \

[-2, -1, 0, 1, 2]).astype(int); target_discrete

e result is shown here:

2000-01-01 0 2000-01-02 -1 ...

2019-12-28 -1

Introduction to scikit-learn regression and classification 183

2019-12-29 0

2019-12-30 0

2019-12-31 0

2020-01-01 0

Freq: D, Name: PriceChange, Length: 7306, dtype: int64

We can visualize the distribution of the five labels by using the following code:

target_discrete.plot(figsize=(12, 6), kind='hist', color='black')

e result is a plot of frequency of occurrence of the five labels, as shown in the following screenshot:

Figure 6.24 Frequency distribution of our discrete target-price change-label values [-2, -1, 0, 1, 2]

For the classification, we use an ensemble of decision tree classifiers provided by sklearn. ensemble.RandomForestClassifier. Random forest is a classifier that uses the bagging ensemble method and builds a forest of decision trees by training each tree on datasets generated by random sampling with replacements from the original dataset. Using a max_depth=5 parameter, we limit the height of each tree to reduce overfitting and then call the RandomForestClassifier.fit(...) method to fit the model, as follows:

from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(max_depth=5)

rf.fit(features, target_discrete)

184 Statistical Estimation, Inference, and Prediction

is builds the following RandomForestClassifier fitted model:

RandomForestClassifier(

bootstrap=True, ccp_alpha=0.0, class_weight=None,

criterion='gini', max_depth=5, max_features='auto', max_leaf_nodes=None, max_samples=None,

min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2,

min_weight_fraction_leaf=0.0, n_estimators=100,

n_jobs=None, oob_score=False, random_state=None,

verbose=0, warm_start=False)

e RandomForestClassifier.score(...) method returns the mean accuracy of the predictions compared to the True labels, as follows:

rf.score(features, target_discrete)

As we can see here, the accuracy score is 83.5%, which is excellent:

0.835340815767862

We add the DiscretePriceChange and PredDiscretePriceChange fields to the DataFrame to hold the true labels and the predicted labels using the RandomForestClassifier.predict(...) method, as follows:

df['DiscretePriceChange'] = target_discrete df['PredDiscretePriceChange'] = rf.predict(features) df

e result is the following DataFrame with the two additional fields:

... DiscretePriceChange PredDiscretePriceChange 2000-01-01 ... 0 0 2000-01-02 ... -1 -1 ... ... ... ... 2019-12-31 ... 0 -1 2020-01-01 ... 0 -1 7306 rows × 10 columns

Introduction to scikit-learn regression and classification 185

In the following code block, we plot two fields for the first quarter of 2010:

df['DiscretePriceChange'].loc['2010-01-01':'2010-03-31']. plot(figsize=(12, 6), color='darkgray', linestyle='-', legend='DiscretePriceChange')

df['PredDiscretePriceChange'].loc['2010-01-01':'2010-03- 31'].plot(figsize=(12, 6), color='black', linestyle='-.', legend='PredDiscretePriceChange')

at yields a plot, as shown in the following screenshot, with some dislocations between the True and predicted labels:

Figure 6.25 Comparison of original and predicted discrete price-change labels from the RandomForest classification model for 2010 Q1

We can compute and plot the distribution of the ClassificationErrors DataFrame with the following code:

df['ClassificationErrors'] = !

df['DiscretePriceChange'] - df['PredDiscretePriceChange'] df['ClassificationErrors'].plot(figsize=(12, 6),

kind='kde', color='black', legend='ClassificationErrors')

186 Statistical Estimation, Inference, and Prediction

is yields the following error distribution:

Figure 6.26 Plot of distribution of classification errors from the RandomForest classifier model

e classification errors are again without bias and are negligible.

Summary

All advanced trading algorithms use statistical models, whether for a direct trading rule or just for deciding when to enter/leave trading. In this chapter, we have covered the four key statistical libraries for Python statsmodels, pmdarima, fbprophet,

and scikitlearn.

In the next chapter, we discuss how to import key financial and economic data into Python.

**

저작자표시

'Finance' 카테고리의 다른 글

Introduction to Zipline and PyFolio (0)	2023.03.20
Financial Market Data Access in Python (0)	2023.03.20
Data Visualization Using Matplotlib (0)	2023.03.20
Data Manipulation and Analysis with pandas (0)	2023.03.20
High-Speed Scientific Computing Using (0)	2023.03.20

Statistical Estimation, Inference, and Prediction

Statistical Estimation, Inference, and Prediction

Technical requirements

Introduction to statsmodels

Normal distribution test with Q-Q plots

Figure 6.1 Q-Q plot for a dataset generated from a uniform distribution

Figure 6.2 Q-Q plot for a dataset generated from an exponential distribution

Figure 6.3 Q-Q plot for 10,000 samples sampled from a standard normal distribution

Time series modeling with statsmodels

ETS analysis of a time series

Figure 6.4 Plot displaying synthetic prices with ETS components

The Hodrick-Prescott filter

In the next section, we will look at the UnobservedComponents model. UnobservedComponents model

Plotting of the results of HP filter, the UnobservedComponents model, and the seasonal_decompose method

Figure 6.5 Plot showing trend components extracted from different ETS decomposition methods

Figure 6.6 Plot showing cyclical/seasonal components extracted

by different ETS decomposition methods

Figure 6.7 Plot displaying error terms from different ETS decomposition models

Augmented Dickey-Fuller test for stationarity of a time series

Step 1 – ADF test on the prices

Step 2 – First differencing on prices

Step 3 – ADF test on the differenced prices

Autocorrelation and partial autocorrelation of a time series

Figure 6.8 Autocorrelation plot showing autocorrelation against different lag values

Figure 6.9 Partial autocorrelation plot showing partial autocorrelations against lag values

ARIMA time series model

Figure 6.10 Plot comparing the original price and the price predicted by an ARIMA (36, 1, 2) model

Figure 6.11 Historical and predicted prices forecasted by the ARIMA model

Using a SARIMAX time series model with pmdarima

Figure 6.12 Table of the various statistical tests

Figure 6.13 SARIMAX result statistics from auto-ARIMA

Figure 6.14 Historical and predicted price forecasted by the auto-ARIMA model

Time series forecasting with Facebook's Prophet library

Figure 6.15 e Prophet library’s output includes prediction values,

along with the model components’ values

Figure 6.16 e trend component chart of the Prophet model

Figure 6.17 e yearly seasonality component chart of the Prophet model

Figure 6.18 e forecast chart of the Prophet model along with the confidence intervals

Introduction to scikit-learn regression and classification

Generating the dataset

Figure 6.19 Price plot for the synthetically generated dataset

Figure 6.20 Scatter matrix for the synthetically generated dataset

Running RidgeCV regression on the dataset

Figure 6.21 Plot comparing the original price and the predicted price from a Ridge regression model

Figure 6.22 Plot comparing the original and predicted price from a Ridge regression model for 2010 Q1

Figure 6.23 Plot displaying the distribution of prediction errors for the Ridge regression model

Running a classification method on the dataset

Figure 6.24 Frequency distribution of our discrete target-price change-label values [-2, -1, 0, 1, 2]

Figure 6.25 Comparison of original and predicted discrete price-change labels from the RandomForest classification model for 2010 Q1

Figure 6.26 Plot of distribution of classification errors from the RandomForest classifier model

Summary

**

**

'Finance' 카테고리의 다른 글

관련글

댓글

티스토리툴바