Predicting Cryptocurrencies Using Sparse Non-gaussian State Space Models

Predicting cryptocurrencies using sparse non-gaussian state space models


In the present paper we develop a non‐Gaussian state‐space model to predict the price of three crypto‐currencies. Taking a Bayesian stance enables us to introduce shrinkage into the modeling framework, effectively controlling for model and specification uncertainty within the general class of state‐space models.

To control for potential outliers we propose a time‐varying parameter vector autoregressive (VAR) model (Cogley & Sargent, 2005; Primiceri, 2005) with heavy‐tailed innovations, 1 as well as a stochastic volatility specification of the error variances.

Since the literature on robust determinants of price movements in crypto‐currencies is relatively sparse (for an example, see Cheah & Fry, 2015), we apply Bayesian shrinkage priors to decide whether using information from a set of potential predictors improves predictive accuracy.

The recent price dynamics of various crypto‐currencies point towards a set of empirical key features that an appropriate modeling strategy should accommodate. First, conditional heteroskedasticity appears to be an important regularity commonly observed (Chu, Chan, Nadarajah, & Osterrieder, 2017). This implies that volatility is changing over time in a persistent manner. If this feature is neglected, predictive densities are either too wide (during tranquil times) or too narrow (in the presence of tail events, i.e., pronounced movements in the price of a given asset).

2 Second, the conditional mean of the process is changing. This implies that, within a standard regression framework, the relationship between an asset price and a set of exogenous covariates is time varying. In the case of various crypto‐currencies this could be due to changes in the degree of adoption of institutional and/or private investors, regulatory changes, issuance of additional crypto‐currencies or general technological shifts (Böhme, Christin, Edelman, & Moore, 2015).

Thus it might be necessary to allow for such shifts by means of time‐varying regression coefficients. Third, and finally, varios crypto‐currencies display a rather strong degree of co‐movment with each other (see Urquhart, 2017).

In our paper, we consider Bitcoin, Ethereum, and Litecoin—three popular choices. All three tend to be strongly correlated with each other, implying that a successful econometric framework should incorporate this information.

The goal of this paper is to systematically assess how different empirically relevant forecasting models perform when used to predict daily changes in the price of Bitcoin, Ethereum, and Litecoin.

The models considered include a wide range of univariate and multivariate models that are flexible along several dimensions. We consider VARs that feature drifting parameters as well as time‐varying error variances. To cope with the curse of dimensionality we introduce recent shrinkage priors (see Feldkircher, Huber, & Kastner, 2017) and a flexible specification for the law of motion of the regression parameters (Huber, Kastner, & Feldkircher, 2017).

In addition, we introduce a heavy‐tailed measurement error distribution to capture potential outlying observations (see, among others, Carlin, Polson, & Stoffer, 1992; Geweke & Tanizaki, 2001).

We jointly forecast the three crypto‐currencies considered by using daily data from October 2016 to October 2017, with the last 160 days being used as a hold‐out period. In a forecasting comparison, we find that time‐varying parameter VARs with some form of shrinkage perform well, beating univariate benchmarks like the AR(1) model with stochastic volatility (SV) as well as a random walk with SV.

Constant‐parameter VARs tend to be inferior to their counterparts that feature time‐varying parameters, but still prove to be relevant competitors. Especially during days which are characterized by large price changes, controlling for heteroskedasticity in combination with a flexible error variance–covariance structure pays off in terms of predictive accuracy. These findings are generally corroborated by considering probability integral transforms, showing that more flexible models lead to better calibrated predictive distributions.

Moreover, a trading exercise provides a comparable picture. Models that perform well in terms of predictive likelihoods also tend to do well when used to generate trading signals.

The remainder of this paper is structured as follows. Section 2 provides an overview of the data as well as empirical key features of the three crypto‐currencies considered. Moreover, this section details how the additional explanatory variables are constructed.

Section 3 introduces the econometric framework adopted, providing a brief discussion of the model as well as the Bayesian prior setup and posterior simulation. Section 4 presents the empirical forecasting exercise, while Section 5 focuses on applying the proposed models to perform portfolio allocation tasks.

Finally, the last section summarizes and concludes the paper.


In this section we first identify important empirical key features of crypto‐currencies and then propose a set of covariates that aim to explain the low‐to‐medium frequency behavior of the underlying price changes.

For the present paper, we focus on the daily change in the log price of Bitcoin, Ethereum, and Litecoin.

To explain movements in the price of the three crypto‐currencies considered, we include information on equity prices (measured through the log returns of the S&P 500 index), the relative number of search queries for each respective crypto‐currency from Google trends, the number of English Wikipedia page views as well as the difference between the weekly cumulative price trend from common mining hardware and similar, but mining‐unsuitable, GPU‐related products to capture the effect of supply‐side factors.

The data spans the period from 26 November 2016 to 3 October 2017, yielding a panel of 316 daily observations.

Predicting cryptocurrencies using sparse non-gaussian state space models

Bitcoin, Ethereum, and Litecoin closing prices are taken from a popular crypto‐currency meta‐platform. 3 They originate from major crypto exchanges and are averaged according to their daily trading volume. Furthermore, alternative financial investments are represented by the S&P 500 indices daily closing prices. Additionally, demand‐side predictors like the relative number of worldwide search operations from Google Trends and the number of Wikipedia page views (in English) are used.

Because large‐scale crypto‐currency mining impacts supply and prices for the required equipment at the same time, hardware price trends are utilized to express changes in supply. To capture these effects, we gather GPU prices from Amazon's bestseller lists and extract the price trend of common mining hardware.

We construct this predictor by computing the difference between the weekly cumulative price trend from common mining hardware (e.g., AMD Radeon RX 480 graphic cards) and similar, but GPU‐related products that are unsuitable for mining activities (e.g., an AMD Radeon R5 230 graphics card).

To provide additional information on the recent behavior of crypto‐currencies, Figure 1 presents the log returns (left panel) as well as the squared log returns (right panel) for all three currencies under scrutiny.

At least two features are worth emphasizing. First, notice that in the first part of the sample (i.e., the end of 2016 and the beginning of 2017), price changes have been comparatively small. This can be seen in both panels of the figure and for Bitcoins and Litecoins. For Ethereum, the pattern is slightly different, but we still observe a general increase in variation during the second part of 2017.

Second, the degree of co‐movement between the three currencies increased markedly in 2017, where most major peaks and troughs coincide. This carries over to the squared returns, where we find that especially the sharp increase in volatility in September 2017 was common to all three currencies considered.

These two empirical regularities suggest that the proposed model should be able to capture co‐movement between Bitcoin, Litecoin, and Ethereum prices as well as changes in the first moment of the sampling density.

Predicting cryptocurrencies using sparse non-gaussian state space models

Moreover, the right panel indicates that large shocks appear to be quite common, calling for a flexible error distribution that allows for heteroskedasticity.

In order to provide further information on the amount of co‐movement in our dataset, Figure 2 shows a heatmap of the lower Cholesky factor of the empirical correlation matrix of the nine time series included.

The upper part of the figure reveals that all three assets display a pronounced degree of co‐movement. This indicates that each individual time series might carry important information on the behavior of the remaining two time series, pointing towards the necessity to control for this empirical regularity. For the remaining factors we do find nonzero correlation but these correlations appear to be rather muted. Nevertheless, we conjecture that the set of fundamentals above should be a reasonable starting point to explain movements in the price of crypto‐currencies.


3.1 A multivariate state‐space model

To capture the empirical features of the three crypto‐currencies, a flexible econometric model is needed.

We assume that the three crypto‐currencies as well as the additional covariates are stored in an m‐dimensional vector that follows a VAR(p) model with time‐varying coefficients:

with Ajt (for j=1,…,p) being a set of m×m‐dimensional coefficient matrices and εt is a multivariate vector of reduced form shocks with a time‐varying variance–covariance matrix:

Hereby we let Ut be a lower unitriangular matrix with diag=ιm and ιm being an m‐dimensional vector of ones.

Stern binary options experts review

Moreover, Ht is a diagonal matrix with typical diagonal element .

The logarithmic volatilities are assumed to follow an AR(1) process:

where μj denotes the unconditional mean of the log‐volatility process, while ρj and ςj are the persistence and variance parameters, respectively.

Following Carriero, Clark, and Marcellino (2015) and Feldkircher et al. (2017), we rewrite Equation 1 as follows:

where and ηt is a vector of orthogonal shocks with a time‐varying variance–covariance matrix.

Note that the ith equation (for i > 1) of this system can be written as


We let xt=(yt−1,…, ytp) be the stacked vector of covariates and At = [A1t ,…, Apt] is the m×mp matrix of stacked coefficients with Ai•,t selecting the ith row of the matrix concerned.

Equation 5 is a standard linear regression model with heteroskedastic innovations and the (negative) of the reduced‐form shocks of the preceding i−1 equations as additional regressors. In the case of i=1, Equation 5 reduces to a simple univariate regression with xt as covariates.

It proves to be convenient to rewrite Equation 5 as follows:

where is a ki = mp + (i−1)‐dimensional vector of regression coefficients and zit = [xt,−ε1t,…,−εi−1,t].

One important implication of Equation 6 is that the covariance parameters are effectively estimated in one step alongside the VAR coefficients.

We assume that βit evolves according to a random walk process:


The shocks to the states follow a Gaussian distribution with diagonal variance–covariance matrix .

To facilitate variable selection/shrinkage we follow Frühwirth‐Schnatter and Wagner (2010), Belmonte, Koop, and Korobilis (2014), and Bitto and Frühwirth‐Schnatter (2016), and rewrite the model given by Equations 6 and 7 as follows:


The matrix is a matrix square root such that with typical element and the jth element of reads .

This parametrization, labeled the noncentered parametrization, implies that the state innovation variances are moved into the observation equation (see Equation 8) and treated as standard regression coefficients.

Predicting cryptocurrencies using sparse non-gaussian state space models

Thus, if , the coefficient associated with the jth element in zit is constant over time.

Up to this point we have remained silent on the distributional assumptions on the measurement errors.

In what follows we depart from the literature on TVP‐VARs and assume that the measurement errors are heavy tailed and follow a t‐distribution. This choice is based on evidence in the literature (Gallant, Hsieh, & Tauchen, 1997; Geweke, 1994; Jacquier, Polson, & Rossi, 2004) which calls for heavy‐tailed distributions when used to model daily financial market data. As can be seen in Figure 1, we also observe multiple outlying observations for all three crypto‐currencies under consideration.

Since the assumption of non‐Gaussian errors would render typical estimation methods like the Kalman filter infeasible, we follow Harrison and Stevens (1976), West (1987), and Gordon and Smith (1990), and use a scale mixture of Gaussians to approximate the t‐distribution:


Note that the degree‐of‐freedom parameter vi is equation specific, implying that the excess kurtosis of the underlying error distribution is allowed to change across equations, a feature that might be important given the different time series involved.

The latent process ϕit simply serves to rescale the Gaussian distribution in the case of large shocks. Note that if ϕit = 1 for all i,t we obtain the standard time‐varying parameter VAR as in Primiceri (2005).

3.2 Prior specification

The prior setup adopted closely follows Feldkircher et al.

Co-movement in crypto-currency markets: evidences from wavelet analysis

(2017). More specifically, we use a normal‐gamma (NG) shrinkage prior on the elements of βi0 and .

The NG prior comprises a Gaussian prior on the coefficients alongside a set of local and global shrinkage parameters for the first mp elements of βi0 and :

for i=1,…,m and j=1,…,mp.

Here we let (for s ∈ {β,ϑ}) denote local shrinkage parameters with

where κ is a hyperparameter specified by the researcher and λL is a global shrinkage parameter that is lag specific, that is, applied to the elements in βi0 and associated with the Lth lag of yt, and constructed as follows:

This implies that if πl > 1, the prior introduces more shrinkage with increasing lag orders.

The degree of overall shrinkage is controlled through the hyperparameters c0 and d0.

Note that this specification pools the parameters that control the amount of time variation as well as the time‐invariant regression parameters. This captures the notion that if a variable is not included initially, the probability of having a time‐varying coefficient also decreases (by increasing the lag‐specific shrinkage parameter λL).

Predicting crypto‐currencies using sparse non‐Gaussian state space models

For the covariance parameters indexed by j = mp + 1,…,ki the prior is specified analogously to Equations 13 and 14 but with λL replaced by ϱ. This choice implies that all covariance parameters as well as the corresponding process innovation variances are pushed to zero simultaneously.

For ϱ we again use a Gamma distributed prior:

with a0, b0 being hyperparameters.

This prior specification has the convenient property that the parameters λL and ϱ introduce prior dependence, pooling information across different coefficient types (i.e., regression coefficients and process innovation variances), introducing strong global shrinkage on all coefficients concerned.

By contrast, the introduction of the local scaling parameters τs,ij serves to provide flexibility in the presence of strong overall shrinkage introduced by λL and ϱ. Thus, even if the aforementioned global scaling parameters are large (i.e., heavy shrinkage is introduced in the model), the local scalings provide sufficient flexibility to drag away posterior mass from zero and allowing for nonzero coefficients. The role of the hyperparameter κ is to control the tail behavior of the prior.

If κ is small (close to zero), the prior places more mass on zero but the tails of the marginal prior obtained after integrating over the local scales become thicker (see Griffin & Brown, 2010, for a discussion).

For the parameters of the log‐volatility equation (Equation 3) we follow Kastner and Frühwirth‐Schnatter (2014) and Kastner (2015a) and use a normally distributed prior on , a Beta prior on and a Gamma prior on .

In addition, we specify a uniform prior on , effectively ruling out the limiting case of a Gaussian distribution if vi becomes excessively large.

3.3 Full conditional posterior simulation

Estimation of the model is carried out using Markov chain Monte Carlo (MCMC) techniques. Our MCMC algorithm consists of the following blocks:

  1. Conditional on the remaining parameters/states in the model, simulate the full history of using a forward‐filtering backward sampling algorithm (Carter & Kohn, 1994; Frühwirth‐Schnatter, 1994) on an equation‐by‐equation basis.

  2. The full history of the log‐volatility process as well as the parameters of Equation 3 are obtained by relying on the algorithm proposed in Kastner and Frühwirth‐Schnatter (2014) and implemented in the R package stochvol (Kastner, 2015b).
  3. The time‐invariant components βi0 as well as θi=diag(Θi) are simulated from a multivariate Gaussian posterior that takes a standard form (see Feldkircher et al., 2017).

  4. The sequence of local scaling parameters is simulated from a generalized inverted Gaussian (GIG) distributed posterior distribution given by
    for .

    The posterior distribution for the scalings associated with the covariance parameters is similar, with λL replaced by ϱ.

  5. We obtain draws from the posterior of the lag‐specific shrinkage parameter associated with the lth lag by combining the likelihood with the prior on πl.

    The resulting posterior distribution is a Gamma distribution:

    with the • indicating the conditioning on everything else, R=2pm2 and λ0=1.

    The set selects all coefficients associated with the lth lag of yt.

    Similarly, the conditional posterior of ϱ is given by

    where ν=m(m−1) denotes the number of covariance parameters in addition to the number of process variances for the corresponding parameters.

    Predicting cryptocurrencies using sparse non-gaussian state space models

  6. The full history of is obtained by independently simulating from an inverted Gamma distribution (see Kastner, (2015c)):
    for t=1,…,T.

  7. To simulate the degrees of freedoms vi, we perform an independent Metropolis–Hastings (MH) step described in Kastner (2015c).

This algorithm is repeated a large number of times with the first Nburn observations being discarded as burn‐in. 4 Note that the equation‐by‐equation algorithm yields significant computational gains relative to competing estimation algorithms that rely on full‐system estimation of the VAR model.

Predicting cryptocurrencies using sparse non-gaussian state space models


4.1 Model specification and design of the forecasting exercise

In this section, we briefly describe model specification and the design of the forecasting exercise.

The prior setup for our benchmark specification (henceforth labeled the t‐TVP NG) model closely follows the existing literature on NG shrinkage priors (Bitto & Frühwirth‐Schnatter, 2016; Feldkircher et al.

2017; Griffin & Brown, 2010; Huber & Feldkircher, 2017). More specifically, we set κ = 0.1, c0 = 1.5 and c1 = 1 to center the prior on πl above unity while a0 = b0 = 0.01.


The choice for κ implies that we place a large amount of prior mass on zero while at the same time allowing for relatively thick tails. Our choice for the Gamma prior on ϱ introduces heavy shrinkage on the covariance parameters as well as the corresponding process standard deviations.

For all models (i.e., the competitors introduced in the next subsection) we consider, as well as the proposed model, we include a single lag of the endogenous variables. Higher lag orders are generally possible but, given the high dimension of the state space and the increased computational complexity, we stick to one lag. In addition, experimenting with slightly higher lag orders leads to models that are relatively unstable during several points in time in our estimation sample.

The design of our forecasting exercise is the following.

We start with an initial estimation period that spans the period between the end of November 2016 (22 November) to the end of April 2017 (26 April). The remaining 160 days are used as a hold‐out period. After obtaining the one‐step‐ahead predictive density for 27 April 2017, we consequently expand the estimation sample by a single day until the end of the sample is reached.

This yields a sequence of 160 one‐day‐ahead predictive densities.

To assess the predictive fit of our model we use the log‐predictive likelihood (LPL), motivated in Geweke and Amisano (2010), for example, and the root mean square forecast error (RMSE). Using LPLs enables us to assess not only how well the model fits in terms of point predictions but also how well higher moments of the predictive density are captured.

In addition, to assess model calibration we use univariate probability integral transforms (Amisano & Geweke, 2017; Clark, 2011; Diebold, Gunther, & Tay, 1998).

4.2 Competing models

Our set of competing models ranges from univariate benchmark models that feature SV to a wide set of multivariate benchmark models. The first set of models considered are a random walk (RW‐SV) and the AR(1) model (henceforth labeled AR‐SV), both estimated with SV. We use noninformative priors on the AR(1) regression coefficient and the same prior setup for the log‐volatility equation, as discussed in the previous section.

These two models serve to illustrate whether a multivariate modeling approach pays off and, in addition, whether allowing for structural changes in the underlying regression parameters improves predictive capabilities.

In addition, we consider a set of nested multivariate benchmark models.

To quantify the accuracy gains of time‐varying parameter specifications, we estimate three constant parameter VARs with SV.

Predicting crypto-currencies using sparse non-Gaussian state space models

The first VAR uses the prior setup described above but with for all i , j. The second model is a nonconjugate Minnesota VAR with asymmetric shrinkage across equations.

To select the hyperparameters we follow Giannone, Lenza, and Primiceri (2015) and place hyperpriors on all hyperparameters and estimate them using a random walk MH step. The last VAR we consider is a model that features a stochastic search variable selection (SSVS) prior specified as in George, Sun, and Ni (2008). This implies that a two‐component Gaussian prior is used, with the Gaussians differing in terms of their prior variance.


One component features a large prior variance (labeled the slab distribution), which introduces relatively little prior information, whereas the second component has a prior variance close to zero (the spike component) that strongly forces the posterior of the respective coefficient to zero. We set the hyperparameters (i.e., the prior standard deviations) for the slab distribution by using the OLS standard deviation times a constant (10 in our case), while the prior standard deviation on the spike component is set equal to 0.1 times the OLS standard deviation.

Moreover, we include two time‐varying parameter models with SV and Gaussian measurement errors. The first TVP‐VAR considered (labeled TVP) is based on an uninformative prior (obtained by setting the prior variances to unity for both the initial states as well as the process standard deviations). The next benchmark model (called TVP NG) is our proposed specification with an NG prior but with Gaussian errors (i.e., ϕit = 1 for all i,t).

This choice serves to assess whether additional flexibility on the measurement errors is needed.

Finally, the last model considered is the most flexible specification in terms of the law of motion of the latent states.

This model, labeled the threshold TVP‐VAR (labeled TTVP) is based on Huber et al. (2017) and captures the notion that parameter movements are only allowed if they are sufficiently large. To achieve this, a threshold specification for the process variances is adopted. This specification depends on a latent indicator that, in turn, is driven by the absolute size of parameter changes. Thus, if the change in a given regression parameter is large (i.e., exceeds a certain threshold we estimate), we use a large variance in Equation 7.

By contrast, if the change is small the process variance is set to a small constant that is close to zero. The prior specification adopted here closely follows the benchmark specification outlined in Huber et al. (2017) and we refer to the original paper for additional details.

4.3 Out‐of‐sample forecasting performance

We start by considering the forecasting performance in terms of log predictive likelihoods (LPS). Table 1 displays the LPS as well as the RMSEs for the competing models. The first column shows the joint LPS for the three crypto‐currencies considered, while the next three columns display the marginal LPS for a given crypto‐currency.

The final three columns show the RMSEs.

Table 1. Joint and marginal log predictive likelihoods for all models considered (left panel) and root mean square forecast errors (right panel). For the joint log predictive likelihood we integrate out the effect of the other variables included in yt and focus exclusively on the predictive performance for the three crypto‐currencies
Log predictive scoreRoot mean square error
TVP NG632.410286.134144.629159.5620.0500.0830.079
t‐TVP NG643.873277.679161.768166.9880.0500.0840.078

Considering the joint LPS indicates that, across models, the t‐TVP NG specification outperforms the remaining models.

This points towards the necessity to allow for both a flexible error distribution as well as time‐varying parameters with appropriate shrinkage priors. Especially when compared to the constant‐parameter VAR models, all three TVP‐VAR specifications with some form of shrinkage yield pronounced accuracy gains.

Note also that the AR(1) model with SV proves to be a tough competitor relative to the set of Bayesian VARs.

The necessity of introducing shrinkage in the TVP‐VAR framework can be seen by comparing the joint forecasting performance of the TVP model with the remaining TVP‐VARs considered. Note that in our medium‐scale model a TVP‐VAR with relatively little shrinkage leads to overfitting issues, which in turn are detrimental to forecasting performance.

Zooming into the results for the three crypto‐currencies, we generally observe that models performing well in terms of the joint LPS also do well on average.

One interesting exception is our proposed t‐TVP NG specification. While the performance gains for Litecoin and Ethereum appear to be substantial vis‐à‐vis the competing models, we find that Bitcoin predictions appear to be inferior relative to the TTVP and the TVP NG specifications.

If the researcher is interested in predicting the price of Bitcoin, the two best‐performing models are the TTVP specification and the Bayesian VAR with a normal‐gamma shrinkage prior. Interestingly, note that the comparatively weaker joint performance of the BVAR models stems from weaker Litecoin and Ethereum predictions, whereas Bitcoin predictions appear to be rather precise.

Considering point forecasting, performance generally corroborates the findings for density forecasts.

Here we again observe that models which yield precise predictive densities also work well when only point predictions are considered. Note, however, that the differences in terms of RMSE between multivariate models and the univariate AR(1) model are negligible. This somewhat highlights that forecasting gains in terms of predictive likelihoods stem from higher moments of the predictive density like the predictive variance (in terms of the marginal log scores) or a more appropriate modeling strategy for the predictive variance–covariance structure.

Next, we investigate whether differences in forecasting performance appear to be time varying.

Figure 3 shows the log predictive Bayes factors relative to the random walk with SV. Comparing the model performances over time points towards a pronounced degree of heterogeneity over time.

Results for Bitcoin (see panel (a)) show that the two best‐performing models are the TTVP and the TVP NG specifications. While the former yields a slightly better performance over time, the latter proves to be the best‐performing model during the first part of the hold‐out period. For the remaining models we find only relatively little time variation in their predictive performance.

How to win consistently in forex trading

Considering the results for Litecoin (see panel (b)) we find pronounced movements in relative forecasting accuracy. More specifically, we find that while forecasting performance appears to be homogeneous during the first months of the hold‐out period, from May 2017 onward the t‐TVP NG specification starts to perform extraordinarily well, improving upon all competitors by large margins.