# Predicting stock market movements using network science: an information theoretic approach

- Minjun Kim
^{1, 2}Email authorView ORCID ID profile and - Hiroki Sayama
^{1, 2}

**Received: **30 May 2017

**Accepted: **8 September 2017

**Published: **10 October 2017

## Abstract

A stock market is considered as one of the highly complex systems, which consists of many components whose prices move up and down without having a clear pattern. The complex nature of a stock market challenges us on making a reliable prediction of its future movements. In this paper, we aim at building a new method to forecast the future movements of Standard & Poor’s 500 Index (S&P 500) by constructing time-series complex networks of S&P 500 underlying companies by connecting them with links whose weights are given by the mutual information of 60-min price movements of the pairs of the companies with the consecutive 5340 min price records. We showed that the changes in the strength distributions of the networks provide an important information on the network’s future movements. We built several metrics using the strength distributions and network measurements such as centrality, and we combined the best two predictors by performing a linear combination. We found that the combined predictor and the changes in S&P 500 show a quadratic relationship, and it allows us to predict the amplitude of the one step future change in S&P 500. The result showed significant fluctuations in S&P 500 Index when the combined predictor was high. In terms of making the actual index predictions, we built ARIMA models with and without inclusion of network measurements, and compared the predictive power of them. We found that adding the network measurements into the ARIMA models improves the model accuracy. These findings are useful for financial market policy makers as an indicator based on which they can interfere with the markets before the markets make a drastic change, and for quantitative investors to improve their forecasting models.

## Keywords

## Introduction

Stock market crashes are hard to prevent from happening due to the high complexity of the market which made of a lot of components behaving interdependently. “The Crash of 2:45” happened in May 6th 2010, which made U.S. Stock markets value decrease by about 6 percent in less than 30 min, and the flash crash occurred in Singapore took away $6.9 billion from the Singapore Exchange are a few examples of flash crashes. Studies (Menkveld and Yueshen 2016; U.S. Commodity Futures Trading Commision 2010; Kirilenko et al. 2017) including the CFTC (U.S. Commodity Futures Trading Commission) and SEC (U.S. Securities and Exchange Commission) report suggested that the main cause was said to be the high-frequency algorithmic traders dumping high volumes of the financial instruments to the market around the same time, and exacerbating the volatility during the events. Those algorithms are developed using complex mathematical models based on some theories from physics, statistics and other scientific fields with a sole purpose of producing possible trading signals. When these algorithms are triggered to make trades, the market surges or falls drastically because of the high volatility made by the algorithms in the market (Kirilenko et al. 2017). Considering that the high-frequency trading accounts for over 70 percent of dollar trading volumes in the U.S. financial market, and those flash crashes happened over 18,500 times between 2006 and 2011 (Frank 2010; Neil et al. 2012), forecasting the flash crash and being able to prevent any loss are strongly needed for the unarmored ordinary individual investors’ safety, healthy market ecology, and the whole economy.

There have been many studies and developments on predicting stock market movements using many different approaches including deep learning algorithms with neural networks. Having machines learn huge sets of data such as historical stock prices, trading volumes, accounting performances, fundamental features of the stocks, and even the weather, and produce the future values of stocks or index is one big branch of stock market forecasting methods. It utilizes many learning, regression, classification, neural networks algorithms such as support vector machine, random forest, logistic regression, naive Bayes, and reccurent neural networks, and tries to make accurate predictions by adjusting itself according to the market changes (Guresen et al. 2011; Huang et al. 2004; Atsalakis and Valavanis 2009; Kim and Han 2000). Another popular method is to use natural language processing techniques that let machines extract and understand information written and spoken in human languages, and try to capture stock market sentiments for making investment decisions based on the mood or the sentiments of the stock market (Schumaker et al. 2012; Schumaker and Chen 2009). Traditional finance and modern financial engineering also attempt to forecast the stock market using the fundamental and technical analysis. While the fundamental analysis is interested in valuating the intrinsic values of the stocks based on companies’ performances and the economic status, technical analysis focuses on the price and volume dynamics, and tries to capture the investing timing by developing technical indicators (Wong et al. 2010).

Some studies adapted network science theories to study the stock market. Those studies are mainly focused on analyzing the stock market networks’ structural properties to find out the major influencer, and to detect the communities of the stock markets (Namaki et al. 2011; Tse et al. 2010; Huang et al. 2009). However, few research has been conducted to forecast the future movements of the stock market using networks science. One of the few studies built corporate news networks using top 50 European companies in STOXX 50 index as nodes, and the sum of the number of news items with the common topic of each company pair as link weights. This study found out that the average eigenvector centrality of the news networks has an impact on return and volatility of the STOXX 50 index (Creamer et al. 2013). Another study constructed a role-based trading network for each company characterizing the daily trading relationship among its investors with transaction data. Particularly, nodes are traders involved in the transactions of a stock, and for each transaction between two traders, there is a link from the seller to the buyer. By categorizing the nodes into three types (Hub, Periphery, and Connector) according to the node’s connectedness, this study created 9 different link types, and found that the time-series of fraction of the link type P-H and C-H have a predictive power with the maximum accuracy of 69.2% (Sun et al. 2014).

Network science has been used and developed for many different fields. However, a few studies were conducted in terms of financial market time-series forecasting. Also, the previous studies did not show that whether the network analysis helps improve the performances of financial market time-series forecasting models. In this paper, we discuss our network analysis that forecasts future amplitudes of the S&P 500 changes to the one hour future and helps improve the performances of ARIMA models.

## Method

### Data gathering and preprocessing

Among the 504 companies in S&P 500 components at the time of the analysis, 475 companies’ stock price records were used due to the data availability. Also, S&P 500 index record was used as the dependent variable of our model. Each time-series record consists of 5340 one-minute interval closed prices ranging from 9:30am in September 22nd 2016 to 4:00pm in October 11th 2016, which is for 89 consecutive trading hours. The stock price records for 475 stocks and S&P 500 index records were acquired from the Google Finance (https://www.google.com/finance) real-time price quotes. The data set contained missing values due to the fact that some stocks were not traded at specific dates or times where the stock exchanges halted or delayed the trades of the specific companies’ stocks for news pending or significant imbalances in the pending buy and sell orders (Christie et al. 2002). The data set was preprocessed to handle the missing data in hot deck imputation method - “last observation carried forward” - particularly by replacing the missing values with the nearest available data points in the same time-series specifically with the very last closed price (DiCesare 2006).

### Network construction

*X*

_{ i }and

*Y*

_{ j }are price records of a pair stocks of the specific window, and

*p*(

*x*) is the probability of a random sample

*x*occurring in

*X*

_{ i }, and

*p*(

*y*) is the probability of a random sample

*y*occurring in

*Y*

_{ j }, and

*i*={

*s*

_{1},

*s*

_{2},⋯,

*s*

_{475}} and

*j*={

*s*

_{1},

*s*

_{2},⋯,

*s*

_{475}} representing 475 companies, while

*p*(

*x,y*) is the joint probability. Finally, we created mutual information matrix for all the stock pairs as shown below matrix (2).

As for the network links, we assigned a link weight with the corresponding mutual information. This means that the networks are complete weighted graphs having links between every node and every other node, and since the mutual information is symmetric, the network is also undirected. Some previous studies formed the networks with a threshold value on the link weights for a specific purpose of detecting the stock market clusters to see whether the stocks in the same industries or sectors fall into the same community as of their market classifications (Namaki et al. 2011). However, in this particular study, we took all the link weights under our consideration in order to study the dynamics of the strength distributions of the networks for forecasting the future stock market fluctuations.

### Analysis

#### Metrics using strength distributions

*Q*

_{ t }), the probability distribution of the aggregated average strength of the prior networks calculated by the Eq. (3) where

*s*is the number of prior networks. Kullback-Leibler divergence, which is also called relative entropy, is a measure of the difference between two probability distributions

*P*and

*Q*where

*P*is the distribution of the observation that we want to see how much it differs from the average prior distribution

*Q*. For example,

*Q*

_{ t }for calculating KLD-

*s*of the t

*th*network is the average of the

*t*-1

*th*,

*t*-2

*th*, ⋯, and

*t*-s

*th*networks’ strength distributions. Considering that the stock market runs 6.5 h daily, we computed Kullback-Leibler divergence between each strength distribution and the average of 3 h (0.5-trading day), 6 h (1-trading day), 9 h (1.5-trading day), 13 h (2-trading days) and all period prior distributions namely KLD-3, KLD-6, KLD-9, KLD-13 and KLD-All. The Kullback-Leibler divergence of

*P*from

*Q*was calculated by Eq. (4).

*RS*) of the network. We calculated the average strength (

*AD*) of nodes of each network, and divided it by the average strength of prior networks as shown in the Eq. (5) where

*s*is the number of prior networks.

#### Network centrality and modularity

Network centrality is often used in social network analysis for finding out the most influential or important nodes in networks by measuring their cohesiveness or involvements in the networks. In the sense that our stock market networks contained more high strength nodes when there were large changes in S&P 500 index, metrics using centrality of the nodes could explain the movements of S&P 500 index. We computed eigenvector and betweenness centralities of the nodes in our S&P 500 networks, and formed metrics using their mean, median and maximum values. In the same sense, we also formed metrics using network modularity. Modularity is a clustering measure that is to find the community structure of a network. A high modularity means that there are more links within a specific group in a network than when links are randomly distributed among groups in the network.

For each metric we built, we tested their prediction performance against the actual changes, squares of the changes and absolute values of the changes in S&P 500 index. The second and third-order of polynomial regression and a simple linear regressions are used for performing the fitting tests. Results are shown in the “Results” section.

## Results

Correlation coefficients between S&P 500 changes and the predictors with polynomial and linear regressions

Correlation matrix | Act. S&P | Act. S&P | Sqrs. S&P | Abs. S&P |
---|---|---|---|---|

Strength distribution | ||||

KLD 3 | 0.5628 / 0.6081 | 0.0895 | 0.6705 | 0.6454 |

KLD 6 | 0.5752 / 0.5984 | 0.0837 | 0.6823 | 0.6360 |

KLD 9 | 0.5333 / 0.5334 | 0.0582 | 0.6498 | 0.6725 |

KLD 13 | 0.4794 / 0.4886 | 0.0408 | 0.6182 | 0.6219 |

KLD All | 0.5582 / 0.5635 | 0.1175 | 0.6521 | 0.6587 |

RS 3 | 0.4185 / 0.4455 | 0.0173 | 0.3811 | 0.6630 |

RS 6 | 0.4326 / 0.4845 | 0.0159 | 0.3838 | 0.6615 |

RS 9 | 0.4196 / 0.4855 | 0.0093 | 0.3750 | 0.6526 |

RS 13 | 0.4385 / 0.4674 | 0.0024 | 0.4077 | 0.6685 |

RS All | 0.4065 / 0.4447 | 0.0134 | 0.3649 | 0.6552 |

Mean | 0.4189 / 0.4583 | 0.0129 | 0.3640 | 0.6536 |

Variance | 0.1487 / 0.1641 | 0.0175 | 0.3548 | 0.6407 |

Skewness | 0.5471 / 0.5610 | 0.0644 | 0.6265 | 0.5716 |

Kurtosis | 0.5425 / 0.5581 | 0.0192 | 0.4047 | 0.6532 |

Eigenvector centrality | ||||

Mean | 0.1526 / 0.1795 | 0.0099 | 0.2591 | 0.5351 |

Median | 0.3168 / 0.3200 | 0.0110 | 0.2720 | 0.5509 |

Maximum | 0.4272 / 0.4494 | 0.0068 | 0.2175 | 0.4836 |

Betweenness centrality | ||||

Mean | 0.0435 / 0.0482 | 0.0111 | 0.2797 | 0.5482 |

Median | 0.0288 / 0.0289 | 0.0102 | 0.0089 | 0.0332 |

Maximum | 0.0288 / 0.0288 | 0.0162 | 0.2445 | 0.4350 |

Network modularity | ||||

Modularity | 0.2973 / 0.2982 | 0.0082 | 0.1503 | 0.3906 |

*KLD*means the nodes strength of a network at a specific time were distributed similar to the average prior distribution, and a hike means that the strength distribution of the network drastically deviates from the average prior distribution, having more high strength nodes. This reflects that when

*KLD*hikes, the movements of the stocks in the networks have more correlation in each other’s movements. As we can see in this graph and the scatter plot (see Fig. 4) KLD has a clear positive correlation with absolute values of S&P 500 changes, and shows a quadratic relationship having both negative and positive relationships with the actual changes in S&P 500.

Among the statistical measures using the strength distributions, the skewness was the top performer in Act. S&P 500 * case (R-squared = 0.5471), and the kurtosis was the top performer in Abs. S&P 500 ** case (R-squared = 0.6532). The mean and variance performed much better in predicting the absolute changes in S&P 500 than predicting the actual changes.

We tested average, median, and maximum values of eigenvector and betweeness centralities. They showed correlations with actual and absolute values of the S&P 500 changes. Especially, predicting the absolute values of the changes in S&P 500, it showed R-squared of 0.5. This relationship was not strong, but still explained that dynamics of the S&P 500 index can be explained by the structural property of the stock market networks - when the nodes in the networks were clustered, grouped or tied more strongly, there came the large changes in S&P 500 index. Modularity, however, performed poorer than other metrics having no clear relationships with the S&P 500 changes.

*C*1=

*aKLD*3+(1−

*a*)

*S*,

*C*2=

*aKLD*6+(1−

*a*)

*S*, and

*C*3=

*aKLD*9+(1−

*a*)

*RS*13. We optimized the two constants to have a maximum correlation between the combined metrics and the S&P 500 changes by performing the grid search method for finding out the optimal value of

*a*. We found that all of the three metrics have statistically significant correlations with the S&P 500 changes. Table 2 shows the correlations between S&P 500 changes and the three predictors as well as the optimal value of the constant

*a*.

*C*3 performed the best in predicting the amplitudes of the S&P 500 changes. It showed R-squared of 0.7301 in the linear regression with the optimized constant

*a*=0.834.

*C*2 and

*C*1 in the polynomial regressions explained about 64% of the variance in the actual values of S&P 500 changes. As seen in this result, we could predict the amplitude better when working with the absolute values of S&P 500 changes with linear regression rather than working with the actual values with polynomial regressions.

Combined predictors vs. S&P 500 changes

Combinations | Constant( | Linear | Polynomial (k=2) | Polynomial (k=3) |
---|---|---|---|---|

KLD3 + Skewness | 0.798 | - | - | 0.6481 |

KLD6 + Skewness | 0.767 | - | 0.6409 | - |

KLD9 + RS13 | 0.834 | 0.73012 | - | - |

Mean squared errors of ARIMA models

Models | MSE | Models | MSE |
---|---|---|---|

ARIMA(1,1,1) | 25.19 | ARIMA(1,1,0) + KLD | 26.17 |

ARIMA(1,1,0) + RS | 20.27 | ARIMA(1,1,0) + Skewness | 26.32 |

ARIMA(1,1,0) + Kurtosis | 27.02 | ARIMA(1,1,0) + Mean | 25.7 |

ARIMA(1,1,0) + Variance | 25.66 | ARIMA(1,1,1) + Modularity | 25.23 |

ARIMA(1,1,0) + Eigenvector cent | 26.95 | ARIMA(1,1,0) + Betweenness cent | 24.4 |

To sum up, some of the network measurements we built in this research have forecasting power on predicting the amplitudes of S&P 500 changes. KLD, RS and Skewness of the strength distributions were the top performers with the significant correlations of over 0.64. Also, adding RS into the ARIMA model improved the model performance by about 20%.

## Discussions and conclusions

In this study, we demonstrated a new approach to forecast future S&P 500 changes using networks science, and showed that the predictors we built were strongly correlated to the amplitude of the S&P 500 changes. This result was because that we could be able to capture the market dynamics by analyzing the S&P 500 networks. The networks showing high connectedness among all the companies(nodes) means the stocks are more highly correlated. Stocks are highly correlated when the stocks are bought or sold together. As a whole market point of view, when most of the stocks are moving together, the index is likely to move in the same direction as the majority of stocks is moving. Our ARIMA models were improved by adding RS. The proposed method still needs to be tested and validated through out-of-sample evaluation, which is beyond the scope of this paper and is among our future research. The results might be used as a new indicator that might advise financial policy makers in dealing with huge sudden market fluctuations that definitely bring the market serious problems. Also, the result can be used for the quantitative investors to improve their existing ARIMA models. In this paper, we tested the usefulness of network measurements with ARIMA model only. However, in the future, we will investigate whether the network measurements help improve other financial market time-series forecasting models such as machine learning models.

In this study, we were able to get 475 companies’ stock price records out of 504 companies. It might be possible to improve the performance of the models if we have the price records for all the companies. Another improvement can be achieved by using finer data such as a half minute interval price records or even finer than a half minute. If we use finer data sets, we might be able to improve the model for forecasting one hour future, and also able to forecast nearer future such as 30-min future or 15-min future.

## Declarations

### Availability of data and materials

The data set is not allowed to be redistributed for any personal and enterprise use by the term of Google Finance (https://www.google.com/googlefinance/disclaimer/).

### Authors’ contributions

MK and HS equally contributed to this work. Both authors read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- Atsalakis, G, Valavanis K (2009) Forecasting stock market short-term trends using a neuro-fuzzy based methodology. Expert Syst Appl 36:10696–10707. doi:10.1016/j.eswa.2009.02.043.View ArticleGoogle Scholar
- Creamer, GG, Ren Y, Nickerson J (2013) Impact of dynamic corporate news networks on assets return and volatility In: Soc Comput (SocialCom) 2013 ASE/IEEE International Conference. doi:10.2139/ssrn.2196572. https://ssrn.com/abstract=2196572.
- Christie, WG, Corwin SA, Harris JH (2002) Nasdaq trading halts: The impact of market mechanisms on prices, trading activity, and execution costs. J Finance 57:1443–1478. doi:10.1111/1540-6261.00466.View ArticleGoogle Scholar
- Curme, C, Tumminello M, Mantegna RN, Eugene SH, Kenett DY (2015) How Lead-Lag correlations affect the intraday pattern of collective stock dynamics. Off Financ Res Work Pap No15–15. doi:10.17016/feds.2015.090.
- DiCesare, G (2006) Imputation, estimation and missing sata in finance. UWSpace. http://hdl.handle.net/10012/2920.
- David, E, de Prado MM, O’Hara M (2012) Flow toxicity and liquidity in a high frequency world. Rev Financial Stud 25:1457–1493. doi:10.1093/rfs/hhs053.View ArticleGoogle Scholar
- Frank, Z (2010) High-frequency trading, stock volatility, and price discovery. doi:10.2139/ssrn.1691679. https://ssrn.com/abstract=1691679.
- Guresen, E, Kayakutlu G, Daim TU (2011) Using artificial neural network models in stock market index prediction. Expert Syst Appl 38:10389–10397. doi:10.1016/j.eswa.2011.02.068.View ArticleGoogle Scholar
- Huang, W, Nakamori Y, Wang S-Y (2004) Forecasting stock market movement direction with support vector machine. Comput Oper Res 32:2513–2522. doi:doi:10-1016/j.cor.2004.03.016.View ArticleMATHGoogle Scholar
- Huang, W-Q, Zhuang X-T, Yao S (2009) A network analysis of the Chinese stock market. Physica A: Stat Mech Appl 388:2956–2964. doi:10.1016/j.physa.2009.03.028.View ArticleGoogle Scholar
- Junior, LS, Mullokandov A, Kenett DY (2015) Dependency relations among international stock market indices. J Risk Financ Manag 8(2):227–265. doi:10.3390/jrfm8020227.View ArticleGoogle Scholar
- Kirilenko, A, Kyle AS, Samadi M, Tuzun T (2017) The flash crash: high-frequency trading in an electronic market. J Finance. doi:10.1111/jofi.12498.
- Kim, K-J, Han I (2000) Genetic algorithms approach to feature discretization in artificial neural networks for the prediction of stock price index. Expert Syst Appl 19:125–132. doi:10.1016/S0957-4174(00)00027-0.View ArticleGoogle Scholar
- Li, W (1990) Mutual information functions versus correlation functions. J Stat Phys 60:823–837. doi:10.1007/BF01025996.ADSMathSciNetView ArticleMATHGoogle Scholar
- Levy-Carciente, S, Kenett DY, Avakian A, Stanley HE, Havlin S (2015) Dynamical macroprudential stress testing using network theory. J Banking Finance 59:164–181. doi:10.1016/j.jbankfin.2015.05.008.View ArticleGoogle Scholar
- Menkveld, AJ, Yueshen BZ (2016) The Flash Crash: A Cautionary tale about highly fragmented markets. doi:10.2139/ssrn.2243520. https://ssrn.com/abstract=2243520.
- Namaki, A, Shirazi AH, Raei R, Jafari GR (2011) Network analysis of a financial market based on genuine correlation and threshold method. Physica A: Stat Mech Appl 390:3835–3841. doi:10.1016/j.physa.2011.06.033.View ArticleGoogle Scholar
- Neil, J, Guannan Z, Eric H, Jing M, Amith R, Spencer C, Brian T (2012) Financial Black Swans driven by ultrafast machine ecology. doi:10.2139/ssrn/2243520. https://ssrn.com/abstract=2243520.
- Schumaker, RP, Chen H (2009) Textual analysis of stock market prediction using breaking financial news: The azfin text system. ACM Trans Inf Syst 27(2). doi:doi:10-1145/1462198.1462204.Google Scholar
- Schumaker, RP, Zhang Y, Huang C-N, Chen H (2012) Evaluating sentiment in financial news articles. Decis Support Syst 53:458–464. doi:10.1016-j.dss.2012.03.001.View ArticleGoogle Scholar
- Sun, X-Q, Shen H-W, Cheng X-Q (2014) Trading network predicts stock price. Sci Reports 4(3711). doi:10.1038/srep03711.
- Tse, CK, Liu K, Lau FCM (2010) A network perspective of the stock market. J Empirical Finance 17:659–667. doi:10.1016/j.jempfin.2010.04.008.View ArticleGoogle Scholar
- U.S. Commodity Futures Trading Commision, USSEC (2010). https://www.sec.gov/news/studies/2010/marketevents-report.pdf.
- Vinhm, NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: Is a Correlation for Chance Necessary?J Mach Learn Res 11:2837–2854.MathSciNetMATHGoogle Scholar
- Wong, W-K, Manzur M, Chew B-K (2010) How rewarding is technical analysis? evidence from Singapore stock market. Appl Financial Econ 543–551.Google Scholar