Head back to my homework portfolio here.
To assess this question, I will perform a Sentiment Analysis on online posts, and compare how these vary over time to fluctuations in the S&P 500 close and volumes. I chose to collect my text data from Reddit because it makes its data particularly accessible through the Python Reddit API Wrapper.
The above chart shows my first data pull for this project, the code for which can be found in the colab notebook or on github. The challenge I faced here was that the timeframe of posts cannot be specified in a fetch using standard PRAW, and it will only pull around 900 posts per run. I therefore switched to Pushshift PRAW, however this API was limited to 100 posts per run. I circumvented this using a loop. An example of this notebook, including outputs for one day, can be found in my Github here. After fetching the posts, the notebook uses TextBlob to assign a polarity value to each, 1 being entirely positive, and -1 being entirely negative. Due to the large amount of data being processed, I chose to run this loop for 1 month intervals, and then merge them in Pandas, see the code in Github. Once I had collected data for the whole year, I collapsed it to daily values in Stata, see the Do file here. I converted the dates to be Vega friendly in colab, and plotted the years time series below:
As my aim was to test whether the sentiment displayed in the forum over time influences changes in the market over time, I chose to test for Granger causality, that is whether the time-series of sentiment is useful in predicting the time-series of the market index. I tested with the null hypothesis that neither one was Granger causal of the other. In order to perform this test, both of my time series needed to be stationary.
Predictably, the series for S&P 500 Index Values was not stationary, and required first-differencing: see the Colab notebook, or view the code on Github.
My time series for sentiment already appeared stationary according to a Dickey Fuller test, see the code.
Now both series had a p-value less than 0.000 in their Dickey Fuller tests, indicating that the probability of this data occuring despite being non-stationary would be below 0.05%. I therefore interpreted this as them being stationary, and proceeded to the Granger test.
An extended version of my S&P fetch, which indludes this Granger test, can be found on my Github. The smallest final p-value in the final Granger causation matrix was 0.2723, too large to reject the null hypothesis and therefore concluding that sentiment in r/stocks does not cause changes in the S&P 500 closing price.
This result is unsurprising considering that the regression of the raw values yeilds an R-squared statistic of 0.
Secondly, repeating the analysis for stock volume rather than stock closing price found a Granger coefficient of 0.0023, a suficiently small p-value to reject the null hypothesis of no Granger causality at the 5% significance level, indicating that sentiment in the subreddit is Granger-causal of changes in the volume of trades of the S&P 500, see the code in Github.
While I was not able to show any Granger causality for stock prices, the result of a Granger causal relationship for volumes indicates more broadly that text data is incredibly valuble. Already, there are automated traders which incorperate opinion dynamics into their decision making. My first steps to continue this work will be to extend the breadth of my text data to current events and other media such as Twitter, in order to make it more applicable to general market indicies.
Data containing polarity scores for every post prior to daily averaging is available on my github
PRAW: read the docs
Pushift PRAW: read the docs
TextBlob: read the docs
David Little at Arizona State University: Reddit Sentiment