Rick Steinert⋆ and Saskia Altmann1
⋆Europa-Universita¨t Viadrina, Große Scharrnstraße 59, 15230 Frankfurt (Oder), Germany, steinert@europa-uni.de, ORCID: 0000-0002-1680-9811, corresponding author 1Europa-Universita¨t Viadrina, Große Scharrnstraße 59, 15230 Frankfurt (Oder), Germany, altmann@europa-uni.de, ORCID: 0009-0009-7308-4835
September 1, 2023
Abstract
This paper investigates the potential improvement of the GPT-4 Language Learning Model (LLM) in comparison to BERT for modeling same-day daily stock price movements of Apple and Tesla in 2017, based on sentiment analysis of microblogging messages. We recorded daily adjusted closing prices and translated them into up-down movements. Sentiment for each day was extracted from messages on the Stocktwits platform using both LLMs. We develop a novel method to engineer a comprehensive prompt for contextual sentiment analysis which unlocks the true capabilities of modern LLM. This enables us to carefully retrieve sentiments, perceived advantages or disadvantages, and the relevance towards the analyzed company. Logistic regression is used to evaluate whether the extracted message contents reflect stock price movements. As a result, GPT-4 exhibited substantial accuracy, outperforming BERT in five out of six months and substantially exceeding a naive buy-and-hold strategy, reaching a peak accuracy of 71.47% in May. The study also highlights the importance of prompt engineering in obtaining desired outputs from GPT-4’s contextual abilities. However, the costs of deploying GPT-4 and the need for fine-tuning prompts highlight some practical considerations for its use.
Keywords: Sentiment, GPT-4, Prompt Engineering, Microblogging, Stock Price Statement of Funding: The authors report they have not received any financial funding. Statement of Disclosure: The authors report there are no competing interests to declare.
1 Introduction and main idea
The intersection of artificial intelligence, sentiment analysis, and financial markets has been a fertile ground for research over the last decade. As the ability of models to understand and classify human sentiments has improved, so has the allure of using these capabilities to predict stock price movements. In this paper, we harness the power of GPT-4 to evaluate the sentiments of financial tweets and examine their effect on same-day stock price movements for Apple and Tesla. We additionally compare the results to sentiments derived by BERT.
Microblogging, exemplified by platforms like Stocktwits or Twitter, is a digital communication form where users share short, frequent updates. Thus, it allows for the rapid dissemination of information and sentiment, often faster than traditional news outlets. Events like the GameStop stock surge of 2021 illustrated how microblogging communities can rally around and influence stock prices, demonstrating the power of collective sentiment in real-time financial decisions.
Sentiment analysis can be applied to a variety of tasks, such as assessing customer feedback, analysing reactions to breaking news or perceptions of social media content. It can provide insights into people’s feelings about news related to stock companies and can help determine how such news is received by the public, whether it’s perceived as positive or negative. These perceptions might be linked to stock price movements. Given the high frequency of news and social media content in this digital age, the importance of computer-driven sentiment analysis grows. Computers can process content instantly, which is crucial for identifying relevant content both quickly and accurately. As such, enhancing sentiment analysis techniques to detect nuances like sarcasm and irony through context-based analysis becomes even more vital.
In our study, we assume that a higher correlation between investor sentiment and same-day stock return movements indicates a better interpretation of the social media feed. Therefore, we assume that finding a stronger predictability between same-day price movements and the sentiment of social media feed suggests that the utilized sentiment model (GPT-4 vs BERT) is more effective in capturing investor sentiment. To focus our study solely on the performance of modeling microblogging sentiment towards stock movements, we remove the potentially diluting effect of an uncertain link between past sentiments and future returns. Sentiment derived from tweets might possess a limited shelf-life, implying that today’s hot topics or sentiments may not necessarily influence the stock’s movement tomorrow or the day after. By predicting same-day movements, we capture the most immediate and potent effects of sentiment.
Despite the abundance of research articles on forecasting specific returns, it remains evident that two schools of thought prevail. One asserts the possibility of gaining abnormal returns through certain investment strategies, while the other contends that the efficient market hypothesis is fundamentally true, resulting in such returns being unsustainable over the long term. Supporting our approach, the Efficient Market Hypothesis (EMH) (Fama (1970)) posits that stock prices rapidly adjust to new information, making it challenging to predict future price movements based solely on historical data. Consequently, modeling same-day reactions is more feasible than attempting to forecast longer-term changes. Given that we have employed large and closely monitored companies like Apple and Tesla, we assume that the markets are quite efficient in these domains. Thus, detecting a possible link between future returns and current sentiments would likely require further steps in these specific contexts. However, should such a link be established—meaning for stocks lacking efficient markets—the methodology we have developed here would be adequately applicable. It would aid in exploring these patterns more comprehensively. To reiterate, our main intention is to demonstrate the potential for a connection between sentiment and capital stock price movement, contingent upon an intrinsic link between the two factors and showing how investors can capture this link using a sophisticated prompt.
Following this introduction, section 2 provides an overview of the relevant literature. In section 3, we detail the model setup, our methodology, message handling, prompt engineering, and our approach to stock data and sentiment matching. Section 4 presents the results of our analysis, highlighting key findings. Finally, in section 5, we engage in a thorough discussion of our results and conclude the article with major takeaways and potential future research directions.
2 Review of the literature
Sentiment analysis in stock price prediction has been explored through many approaches and techniques. To offer a clear comparative perspective, Table 1 encapsulates key details from the highlighted studies. It summarizes the data sources, time spans, sentiment methods, performance metrics, and prediction models utilized by various authors in the recent years. This tabulation provides an organized reference, emphasizing the diversity and nuances in methodologies adopted by researchers in this field.
The field of sentiment analysis has seen significant advancements in recent years, with researchers exploring various models and methodologies to understand and predict stock market movements. The adoption of sophisticated models for extracting investor sentiment is a trend that has piqued the interest of numerous scholars. A noteworthy contribution in this context is the work by Leippold (2023), which emphasized the significance of using more advanced sentiment models. Specifically, they showcased how sentences manipulated using GPT-3 can yield semantically incoherent results that are, however, easily recognized by humans. By instructing GPT-3 to generate synonyms for negative words and then using these to rephrase sentences, they evaluated the robustness of models. They underscore FinBERT’s resilience against adversarial attacks, especially when compared to traditional keyword-based methods.
While there’s a surge in studies employing cutting-edge models for sentiment analysis, there’s no shortage of holistic overviews and thorough reviews that trace the evolution of this field over time. Wankhade et al. (2022) provided an overview of sentiment analysis, offering a comprehensive review of the subject, its methodologies, applications, and developments in the field. The review of Rodr´ıguez-Iba´nez et al. (2023) encompassed both traditional methods and newer models, including BERT and GPT-2/3, spotlighting their roles and advancements in the domain of sentiment analysis.
Within the ambit of sentiment analysis, GPT models, particularly the more recent GPT variants, have emerged as favorites in contemporary studies. Kheiri and Karimi (2023) employed the GPT-3.5 Turbo model to undertake sentiment analysis on social media posts. For comparative analysis and benchmarking, the authors utilized RoBERTa. Their study attributed the specific role social scientist to the model. Belal et al. (2023) also used the GPT 3.5 Turbo variant, for their sentiment analysis on Amazon reviews. They revealed a major enhancement in accuracy. As benchmark models for sentiment classification, they used VADER and TextBlob. However, the specifics of the prompt they employed remained unspecified in their publication. Lopez-Lira and Tang (2023) employed a range of AI models, including GPT-1, GPT-2, GPT-3, GPT-4, and BERT, to predict stock market returns based on news headlines. Interestingly, they found that GPT-1, GPT-2, and BERT models are not particularly effective in accurately predicting returns. Further, they used a regression model to predict future returns. Meanwhile, Zhang et al. (2023) predicted Chinese stock price movements using sentiment analysis, employing models such as ChatGPT, Erlangshen RoBERTa, and Chinese FinBERT.
Besides more recent work, earlier works have also explored the correlation between public mood and stock market movements. Bollen et al. (2011) analyzed 9.9 million tweets to uncover a correlation between the Dow Jones Industrial Average and public mood. They employed OpinionFinder to determine daily positive vs. negative sentiments and Google-Profile of Mood States to assess moods in six dimensions. Their predictive model utilized granger causality analysis and neural networks. Build upon Bollen’s work, Mittal and Goel (2012) crafted a unique mood analysis based on four classes (calm, happy, alert, kind) derived from the POMS questionnaire. Alongside the granger causality test, they integrated machine learning models like SVM and neural networks for enhanced prediction accuracy. Rao and Srivastava (2012) demonstrated a significant correlation between stock prices and Twitter sentiment, including DJIA, NASDAQ-100, and 13 other major technology stocks. In addition to assessing a ”bullishness” metric sentiment, they measured an agreement measure among positive and negative tweets. Their findings revealed a positive correlation between stock returns and concurrent day bullishness. Oliveira et al. (2017) utilized Twitter data to predict returns, volatility, and trading volume of indices and portfolios, analyzing 31 million tweets related to 3,800 US stocks. Their methodology incorporated rolling window and regression models. For sentiment analysis, they employed a lexicon-based approach. Similar to Rao and Srivastava (2012), they computed a Bullish/Bearish Ratio, an agreement ratio, and a variation measure to capture the nuances in public sentiment. Ranco et al. (2015) examined Twitter sentiments concerning 30 stock companies from the DJIA. Their study revealed a low Granger causality and person correlation between stock prices and Twitter sentiments. To determine the sentiment, 100,000 tweets were manually annotated, with 6,000 of them annotated twice to gauge expert consensus. Support Vector Machines (SVM) were utilized as the primary classification method. Coqueret (2020) looks at the link between company sentiment and future returns and does not find news sentiment to be a powerful predictor. Hamraoui and Boubaker (2022) explored the relationship between Twitter sentiments and the Tunisian financial market. The study found a low Pearson correlation and Granger causality between the two. Matthies et al. (2023) employed Twitter data to investigate its correlation with subsequent day stock volatility. For this, they used linear regression models and did not uncover a significant relation-ship between sentiment, as measured by the VADER sentiment analysis tool, and stock performance. Their dataset comprised approximately 2 million tweets associated with four stocks known for high price volatility. Notably, their findings underscored that sentiment typically did not enhance the predictive power of their models significantly. In contrast Smailovi´c et al. (2013) indicated that sentiment trends could predict price changes days ahead. They calculated a daily ”positive sentiment ratio” by comparing the number of positive tweets to the total tweets, and looked at how this ratio changed daily. Si et al. (2013) leveraged topic-based sentiments from Twitter to forecast the stock market. Aiming for a one-day-ahead prediction of the stock index, they employed a Vector Autoregression (VAR) model for predictions. Their approach started with topic modeling using the Dirichlet Processes Mixture (DPM) to determine the number of topics from daily tweets. Once topics were identified, they crafted a sentiment time series based on these topics. Finally, both the stock index and the sentiment time series were integrated into an autoregressive framework for predictions. Bing et al. (2014) explored the potential of public sentiment in predicting the stock price of specific companies. Their study, spanning 30 companies from NASDAQ and NYSE, analyzed 15 million tweets. Pagolu et al. (2016) investigated the correlation between stock price movements of a company and the public opinions expressed in tweets about that same company. To analyze sentiment, they employed techniques such as word2vec and n-gram models. Jaggi et al. (2021) introduced the FinALBERT and ALBERT models in their examination of stocktwits. They categorized stocktwits based on the same-day stock price changes and used various machine learning techniques to predict these categories.
Some inquiries have shifted their focus from merely the content to the credibility and influence of certain Twitter personalities or financial circles in sentiment analysis. This underscores not just the essence of the content but also the significance of its originator. Yang et al. (2015) focused on identifying relevant Twitter users within the financial community. The study then found a correlation between a weighted sentiment measure using messages from these essential users and major financial market indices. Groß-Klußmann et al. (2019) explored the connection between Twitter sentiment and stock returns. They focused on expert users whose tweets predominantly revolved around financial topics. For sentiment analysis, they utilized a dictionary-based approach. Sul et al. (2017) looked at the feelings expressed in 2.5 million Twitter posts about individual S&P 500 companies and compared this to the stock market performance of those companies. They found that tweets from users with fewer than 171 followers (which was the median) that weren’t retweeted had a big effect on a company’s stock performance the next day, and also 10 and 20 days later. They used the Harvard-IV dictionary to analyze the sentiments in the tweets.
Also the timeframe of the sentiment analysis is an important factor. Recent global events, such as pandemics, have also influenced the direction of research. Valle-Cruz et al. (2022) studied the relation-ship between Twitter sentiment and the performance of financial indices during pandemics, specifically focusing on H1N1 and Corona. They considered both fundamental and technical indicators. To gauge sentiment, they used a lexicon-based approach, and their analysis was centered on tweets from financial-focused Twitter accounts. Katsafados et al. (2023) examined tweets related to the COVID pandemic, utilizing the VADER sentiment analysis approach. They measured the degree of positivity and negativity in tweets and linked this information to the stock market. Their results indicated that, in the short term, heightened positivity is linked to increased returns, whereas negativity was associated with diminished returns.
In addition to sentiment analysis, some studies have introduced other methods to consolidate news for stock market forecasts. Jiang et al. (2021) introduced an innovative approach to aggregating news. They constructed portfolios based on news-driven and non-news driven returns, buying when news returns were high, selling when low, and holding for 5 days. Bustos and Pomares-Quimbaya (2020) presented a comprehensive overview of stock movement prediction. They highlighted that most papers in their study used market information or technical indicators as input variables, with some also incorporating news, blogs, or social network data. Sousa et al. (2019) employed BERT for sentiment analysis of news articles and predicted future DJI index movements. They also manually labelled news data as positive, neutral, or negative, and fine-tuned the BERT model. Additionally, they looked at the hours of sentiment before opening time and predicted subsequent DJI day trends.
Lastly, the integration of various data sources, including social media texts and technical indicators, has been a focal point for many researchers. Ji et al. (2021) incorporated both social media texts and technical indicators into their predictive model. They utilized daily stock price data. Rather than deriving a sentiment measure from the textual content, they employed Doc2Vec to extract text features. To effectively process multiple posts from a single day, they aggregated the data into daily groupings.
Table 1: Structured Literature Overview
3 Model setup
3.1 BERT and GPT
The development of deep learning models, particularly the Transformer architecture, has marked a significant shift in the way textual data is processed. At the heart of this transformation lie two influential models, GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers). These models harness the power of the Transformer architecture, developed Google AI, introduced in the paper of Vaswani et al. (2017). The main objective of the Transformer model was to enhance and outperform existing text processing paradigms. Traditional models often processed text word-by-word, leading to limitations. Transformers, on the other hand, allowed entire sentences to be input, thus enabling not only parallelization, but also consider the context of words. The concept of self-attention was also introduced by Vaswani et al. (2017), which enabled the model to process words in the context of surrounding words. In their original Transformer model, the architecture is divided into an encoder and a decoder. However, variations of the Transformer architecture might use only one of these components or both, depending on the specific application. For example, BERT leverages the encoder (Devlin et al. (2018)), while GPT uses a multi-layer decoder (Radford et al. (2018)). BERT has two predominant models: BERT-Base, composed of 12 encoder blocks and BERT-Large, using 24 encoder blocks (Devlin et al. (2018)). During its pre-training phase, BERT is exposed to the prediction of masked tokens and next sentence predictions. This training harnesses data from BooksCorpus (comprising 800M words) and English text passages from Wikipedia (encompassing 2,500M words). Devlin et al. (2018) showcased the model’s adaptability by fine-tuning BERT on eleven NLP tasks.
GPT was first introduced by OpenAI in 2018 with GPT-1 and has since seen multiple iterations, each improving upon the other in terms of model size, data, and tasks they can handle. It has been trained on a diverse range of datasets including Wikipedia articles, books, and internet data. The architectural divergence between BERT and GPT is primarily based on directionality. BERT operates bidirectionally, it comprehends a word by analyzing its preceding and succeeding context. GPT, in contrast, is uni-directional, primarily forecasting a word based on preceding words. Conversely, for our GPT sentiment analysis, we engage the GPT-4 API with a customized prompt, which we will describe in section 3.2.2 of this study. GPT-3 (Brown et al. (2020)) uses the same model architecture as GPT-2 (Radford et al. (2018)). GPT-4, released in 2023, is a successor to GPT-3 and build on the same foundational ideas. GPT-4, like its predecessors, is a generative model from OpenAI built upon the Transformer architecture. However, its distinguishing features lie in its sheer scale, improved fine-tuning capabilities, and broader application potential. This model excels in few-shot learning, adapting to specialized tasks with minimal examples. With a broader training dataset and potential architectural tweaks, GPT-4 sets a new standard in the domain of natural language processing. Further information about GPT-4 can be found in the technical report on the OpenAI website.
3.2 Data and Methodology
To investigate the potential improvement of the GPT-4 LLM in comparison to BERT, we have decided to study the daily stock price movements of the companies Apple and Tesla for the year 20171. For that, we have recorded adjusted closing prices Ct for each day t and company of the study from the yahoo Finance platform. These are translated into up-down movements:
This results in overall T = 502 combined daily observations, of which Apple (App) contributes TApp = 251 and Tesla (Tes) TTes = 251 observations. To create a sentiment for each day of these observations, we have used the micro blogging website Stocktwits. Stocktwits is a social media platform designed for investors, traders, and individuals interested in the stock market and financial markets in general. It functions as a microblogging platform where users can share their thoughts, opinions, and insights about various stocks, cryptocurrencies, and other financial instruments. Users post short messages, often referred to as “tweets”, containing stock tickers, charts, news articles, and commentary on market trends. These messages are public and can include text, images, and links. The Stocktwits messages have been obtained from the researcher team of Jaggi et al. (2021) to ensure comparability between research in the field. The selected Stocktwit messages of Apple and Tesla, have been juxtaposed to the stock price data to match the prices’ dates. For the whole time period of the year 2017, we have extracted 214,548 messages which were the foundation for the scheduled sentiment analysis. Each of these messages have then been analyzed to extract a sentiment using the two LLMs.
3.2.1 Message handling
As both LLMs have different capabilities, we needed to clean the messages differently for each model. In case of the GPT-4 model, we have removed all URL or other references to websites, removed duplicate entries of tweets (as they are most likely advertisement), removed all images and reformatted every tweet to lowercase. To remove duplicates we have used the full data range of Jaggi et al. (2021) starting from 2010-06-28, so we can more efficiently see potential advertisements which are assumed to have no relevant sentiment. The BERT model needed further cleaning, as we have observed that certain elements prevent it from producing a plausible sentiment. Therefore, in addition to the cleaning steps for GPT-4, we have also removed all hashtags (# -lead words), cashtags ($ -lead words), mentions (@ -lead words), plain unicodes (e.g. emoticons) and numbers (e.g. dates, percentages) as well as all special characters (e.g. . and 😉 for BERT only. In the message box 1 you can find an exemplary Stocktwit compared with the formatting for each model. To extract the sentiment for BERT, we have used the pretrained
weights for BERT called “bert-base-multilingual-uncased-sentiment” from NLP town, downloadable at www.huggingface.co. This specific variant of BERT was trained on product reviews which had ratings of 1 star (negative) to 5 stars (positive) and is therefore specifically tailor made for sentiment analysis. We have not made any further changes to the model. As a result of classifying all messages using these model and weights, we receive the logits for each sentiment, which can be recomputed to give us probabilities for each class from 1 to 5. The class with the highest probability is consequently chosen as the final sentiment classification. For GPT-4, we have used the API model version of August 2023 in combination with the 8k token environment. To remove unnecessary creativity for the LLM’s answers and gain reproducibility of results, we have set the model parameter “temperature” to 0. However, as GPT is not tailor-made to produce sentiments from messages, we needed to fine-tune the user prompts to come up with a comparable classification, while at the same time utilizing the powers of the LLM beyond simple sentiment analysis.
3.2.2 Prompt engineering
In order to access the full capabilities of an LLM, a new field called “prompt engineering” has emerged. The term refers to the practice of designing and refining input prompts (e.g. commands) to guide the output of mostly language models, in order to achieve desired results or improve the model’s performance on specific tasks. It involves the careful crafting of questions, instructions, or other inputs to maximize the quality and relevance of the model’s responses. Determining the correct prompts is therefore essential for sentiment analysis using LLMs. On the one hand, we need the machine to produce concise and comparable statements and on the other hand, we need to allow the machine to use its contextual based capabilities to provide high-quality outputs. The former is relevant as each request to such a machine is costly, in terms of CPU/GPU-time and money. Each unnecessary input or output token will therefore increase the processing time as well as the money paid and consequently might render the original purpose of the analysis pointless. Eventually the results of each prompt need to be summarized in some way, e.g. by averaging, counting, semantic analysis or other. Therefore, the otherwise highly appreciated variety in answers is usually not desired in terms of sentiment analysis. Saving resources by truncating the prompt too much, however, does lead to issues with the quality of the output, which in terms of sentiment analysis comes mostly from the skill to analyze news based on context. Message box 2 outlines this issue.
In the original tweet, we can clearly see a negative general sentiment, indicated by words like “lose” and “fight”. In terms of the results of the imprecise prompt, we can clearly see that this is captured quite well, as the negative sentiment 1 has the highest probability with 70 percent. However, from
the perspective of the company Apple this message does not convey bad news. In fact, this should be perceived as good news, as the president losing is a win for Apple in this case. Our prompt can capture this ambiguity much clearer by assigning each sentiment the same probability of 20 percent. Further note that both prompts were designed to explicitly provide the sentiment probabilities without further text, fulfilling the requirements of a concise and comparable prompt. In code box 1 we therefore show the our final prompt, which has been sent to the API. 2
Our prompt has several features, which we have found to be relevant for a correct sentiment. These features guarantee the three before mentioned requirements of producing concise, comparable and con-textual results. These are:
• Providing a role for GPT-4: It is of crucial importance, to allow the GPT-4 LLM to assess the situation from a specific perspective. For that, specifying a role like “financial analyst” can aid in the process to achieve better contextual results. (Line 11)
• Set a subject and goal: The GPT-4 LLM can benefit from further context, if a specific situation or setting is described, which allows for more tailor-made answers. For that, setting an aim for the analysis can tremendously help with the outcome. Here, we have set that potential benefits for a company need to be evaluated. The subject of interest, e.g. Apple, must be defined clearly, especially if multiple words for expressing the subject (e.g. Apple and AAPL) exist. (Line 11)
• Add “catch-all” elements: Even if we are interested in the sentiment alone, it is very beneficial to add further perspectives on the matter. Keep in mind that many of the microblogging tweets will have totally unrelated information or could be sarcastic. Therefore, allowing the LLM to think of the sentiment in terms of advantages for or relatedness to the company improves the prompt even further. (Lines 1,2,3,14)
• Extracting probabilities: Probabilities help the model and the user to understand, that some decision are easier than others and that there is very often room for interpretation. These probabilities, which are usually expressed as logits in other LLMs, help clarify these uncertainties. (Line 13)
• Using lists: To shrink the output and therefore the number of tokens required, it is very useful to command GPT-4 to use a specific format. Here, the python syntax of [] for a list helps. This also allows for better comparison of results. (Line 14)
• Specify an alternative: If all efforts are in vain, then GPT needs an alternative. Otherwise, the LLM will very likely produce a text body explaining why it could not come up with an answer. (Line 15)
• Zero shot prompt: Zero shot prompts will use one prompt to immediately access the power of the GPT-4 LLM to achieve a result. This helps to reduce cost, but can lead to deterioration of quality in comparison to few shot prompts, where the machine can see some exemplary results.
Besides that, we have also used specific formatting using e.g. three consecutive double quotes for multi line strings. To see a proper guidance for formatting, we would like to refer to the OpenAI course for GPT on the deeplearning.ai platform.
Each of the 214,548 Stocktwit messages have been handled according to section 3.2.1 and then sent to GPT-4 using the prompt described here. The prompt has been adjusted for Tesla by simply replacing the words Apple and AAPL with Tesla and TSLA. As each Stocktwit was of different character length and in about 15-20 percent of cases the returned answer was simply ”NA”, requests exhibited varying costs. On average, however, we concluded that each prompt was roughly 0.01 USD.
3.2.3 Stock data and sentiment matching
The resulting microblogging sentiment probabilities PS do not follow an equidistant time gap between messages and are of higher frequency than daily and therefore need to be matched properly to be used for explaining daily price movements. We have therefore decided to aggregate the data using averages as well as counting. We apply the following procedure for each of the companies separately.
For the sentiments, we have classified a message based on the probability, whereas the highest probability was determining the final sentiment s:
where j ranges over all columns of PS and m is the number of all messages. As can be obtained from code box 1, we have sentiments as integers from 1 (negative) to 5 (positive), i.e. j ∈ {1,2,3,4,5}. We have then calculated the average of the sentiments s for a time window of 16:00 ET of t−1 to 16:00 ET of t for every day t in the study and deducted 3 to receive 0 as the baseline case:
where N is the number of messages for the given time window. For the GPT-4 decision of “advantage”, we again have classified the presumed advantage of the message for the company by taking the m provided
probabilities pa of the GPT-4 answer. We then calculate the simple sum of occurrences for advantage at and disadvantage dt during the same time frame as the sentiment s¯t:3
This allows us to retrieve sentiments s, advantages a and disadvantages d as regressors to match with the stock price movements Ut from equation (1) for each company, having exactly TApp and TTes observations respectively. We have also constructed the average advantage ¯at equivalently to (2), but will only use this object for the exploratory data analysis and not for the model, as the combined objects at and dt also inherit relevant information of the number of messages per day.
For the final modeling part, we now need to stack these regressors to be able to conduct a regression on the full sample with T = 502 observations and allowing for differences in parameters based on the company. Stacking regressors is a common procedure to isolate group specific effects, see e.g. Otto and Steinert (2023) for a similar application. We apply this for the sentiments s¯App,t of Apple and s¯Tes,t of Tesla as follows:
where there are TApp = TTes zero entries per vector. The same procedure is applied for at and dt for each company, resulting in aApp, aTes and dApp, dTes.
Please note, that this setup allows us to model the upward or downward movement of Apple and Tesla solely based on the sentiment and contextual analysis of Stocktwit messages of that same day. Detecting a link between the captured context of the messages with GPT-4 and the price movements would allow us to infer that this model is indeed suficient to be applied in a financial context. We therefore clearly want to point out that we do not follow the goal of proving that there is an exploitable link of microblogging messages towards future returns. All following results have been obtained using the programming language R, depictions are created using the package DescTools (Signorell et al., 2017).
4 Results
First and foremost, we have analyzed the outcome of our prompts engineering approach described in sections 3.2.2 and 3.2.3 respectively. For that, we are specifically interested in the addition of the advantages estimations of each message. Figure 1 provides an overview of the different classifications by GPT-4.
Here it can be obtained, that during our study in 2017, most micro bloggers of Stocktwit have been publishing complaisant information towards the companies. This is reflected by the fact, that almost 50 percent of messages have been classified as of being advantageous for either Apple or Tesla. In twenty percent of cases, however, GPT-4 could not decide whether a message favors the company or not, therefore specifying the probability for advantage and disadvantage as being 0.5. Message box 3 provides more insights for this case:4
In this box we can clearly see, that the message is not displaying any clear favor for the mentioned companies, as it not clear what “trade idea” is being proposed – it could be buying or selling the stock. GPT-4 catches this by specifying the same probability for advantage as for disadvantage. BERT, on the other hand, which is not capable of analyzing the context, will simply lean towards a positive sentiment, without understanding that trade ideas alone could be good or bad for the company.
It is furthermore relevant, to understand if the proposed usage of our regressor “Advantage” is not simply the same as a positive sentiment, as this would render all efforts of a more sophisticated prompt
useless. We therefore have decided to specifically analyze the univariate and bivariate relationship of both, by using the previously mentioned average sentiment s¯t and average advantage a¯t. Keep in mind that the average sentiment has a support on [−2,2] by construction and the average advantage on [−1,1], while 0 being neither positive nor negative or advantageous or disadvantageous respectively. In addition, we analyze how the grouping variable company effects the distribution of sentiment and advantage, as we specifically constructed company based regressors such as the ones seen in (3). Figure 2 visualizes our analysis for that:
This figure, containing 3 subplots, provides various relevant information. First and foremost, we can obtain from the subfigures (a) and (b) that the distribution of sentiment and advantage is seemingly smooth, with some heavy tails for the average advantage, i.e. some days for which we find very strong implications for an advantage or disadvantage. This in accordance with stock price movements when they are modeled as returns. Here, it is well known that these returns are exhibiting heavy tails, see e.g. (Cont, 2001). Besides that, we can obtain from the bivariate scatterplot, that sentiment and advantage do have a clear and strong correlation, which is to be expected. In most cases, messages directed to the companies’ Stocktwit feed are presented to determine an advantage or disadvantage for a company by using a certain tone of the message. Nonetheless, there are many dots observable in the scatterplot, which show a discrepancy of the general seemingly linear relation of sentiment and advantage. Message box 4 provides insights into this:
Analyzing this message, it is clear that this is not good news for Apple. However, as there are words like “awarded” and “+8.6” GPT-4 is deciding on a positive sentiment. However, it can due to its contextual skills, understand that this message is disadvantageous for Apple as there seems to be a payment for VHC being due. Interestingly, BERT does capture the negative sentiment, very likely because of the word “disaster” and the fact that for BERT numbers like “+8.6” have been removed. Further analyzing Figure 2 also helps to understand, why it is necessary to treat companies individually in any further modeling part. For the data of 2017 it is clear that there is a great favoritism for Apple, indicated by the mode of both distributions being in the positive range. For Tesla, the mode is around 0 for both, sentiment and advantage. We therefore conclude that our adjustments in equation (3) are justified.
Other interesting insights from the contextual capabilities of GPT-4 emerge from the fact that is also can use the background knowledge it was trained on. this becomes particularly noteworthy when it comes
to finance based terminology, as shown in message box 5.
This message states that the users is going “long” and that the users bought “calls” which is finance terminology for buying an asset or betting on increasing stock prices respectively. Paired together with the word “fine” GPT-4 can easily deduct that this is not only advantageous for Apple, but also indicative of a high positive emotion towards the stock. This can not be done using simpler models like BERT, which do not have the background knowledge. Thus, BERT only classifies this message as slightly positive.5
Finally, we have investigated the capabilities of both, BERT and GPT-4 LLM, for modeling stock price movements solely based on microblogging sentiment. For that, we have chosen a logistic regression approach using several training and test splits.6 Logistic regression is a statistical method used for analyzing datasets in which there are one or more independent variables that determine an outcome. In our case, the outcome is the movement of stock prices, which can be categorized as either ”up” or ”down”. By using the logistic regression model, we can predict the likelihood of a stock price moving up
based on our created regressors. For logistic regression, the log-likelihood function ℓ(β) for a given set of coeficients β is defined as:
where T is the number of observations in the dataset, Ui is the observed stock price movement of Apple or Tesla for the ith observation and pi is the predicted probability of the ith observation being 1, given by the logistic function:
The objective in logistic regression is to find the values of β that maximize this log-likelihood function. We use Fisher Scoring to obtain the coeficient estimates. For BERT, we only have kBERT = 2 regressors, the average stacked sentiments sBERT for Apple and sBERT for Tesla. For GPT-4 we have kGPT = 6 regressors, the average sentiments sGPT and sGPT as well as the counts for stacked advantage and disadvantage aApp , aTes , dApp and dGPT , see (3) for an explanation. To maintain a possible relationship in time of the daily stock price movements, we split our data into training and test observations using connected periods of calendar months. We have conducted several splits, where the training period always starts with January 2017 and ends in different months. The test period starts with the upcoming month and always ends in December 2017. This is done to provide information on the stability of results and to overcome issues with potential seasonal effects in the data. We measure the success of a model in terms of accuracy. Accuracy is a metric used to evaluate classification models. It represents the proportion of correct predictions in the total predictions made. The formula for accuracy in terms of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) is given by:
As a benchmark model we use a simple buy-and-hold strategy and refer to it as “Naive”. Additionaly to the accuracy comparison, we have investigated the p-values of the bivariate paired series of BERT with the Naive model logits of the test data and the GPT with Naive model logits using McNemar’s test. The results of our experimental setup can be obtained from Table 2.
Table 2: Comparison of accuracies for stock price movements: BERT vs. GPT-4. Number of observations in the test sample shown in ntest. All p-values are calculated using the naive buy-and-hold strategy as benchmark.
Across the evaluation period, the Naive baseline displayed a fluctuating accuracy, ranging from 46.99% to 52.06%. BERT’s performance consistently surpassed this, with its best accuracy observed at 66.47% during the month of May. In terms of statistical significance, BERT’s p-values remained notably low, underscoring the reliability of its predictions, particularly in April, May, and July where the p-values dropped below 0.001, indicating that it is able to reflect the sentiment of microblogging sentiment towards stock price movements.
GPT-4, on the other hand, consistently demonstrated a commendable performance, outpacing BERT in five out of the six months. Its peak accuracy reached 71.47% in May, with the lowest p-value observed in the same month at 4.91×10−5. All p-values are below the common threshold of 0.05. This suggests a consistent and statistically significant predictive power of the GPT-4 model in the context of stock price sentiment analysis.7
5 Discussion and Conclusion
The application of GPT-4 in stock market sentiment analysis offers a novel approach to understand-ing market dynamics and investor sentiment. Our study primarily focused on the sentiments of Apple and Tesla stock from various microblogging messages over a year. The study underscores the power and potential of GPT-4 in analyzing microblogging sentiments for financial applications. By refining input prompts, GPT-4 was able to capture nuanced sentiments and context that might be missed by other models like BERT. The comparison between the two models highlighted the advanced contextual understanding capabilities of GPT-4, making it a promising tool for sentiment analysis in finance.
However, while our findings suggest a strong correlation between sentiment derived from GPT-4 and stock price movements, there are several considerations and implications to be discussed.
While traditional models for sentiment analysis, such as BERT and its variants, have been widely used, GPT-4’s ability to understand context and nuance might offer a more accurate representation of sentiment. Its performance in our study supports this claim, but further comparative analyses are needed to establish its superiority conclusively. This is specific all true as the significance of prompt engineering in obtaining desired outputs from GPT-4 cannot be understated. The quality and specificity of the prompts play a pivotal role in sentiment extraction. Future studies can explore the optimal strategies for prompt engineering tailored for financial sentiment analysis.
The use of GPT-4 introduces a new dimension in sentiment analysis, but its cost might be prohibitive for some investors or analysts. However, the potential benefits in terms of accuracy and insight generation might outweigh the costs, especially for institutional investors.
Beyond sentiment analysis, there’s potential to use GPT-4 and similar models in other areas of financial analysis, such as forecasting, risk assessment, and portfolio optimization. Stock prices can be influenced by a myriad of factors beyond just microblogging sentiments, which we have not explored within this study. The versatility of GPT-4 opens avenues for multifaceted financial research.
The insights derived from this study can also be beneficial for hedge funds, institutional investors, and individual traders in making informed decisions. Moreover, news agencies and financial platforms might integrate such sentiment analysis tools to offer real-time sentiment scores alongside stock prices, providing an additional layer of information for their audience.
In conclusion, while the GPT-4 LLM offers promising insights and capabilities in the realm of financial sentiment analysis, further studies, and refinements are essential for its practical applications in modeling stock price movements.
References
Belal, M., She, J., and Wong, S. (2023). Leveraging chatgpt as text annotation tool for sentiment analysis. arXiv preprint arXiv:2306.17177.
Bing, L., Chan, K. C., and Ou, C. (2014). Public sentiment analysis in twitter data for prediction of a company’s stock price movements. In 2014 IEEE 11th International Conference on e-Business Engineering, pages 232–239. IEEE.
Bollen, J., Mao, H., and Zeng, X. (2011). Twitter mood predicts the stock market. Journal of Computational Science, 2(1):1–8.
7We have excluded test periods beginning in January, February, March, October, November and December as for all of these cases either the training or the test period would have only three months for estimation and application respectively. This would most likely lead to unstable and not representative results.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., et al. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
Bustos, O. and Pomares-Quimbaya, A. (2020). Stock market movement forecast: A systematic review. Expert Systems with Applications, 156:113464.
Cont, R. (2001). Empirical properties of asset returns: stylized facts and statistical issues. Quantitative finance, 1(2):223–236.
Coqueret, G. (2020). Stock-specific sentiment and return predictability. Quantitative Finance, 20(9):1531– 1551.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Fama, E. F. (1970). Eficient capital markets: A review of theory and empirical work. The Journal of Finance, 25(2):383–417.
Groß-Klußmann, A., K¨onig, S., and Ebner, M. (2019). Buzzwords build momentum: Global financial twitter sentiment and the aggregate stock market. Expert Systems with Applications, 136:171–186.
Hamraoui, I. and Boubaker, A. (2022). Impact of twitter sentiment on stock price returns. Social Network Analysis and Mining, 12(1):28.
Jaggi, M., Mandal, P., Narang, S., Naseem, U., and Khushi, M. (2021). Text mining of stocktwits data for predicting stock prices. Applied System Innovation, 4(1):13.
Ji, X., Wang, J., and Yan, Z. (2021). A stock price prediction method based on deep learning technology. International Journal of Crowd Science, 5(1):55–72.
Jiang, H., Li, S. Z., and Wang, H. (2021). Pervasive underreaction: Evidence from high-frequency data. Journal of Financial Economics, 141(2):573–599.
Katsafados, A. G., Nikoloutsopoulos, S., and Leledakis, G. N. (2023). Twitter sentiment and stock market: a covid-19 analysis. Journal of Economic Studies.
Kheiri, K. and Karimi, H. (2023). Sentimentgpt: Exploiting gpt for advanced sentiment analysis and its departure from current machine learning. arXiv preprint arXiv:2307.10234v2.
Leippold, M. (2023). Sentiment spin: Attacking financial sentiment with gpt-3. Finance Research Letters, 55:103957.
Lopez-Lira, A. and Tang, Y. (2023). Can chatgpt forecast stock price movements? return predictability and large language models. arXiv preprint arXiv:2304.07619.
Matthies, T., L¨ohden, T., Leible, S., and Raabe, J.-P. (2023). To the moon: Analyzing collective trading events on the wings of sentiment analysis. arXiv preprint arXiv:2308.09968v1.
Mittal, A. and Goel, A. (2012). Stock prediction using twitter sentiment analy-sis. Stanford University, CS229 (2011 http://cs229. stanford. edu/proj2011/GoelMittal-StockMarketPredictionUsingTwitterSentimentAnalysis. pdf), 15:2352.
Oliveira, N., Cortez, P., and Areal, N. (2017). The impact of microblogging data for stock market prediction: Using twitter to predict returns, volatility, trading volume and survey sentiment indices. Expert Systems with Applications, 73:125–144.
Otto, P. and Steinert, R. (2023). Estimation of the spatial weighting matrix for spatiotemporal data under the presence of structural breaks. Journal of Computational and Graphical Statistics, 32(2):696–711.
Pagolu, V. S., Reddy, K. N., Panda, G., and Majhi, B. (2016). Sentiment analysis of twitter data for predicting stock market movements. In 2016 International Conference on Signal Processing, Communication, Power and Embedded System (SCOPES), pages 1345–1350. IEEE.
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. (2018). Improving language understanding by generative pre-training. OpenAI Preprint.
Ranco, G., Aleksovski, D., Caldarelli, G., Grˇcar, M., and Mozetiˇc, I. (2015). sentiment on stock price returns. PloS one, 10(9):e0138441.
The effects of twitter
Rao, T. and Srivastava, S. (2012). Analyzing stock market movements using twitter sentiment analysis. In Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), pages 119–123.
Rodr´ıguez-Iba´nez, M., Cas´anez-Ventura, A., Castejo´n-Mateos, F., and Cuenca-Jim´enez, P.-M. (2023). A review on sentiment analysis from social media platforms. Expert Systems with Applications, 223:119862.
Si, J., Mukherjee, A., Liu, B., Li, Q., Li, H., and Deng, X. (2013). Exploiting topic based twitter sentiment for stock prediction. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 24–29.
Signorell, A., Aho, K., Alfons, A., and mult. al. (2017). DescTools: Tools for Descriptive Statistics. R package version 0.99.23.
Smailovi´c, J., Grˇcar, M., Lavraˇc, N., and Znidarˇsiˇc, M. (2013). Predictive sentiment analysis of tweets: A stock market application. In Human-Computer Interaction and Knowledge Discovery in Complex, Unstructured, Big Data, pages 77–88. Springer.
Sousa, M. G., Sakiyama, K., de Souza Rodrigues, L., Moraes, P. H., Fernandes, E. R., and Matsubara, E. T. (2019). Bert for stock market sentiment analysis. In 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), pages 1597–1601. IEEE.
Sul, H. K., Dennis, A. R., and Yuan, L. (2017). Trading on twitter: Using social media sentiment to predict stock returns. Decision Sciences, 48(3):454–488.
Valle-Cruz, D., Fernandez-Cortez, V., Lo´pez-Chau, A., and Sandoval-Almaza´n, R. (2022). Does twitter affect stock market decisions? financial sentiment analysis during pandemics: A comparative study of the h1n1 and the covid-19 periods. Cognitive computation, 14:372–387.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Wankhade, M., Rao, A. C. S., and Kulkarni, C. (2022). A survey on sentiment analysis methods, applications, and challenges. Artificial Intelligence Review, 55(7):5731–5780.
Yang, S. Y., Mo, S. Y. K., and Liu, A. (2015). Twitter financial community sentiment and its predictive relationship to stock market movement. Quantitative Finance, 15(10):1637–1656.
Zhang, H., Hua, F., Xu, C., Guo, J., Kong, H., and Zuo, R. (2023). Unveiling the potential of sentiment: Can large language models predict chinese stock price movements? arXiv preprint arXiv:2306.14222.