Sep 1, 2023
Junwon Sung*∗, Woojin Heo, Yunkyung Byun, Youngsam Kim§
EG Asset Pricing, Republic of Korea
Abstract
In the rapidly advancing domain of artificial intelligence, state-of-the-art language models such as OpenAI’s GPT-3.5-turbo and GPT-4 offer unprecedented opportunities for automating complex tasks. This research paper delves into the capabilities of these models for semantically analyzing corporate disclosures in the Korean context, specifically for timely disclosure. The study focuses on the top 50 publicly traded companies listed on the Korean KOSPI, based on market capitalization, and scrutinizes their monthly disclosure summaries over a period of 17 months. Each summary was assigned a sentiment rating on a scale ranging from 1(very negative) to 5(very positive). To gauge the effectiveness of the language models, their sentiment ratings were compared with those generated by human experts. Our findings reveal a notable performance disparity between GPT-3.5-turbo and GPT-4, with the latter demonstrating significant accuracy in human evaluation tests. The Spearman correlation coeficient was registered at 0.61, while the simple concordance rate was recorded at 0.82. This research contributes valuable insights into the evaluative characteristics of GPT models, thereby laying the groundwork for future innovations in the field of automated semantic monitoring.
1 Introduction
In recent years, large language models like OpenAI’s ChatGPT have caught people’s attention [5]. What sets these models apart is their ability to learn from the context they’re given, making them incredibly versatile when it comes to handling new data [2, 9]. This means they are great for a wide range of language tasks, without needing specialized training data or fine-tuning [7, 6, 8, 13].
This is especially useful for tasks that require keeping up with constantly changing information, like monitoring the news in real-time. In our study, we are using these advanced models to analyze the sentiment in corporate announcements. This is an intricate task because these announcements are usually filled with complex data that takes time and expertise to understand. We are specifically looking at how well OpenAI’s latest models, like GPT-3.5-turbo and GPT-4, can perform sentiment analysis on corporate announcements from Korean companies.
Conceptually, corporate disclosure can be broadly categorized into two types: periodic and continuous disclosures [11]. Periodic disclosures encompass routine reports that companies are obligated to submit to the pertinent regulatory bodies. These reports often encompass financial statements, quarterly updates, and annual summaries. Conversely, continuous disclosures encompass significant information that emerges outside the regular reporting timetable and could potentially influence the valuation of a company’s shares or other financial instruments. In South Korea, these ongoing reports are termed ‘timely disclosures,’ while in the United States, they are denoted as ‘current disclosures’ or simply 8-K reports.
Our monitoring system focuses on continuous disclosures, given their often immediate impact on a company’s financial standing and, subsequently, investor decisions. This type of disclosure may include, but is not limited to, merger announcements, changes in executive leadership, regulatory investigations, or any other material events that could affect the stock price or investor perception. Given the increasing importance of continuous corporate disclosure in shaping investor sentiment and market dynamics, the ability to rapidly and accurately assess these communications is crucial. Traditional methods of sentiment analysis often involve manual annotations or rely on simpler algorithms that may not capture the complex events often found in these announcements [10, 1]. The limitations of these approaches become even more pronounced when dealing with non-English languages, where cultural and linguistic subtleties can significantly impact the interpretation of sentiment.
Our study aims to fill a gap in the existing literature by leveraging the capabilities of state-of-the-art language models to perform sentiment analysis on Korean corporate announcements. We seek to answer the following research questions:
1. How effective are large language models, specifically GPT-3.5-turbo and GPT-4, in analyzing sentiment in corporate announcements?
2. What challenges or limitations are associated with using large language models like GPT-3.5-turbo and GPT-4 for sentiment analysis in the context of corporate announcements?
The second question aims to uncover any potential drawbacks or limitations of using these advanced models for this specific task. While these models offer a range of capabilities, they are not without challenges. These could include computational constraints or the risk of model biases affecting the analysis. Understanding these challenges is essential for evaluating the practicality and reliability of using large language models for real-time semantic monitoring.
By addressing these research questions, we aim to offer a balanced view of both the capabilities and limitations of using large language models for sentiment analysis in Korean corporate announcements. This should contribute to both academic discussions and provide actionable insights for industry practitioners.
2 Methodology
2.1 Data Collection
• Targeted Enterprises: The top 50 companies by market capitalization were selected from the Korea KOSPI as of June 28, 2023. This decision was based on their significant influence on the market and the general interest in these leading companies.
• Data Collection Period: From January 1, 2022, to May 31, 2023.
• Data Collection Methodology: The dataset for this study was sourced from the Korea Investor’s Network for Disclosure System(KIND). To maintain coherence and relevance, the disclosures were summarized into single sentences using the GPT-3.5 model prior to the sentiment rating process. As noted in the introduction, periodic disclosures were excluded. Specifically, fair reports, business reports, semi-annual reports, and quarterly reports were not included. It is worth noting that the average token length for these periodic disclosures is 65,657, whereas the token length for timely disclosures averages 2,172.
2.2 Data Preprocessing
Data obtained from corporate disclosures was systematically converted into a monthly time-series format to enable effective sentiment analysis, as depicted in Table 2. Each data point includes key elements such as the date, time, title, and a succinct summary of the disclosure’s content. The collection process is limited to a maximum of 15 disclosures per month for each company. In cases where a company releases more than 15 disclosures within a single month, only the most recent 15 are chosen for analysis. This limitation is imposed primarily due to the context-length constraints of the language models being used. For example, if a company were to issue daily disclosures from June 1 to June 30, analytically, it would be more valuable to consider the disclosures from June 16 to June 30 rather than the initial 15 from June 1 to June 15.
2.3 Utilization of the GPT Model
The GPT model was directed to assign scores between 1 and 5 based on criteria delineated in Table 1.
Score | Criteria |
1 (Very Negative) | The company’s overall situation is very unfavorable, indicating a decline in revenue and profit. Financial conditions are unstable, market share is decreasing, and there are concerns about the ability of management and social responsibility. The future outlook in this situation is highly uncertain, facing threats to the company’s sustainability. |
2 (Negative) | The company’s condition is unfavorable, but certain improvements are possible. The trend of declining revenue and profit continues, and financial conditions are unstable. Market share may vary de-pending on competitive situations, and evidence of innovation or growth potential is limited. The outlook for the future is not very bright. |
3 (Neutral) | The company’s situation has not changed significantly, indicating that revenue and profit are stable. Financial conditions are stable, and competitiveness in the market is consistently maintained. In-novation and growth potential are average, and the future outlook remains stable without significant changes. |
4 (Positive) | The company is showing significant revenue and growth, indicating that it is being operated well overall. Financial conditions are positive, and there is a trend of increasing market share. There are positive expectations regarding innovation and growth potential, and the outlook for the future is positive. |
5 (Very Positive) | The company is achieving explosive revenue and profit, occupying an outstanding position in the market as a result. Financial conditions are very stable, and market share is dominant. The company possesses excellent innovation and growth potential, and the expectations for the future are very high. |
Table 1: Scoring criteria (common for GPT model and human participants)
As illustrated in Table 1, the model was prompted to evaluate several key factors, including the company’s financial health, market share, and growth potential. Using the OpenAI API, both the prompt (Table 1) and summarized data (Table 2) were dispatched to the model, from which the GPT’s rating was then retrieved.
Date | Time | Details |
2023-06-13 | 16:30 | Additional Listing (Domestic CB Conversion): CJ CGV Co., Ltd. has additionally listed 383 registered common shares. The issuance price for the 6th time is 26,600 KRW, and for the 9th time is 22,000 KRW. The issuance period is from May 16 to 31, 2023. The dividend base date is January 1, 2023. The capital increase method is a domestic CB conversion, and the listing date is June 16, 2023. |
2023-06-20 | 15:49 | Capital Increase Decision: CJ CGV decided through a board meeting on June 20, 2023, to increase capital by issuing a total of 74,700,000 common shares. They plan to raise 100 billion KRW for facility funds, 90 billion KRW for operational funds, and 380 billion KRW for debt repayment. The issue price of new shares is 7,630 KRW per share. The assignment date for the new shares is July 31, 2023, and the listing date is expected to be September 27, 2023. |
2023-06-20 | 16:04 | Loan Decision: A decision was made to lend money to CJ CGV’s Hong Kong corporation, CGI HOLDINGS LIMITED. The loan amount is 102.456 billion KRW with an interest rate of 7.37%. The loan period is from June 20, 2023, to December 20, 2023. This loan is an extension of an existing loan for the improvement of the subsidiary’s financial structure. The board decision date is June 20, 2023. |
2023-06-20 | 16:09 | Transactions with Afiliates: CJ CGV conducted product and service transactions with its afiliate, CJ OliveNetworks, in the third quarter of 2023. The transaction amount totals 12.349 billion KRW, which is 1.75% of the previous fiscal year’s sales. The transaction details include software and other service contracts, and the contract method is by mutual agreement. |
2023-06-20 | 16:12 | Bond Warning: CJ CGV 35CB (new type) (KR6079161C75) is designated as a bond of concern for investment. This is due to its closing price falling below 80% of its face value. |
2023-06-27 | 17:04 | Additional Listing (Domestic CB Conversion): CJ CGV has additionally listed 378 registered common shares. The issuance prices for the 6th and 9th issues are 26,600 KRW and 22,000 KRW, respectively. The issuance date ranges from June 8 to 15, 2023. The dividend calculation date is January 1, 2023. The method of capital increase is domestic CB conversion, and the listing date is June 30, 2023. |
2023-06-30 | 15:50 | Decision to Provide Collateral for Others: The company decided to provide collateral for CGI Holdings Limited’s borrowings of 26.256 billion KRW from KEB Hana Bank Hongkong Branch. Thecollateralamountis29.343billionKRW,whichis7.46%ofthe company’s equity capital of 393.089 billion KRW. The collateral provision period is from June 30 to September 27, 2023, and the collateral property is in KRW deposit. |
Table 2: CJ CGV prompt data as of end of June
To ensure a uniform and comprehensive analysis, responses from the GPT models are formatted as depicted in Table 3. For enhanced interpretability, each generated response is structured to incorporate a brief rationale along with its corresponding evaluation score.
Rating Score | Reasons for the score |
2 (Negative) | CJ CGV is making efforts to secure funds through additional listing and paid-in capital increases. However, given the extension of loans to affiliates, the use of capital increase funds for debt repayment, and the forecast for designation as an investment cautionary stock, the company’s financial status is perceived as unstable. Such circumstances could increase uncertainty about the company’s future growth and potentially weaken its competitiveness. |
Table 3: Response example from the GPT-4 model.
2.4 Evaluation Method
In our research, we specifically selected the two language models:
• ChatGPT-3.5-turbo-16K
• GPT-4
For the evaluation, both the GPT models and human assessors scrutinized a set of 815 evaluation queries related to the top 50 companies listed on the KOSPI index. These queries were assessed using a standardized 1-5 point scale. To ensure consistency, the prompted queries were kept uniform across all evaluations. It should be noted that the queries were translated into English to optimize the performance of the model, a technique commonly employed to enhance accuracy.
For human evaluation, we collaborated with two highly proficient experts boasting over a decade of experience within the realm of financial data analysis. These experts conducted individual assessments of the data, employing identical criteria to those utilized for evaluating the GPT models. To ensure a fluid and efficient evaluation process, we equipped the experts with a user-friendly web-based interface, as depicted in Figure 1.
We utilized Cohen’s Kappa statistic to measure the degree of consistency between human evaluators, yielding a value of 0.352. This result indicates a fair level of inter-rater agreement. Additionally, the simple agreement rate was observed to be 68%, further substantiating the reliability of the assessment. The scores attributed to each query by the two experts were summed and subsequently averaged. Any fractional values resulting from this averaging process were methodically rounded down to the nearest tenth, reflecting a conservative approach to the rating values.
Figure 1: Disclosure rating screen for human raters
3 Experimental Results
3.1 Conditions for Rating Adjustments
Previous study has shown that GPT model summaries of public disclosures could emphasize a positive or negative tone more than human intuition might suggest [6]. Therefore, this study considered the possibility of bias in the sentiment score assigned by the GPT model. To evaluate the effect of this potential bias, we constructed several artificial rating adjustment conditions as follows:
• Condition 1: No adjustments. This serves as a control group, with no adjustments made, to have a baseline performance metric.
• Condition 2: Subtract 1 point if the GPT score is 4 or above. This adjustment helps balance the model’s tendency to be overly positive, making its evaluations more aligned with human judgment.
• Condition 3: Add 1 point if the GPT score is 2 or lower. By adding a point for lower scores, you are testing the reverse hypothesis – that the model might be unduly negative or understate negative tones.
• Condition 4: Add 1 point if the GPT score is 2 or below, and subtract 1 point if the score is 4 or above. This combines Conditions 2 and 3 to examine whether the model has both overestimation and underestimation biases at the same time.
3.2 Results of the Conditions
The correlation and concordance rate results for each condition are presented in Table 4. In Condition 1, The concordance rate is higher in ChatGPT-3.5 than in GPT-4, yet GPT-4 has a higher Spearman and Kendall coefficient. The GPT-4 model records the highest correlation in Condition 2 (0.82 agreement rate, 0.61 Spearman, and 0.59 Kendall). However, both GPT models show reduced correlations compared to the baseline condition in Condition 3 and Condition 4.
The key findings from these results are summarized as follows:
• The GPT-4 model demonstrates higher correlations than GPT-3.5 in all conditions.
• The GPT-4 model in Condition 2 shows the highest performance across all settings for every measure.
• The effects of rating adjustments are clearly observed in correlation measures.
The third point on the previous list emphasizes a significant imbalance within the rating scales. As depicted in Figure 2, human evaluators assigned roughly 75%(615/815) of their ratings to a score of 3. This skew allowed the GPT models to achieve relatively high performance in terms of concordance rate under Condition 4, even while registering the lowest correlation values.
The performance pattern of the ChatGPT-3.5 model differs subtly from that of the GPT-4 model. Specifically, ChatGPT-3.5 achieves its peak concordance rate under Condition 2, while it registers the highest correlation values in Condition 1. This variation could suggest that the ChatGPT-3.5 model’s ratings are less consistent than those generated by the GPT-4 model. For a detailed view of the rating distribution across various conditions for the models, please refer to Figures 2 through 5.
Table 4: Performance of GPT models
Figure 2: Condition 1
Figure 3: Condition 2
Figure 4: Condition 3
Figure 5: Condition 4 9
4 Discussions
ChatGPT-3.5 VS. GPT-4. While ChatGPT-3.5 shows a higher concordance rate than the GPT-4 model in Condition 1 (0.6 versus 0.43), GPT-4 ultimately outshines ChatGPT-3.5 when comparing their peak performance rates in Condition 2 (0.82 versus 0.77). Notably, the correlations between GPT-4 and human ratings consistently out-perform those of ChatGPT-3.5, indicating that GPT-4 offers superior consistency. This outcome is not unexpected, given GPT-4’s superior performance records in tasks related to common sense reasoning [9].
Rating Adjustments. The empirical finding [6] that the GPT models might over-emphasize positive tones seems validated in our results. Figures 2 and 3 clearly illustrate that both GPT models have a tendency to overestimate positive disclosures when com-pared to human evaluators. Interestingly, this tendency is not symmetrical. As a result, Conditions 4 and 5 yielded suboptimal performance for the GPT models overall. The GPT-4 model under Condition 2 aligns most closely with the human evaluators and exhibits the highest levels of accuracy and correlations in our results. Our empirical study suggests that prompt engineering alone may be insufficient to address the domain-specific nuances of financial texts like corporate disclosures. This is because large language models may struggle to accurately estimate the impact of such financial events.
Limitations and Challenges. While this study marks a significant stride in the application of Large Language Models (LLMs) like GPT to corporate disclosures, several challenges and limitations must be acknowledged. First and foremost, the GPT models operated without the advantage of background knowledge about the companies being analyzed. This presented a significant drawback, as it limited the depth of their analysis and their ability to understand context. Without knowledge of a company’s history, market position, or unique financial complexities, the insights generated by the models were inherently limited.
Secondly, the absence of external sources such as financial data in tables or news articles further constrained the models’ performance. Financial reports are often intricate documents accompanied by various supplementary data and market analyses. The in-ability to integrate and interpret these additional resources may have prevented a more comprehensive and nuanced evaluation of corporate disclosures. Furthermore, the mathematical capabilities of the GPT models present another set of limitations. While they can perform basic arithmetic and some algebraic operations, their ability to understand and analyze complex financial formulas or perform advanced statistical analyses is limited [4]. This is especially pertinent when evaluating corporate financial reports, which often require a sophisticated understanding of accounting principles and mathematical models for accurate interpretation. However, it is worth noting that some of these limitations can be mitigated by employing expert libraries for the problems that GPT models struggle with [12]. This hybrid approach could potentially offer a more accurate and nuanced analysis, bringing together the text-processing strengths of GPT models with the computational rigor of specialized libraries.
Lastly, LLMs like ChatGPT are subject to modifications, which could lead to inconsistent performance [3]. This inherent risk of variability must be addressed to ensure the stability of any monitoring system that relies on these models.
Future Work. There is significant room for growth and expansion of this research. Potential avenues include:
• Incorporating the GPT-4 model for the summarization process, hypoth-esizing that its advanced capabilities might yield even more accurate summaries than ChatGPT-3.5-turbo.
• Developing mechanisms to feed contextual information about companies to the GPT models, allowing for a richer and more nuanced analysis.
• Incorporating external data sources into the analysis, such as financial data tables and news articles, for a more comprehensive sentiment analysis.
5 Conclusion
In conclusion, our study demonstrates the potential of large language models, such as ChatGPT-3.5-turbo and GPT-4, in performing sentiment analysis on corporate announcements from Korea’s top 50 KOSPI companies. These models exhibit varying degrees of success in evaluating the sentiment of continuous disclosures. Through empirical analysis, we observed that GPT-4 outperforms ChatGPT-3.5-turbo in terms of both consistency and correlation with human evaluations, indicating its superior ability to comprehend and assess complex financial language. The study also highlights the importance of addressing biases and limitations in LLMs’ output, as evidenced by the need for rating adjustments to align model-generated scores with human judgment. However, challenges like the lack of background knowledge, the absence of external data integration remind us of the evolving nature of LLMs’ capabilities. Future research should explore approaches to improve contextual understanding, integrate external data, and enhance mathematical analysis within the framework of sentiment monitoring in the financial domain. This re-search contributes to a deeper understanding of the capabilities and limitations of LLMs for real-time semantic monitoring and presents valuable insights for both academia and industry practitioners engaged in sentiment analysis of corporate disclosures.
References
[1] Sudeep R. Bapat, Saumya Kothari, and Rushil Bansal. Sentiment analysis of esg disclosures on stock market, 2022.[2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Pra-fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christo-pher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
[3] Lingjiao Chen, Matei Zaharia, and James Zou. How is chatgpt’s behavior changing over time?, 2023.[4] Simon Frieder, Luca Pinchetti, Alexis Chevalier, Ryan-Rhys Grifiths, Tommaso Sal-vatori, Thomas Lukasiewicz, Philipp Christian Petersen, and Julius Berner. Mathe-matical capabilities of chatgpt, 2023.
[5] Will Douglas Heaven. The inside story of how chatgpt was built from the people who made it. Technical report, MIT Technology Review, 2023.
[6] Alex Kim, Maximilian Muhn, and Valeri Nikolaev. Bloated disclosures: Can chatgpt help investors process financial information? arXiv preprint arXiv:2306.10224, 2023.
[7] Alex Kim, Maximilian Muhn, and Valeri Nikolaev. Can gpt-4 support analysis of textual data in tasks requiring highly specialized domain expertise? arXiv preprint arXiv:2306.10224, 2023.
[8] Jessica Lo´pez Espejel, El Hassane Ettifouri, Mahaman Sanoussi Yahaya Alassan, El Mehdi Chouham, and Walid Dahhane. Gpt-3.5 vs gpt-4: Evaluating chatgpt’s reasoning performance in zero-shot learning. arXiv preprint arXiv:2305.12477, 2023.
[9] OpenAI. Gpt-4 technical report, 2023.[10] Sridhar Ravula. Bankruptcy prediction using disclosure text features, 2021.[11] Jin-Young Yang. Continuous disclosure practices in the korean equity market. Macroeconomy and Financial Markets, 2013.
[12] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023.
[13] Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Jialin Pan, and Lidong Bing. Sentiment analysis in the era of large language models: A reality check. arXiv preprint arXiv:2305.15005, 2023.
Appendix: Individual results of the KOSPI 50 Companies
* Concordance Rate and correlation index of the companies in condition 2.
* NaN: When assessing the rating relationship between humans and models, identical values on one or both sides prevent accurate ranking and can lead to undefined correlation coeficients like Spearman or Kendall.