October 31, 2023

Xuefeng Bai♠ , Jialong Wu♡ , Yulong Chen♠ , Zhongqing Wang★ , Yue Zhang∗♠♢∗

^♠School of Engineering, Westlake University

^♡School of Computer Science and Engineering, Southeast University

^★ Natural Language Processing Lab, Soochow University

^♢ Institute of Advanced Technology, Westlake Institute for Advanced Study

Abstract

Constituency parsing is a fundamental yet unsolved natural language processing task. In this paper, we explore the potential of recent large language models (LLMs) that have exhibited remarkable performance across various domains and tasks to tackle this task. We employ three linearization strategies to transform output trees into paca, comparing their performance against the state-of-the-art constituency parsers. Our experiments encompass zero-shot, few- shot, and full-training learning settings, and we evaluate the models on one in-domain and five out-of-domain test datasets. Our findings reveal insights into LLMs’ performance, generalization abilities, and challenges in constituency parsing.

1 Introduction

Constituency parsing is a fundamental natural language processing task which aims to predict the syntactic structure of a given sentence according to a phrase structure grammar (Marcus et al., 1993; Collins, 1997; Charniak, 2000; Petrov and Klein, 2007). As shown in Figure 1, given the sentence “Singapore is located in Asia.”, a constituent parser produces its phrase structure tree automatically. This task is not only appealing in computational linguistics studies but also serves as a foundation for improving various natural language processing tasks over the past decades, facilitating numerous downstream applications (Wang et al., 2018; Jiang and Diesner, 2019; Xu and Durrett, 2019). Despite its importance, constituency parsing remains a challenging problem, with the best

Figure 1: The constituency tree for sentence “*Singapore is located in Asia.*”.

results on the news domain being 96% (Kitaev et al., 2019), while the performances on review and literature domains being only 84% and 86%, respectively (Yang et al., 2022).

Recently, large language models (LLMs) trained on massive data have achieved remarkable performance on many NLP tasks, such as text classification (Schick and Schütze, 2020; Chen et al., 2022), commonsense reasoning (Zhang et al., 2022b), text summarization (Goyal et al., 2022; Chen et al., 2023b), machine translation (Jiao et al., 2023) and dialogue systems (Qin et al., 2023), using prompting technologies. The primary idea is to convert NLP tasks into a text-to-text problem, enabling knowledge transfer from pre- training tasks to downstream tasks. However, due to the fact that the output of constituency parsing is represented in a tree structure, which is rarely found in textual data where LLMs are trained, there has been limited research utilizing LLMs for constituency parsing. It remains an open question whether LLMs can generalize well on constituency parsing.

To fill this gap, we take the first step to study the impact of LLMs on constituency parsing. In order to enable LLMs to process constituency trees, we first transform constituency trees into sequences of symbols using various linearization strategies. And we empirically evaluate LLMs from the two most typical families, namely the GPTs (Brown et al., 2020; Ouyang et al., 2022; OpenAI, 2023) (including ChatGPT, GPT-4, etc) and LLaMAs (Touvron et al., 2023) (including Vicuna (Chiang et al., 2023), Alpaca (Taori et al., 2023), etc), comparing them with the state-of-the-art constituency parsers (Kitaev et al., 2019). We conduct experiments under multiple learning settings, with a zero-shot and a few-shot scenario to measure how much syntactic knowledge LLMs learned during pre-training, and a full-training setting to study to what extent LLMs boost the performance of constituency parsing.

We evaluate the state-of-the-art parsers and LLMs on one in-domain test dataset and five out-of-domain test datasets to measure their capability and cross-domain generalization abilities, respectively. Experimental results show that: 1) LLMs greatly boost the performance of sequence- based constituency parsing methods; 2) Fine- tuned LLM-based parsers can give competitive results compared to the state-of-the-art chart-based and transition-based parsers; 3) LLMs (except commercial ones) are not good few-shot learners for constituency parsing; 4) To our surprise, while LLMs improve the cross-domain generalization ability of sequence-based methods, LLM-based parsers are weaker than chat-based parsers in cross-domain parsing. Furthermore, we conduct an error analysis to study the main challenges of LLM-based models for constituency parsing, and our analyses show that: 1) LLM-based parsers suffer from hallucination and have limited ability to learn extremely long constituents; 2) The main challenge for LLM-based constituency parsing stems from invalid trees, which can be be- cause that data of tree structures are seldomly presented in LLMs’s pre-training. Our code and trained models are available at https://github. com/goodbai-nlp/LLM-ConParsing.

2 Related Work

Constituency Parsing. Current approaches to constituency parsing are predominantly divided into three categories: chart-based (Collins et al., 1999; Durrett and Klein, 2015), transition-based (Yamada and Matsumoto, 2003; Sagae and Lavie, 2005; Nivre and McDonald, 2008; Zhang and Clark, 2009), and sequence-based methods (Vinyals et al., 2015; Kamigaito et al., 2017). Chart-based methods assign scores to all spans within a sentence, then employ the CKY dynamic programming algorithm to identify the op- timal parse tree (Kitaev and Klein, 2018; Kitaev et al., 2019; Zhang et al., 2020). Transition-based methods function by sequentially processing input sentences and progressively building the resultant constituency trees through a succession of locally predicted transition actions (Dyer et al., 2016; Cross and Huang, 2016; Liu and Zhang, 2017). Sequence-based methods solve constituency parsing as a sequence-to-sequence (Seq2Seq) generation problem (Fernández-González and Gómez-Rodríguez, 2020; Yang and Tu, 2022). This work belongs to the category of sequence-based methods. Different from existing methods, we solve constituency parsing with generative LLMs which have shown impressive performance in many NLP applications. To our knowledge, we are the first to systematically study the impact of LLMs on con- stituency parsing.

Large Language Models. Large Language Models (LLMs) (Brown et al., 2020; Chowdhery et al., 2022; Touvron et al., 2023) have substantially influenced the field of Natural Language Processing (NLP). As the pioneering work, Radford et al. (2019) and Brown et al. (2020) demonstrate the capability of language models to solve a task with minimal task supervision. The following work shows that LLMs are adept at leveraging textual instructions to perform various tasks (Sanh et al., 2022; Webson and Pavlick, 2022; Liang et al., 2022; Zhang et al., 2022a). Recently, LLMs with Reinforcement Learning from Human Feedback (RLHF) can further generate a response that is well aligned with human values (Ouyang et al., 2022; Lambert et al., 2022; Bai et al., 2022; Gao et al., 2023). LLMs now showcase remarkable performance on a wide array of tasks, including commonsense reasoning (Zhang et al., 2022b), text summarization (Goyal et al., 2022; Chen et al., 2023b), and massive multitask language understanding (Hendrycks et al., 2021). Different from previous work, our paper focuses on the task of constituency parsing, which poses a unique challenge for LLMs since its structured output is inherently divergent from plain text.

3 Constituency Parsing with LLMs

We reformulate constituency parsing into a sequence-to-sequence problem such that constituency trees can be autoregressively generated

Figure 2: The constituency tree for “*Singapore is located in Asia.*” with three different linearization.

by LLMs. We first adapt three linearization strategies to transform constituency trees into sequences of symbols (Sec 3.1). To study the zero-shot/few-shot learning abilities of LLMs, we design prompts to guide them to generate outputs without parameter update (§ 3.2). In addition, we perform fine-tuning on the full training dataset to exploit the potential of open-source LLMs on constituency parsing (§ 3.3).

3.1 Tree Linearization

We use three tree-isomorphic¹ linearization strategies to serialize a constituent tree into a sequence so that LLMs can learn constituency parsing by predicting the linearized tree. Figure 2 shows the linearized trees of the constituent tree for a sentence “Singapore is located in Asia” using three different linearization strategies, which we detail below:

Bracket-based We follow Vinyals et al. (2015) and linearize a constituent tree as a bracket sequence from top to bottom. The bracket-based linearization uses parentheses to mark visit depth and restore tree structure. The bracket-based sequence is different from natural language but similar to codes which accounts for a certain proportion in the pre-training corpus of LLMs.²

Transition-based We follow Liu and Zhang (2017) and convert a constituent tree into a sequence of transition actions. The transition sequence assumes there is a stack and a buffer for representing a state, where the stack holds partially-built constituents and the buffer contains the next incoming words. The parse tree can be constructed by executing transition actions, which incrementally consume words from the buffer and build output structures on the stack. The transition-based sequence contains a set of special tokens (i.e. transition actions) which are new to LLMs and differs from the natural language distribution.

Span-based We consider a novel method, trans- forming a tree into a sequence of short sentences. Specifically, we extract all phrases as well as their labels from the tree and convert them into short sentences by filling the template “A is a B”, where “A” denotes a phrase and “B” is its corresponding label. Compared with linearized trees using the above two methods, Span-based linearized tree is most similar to natural language but is much longer due to word/phrase repetitions.

3.2 Prompting LLMs for Constituency Parsing

For ChatGPT and GPT-4, we follow an instruction-prompt scheme to teach LLMs to generate linearized constituency trees (Ouyang et al., 2022). Figure 3 shows the detailed structure of our prompt. In the zero-shot setting, our prompt contains three parts: Task Introduction, Instruction, and Task Input. Task Introduction explains the constituency parsing task and introduces constituent tags, ranging from clause level and phrase level to word level. Instruction describes the task that needs to be solved and specifies the required output format. Task Input refers to the input sentence to parse. In the few-shot setting, a small number of Training Instances are additionally provided as a context to instruct the LLMs to predict the constituency tree of the input sentence.

3.3 Tuning LLMs for Constituency Parsing

To explore the full potential of LLMs on constituency parsing, we fine-tune LLMs on the full training dataset. Specifically, we follow the training paradigm of standard generative language modeling to fine-tune LLMs. We concatenate

Figure 3: Illustration of prompts for zero/few-shot learning.

Instruction³, Task Input as well as the linearized constituency tree as a sequence, and train the model to generate the whole sequence incrementally.

Formally, denoting the concatenated sequence as Z, we fine-tune the model by minimizing the following training objective:

where Z_i is the i-th token of output sequence, Z_<i is the generated sequence before time step i, θ are model parameters, and N is the length of the out- put sequence.

Apart from the full-parameter fine-tuning, we also explore the low-rank adaptation LoRA (Hu et al., 2022) for parameter-efficient learning.

4 Experimental Setup

4.1 Datasets

In-domain We use the Penn Treebank (PTB; Marcus et al., 1993) dataset for in-domain evaluation, which covers 40k instances from the news domain. We follow the standard splits, with sections 02-21 for training, section 22 for development, and section 23 for testing.

Cross-domain To investigate the capability of large-scale language models in cross-domain learning scenarios, we conduct an evaluation on the MCTB dataset (Yang et al., 2022), which covers five domains: dialogue, forum, law, literature, and review.

The statistics of the above datasets are summarized in Table 1.

4.2 Baselines and basic LLMs

We consider the following non-LLM baselines for comparison:

SEPar (Kitaev et al., 2019) is a chart-based model consisting of a self-attentive encoder and a chart-based decoder.
SAPar (Tian et al., 2020) augments SEPar with the span-based attention mechanism to better learn n-gram features.
TGCN (Yang and Deng, 2020) is a transition-based parser which employs a graph convolutional network to encode the partial tree and generate actions.
LSTM (Vinyals et al., 2015) is a sequence-based parser which uses a model with 3 LSTM layers and uses the attention mechanism for better gen- eration.
Vanilla Transformer (Vaswani et al., 2017) has been employed in parsing, the encoder of which takes the sentence as input while the decoder generates the linearized constituency tree.
GPT-2 (Radford et al., 2019) parser is fine-tuned by taking sentences as input and linearized constituency trees as output. We use the large version of GPT-2 for experiments.

We experiment with the following LLMs:

ChatGPT is a conversational version of GPT-3.5 model (Ouyang et al., 2022). We use the turbo API from OpenAI.⁴
GPT-4 (OpenAI, 2023) is the most advanced GPT model from OpenAI. We use the gpt-4 API.5
LLaMA (Touvron et al., 2023) is a transformer decoder which is pre-trained on 1.4T tokens.
OPT (Zhang et al., 2022a) is a decoder-only transformer pre-trained on 180B tokens.
Alpaca (Taori et al., 2023) is fine-tuned from a 7B LLaMA model on 52K instruction-following data generated by GPT-3.5.
Vicuna (Chiang et al., 2023) is trained by fine-tuning a LLaMA model on 70K user-shared con- versations.

4.3 Settings

Pre-Processing Our preliminary experiments show that punctuations often lead to invalid constituency trees. To reduce such errors, we replace all punctuation marks with special symbols. More details can be found in our released code.

Post-processing LLMs-based parsers may generate constituency trees that are not necessarily valid. To this end, we add/remove parentheses to/from the generated tree to make it as valid as possible in terms of the bracket form. In addition, we remove any token which is not a possible continuation given the token that precedes it. Furthermore, we truncate the generated tree according to the input sentence to deal with hallucinated to- kens. It should be noted that our post-processing approaches address content limited to a few to- kens, and can not handle span-level errors.⁶ Evaluation Metric Following previous work (Kitaev and Klein, 2018; Kitaev et al., 2019; Zhang et al., 2020; Yang et al., 2022; Cui et al., 2022), we evaluate the model performance using the evalb tool,⁷ which measures constituent-level labeled precision (LP), recall (LR), and F1 scores (F1).

Hyper-parameters. We set the decoding temperature as 0 to increase the model’s determinism. For fine-tuning, we use a learning rate of 3e-5 for 7B and 13B models, and using 1e-5 for 33B and 65B models. We set the maximum length to 2048.

Figure 4: Comparison among transition-based, span-based, bracket-based linearization strategies.

Different from full parameter fine-tuning, we use a learning rate of 1e-4 for LoRA fine-tuning. We evaluate the model performance in the validation dataset after each epoch and choose the model with the highest F1 score as the best model for testing. Our model is implemented based on pytorch⁸ and huggingface transformers⁹ libirary.

5 Results

5.1 Development Experiment Results

We first investigate which serialization format is the most preferred by LLMs for constituency parsing. Specifically, we take LLaMA-7B as the basic model and compare the performance of the model on the PTB test dataset.¹⁰ Figure 4 provides a comparison of three serialization formats in zero shot, few shot and fine tuning settings. In the zero-shot setting, it can be observed that LLMs initially exhibit weak capabilities in generating all three types of trees. This is intuitive since linearized trees are different from pre-training text and the task of consistency parsing is new to LLMs. In the few-shot setting, span-based linearization gives the best result among the three variants, with an F1 score of 13.72. The reason is that the span- based linearization exhibits in natural language format, which is easier for LLMs to predict than the other two methods. Bracket-based linearization obtains the second-best result, which is about 4 points lower than span-based method. In contrast, the performance of LLMs is poor when they are prompted to generate transition-based trees. This suggests that transition-based trees are harder to generate than bracket-based linearization and span-based linearization.

In addition, bracket-based linearization sees better results than transition-based and span-based

methods in the fine-tuning setting, the reason can be two folds: 1) transition-based constituent parsing requires the implicit maintenance of a stack and a buffer, thus is hard for LLMs to learn. 2) transition-based and span-based linearization are longer than bracket-based one, thus suffering more from error propagations. In the subsequent experiments, we opt for the bracket-based linearization strategy for comparison, as it is relatively shorter and generally delivers better performance than the transition-based method.

We then study the impact of different decoding strategies during inference. Figure 5 compares the performance of greedy search and beam search with beam sizes ranging from 2 to 5, on the development dataset of the PTB. In general, the decoding strategy does not impact the parsing performance. In particular, for LLaMA-7B, greedy search yields an F1 score of only 0.1 lower than beam search with a beam size of 5. This suggests that the decoding strategy is not the bottleneck for LLM-based constituency parsing. Moreover, the impact of the decoding strategy on LLaMA-33B has a similar trend compared to LLaMA-7B.¹¹

5.2 The Capacity of LLMs on Constituency Parsing

To study the full potential of LLMs on constituency parsing, we fine-tune LLMs on the full training dataset of PTB and compare the performance of LLMs with the state-of-the-art method. Table 2 presents the results of different systems on the PTB test set. Among all LLM-based methods, LLaMA-65B achieves the highest result, with an F1 score of 95.90. Compared with sequence-based baselines, fine-tuned LLMs yield greatly superior results. Notably, LLaMA-65B gives an improvement of approximately 2.2 F1 score over GPT-2. This observation suggests that LLMs can substantially boost the performance of sequence-based constituency parsers.

Table 2: Fine-tuning results on PTB. LR: labeled recall. LP: labeled precision. ♡ means chart- based models. ♣means transition-based models. ⋆ means sequence-based models. [IT] means instruction-tuned LLMs. The best results among all methods are bolded and the best sequence- based results are underlined.

Compared with the state-of-the-art chart-based and transition-based systems, LLaMA-65B gives competitive results. Specifically, LLaMA-65B outperforms the SEPar parser’s 95.72 F1 score, narrowing the gap between the sequence-based parser and chat-based parser. This indicates that the LLM-based parser can serve as a strong backbone for constituency parsing. LLaMA-65B produces relatively lower results compared with SAPar and TGCN, the reason is that SAPar and TGCN employ external mechanisms to learn local features.

Regarding the model scale, LLaMA-13B improves LLaMA-7B by approximately 0.18 in F1 score, while LLaMA-33B surpasses LLaMA-13B by 0.32 in F1 score. Moreover, LLaMA-65B further delivers a 0.09 F1 score improvement over LLaMA-33B. These results suggest that increasing the model scale can still bring performance improvements even with billions of parameters.

As shown in the last two rows of Table 2, both instruction-tuned models deliver slightly lower performance than vanilla LLMs. This indicates that instruction-tuning cannot benefit constituency parsing under the fine-tuning setting. The reason can be that instruction-tuning tasks primarily focus on semantics, resulting in forgetting of constituent knowledge.

Figure 6: Comparison between Full-tuning and LoRA-tuning.

Figure 6 shows the impact of the LoRA- tuning (Hu et al., 2022). Compared to standard fine-tuning, LoRA-tuning yields relatively lower performance. For instance, the LoRA-tuned LLaMA-7B records an F1 score of 92.96, which is 2.35 points lower than that of LLaMA- 7B. This observation is different from existing findings on sentence classification and text generation tasks (Hu et al., 2022; Xu et al., 2023) which suggest LoRA-tuning can deliver results that are either slightly lower or comparable to those achieved through fine-tuning. A potential reason could be that the output format of constituency parsing is more complex than that of other text-based tasks, thereby necessitating more substantial parameter modifications.

5.3 Are LLMs Still Few-shot Learners for Constituency Parsing?

Previous work has shown that LLMs have strong generalization abilities across tasks and domains (Wei et al., 2022; Ge et al., 2023; Zhou et al., 2023). To study this, we evaluate the generalization ability of LLMs in both in-domain and cross-domain settings. Table 3 documents the zero-shot and 5-shot learning performance of different LLMs on the testset of PTB.¹² It is noteworthy that commercial LLMs (ChatGPT and GPT- 4) deliver impressive results in both scenarios, achieving approximately 60 and 70 in F1 scores in zero-shot and 5-shot settings, respectively. In contrast, open-source LLMs show much worse performance. Specifically, all open-source LLMs deliver low results in the zero shot setting and give less than 30 F1 scores in the 5-shot setting. This suggests that the syntactic knowledge acquired during

Table 3: Zero-shot and few-shot results on PTB.

pre-training cannot be directly transferred to the constituency parsing task, thereby confirming our assumption. Additionally, LLMs show better performance as model parameters increase, suggesting a positive correlation between generalization ability and model size. Furthermore, when com- pared with vanilla LLMs, instruction-tuned LLMs achieve relatively better results. A possible reason is that instruction tuning can help LLMs to shape the output according to instructions.

To study the cross-domain generalization capability of LLMs, we evaluate the zero-shot domain adaptation performance in five cross-domain test sets. Table 4 shows the F1 score on different datasets and the relative performance reduction rate. Remarkably, ChatGPT and GPT-4 obtain impressive performance without training. Compared with the ChatGPT series, the fine-tuned models (SEPar, SAPar, GPT-2, LLaMA) give relatively better F1 scores across all cross-domain sets.

For all models, the overall performance on cross-domain testsets exhibits a large drop compared to the in-domain testset, ranging from 8.9% to 17.0%. Compared with GPT-2, LLaMAs show relatively lower performance reduction rates, and these reduction rates decrease as the model parameters increase. This suggests that larger models have relatively better generalization ability for do- main shifting.

The relative performance reduction rates of LLMs (including LLaMAs and GPT-2) are larger compared to SEPar and SAPar. This shows, somewhat surprisingly, that LLMs-based parsers are weaker than traditional chart-based parsers in cross-domain parsing despite having knowledge of large texts. The reason can be that the sequence- based parsing paradigm lacks abilities to generalize across different domains.

Table 4: Results on cross-domain test datasets. ∆F1: relative reduction rate of F1, lower is better. The best results among all methods are bolded and the best sequence-based results are underlined.

Table 5: Valid F1/Overall F1/Invalid rate on in- domain and cross-domain testsets.

6 Analysis

We conduct error analysis to study what factors influence the performance of LLMs. We choose the fine-tuned LLaMA-7B as the basic model and analyze its parsing results on both in-domain and cross-domain testsets.

6.1 Invalidation Analysis

Different from chart-based models, a drawback of sequence-based models, including LLMs, is that there is no mechanism, such as the CKY algorithm (Pereira and Warren, 1983) or using a state buffer (Yamada and Matsumoto, 2003), which can ensure the validity of the generated constituency parse trees. To study the impact of invalid generated trees, we calculate the valid F1 and invalid rate of fine-tuned LLaMA-7B on the testset of PTB and compare them with SEPar. As shown in Table 5, invalid trees greatly hurt the overall performance of LLaMA-7B. In particular, the F1 score on the law domain is 90.69 for valid trees, which is about 4 points higher than the overall performance. In contrast, the invalid rates of SEPar are consistently zero in all testsets. In addition, the

Table 6: The number of invalid instances on in- domain and cross-domain datasets.

invalid rates of cross-domain testsets are higher compared to in-domain testset, the reason is that cross-domain testsets contain novel patterns which do not exist in the in-domain training dataset.

To further understand the problem of invalid tree generation, we divide invalid trees into different groups and analyze the distribution of each group, and the results are presented in Table 6. Following the EVALB evaluation script, we categorize invalid trees into three types: length mismatch, word mismatch, and prediction failure. “length mismatch” refers to a situation where the number of words in the predicted tree differs from the original text; “word mismatch” refers to a situation where a word in the predicted constituent parse tree has been modified, resulting in a mismatch with the original text; “prediction failure” refers to a scenario where the model totally fails to predict a tree.

The most frequent error is length mismatch error, accounting for 71.4% of the total number. We further categorize length mismatch errors into four subcategories: repetition, task-dofocus, and other

Figure 7: Distribution of hallucination source of
LLaMA-7B.

errors. Among them, “repetition” is frequently observed in generative models, which refers to the phenomenon where model outputs contain large repeated contents (Holtzman et al., 2019). “Task- defocus” refers to the phenomenon that LLMs modify the given sentence before performing constituency parsing. “Others” encompasses errors such as incorrect word segmentation or missing words. In addition, word mismatch accounted for a quarter of the cause of invalid trees. The number of word mismatch errors is larger in cross-domain testsets compared with in-domain testsets. Finally, prediction failures account for a small proportion of invalid trees. Our human analysis shows that prediction failures are typically caused by informal input sentences or phases.

6.2 Hallucination Analysis

Hallucination (Maynez et al., 2020; Bang et al., 2023) is still an unsolved problem for LLMs, which has been observed in many generation tasks. In constituency parsing, we find that LLM-based parsers frequently generate constituency trees that contain new tokens which do not exist in input sentences, which are the main cause of length mismatch and word mismatch. In particular, hallucinations account for about 47.0% and 77.1% of length mismatch and word mismatch, respectively. To further understand this, we conduct a human evaluation to study which are the main factors to cause hallucinations.

Figure 7 shows the distribution of hallucination causes across all cross-domain testsets. We see that the main factor comes from the use of informal language (including consecutive punctuations like “!!!”, “–”, etc.), with a proportion of about 69%. LLMs often pay much attention to these tokens and then generate content that is unrelated or irrelevant to the input sentence. The second source is “non-fluencies”, where generatib LLMs automatically correct the token or phrases that are not considered fluent on their own. In addition, other

Figure 8: Comparsion of model performance (F1) regarding input sentence length (# tokens).

informal terms, such as capitalized tokens, non-English tokens, and numbers, are also important contributors to hallucination.

6.3 Impact of Sentence Length

Since LLMs are trained with a longer context, they are expected to have a better ability on long in- puts. To study this, we partition the test set into three subsets based on sentence length: ≤20, 20- 40, ≥40, and evaluate the model performance. We compare an LLM-based parser (LLaMA-7B) with a sequence-based baseline (GPT-2) as well as a chart-based system (SAPar).

Figure 8 shows the performance of tree models against input length. All three models give relatively lower results when parsing long input sentences. GPT-2 shows the largest performance reduction among the three models. Compared with GPT-2, LLaMA-7B sees lower performance decreases on long sentences, showing that LLMs are more robust to long sentences. Compared with SEPar, LLaMA-7B gives comparable performance decreases when input sentences have less than 40 tokens, while the gap is larger when parsing longer sentences. This indicates that LLMs are unable to fundamentally address the performance limitation of sequence-based parsers.

6.4 Performance against Span Length

Since LLMs are trained with longer context than traditional models, they are expected to have a better ability to predict long spans. To verify this, we calculate the F1-score based on span lengths and compare the performance of LLM-based parser and traditional sequence-based parser. Figure 9 shows the F1-score of Llama-7B and GPT-2 on constituent spans with different lengths. Generally, LLaMA-7B gives better results than GPT-2 in

Figure 9: Results across different span lengths for GPT-2 and LLaMA-7B on PTB dataset.

most span length conditions. Moreover, LLaMA- 7B shows larger improvements on medium and long spans, suggesting that LLMs-based parser has better ability to learn more sophisticated constituency label dependencies. Interestingly, both models show great performance drops when the span length is greater than 35, suggesting that generative models still have limitations in handling extremely long spans. This limitation may be attributed to fact that sequence-based parsers are naturally weak in modeling long-range nested dependencies.

6.5 Data Efficiency

LLMs are observed to have high data efficiency on text-to-text NLP tasks, achieving competitive performance with a small proportion (e.g. 0.5%) of full training data (Bao et al., 2023; Chen et al., 2023a). We study this in the context of constituency parsing. Specifically, we evaluate the performance of LLaMA-7B when fine-tuned on data of different sizes, and compare results with GPT-2 and SAPar. We randomly sample {100, 500, 1000, 5000, 10000} data from PTB for fine- tuning, and evaluate the model performance on the PTB test set. The results are shown in Figure 10, where we report the overall F1 score and F1 score of valid trees for LLaMA-7B and GPT-2. Since chart-based system SAPar does not generate in- valid trees, resulting in the overall F1 score and the F1 score of valid trees being the same.

As the amount of training data increases, all models exhibit improvements. Different from previous observations (Chen et al., 2023a), LLaMA- 7B does not reach its peak performance until using full training data.¹³ The reason can be that

Figure 10: Performance comparison on PTB dataset with different training data scale.

the structured output of constituency parsing is more challenging to generative LLMs compared with other text-based tasks. In addition, LLaMA-7B outperforms GPT-2 in all different training settings, especially when there are less than 5000 training instances. Also, LLaMA-7B gives better results than SAPar when the number of training instances is less than 1000. This shows that LLMs have better generalization abilities compared to medium-size models. When the size of training instances increases, the performance gap between LLaMA-7B and GPT-2 becomes less. Furthermore, both models suffer greatly from invalid trees when trained with limited data, and the gap between overall F1 and valid F1 becomes less when the number of training instances increases. Compared with GPT-2, LLaMA-7B sees a smaller performance gap, indicating that LLMs have better abilities to generate a valid tree.

7 Conclusion

We investigated the usage of large language models (LLMs) for constituency parsing. Our experiments demonstrate that LLMs greatly enhance the performance of sequence-based methods and can yield competitive results compared to the state-of- the-art chart-based and transition-based parsers. However, we observe that LLMs do not exhibit strong few-shot and cross-domain generalization ability in constituency parsing, and our error analysis reveals that LLM-based parsers are prone to hallucinations. Our analysis further suggests that, while generative LLMs show promise for improving constituency parsing, there is still large room for improvement, and future work can focus on addressing the challenges, for instance, failing to output in a strictly structured manner.

References

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multi- modal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.

Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. arXiv preprint arXiv:2305.00447.

Tom B. Brown, Benjamin Mann, Nick Ry- der, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.

Eugene Charniak. 2000. A maximum-entropy-inspired parser. In 1st Meeting of the North American Chapter of the Association for Computational Linguistics.

Hao Chen, Yiming Zhang, Qi Zhang, Hantao Yang, Xiaomeng Hu, Xuetao Ma, Yifan Yanggong, and Junbo Zhao. 2023a. Maybe only 0.5% data is needed: A preliminary exploration of low training data instruction tuning. arXiv preprint arXiv:2305.09246.

Yulong Chen, Yang Liu, Li Dong, Shuohang Wang, Chenguang Zhu, Michael Zeng, and Yue

Zhang. 2022. AdaPrompt: Adaptive model training for prompt-based NLP. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6057–6068, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Yulong Chen, Yang Liu, Ruochen Xu, Ziyi Yang, Chenguang Zhu, Michael Zeng, and Yue Zhang. 2023b. UniSumm and SummZoo: Unified model and diverse benchmark for few-shot summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12833–12855, Toronto, Canada. Association for Computational Linguistics.

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.

Michael Collins. 1997. Three generative, lexicalised models for statistical parsing. In 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, pages 16–23, Madrid, Spain. Association for Computational Linguis- tics.

Michael Collins, Jan Hajic, Lance Ramshaw, and Christoph Tillmann. 1999. A statistical parser for czech. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics, pages 505–512.

James Cross and Liang Huang. 2016. Span-based constituency parsing with a structure-label system and provably optimal dynamic oracles. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1–11, Austin, Texas. Association for Computational Linguistics.

Leyang Cui, Sen Yang, and Yue Zhang. 2022. Investigating non-local features for neural constituency parsing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2065–2075, Dublin, Ireland. Association for Computational Linguistics.

Greg Durrett and Dan Klein. 2015. Neural crf parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 302–312.

Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent neural network grammars. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 199–209, San Diego, California. Association for Computational Linguistics.

Daniel Fernández-González and Carlos Gómez- Rodríguez. 2020. Enriched in-order linearization for faster sequence-to-sequence constituent parsing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4092–4099, Online. Association for Computational Linguistics.

Leo Gao, John Schulman, and Jacob Hilton. 2023. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR.

Yingqiang Ge, Wenyue Hua, Jianchao Ji, Juntao Tan, Shuyuan Xu, and Yongfeng Zhang. 2023. Openagi: When llm meets domain experts. arXiv.

Tanya Goyal, Junyi Jessy Li, and Greg Durrett. 2022. News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356.

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. In International Conference on Learning Representations.

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. Lora: Lowrank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.

Ming Jiang and Jana Diesner. 2019. A constituency parsing tree based method for relation extraction from abstracts of scholarly publications. In Proceedings of the Thirteenth Work- shop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pages 186– 191, Hong Kong. Association for Computational Linguistics.

Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, and Zhaopeng Tu. 2023. Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745.

Hidetaka Kamigaito, Katsuhiko Hayashi, Tsutomu Hirao, Hiroya Takamura, Manabu Okumura, and Masaaki Nagata. 2017. Super- vised attention for sequence-to-sequence constituency parsing. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 7–12, Taipei, Taiwan. Asian Federation of Natural Language Processing.

Nikita Kitaev, Steven Cao, and Dan Klein. 2019. Multilingual constituency parsing with self-attention and pre-training. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3499–3505, Florence, Italy. Association for Computational Linguistics.

Nikita Kitaev and Dan Klein. 2018. Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2676–2686, Melbourne, Australia. Association for Computational Linguistics.

Nathan Lambert, Louis Castricato, Leandro von Werra, and Alex Havrilla. 2022. Illustrating reinforcement learning from hu- man feedback (rlhf). Hugging Face Blog. Https://huggingface.co/blog/rlhf.

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2022. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.

Jiangming Liu and Yue Zhang. 2017. In-order transition-based constituent parsing. Transactions of the Association for Computational Linguistics, 5:413–424.

Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The penn treebank. Comput. Linguist., 19(2):313–330.

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.

Joakim Nivre and Ryan McDonald. 2008. Integrating graph-based and transition-based dependency parsers. In Proceedings of ACL-08: HLT, pages 950–958, Columbus, Ohio. Association for Computational Linguistics.

R OpenAI. 2023. Gpt-4 technical report. arXiv, pages 2303–08774.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.

Fernando CN Pereira and David HD Warren. 1983. Parsing as deduction. In 21st annual meeting of the association for computational linguistics, pages 137–144.

Slav Petrov and Dan Klein. 2007. Improved inference for unlexicalized parsing. In Human

Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 404–411, Rochester, New York. Association for Computational Linguistics.

Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.

Kenji Sagae and Alon Lavie. 2005. A classifier- based parser with linear run-time complexity. In Proceedings of the Ninth International Work- shop on Parsing Technology, pages 125–132, Vancouver, British Columbia. Association for Computational Linguistics.

Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. 2022. Multitask prompted training enables zero-shot task generalization. In ICLR 2022-Tenth International Conference on Learning Representations.

Timo Schick and Hinrich Schütze. 2020. Exploiting cloze questions for few-shot text classification and natural language inference. Computing Research Repository, arXiv:2001.07676.

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction- following llama model. https://github. com/tatsu-lab/stanford_alpaca.

Yuanhe Tian, Yan Song, Fei Xia, and Tong Zhang. 2020. Improving constituency parsing with span attention. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1691–1703, Online.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal,

Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural in- formation processing systems, 30.

Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar as a foreign language. Advances in neural information processing systems, 28.

Xinyi Wang, Hieu Pham, Pengcheng Yin, and Graham Neubig. 2018. A tree-based decoder for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4772– 4777, Brussels, Belgium. Association for Computational Linguistics.

Albert Webson and Ellie Pavlick. 2022. Do prompt-based models really understand the meaning of their prompts? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2300–2344, Seattle, United States. Association for Computational Linguistics.

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. 2022. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.

Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. 2023. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196.

Jiacheng Xu and Greg Durrett. 2019. Neural ex- tractive text summarization with syntactic compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3292–3303, Hong Kong, China. Association for Computational Linguistics.

Hiroyasu Yamada and Yuji Matsumoto. 2003. Statistical dependency analysis with support vector machines. In Proceedings of the Eighth International Conference on Parsing Technologies, pages 195–206, Nancy, France.

Kaiyu Yang and Jia Deng. 2020. Strongly incremental constituency parsing with graph neural networks. Advances in Neural Information Processing Systems, 33:21687–21698.

Sen Yang, Leyang Cui, Ruoxi Ning, Di Wu, and Yue Zhang. 2022. Challenges to open-domain constituency parsing. In Findings of the Association for Computational Linguistics: ACL 2022, pages 112–127, Dublin, Ireland. Association for Computational Linguistics.

Songlin Yang and Kewei Tu. 2022. Bottom-up constituency parsing and nested named entity recognition with pointer networks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2403–2416, Dublin, Ireland. Association for Computational Linguistics.

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022a. Opt: Open pre- trained transformer language models. arXiv preprint arXiv:2205.01068.

Yu Zhang, Houquan Zhou, and Zhenghua Li. 2020. Fast and accurate neural CRF constituency parsing. In Proceedings of IJCAI, pages 4046–4053.

Yue Zhang and Stephen Clark. 2009. Transition- based parsing of the Chinese treebank using a global discriminative model. In Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09), pages 162–171, Paris, France. Association for Computational Linguistics.

Zhuosheng Zhang, Shuohang Wang, Yichong Xu, Yuwei Fang, Wenhao Yu, Yang Liu, Hai Zhao, Chenguang Zhu, and Michael Zeng. 2022b. Task compass: Scaling multi-task pre-training with task prefix. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5671–5685, Abu Dhabi, United

Arab Emirates. Association for Computational Linguistics.

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. 2023. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.

Constituency Parsing using LLMs