October 30, 2023

Josh Magnus Ludan Qing Lyu Yue Yang Liam Dugan Mark Yatskar Chris Callison-Burch

University of Pennsylvania

{jludan, lyuqing, yueyang1, ldugan, myatskar, ccb}@seas.upenn.edu

Abstract

Deep neural networks excel in text classification tasks, yet their application in high-stakes domains is hindered by their lack of interpretability. To address this, we propose Text Bottleneck Models (TBMs), an intrinsically interpretable text classification framework that offers both global and local explanations. Rather than directly predicting the output label, TBMs predict categorical values for a sparse set of salient concepts and use a linear layer over those concept values to produce the final pre- diction. These concepts can be automatically discovered and measured by a Large Language Model (LLM), without the need for human cu- ration. On 12 diverse datasets, using GPT-4 for both concept generation and measurement, we show that TBMs can rival the performance of established black-box baselines such as GPT-4 fewshot and finetuned DeBERTa, while falling short against finetuned GPT-3.5. Overall, our findings suggest that TBMs are a promising new framework that enhances interpretability, with minimal performance tradeoffs, particularly for general-domain text.¹

1 Introduction

Interpretability has become a critical aspect of deep learning systems, especially in high-stakes domains such as law, finance, and medicine, where understanding and analyzing model behavior is crucial (Bhatt et al., 2020; Dwivedi et al., 2023). A promising line of work focuses on “self-interpretable” models, which provide built-in explanations along with their predictions (Du et al., 2019; Linardatos et al., 2020). These model-provided explanations can come in various forms: token-level importance scores, influential training examples, or even free text. However, these types of explanations often- times provide only local justification for individual predictions and fail to offer global insights into the

Figure 1: Unlike end-to-end black-box models (left), Text Bottleneck Models (right) first discover and measure a set of human-interpretable concepts and then predict the label with a linear layer.

overarching principles that guide model behavior (Bhatt et al., 2020).

An alternative form of explanation that addresses this issue is concept-based explanations (Madsen et al., 2022). A concept is an abstract feature representing some aspect of the input text, such as “food quality” for a restaurant review. Concept-based explanations can provide both global and local in- sights by identifying important concepts across the dataset and localizing how these concepts relate to each individual prediction. However, concept- based approaches typically involve extensive human labor to implement, since they require experts to curate a set of concepts for each new task, and the concept values need to be further annotated on each training example (Abraham et al., 2022). Ad- ditionally, current approaches often lack sparsity, including hundreds or even thousands of concepts in their explanations (Rajagopal et al., 2021). With such large concept spaces, it remains difficult to draw useful takeaways on the global behavior of the model (Ramaswamy et al., 2022).

In this work, we propose Textual Bottleneck Models (TBMs), an extension of Concept Bottle-

Figure 2: Demonstration of the system with an example from the CEBaB (Abraham et al., 2022) dataset. Given an input example (restaurant review), during Concept Generation (a), it iteratively discovers new concepts (e.g., “Restaurant Variety”). Concept Measurement (b) measures the value of concepts by identifying relevant snippets (e.g., “food for everyone”) and providing a numerical concept score (e.g., +1). Finally, the Prediction Layer (c) aggregates all concept scores for the input and learns their relative weights to make final prediction of the task label.

neck Models (CBMs) from the vision domain (Koh et al., 2020) to text classification and regression tasks. Our system has three modules, all fully automated: Concept Generation, Concept Measurement, and Prediction Layer, as shown in Figure 2. Given a dataset of input texts (e.g., restaurant re- views), the Concept Generation module iteratively discovers a sparse set of concepts (e.g., “Restaurant Variety”) that help discriminate between texts with different output labels. The Concept Measurement module then determines the value of each concept (e.g., “wide variety”) for a text as a numerical score (e.g., +1). Finally, these concept scores are aggregated into the final prediction by a white-box Prediction Layer (e.g., a linear layer).

Using GPT-4 to generate and measure concepts, we evaluate our system on 12 diverse datasets, spanning from fake news detection to sentiment classification. TBMs perform competitively with strong black-box baselines including few-shot GPT- 4 and finetuned BERT, but lag behind state-of-the-art models like finetuned GPT 3.5. In particular, TBMs are highly competitive for sentiment comprehension and natural language inference tasks, though there is room for improvement in specialized domains like news and science.

To understand where the error comes from, we perform a manual evaluation of each module. We find that the Concept Generation module can consistently generate relevant and unambiguous concepts, but can occasionally struggle with redundancy and leakage. The Concept Measurement module is found to score the majority of concepts in sentiment analysis with high accuracy, whereas those in fake news detection are harder to measure, which might be a reason behind the performance difference in these domains. Finally, the concept learning curves make it transparent what concepts are learned over time and their relative impact, which can offer valuable insights for model understanding and debugging.

Our contributions are as follows:

We introduce TBMs, a text classification framework that provides both global and local interpretability, by automatically constructing sparse concept bottlenecks using LLMs without any hu- man effort.
We demonstrate that, on average, TBMs perform competitively with strong, but not state-of-the- art, black-box baselines across 12 diverse datasets.
We provide an in-depth human evaluation and analysis of each module in the TBM and show how the system allows for easier model interaction and debugging.

2 Related Work

Self-interpretable NLP models aim to provide a built-in explanation along with the prediction, without relying on post-hoc explanation methods. They offer diverse forms of explanation. Token-based explanations, such as rationales (Lei et al., 2016; Bastings et al., 2019), provide a span of important tokens that are minimally sufficient for the prediction. Example-based explanations (Han et al., 2020; Das et al., 2022) identify the most similar examples within the training set relative to the examples for inference. Free-text explanations, such as those in (Camburu et al., 2018; Nye et al., 2021; Wei et al., 2022), generate a free-form justification in Natural Language for the prediction. We note that these only provide local interpretability, and our approach differs in that it provides both local and global insights into model behavior owing to the use of concept-based explanations.

Concept Bottleneck Models were first introduced by Koh et al. (2020) for vision tasks such as image classification. In their work, they tasked experts with manually crafting a set of human-interpretable concepts that then became the only input for a classifier model. Stakeholders could then intervene in these concepts and correct them, allowing easier model behavior analysis. Collins et al. (2023) describe several problems with CBMs, such as information leakage (Mahinpei et al., 2021) and having too many concepts (Ramaswamy et al., 2022). Information leakage causes the concept bottleneck to be unfaithful (Jacovi and Goldberg, 2020; Lyu et al., 2022) by having the labeling task as a concept. Having too many concepts causes information overload for the user, preventing them from developing a general understanding of model behavior. We note that these problems can also exist in the text domain, so we carefully evaluate them in our manual analysis.² To reduce the cost of concept generation, previous work in computer vision has also used LLMs to automate this process for image classification (Yang et al., 2023; Pratt et al., 2023). Our work extends this method to the text domain, with additionally introduced benefits such as sparsity.

Concept-based explanations in NLP can be broadly categorized into two lines of work. The first focuses on mechanistic interpretability, analyzing what latent concepts are represented by different neurons in pre-trained LMs (Sheng and Uthus, 2020; Bills et al., 2023; Vig et al., 2020). The second focuses on explaining why models make certain decisions, providing explicit concepts as supporting evidence for predictions (Rajagopal et al., 2021; Wu et al., 2023). Our work belongs to the second category.

Within this category, SELF-EXPLAIN (Ra- jagopal et al., 2021) is an explainable framework that jointly predicts the final label and identifies both globally similar concepts from the training set and locally relevant concepts from the current example. Notably, there is no bottleneck structure in their approach, which makes information leak- age easier. Also, they define each phrase (e.g., “for days”, “the lack of”, etc.) in each example as a concept, resulting in an enormous concept space of hundreds of thousands of phrases. By contrast, our concepts are high-level, categorical features, resulting in a sparse space of ≤ 30 concepts for each dataset, from which it is easier to draw useful takeaways. Another representative work (Wu et al., 2023) trains a Causal Proxy Model that mimics the behavior of a black-box model using human-annotated counterfactual data. Our definition of concepts is consistent with theirs, but our method does not require expert data curation.

3 Method

Figure 2 provides an overview of our system. It consists of three components: Concept Generation, which iteratively discovers new concepts using misclassified examples; Concept Measurement, which measures the concept scores for each example; and Prediction Layer, which predicts the output label with only the concept scores as input. The first two modules are implemented by prompting an LLM,³ and the last module is implemented as training a linear layer.

3.1 Method Formulation

We describe the structure of TBMs as follows:⁴ Given a text classification or regression dataset with a training set D_train and a test set D_test, each instance can be denoted as a text-label pair (t, y). During training, we generate a set of N concepts C = {c₁, c₂, . . . c_N } using D_train, where each concept c_i is a categorical feature (e.g., “restaurant variety”) with multiple possible values (e.g., high, low, mixed or unmentioned). For each text t_train, we measure the values of all concepts as a list of numerical scores, [s(t_train, c_i)|c_i ∈ C] (e.g., +1, −1, 0). The sign of the score represents the polarity of a concept in the text, i.e. a positive/negative score

Table 1: JSON Representation for the concept “Build Quality” for a hypothetical product review dataset included in the Concept Generation prompt as an in-context example.

indicates that the concept is positively/negatively reflected, and a zero score represents uncertainty or absence of the concept. The magnitude of the score represents the intensity of a concept, with larger magnitude indicating higher intensity. These concept scores are then used as the only input to train a white-box prediction layer to predict the la- bel y_train. During inference, given a new input text t_test ∈ D_test, we measure the score of each concept in the generated concept set [s(t_test, c_i)|c_i ∈ C], and use the trained prediction layer to predict the final label y_test.

In the following sections, using Figure 2 as a running example, we describe our specific implementation of each TBM module in terms of how concepts are represented, generated and measured, and how these concept measurements are turned into predictions.

3.2 Concept Representations

Each concept consists of the following components, represented as a JSON object in our prompts:

Concept Name: The name of the concept.
Concept Description: A description of the concept and the factors relevant to measuring it.
Concept Question: The question we use to measure the concept value.
Possible Responses: The set of possible responses to the concept question.
Response Guide: A list of criteria for possible responses, to guide the process of answering the concept question.
Response Mapping: A dictionary mapping each possible response to a numerical score.

Table 1 shows an example representation of the concept “Build Quality” for a product review dataset. The concept question and response guide are important during Concept Measurement stage.

3.3 Concept Generation

At a high level, we generate concepts by prompting an LLM to iteratively discover new concepts that help discriminate between misclassified examples. As outlined by Algorithm 1, given the training set (e.g., restaurant reviews), we initialize the TBM with an empty concept set C. In each iteration, to generate a new concept c, we first identify training examples that have similar representations in the existing concept space but have a high prediction error under the current Prediction Layer. For example, if the current concept space C contains only “Atmosphere” (c₁) and “Food Quality” (c₂), then the two reviews Great food and ambiance, but quite limited choices on the menu (3-star) and Food, atmosphere, variety of choices… everything was excellent! (5-star) will both be represented as [+1, +1]. However, a new concept “Restaurant Variety” can help differentiate between them. There- fore, we construct the concept generation prompt (GeneratePrompt) using the dataset metadata (description and labelling scheme) and these hard examples as in-context exemplars, in order to encourage the generation of a new discriminative concept. To reduce concept duplication, we also include the list of previously generated concepts in this prompt. Taking GeneratePrompt as input, the LM generates a new candidate concept c, which will then be refined through RefinePrompt. RefinePrompt

contains a few examples of problematic concepts, such as those with ambiguous questions and invalid JSON formatting, and how they are fixed. The resulting refined concept, c^′, along with the existing C, is used to train a new Prediction Layer to create a new candidate TBM^′. If TBM^′ outperforms existing TBM on a random subset of D_train by some threshold γ, it is retained, otherwise, it is omitted. The above procedure is iteratively executed for N cycles, resulting in a final concept set C.⁵

3.4 Concept Measurement

With the generated concept set C, the Concept Measurement module determines the scores [s(t, c_i)|c_i ∈ C] for any given text t. To measure a concept, we prompt an LLM in a zero-shot fashion to answer the concept question associated with that concept, using the concept description and response guide as context (see Sec 3.2). For instance, consider the concept “Restaurant Variety” in Figure 1. Given a restaurant review, the concept question prompts, “How does the review describe the variety and originality of the restaurant?” The possible answers could be “wide variety”, “low sponse given by the LLM is then converted into a numerical concept score using the concept’s response mapping (+1 for Positive, -1 for Negative, 0 otherwise). In addition to the categorical answer, the prompt also instructs the LLM to provide relevant snippets in the input text as supporting evidence, for example, “food for everyone, with 4 generations to feed” as supporting snippets for “wide variety”.

3.5 Prediction Layer

To combine the concept scores [s(t, c_i)|c_i ∈ C] into a final prediction y, we train a Prediction Layer on D_train, using linear regression for regression tasks and logistic regression for classification tasks.⁶ It learns a weight associated with each concept using y as the supervision signal. For a new input example at inference time, its measured concept scores are multiplied by their weights and summed into the final prediction logit. For example, in Figure 2, across the dataset, “Customer Recommendation” and “Food Quality” are the most important concepts, while “Restaurant Variety” is less crucial. On the given review, “Customer Recommendation” and “Restaurant Variety” are positively scored, but “Atmosphere” and “Value for Money” are negatively scored. Their weighted sum results in a final prediction of 3 stars.

Finally, the concept weights provide a global explanation for their relative importance across the dataset, and the concept scores and supporting snippets provide a local explanation for the decision on each individual example.

4 Experimental Setup

Implementation Details. We use GPT-4 (GPT-4-0613) (OpenAI, 2023) as the underlying LLM for both Concept Generation and Concept Measurement and use Scikit-learn (Pedregosa et al., 2011) to implement linear and logistic regression (with default parameters). See Appendix D for implementation details and prompts.

Datasets. We evaluate on a total of 12 datasets, which we split into 7 “general domain” and 5 “specialized domain” categories. Specialized domain tasks include Fake News Detection (Zhong et al., 2023), News Partisanship Classification (Kiesel et al., 2019), Citation Intent Detection (Cohan et al.,

Table 2: Model performance on 12 datasets. ✗ and ✓denote whether the model is interpretable or not. For each dataset, the highest performance is bold, and the second highest is underlined.

Figure 3: Average performance on 4 General-domain classification datasets (left), 3 General-domain regres- sion datasets (middle) and 5 Specific-domain classification datasets (right).

2019), AG News (Gulli, 2004), Patent Classifica- tion (Sharma et al., 2019). General domain tasks include Stanford Natural Language Inference (SNLI) (Bowman et al., 2015), Hate Speech Detection (Kennedy et al., 2020), and five sentiment analysis datasets (Rotten Tomatoes (Pang and Lee, 2005), Amazon reviews (McAuley and Leskovec, 2013), Yelp reviews (Zhang et al., 2015), CEBaB (Abraham et al., 2022), and Poem Sentiment (Sheng and Uthus, 2020)). They differ in that the former requires domain-specific knowledge (mainly in the news and science domain) to solve, whereas the latter can be solved mostly based on common sense and world knowledge. More details of these datasets can be found in Appendix B. Three of these datasets involve a regression task (CEBaB, Yelp, and Hate Speech), while the rest involve classification. With few noted exceptions, we train TBM using 250 examples and test on 250 exam-

Evaluation Metrics. We evaluate TBMs in three ways. First, we compute the end-to-end perfor- mance (Mean Squared Error (MSE) for regression and accuracy for classification) compared to the above baselines. Next, we evaluate the Concept Generation and Concept Measurement modules us- ing human annotation (see metrics in Sec 5).

5 Results

5.1 End-to-End Performance

TBMs perform competitively with black-box baselines except for finetuned GPT-3.5. As shown in Figure 3, TBMs achieve the second- highest average accuracy across all sentiment classification datasets (0.89) and the second-lowest average MSE (0.713) across all regression datasets, surpassing all the baselines except for finetuned GPT-3.5. Compared to black-box baselines such as

Figure 4: Expert concept annotations for concept generation quality on five aspects: **Redundancy (Rdy)** is concept duplication, “bad” indicates repetition; **Relevance (Rlv)** is pertinence to the task, “bad” identifies spurious concepts; **Leakage (Lkg)** checks if the concept directly performs the task, “bad” indicates leakage; **Objectivity (Obj)** is measurability clarity, with “bad” indicates subjectivity; and **Difficulty (Dfc)** checks the complexity of measuring the concept, “bad” means the concept measurement is harder than dataset task.

Figure 5: Human evaluation on concept measurement. **Machine-human correlation** measures the Pearson correlation between the concept scores measured by the LLM vs. human annotators. **Exact Match** refers to the performance of the LLM in predicting the exact string label for a concept when using human annotation as gold-standard.

GPT-4 fewshot and finetuned DeBERTa, TBMs exhibit competitive and consistent performance. These results are particularly surprising given that, compared to black-box models, TBMs have access to much less information due to the concept bottle- neck while still maintaining the performance. By contrast, the interpretable baseline, Naive Bayes, falls far behind.

Zooming into individual datasets in Table 2, TBMs achieve the highest or second highest performance on 5 datasets. On the remaining 7 datasets, TBMs are visibly outperformed by black-box baselines, but the performance gap tends to be small.

On datasets where the TBMs underperform the best model, the average performance difference be- tween TBMs and the best model is 14% (9.6% when excluding finetuned GPT-3.5). This gap shrinks for classification tasks on sentiment (Rot- ten Tomatoes, Amazon Reviews, Poem Sentiment), where the average performance gap is 1.4%, indicating minimal interpretability-performance trade-offs for this domain.

TBMs excel in general-domain texts but struggle for domain-specific texts. After further examining the results in different domains, we observe that TBMs perform well on the general domain tasks, including sentiment comprehension and Natural Language Inference. However, it falls behind in specialized domain tasks, including those in the news and science domain. Below, we compare the performance of TBMs against all other baselines excluding finetuned GPT-3.5 to allow for a cleaner comparison with these established baselines.

Among all 7 general domain datasets, compared to all baselines except finetuned GPT-3.5, TBMs achieve the best performance on 3 of them (Rotten Tomatoes, Poem Sentiment, and Hate Speech). On Amazon Reviews and SNLI, TBMs closely match the best baseline, with an accuracy difference ≤0.004. The only exception is CEBaB and Yelp, where TBMs are outperformed by a large margin (relative difference of 37% and 10%), which we have not understood well. Overall, we hypothesize that the encouraging performance in these tasks is potentially because they do not require domain- specific knowledge, making it easier for LMs to discover concepts by relying on knowledge learned during pretraining.

In specialized domains such as news (Fake News, AG News, News Partisanship) and science (Patent, SciCite), TBMs are consistently outperformed by either GPT-4 fewshot or finetuned DeBERTa. We postulate that this can be attributed to the fact that it is more challenging for LMs to discover relevant concepts for these tasks in a zero-shot fashion, with- out domain-specific knowledge. Another factor can be that certain generated concepts in specialized domains, such as “Fact Checking” in detecting fake news, are as difficult to measure as the target la- bel. Therefore, it is challenging for the Concept Measurement module to assess the concept value in a zero-shot manner accurately. This hypothesis is further investigated in Sec 5.2. Other potential factors remain to be explored in the future.

Overall, all the above results demonstrate that TBMs are competitive with GPT-4 fewshot and finetuned DeBERTa on average, with exceptional performance on sentiment classification and NLI tasks, but still have room for improvement in domains such as news and science. To further understand where the error comes from, we manually evaluate each module in the TBM pipeline in the next two subsections.

5.2 Concept Generation Module Evaluation

To assess the Concept Generation module, we manually evaluate the generated concepts in 6 aspects: Redundancy, Relevance, Leakage, Objectivity, and Difficulty, each explained in the caption of Figure 4. Three annotators, who are all authors of this paper, perform this evaluation for each concept on six datasets with conflicts resolved by a simple majority vote.

According to Figure 4, across all datasets, the overwhelming majority of concepts are of high quality, except Poem Sentiment. On average, Redundancy emerges as the most common issue (25%), followed by Leakage (15%), while the other issues, including difficulty (9%), objectivity (6%) and relevance (1%), are less frequent. This suggests that the module has almost no problem discovering concepts that are relevant to the task label and can mostly ensure that the concepts are unambiguous and easy to measure. However, the Concept Generation Module occasionally accepts unnecessary concepts that are too similar to previously generated ones or concepts that directly leak the task label. The prevalence of these issues varies across datasets. For instance, Poem Sentiment shows high concept error rates in almost all aspects except relevance, while Hate Speech concepts have mostly redundancy issues.

Redundant concepts unnecessarily increase the size of the concept space, which can increase the cognitive load of users trying to interpret the model behavior. Leaky concepts can undermine the faithfulness of provided explanations, making the “self-explanatory” claim invalid. To mitigate these issues, we are exploring other heuristics to filter problematic concepts during generation, in addition to the performance improvement threshold.

5.3 Concept Measurement Module Evaluation

To determine whether the Concept Measurement Module measures concepts correctly, we compare the concept scores rated by the LLM with those rated by humans on the CEBaB and Fake News datasets. We asked a group of crowdworkers¹⁰ to answer the questions generated by the model for each concept, with the concept description and response guide as additional context. This is the same information that the LLM receives when per- forming Concept Measurement. We compute the exact match and correlation between the human and LLM judgments. If annotators do not have a clear majority decision for an instance, it is labeled as “uncertain”.

Figure 5 (a) shows the histogram of the correlations and accuracies for all the concepts in the CEBaB dataset. We see that the TBM can measure a majority of the concepts it generates accurately: the median correlation and accuracy are high at 0.814 and 0.893 respectively, with the average being 0.759 for correlation and 0.824 for accuracy. This level of agreement is remarkable since concept measurement is done in a zero-shot manner, with no training data about the specific concept being measured.

In contrast, the performance for the Fake News dataset, as shown in Figure 5 (b), is modest: the median correlation and accuracy are 0.317 and 0.549, respectively, while the average scores are 0.305 for correlation and 0.571 for accuracy. Most of this reduced performance comes from hard-to-measure concepts such as “Fact Checking”, where the LLM asserts that a text can be fact-checked despite no access to external resources.

Meanwhile, this stark difference in performance between the two datasets reflects the transparency and auditability of TBMs. The exemplary performance on the CEBaB dataset validates the potential

Figure 6: Concept learning curves of TBMs on 6 datasets. The x-axis represents the TBM’s performance (MSE for regression task and Accuracy for classification tasks) at each iteration, and the y-axis indicates the specific concept added to the bottleneck during that iteration. The size of each node is determined by the magnitude of the weight of the corresponding concept in the prediction layer.

and effectiveness of this module. Simultaneously, the suboptimal results on Fake News provide clear signals for potential pitfalls that require human intervention and debugging.

6 Analysis of Learning Curves

One unique advantage of TBMs is that their interpretable structure allows users to analyze model behavior in a more granular way compared to black- box models. We demonstrate this by plotting the concept learning curves of TBMs on four of our datasets in Figure 6.¹¹ These learning curves show how the TBM’s performance changes on the test data as it iteratively generates new concepts. The figure also shows the importance of each concept, which is calculated as the absolute value of its weight learned by the final prediction layer.¹² These curves make it easy to see the system’s performance change as it progressively adds new concepts to the bottleneck allowing users to directly identify the most helpful concepts. For example, in the Yelp dataset, the introduction of the “Customer Recommendation” concept led to a significant drop in MSE. However, the concept “Emotional Intensity” appears less informative in the poem sentiment task, as the accuracy decreases after it is added to the bottleneck.

Interestingly, we also observe that the most important concepts in terms of weight are not always discovered immediately. Instead, they can still show up at later stages of iteration. For example, in CEBaB, “Expectations Met” has one of the highest importance weights but is discovered last.

These learning curves can be contrasted with the learning curves in black-box models, where sudden increases in model performance require in-depth investigation to identify the cause of improvement. We include two additional examples of how this added interpretability can be useful in our Appendix. Appendix A.1 shows an analysis of how the performance of different training runs on the same dataset can be explained using the discovered concepts, and Appendix A.2 shows how we can explain the overfitting of our TBM on a small dataset based on the discovery of spurious concepts.

7 Discussion

In this section, we discuss some additional benefits of using TBMs.

TBMs allow users to intuitively interact with the concept bottleneck Since concepts are fully represented in natural language, practitioners can easily add or delete concepts in the concept space without relying on an LLM. This allows experts to directly inject domain-specific inductive bias at a high level of abstraction. Additionally, they can tweak how concepts are measured by simply rewriting the instructions in the Concept Measurement prompt. For example, if a practitioner wants to in- crease the granularity of the concept “Noise” which currently has two options “noisy” or “not noisy”, they can edit the responses to add options such as “moderately noisy” and “unbearably noisy”. These interactions make it easier to steer the behavior of TBMs compared to black box models.

TBMs can be used to characterize domain shifts. In addition to greater applicability in high-stakes domains, TBMs provide a more grounded handle that we can use to characterize domain shifts. For example, a hypothetical TBM fit on predicting popular movies on dataset A can be re-fit on dataset B using the same concepts. If dataset A contains reviews from expert critics and dataset B contains reviews from casual fans, the difference in reviewer tendencies can be described based on the shifts in the concept weights. For instance, “character quality” may have a higher weight in dataset A compared to dataset B, indicating that expert critics might have placed a greater emphasis on well- written characters.

8 Conclusion

In this paper, we present Text Bottleneck Models (TBMs)—an innovative text classification framework that is interpretable by construction. TBMs provide both global and local interpretability with sparse concept-level explanations, allowing users to understand the general principles being used for inference as well as the specific reasoning for individual examples. TBMs can be fully automated, requiring no human-curated concept set. In our evaluation, we show that TBMs achieve competitive performance against strong black-box baselines such as GPT-4 fewshot and finetuned DeBERTA across 12 diverse text regression and classification datasets despite being constrained by an information bottle- neck. Human evaluations reveal that the concepts generated by the system are mostly relevant and objective, but there still exist issues in redundancy and leakage. Overall, we demonstrate that TBMs are a promising general architecture to construct a highly interpretable predictor with minimal performance trade-offs for general-domain text.

9 Limitations

Scalability. Given the heavy reliance of the model on large language models for concept measurement, the current implementation is not very scalable. This is because, for every text we want to measure, the number of times we have to run inference on an LLM is equal to the number of concepts in the bottleneck. To improve the scalability of the system, it may be possible to finetune smaller language models that perform the concept measurement after the TBM has scored enough texts. This will reduce the number of LLM calls to only scale with regard to the number of concepts, rather than scaling in proportion with both the number of texts and the number of concepts.

Redundant and Leaky Concepts. The analysis of generated concepts reveals the existence of duplicate concepts and concepts that leak classification labels in some datasets. To mitigate these issues, future work can include steps in concept generation to filter problematic concepts since currently we only filter concepts that do not improve model performance.

10 Ethics Statement

Potential risks We note that despite being de- signed to be more interpretable, the system we present in this paper still relies fundamentally on LLMs whose outputs may be unpredictable or unsafe. Accordingly, a proper deployment of our system would require safeguards to reduce the risk of harm if present.

Acknowledgements

This research is based upon work supported in part by the Air Force Research Laboratory (contract FA8750-23-C-0507), the DARPA KAIROS Program (contract FA8750-19-2-1004), the IARPA HIATUS Program (contract 2022-22072200005), and the NSF (Award 1928631). Approved for Public Release, Distribution Unlimited. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of AFRL, DARPA, IARPA, NSF, or the U.S. Government.

We would also like to thank Matthew Pressimone and Saurabh Shah for their help in assessing the initial feasibility of the project.

References

Eldar David Abraham, Karel D’Oosterlinck, Amir Feder, Yair Ori Gat, Atticus Geiger, Christopher Potts, Roi Reichart, and Zhengxuan Wu. 2022. Ce- bab: Estimating the causal effects of real-world concepts on nlp model behavior.

Jasmijn Bastings, Wilker Aziz, and Ivan Titov. 2019. Interpretable neural predictions with differentiable binary variables. arXiv preprint arXiv:1905.08160.

Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep Ghosh, Ruchir Puri, José MF Moura, and Peter Eckersley. 2020. Explainable machine learning in deployment. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pages 648–657.

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. 2023. Language models can explain neurons in language models. URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html.(Date accessed: 14.05. 2023).

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.

Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31.

Zhoujun Cheng, Jungo Kasai, and Tao Yu. 2023. Batch prompting: Efficient inference with large language model apis. arXiv preprint arXiv:2301.08721.

Arman Cohan, Waleed Ammar, Madeleine van Zuylen, and Field Cady. 2019. Structural scaffolds for citation intent classification in scientific publications. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3586–3596, Minneapolis, Minnesota. Association for Computational Linguistics.

Katherine Maeve Collins, Matthew Barker, Mateo Espinosa Zarlenga, Naveen Raman, Umang Bhatt, Mateja Jamnik, Ilia Sucholutsky, Adrian Weller, and Krishnamurthy Dvijotham. 2023. Human uncertainty in concept-based ai systems. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 869–889.

Anubrata Das, Chitrank Gupta, Venelin Kovatchev, Matthew Lease, and Junyi Jessy Li. 2022. Prototex: Explaining model decisions with prototype tensors. arXiv preprint arXiv:2204.05426.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.

Mengnan Du, Ninghao Liu, and Xia Hu. 2019. Techniques for interpretable machine learning. Communications of the ACM, 63(1):68–77.

Rudresh Dwivedi, Devam Dave, Het Naik, Smiti Singhal, Rana Omer, Pankesh Patel, Bin Qian, Zhenyu Wen, Tejal Shah, Graham Morgan, et al. 2023. Ex- plainable ai (xai): Core ideas, techniques, and solutions. ACM Computing Surveys, 55(9):1–33.

Antonio Gulli. 2004. Ag’s corpus of news articles. Xiaochuang Han, Byron C. Wallace, and Yulia Tsvetkov. 2020. Explaining black box predictions and unveil- ing data artifacts through influence functions. In Proceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 5553– 5563, Online. Association for Computational Lin- guistics.

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.

Alon Jacovi and Yoav Goldberg. 2020. Towards faith- fully interpretable NLP systems: How should we define and evaluate faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, Online. Association for Computational Linguistics.

Chris J Kennedy, Geoff Bacon, Alexander Sahn, and Claudia von Vacano. 2020. Constructing interval variables via faceted rasch measurement and multi-task deep learning: a hate speech application. arXiv preprint arXiv:2009.10277.

Johannes Kiesel, Maria Mestre, Rishabh Shukla, Em- manuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. 2019. SemEval- 2019 task 4: Hyperpartisan news detection. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 829–839, Minneapolis, Minnesota, USA. Association for Computational Linguistics.

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. 2020. Concept bottleneck models. In Proceedings of the 37th International Conference on Machine Learning.

Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016. Rationalizing neural predictions. arXiv preprint arXiv:1606.04155.

Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. 2020. Explainable ai: A review of machine learning interpretability methods. Entropy, 23(1):18.

Qing Lyu, Marianna Apidianaki, and Chris Callison-Burch. 2022. Towards faithful model explanation in nlp: A survey. arXiv preprint arXiv:2209.11326.

Andreas Madsen, Siva Reddy, and Sarath Chandar. 2022. Post-hoc interpretability for neural nlp: A survey. ACM Computing Surveys, 55(8):1–42.

Anita Mahinpei, Justin Clark, Isaac Lage, Finale Doshi-Velez, and Weiwei Pan. 2021. Promises and pitfalls of black-box concept learning models. arXiv preprint arXiv:2106.13314.

Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems, pages 165–172.

Andrew McCallum, Kamal Nigam, et al. 1998. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization, volume 752, pages 41–48. Madison, WI.

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. 2021. Show your work: Scratch- pads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.

OpenAI. 2023. Gpt-4 technical report.

Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,

B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,

R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,

D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- esnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

Andrew Peng, Michael Wu, John Allard, Logan Kil- patrick, and Steven Heidel. 2023. Gpt-3.5 turbo fine- tuning and api updates.

Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. 2023. What does a platypus look like? generating customized prompts for zero-shot image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15691– 15701.

Dheeraj Rajagopal, Vidhisha Balachandran, Eduard H Hovy, and Yulia Tsvetkov. 2021. SELFEXPLAIN: A self-explaining architecture for neural text classifiers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 836– 850, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.

Vikram V Ramaswamy, Sunnie SY Kim, Ruth Fong, and Olga Russakovsky. 2022. Overlooked factors in concept-based explanations: Dataset choice, concept salience, and human capability. arXiv preprint arXiv:2207.09615.

Eva Sharma, Chen Li, and Lu Wang. 2019. Bigpatent: A large-scale dataset for abstractive and coherent summarization. arXiv preprint arXiv:1906.03741.

Emily Sheng and David Uthus. 2020. Investigating societal biases in a poetry composition system.

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33:12388– 12401.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.

Zhengxuan Wu, Karel D’Oosterlinck, Atticus Geiger, Amir Zur, and Christopher Potts. 2023. Causal proxy models for concept-based model explanations. In International Conference on Machine Learning, pages 37313–37334. PMLR.

Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. 2023. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19187–19197.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.

Ruiqi Zhong, Peter Zhang, Steve Li, Jinwoo Ahn, Dan Klein, and Jacob Steinhardt. 2023. Goal driven discovery of distributional differences via language descriptions. arXiv preprint arXiv:2302.14233.

A Further Analysis

A.1 Does the model generate similar concepts across repeated runs?

To evaluate the variance in concept generation, we compare the concepts generated by a TBM on five runs on the CeBaB dataset. Figure 9 visualizes the concepts generated across model runs and Figure 10 shows the learning curves for the TBMs. We can see that concepts such as “Menu variety”, “Food Quality”, “Reviewer Expectations”, “Ambiance Quality”, “Service Quality” are generated by the majority of the TBMs. Inspecting the concept learning curves also reveals that these concepts tend to be highly important in the model. Overall, these results indicate that TBMs can consistently discover the important concepts across replicated model runs.

The MSE for the final models are 0.43, 0.48, 0.29, 0.36, and 0.48 respectively. We note that both models that achieve the best performance contain concepts that leak the task label, such as “Dining Experience” and “Overall Restaurant Quality”.

A.2 Can TBMs work on small training sets?

To evaluate the effect of training on a small dataset, we train three TBMs on the CEBaB dataset after limiting the size of the training set to 50 examples. The learning curves for these TBMs can be see in figure 8. In this figure, we see that it is possible for TBMs to overfit after generating too many concepts. Across all three runs, we see that performance increases until we reach around 5 concepts, and afterwards, it starts to drop. This drop can be explained by the fact that the TBM starts to admit concepts that are too specific, for instance, in the third model we see a “Menu Misrepresentation” concept which does not exist in any of the five full- size TBM replications. Another explanation for this drop is that there is not enough information to determine the importance of concepts relative to one another. Thus, even if the correct concepts are generated, the weights assigned to them under low training size samples can be unstable and generalize poorly outside of the training distribution.

A.3 Learning Curves on all datasets

Figure 7 shows the concept learning curves on all datasets, in addition to the four reported in Section 6.

B Dataset Details

Table 3 shows the dataset description, possible labels, and an example from the dataset. Among all datasets, Yelp Reviews, CEBaB, and Hate Speech Detection involve a regression task, while others involve a classification task.

C Human Evaluation for Concept Generation and Measurement

C.1 Concept Generation

The authors used the following guide to annotate each concept’s quality in Sec E.3. Quality scores equal to 1 indicate no problems while quality scores greater than 1 indicate issues.

Evaluation Metrics. For evaluation, we use the following metrics:

Redundancy (Rdy): 1 – No issues; 2 – Given the rest of the concepts already generated, this concept is redundant.

Relevance (Rlv): 1 – This concept is related to the task; 2 – This concept is unrelated to the task.

Leakage (Lkg): 1 – This concept does not leak the labelling task; 2 – This concept leaks the labelling task.

Objectivity (Obj): 1 – This concept can be measured objectively; 2 – This concept is subjective.

Difficulty (Dfc): 1 – Answering this concept question is easier than the labelling task; 2 – Answering this question is around the same difficulty as the labelling task; 3 – This question is harder than the labelling task.

C.2 Concept Measurement

To evaluate the TBM’s performance on concept measurement, we generate a questionnaire for each dataset and ask human crowdworkers to measure the scores of TBM-generated concepts. To avoid cases where the questions do not have an answer, we insert a “None of the Above” response at the end. Our annotators are students from a graduate- level AI class at the University of Pennsylvania, with good English proficiency. Both tasks are given as optional extra credit assignments in the class. Participation is solely voluntary. Before participation, students can preview the tasks, and are given a clear description of how the data will be used at the beginning of the instructions. The population size is 98.

Table 3: Summary of Datasets.

Figure 7: Concept learning curves on all datasets.

We design an Amazon Mechanical Turk interface for the task, which can be found in the Supplementary Materials (Fig 11). With an hour of work, students can earn 1% in extra credit of the overall course grade.

Figure 8: Concept learning curves on small CEBaB datasets with different runs.

D Additional Details of Implementation

Concept Generation

D.1 Prompt Structure

The prompt contains three main sections: The instruction set, dataset information, and TBM state information. The instruction set contains details about what the concept generation task is, what the format of a concept is, and three examples of valid concepts for toxicity detection, product sentiment analysis, and scam detection. This is followed by dataset information which is where we insert the dataset information, label descriptions, and examples from the dataset with different labels. Finally, to avoid making duplicate concepts, we load the list of previously generated concepts at the end in the TBM state information section.

D.2 Selecting in-context examples

We selectively load in highly misclassified examples during the concept generation stage to increase the chances that we generate concepts that are relevant for these misclassified examples. We get these examples by examining the 10 nearest neighbors for each training example under the current concept feature space and then obtaining 20 examples with the highest “neighborhood loss” which is obtained averaging each neighborhood’s MSE (for regression) or accuracy (for classification). We then check to see if this set of examples exceeds the token limit. If it does, we iteratively remove an example with the most common label within the group to ensure diverse representation. If we end up with less than 4 examples, we restart the process mbut truncate the texts by a factor of 0.8. We note that when the TBM generates its first concept, this sampling is random.

D.3 Managing Token Limits

The number of iterations we can perform is bounded primarily by the token limit of the LLM that we are using. As the number of iterations increases, the length of the TBM state information grows and at some point concept generation fails due to exceeding token limits. It is possible to truncate this step but that can cause issues with redundant concepts. To help manage token limits in other parts of the prompt, we dynamically truncate the text examples loaded in to ensure that the prompt stays within the token budget of the LLM being used.

Concept Measurement

This module relies heavily on the concept question and response guide associated with each concept to function. This module is flexible, allowing various prompting methods, such as directly answering the question or chain-of-thought prompting. In this paper, we structure the prompt as a three-step process that involves extracting pertinent snippets from the text and reasoning over them be- fore yielding a final answer. The prompt takes in as input the text we want to measure along with the JSON of the concept being evaluated. The prompt returns a JSON object representing the salient snip- pets in the text for each possible classification, the reasoning of the model over those snippets, and then the final classification. We perform batch inference (Cheng et al., 2023) to reduce LLM costs. In cases where the generated text fails to parse as valid JSON or does not contain text we can turn