Skip to main content
Uncategorized

MaScQA: A Question Answering Dataset for Investigating Materials Science Knowledge of Large Language Models

Aug 17, 2023

Mohd Zaki1, Jayadeva2, Mausam3,4, N. M. Anoop Krishnan1,3

1Department of Civil Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India

2Department of Electrical Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India

3Yardi School of Artificial Intelligence, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India

4Department of Computer Science & Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India

Abstract

Information extraction and textual comprehension from materials literature are vital for developing an exhaustive knowledge base that enables accelerated materials discovery. Language models have demonstrated their capability to answer domain-specific questions and retrieve information from knowledge bases. However, there are no benchmark datasets in the materials domain that can evaluate the understanding of the key concepts by these language models. In this work, we curate a dataset of 650 challenging questions from the materials domain that require the knowledge and skills of a materials student who has cleared their undergraduate degree. We classify these questions based on their structure and the materials science domain-based subcategories. Further, we evaluate the performance of GPT-3.5 and GPT-4 models on solving these questions via zero-shot and chain of thought prompting. It is observed that GPT-4 gives the best performance (~62% accuracy) as compared to GPT-3.5. Interestingly, in contrast to the general observation, no significant improvement in accuracy is observed with the chain of thought prompting. To evaluate the limitations, we performed an error analysis, which revealed conceptual errors (~64%) as the major contributor compared to computational errors (~36%) towards the reduced performance of LLMs. We hope that the dataset and analysis performed in this work will promote further research in developing better materials science domain-specific LLMs and strategies for information extraction.

Keywords. Large language models, materials science, materials discovery, the chain of thought

Introduction

Large language models (LLMs) are machine learning (ML) models based on transformer neural network architecture [1]. These models are called large due to their billions of inherent parameters. The increase in the number of model parameters and different training strategies have improved the performance of these models on natural language tasks such as question answering[2,3], text summarization [4,5], sentiment analysis[1,3], machine translation[6], conversational abilities[7–9], and code generation[10]. Numerous datasets allow researchers to benchmark the performance and evaluate the different capabilities of LLMs. MMLU, a question-answering dataset, comprises questions under four broad categories of humanities, social sciences, STEM, and others. The categories have questions from domains like high school subjects, clinical knowledge, and mathematics, to name a few[11,12]. Another dataset is HellaSwag[13], a benchmark dataset for common sense natural language inference, where the input is a sentence, and the output should be a sentence that completes the given input. A similar dataset, WinoGrande[14], was proposed by Sakaguchi et al. (2020), having 273 expert- crafted pronoun resolution questions. HumanEval[10] is a dataset created to evaluate the performance of LLMs in writing codes. Dua et al. (2019) proposed the DROP[15] dataset for assessing the performance of LLMs on reading comprehension tasks. To evaluate the performance of LLMs on grade school mathematics problems, Cobbe et al. (2021) introduced the GSM8K[16] dataset, comprising linguistically diverse mathematical word problems. AI2 Reasoning Challenge (ARC) contains school-level science-based questions which have been used as a dataset to show the state-of-the-art performance achieved by GPT-4[17]. However, based on literature review and analysis of the technical report on GPT4 and research papers that introduced other LLMs like Chinchilla[18] and PaLM[2] reveals that there are no datasets related to materials science on which these LLMs have been benchmarked yet.

The datasets which exist in the materials science domain are mainly for tasks like named entity recognition (NER)[19,20], classification[21–23], synthesis process and relation classification[24], composition extraction from tables[25], which researchers use to benchmark the performance of materials domain LLMs. These models, namely, MatSciBERT[22] (first materials-domain language model), MatBERT[26], MaterialsBERT[27], OpticalBERT[28], and BatteryBERT[23] have been trained on domain-specific texts, which resulted in state-of- the-art results on the tasks mentioned above. However, there are no large and diverse datasets in the materials domain that can be used for evaluating the natural language question-answering ability of LLMs. The development of such a dataset is thus crucial to investigate the materials domain knowledge of these LLMs so that they can be further used for addressing challenging problems related to materials discovery for areas such as manufacturing, energy, environment, and sustainability. This information is further important to understand the lacunae of the understanding of such LLMs, which are being proposed to be used for several domains such as manufacturing, planning, material synthesis, and materials discovery[22,27].

To address this challenge, we present a question-answering dataset on the materials domain. Specifically, we try to answer the following questions in this paper:

  1. How well do general-purpose LLMs perform in answering complex questions from the materials science domain?
  2. Can we improve the performance of the LLMs by using the chain of thought prompting methods?
  3. What are the factors limiting the performance of these LLMs on this dataset?

To this end, we collected questions that require students to have a graduate-level understanding of material science topics to solve them. These questions and answers are carefully curated from the original questions in the graduate aptitude test in engineering (GATE) exam—a

national-level examination for graduate admission in India. More than 800,000 students give this exam annually, with an average of 100,000 students in major disciplines, such as mechanical or civil engineering, to enroll in masters/doctoral courses in the premier institutes in India. We classify these questions based on their (a) structure and (b) domain knowledge required to solve them. We then evaluate the performance of state-of-the-art proprietary models: GPT-3.5 and GPT4, in solving these questions. We used the API of these models to obtain answers to the questions in two ways: first, by directly prompting the models to answer the questions (zero-shot prompting), and second, by asking the models to solve the questions step by step, also known as the Chain of Thought prompting[29]. The availability of MaScQA will allow the researchers to benchmark existing models and prompting strategies. Specifically, the analysis from a domain-specific perspective will allow the researchers to train better domain-specific LLMs and help them decide where these models can be used in the materials discovery pipeline.

Methodology

Dataset preparation

We are motivated to investigate how LLMs will perform on questions that require an undergraduate-level understanding of materials science topics for their solution. To compile a dataset of such questions, we take question papers related to materials science and metallurgical engineering asked in the GATE examination conducted in India for admission to masters and doctorate courses. To this end, we compiled 650 questions and classified them into four types based on their structure: Multiple choice questions (MCQs), Match the following type questions (MATCH), Numerical questions where options are given (MCQN), and numerical questions (NUM). MCQs are generally conceptual, given four options, out of which mostly one is correct and sometimes more than one option is also correct (Fig. 1 (a)). In MATCH, two lists of entities are given, which are to be matched with each other. These questions are also provided with four options, out of which one has the correct set of matched entities (Fig. 1 (b)). In MCQN, the question has four choices, out of which the correct one is identified after solving the numerical stated in the question (Fig. 1 (c)). The NUM type questions have numerical answers, rounded off to the nearest integer or floating-point number as specified in the questions (Fig. 1 (d)).

To understand the performance of LLMs from a domain perspective, we classified the questions into 14 categories. The list of categories was prepared in consultation with domain experts who teach materials science subjects at the institute where this research is conducted. Then all the questions are assigned one of the categories by two experts. The conflict in the category assignments was resolved through discussion and mutual agreement. Figure 2 shows the number of questions in each category. The color of the bars represents the broad category of materials science topics under which each subtopic is shown in the graphical abstract. The database can be accessed at https://github.com/M3RG-IITD/MaScQA.

Figure 1. Sample questions from each category (a) multiple choice question (MCQ), (b) matching type question (MATCH), (c) numerical question with multiple choices (MCQN), and (d) numerical question (NUM).

Solutions using LLMs

In this work, we benchmark the question-answering ability of GPT-3.5 and GPT-4 models on the MaScQA dataset. The questions are provided to each model in two ways: first, directly asking the model to solve the question, and second, asking the models to solve the given question by providing a detailed step-by-step solution. We call the first approach zero-shot question answering and the second approach chain of thought (CoT) reasoning[29]. The questions are fed to the model using the OpenAI API and selecting the appropriate model type. The prompt used in the first approach is “Solve the following question. Write the correct answer inside a list at the end.” for the second approach, the prompt is “Solve the following question with highly detailed step-by-step explanation. Write the correct answer inside a list at the end.” The last sentence in the prompt was used to automatically retrieve the correct option/answer from the model output and match it with the answer key. However, the model did not always give output in the desired format. Hence, the entire model output is saved as a text file which was then used for manually extracting the answers for comparing with the actual answers provided in the official answer keys of the respective papers. The solutions to all the questions obtained using two approaches for both models can be accessed at https://github.com/M3RG-IITD/MaScQA. The official answer keys are obtained from the official website of IIT Kharagpur, which is one of the organizing institutes of the GATE exam. https://gate.iitkgp.ac.in/old_question_papers.html. The LLMs’ performance on two prompting methods is discussed in detail in the following sections.

Results

Figure 2 shows the details of the dataset comprising a total of 650 questions in different categories. First, we categorize the questions based on their structure. We observe that largest the category of questions (287) are MCQs, while 70 are MATCH-type questions. Further, 66 questions are MCQN, while the remaining 229 questions are NUM that do not provide any choices. Further, we analyze different materials domains covered by this set of questions. To this extent, the questions are categorized into 14 domains: thermodynamics, atomic structure, mechanical behavior, materials manufacturing, material applications, phase transition, electrical properties, material processing, transport phenomenon, magnetic properties, material characterization, fluid mechanics, material testing, and miscellaneous.

Figure 2 shows the number of questions in different domain-specific categories. To visualize the frequently used words related to each domain-specific category of questions, word clouds are shown in Figure 3. The maximum number of questions (114) are in the thermodynamics category, which deals with questions related to enthalpy of formation, energy balance during chemical reactions, transition temperatures, activation energy, and heat transfer (Fig. 3(a)). The category of atomic structure comprises 100 questions which are based on concepts such as dislocations, diffraction planes, and crystal structures (Fig. 3(b)). The mechanical behavior category is based on the concepts of stress-strain behavior of materials, creep, fatigue, and fracture mechanics (Fig. 3(c)). In materials manufacturing (Fig. 3(d)) and material applications (Fig. 3(e)), the questions test the knowledge of extraction processes of materials from their respective ores and why a particular material is used for a specific application. Thus, these questions require logical understanding connecting multiple concepts: first, “recall” or “deduce” the properties of a material based on its composition, label, or processing conditions, and second, “identify” the properties required for a particular application and then connect these two concepts to “derive” a logical explanation to arrive at the correct answer. The questions on phase transition test the knowledge of how phase transition can be induced in materials, how to calculate the percentage of different phases in the materials, and the characteristics of different phases. This is also indicated by the high frequency of words related to different phases of materials (Fig. 3(f)). The questions on electrical properties include fuel cells, characteristics of materials used in batteries, and semiconductor devices (Fig. 3(g)). Then, questions are based on material processing such as annealing, tempering, recrystallization, welding, etc. (Fig. 3(h)). The questions on transport phenomena test concepts related to the diffusion or transport of ions (Fig. 3(i)). The question related to magnetic properties tests the knowledge about magnetization and the characteristics of different magnetic materials (Fig. 3(j)). The material characterization topic has questions related to methods like scanning electron microscopy, diffraction studies, and back scattered electron microscopy (Fig. 3(k)). The fluid mechanics topic comprises questions on the viscosity of the fluid and the movement of particles in a viscous medium (Fig. 3(l)). In the material testing topic, the questions are based on mostly non-destructive material testing methods (Fig. 3(m)). The miscellaneous category deals with questions requiring simultaneously understanding multiple materials science domains for their solution (Fig. 3(n)).

Figure 2. The number of questions in each materials science sub-domain. The bar chart shows the distribution of questions in different sub-domains. The pie chart shows the number of questions classified according to question structure.
Figure 3. Word-cloud for different topics in MaScQA (a)Thermodynamics, (b)Atomic structure, (c)Mechanical behavior, (d) Material manufacturing, (e) Material applications, (f) Phase transition, (g) Electrical properties, (h) Material processing, (i) Transport phenomena

Now, we evaluate the performance of LLMs on MaScQA and the effect of prompting methods on the performance, corresponding to the first two questions posed in this work. Table 1 reports the accuracy of the LLMs on the MaScQA corpus. The scores corresponding to model names GPT-3.5 and GPT-4 represent the accuracy of the models when questions are asked directly to the models representing zero-shot answering. The model names with the suffix “CoT” implies we have asked the models to provide detailed “stepwise” solutions to the given questions. In MCQs, we observe that GPT-4 significantly outperforms GPT-3.5. Further, we also observe

that the CoT provides only marginal improvement in the result for GPT-3.5 and GPT-4. Here, GPT-4-CoT gives an accuracy of 77.11%, which is a high score considering the difficulty levels of this exam. Also, the performance of GPT-4-CoT is ~20% higher than GPT-3-CoT for MCQ type of questions. For MATCH questions, GPT-4-CoT exhibits the maximum performance with a score of 92.86%, a very high score considering the amount of knowledge required to connect the entities. In contrast, the variants of GPT-3.5 performed poorly on MATCH questions, with a score of 40% and 38.57% for the variants without and with CoT, respectively. In this case, the GPT-4-CoT provides ~4% improvement over direct prompting. For MCQN, GPT-4 gives the best performance with a score of 58.82%, while CoT reduces the model’s performance to 51.47%. The same trend of reduced performance on these questions is observed with the GPT-3.5 model. This implies that CoT prompting may not always lead to better performance. Now, we focus on the numerical questions. Among all the categories, models exhibit the worst performance in the NUM category. Here, GPT-4 and GPT-4-CoT obtain the maximum score of 33.33% and 36.84%. Interestingly, we observe that CoT yields poorer results in the case of GPT-3.5, while it yields better accuracy in the case of GPT-4. Finally, regarding overall performance, GPT-4-COT gives the best score of 62%, with GPT-4 following closely at 60.15%. It should be noted that in MCQ, there are 13 questions where more than one options are correct, of which GPT-4 and GPT-4-CoT answered six and seven questions correctly, respectively. Interestingly, we observe that CoT does not always give improved results. In fact, for GPT-3.5, CoT gives poorer results in all the cases except MCQs and marginally better results for GPT-4 in all the cases except MCQN. Note that this observation contrasts with the general observation that the CoT prompting results in improved performance of LLMs on QA tasks. This is further evaluated in detail later.

  Evaluation Method  MCQ (285)Matching (MATCH) (70)Numerical with MCQ (MCQN) (67)Numerical (NUM) (228)Overall accuracy
Baseline scores2525250 
GPT-3.556.4940.0035.8215.7938.31
GPT-3.5-CoT56.8438.5734.3314.0437.38
GPT-474.7488.5759.733.7760.15
GPT-4-CoT76.8492.8652.2437.2862.0
Table 1. Performance (% accuracy) of different evaluation styles using GPT models on various question types. The number in parenthesis represents the total number of questions under respective categories.

In addition to the performance of GPT models in answering different types of questions like multiple choice, numerical, and matching, which test different mental abilities of the students, it is also important to analyze the performance of the models from a domain perspective. To this end, we classify all the questions of our dataset into 14 broad categories. Table 2 shows the accuracy of the GPT-4-CoT prompting method while answering the questions.

It is observed that questions related to materials’ mechanical and electrical behavior have the most percentage of incorrectly answered questions (~60%). The questions on thermodynamics, atomic structure, phase transition, transport phenomena, and magnetic properties have more than ~40% of incorrectly answered questions in the respective categories. Further, more than 15% of materials manufacturing, application, and characterization questions are incorrectly answered, with the lowest error rates for material characterization and no mistakes made on material testing questions. To further gain insights into the factors limiting LLMs’ performance, we will discuss them by classifying the mistakes into two categories, as explained in the Discussion section.

CategoryCorrectIncorrectTotal
# Questions%age# Questions%age
Thermodynamics6355.265144.74114
Atomic structure5959.004141.00100
Mechanical behavior4344.795355.2196
Material manufacturing6268.132931.8791
Material Applications4686.79713.2153
Phase transition2560.981639.0241
Electrical properties1541.672158.3336
Material processing3188.57411.4335
Transport phenomena1562.50937.5024
Magnetic properties960.00640.0015
Material characterization1071.43428.5714
Fluid mechanics1285.71214.2914
Material testing9100.0000.009
Miscellaneous562.50337.508
Table 2. Performance of GPT-4-CoT on questions classified from materials science domain perspective

Discussion

Error Analysis

To use LLMs effectively and to identify areas that require further research, it is important to understand the mistakes made by the LLMs in the materials domain. Answering a question requires retrieval of correct concepts/facts, applying them to the scenarios posed in the question by appropriate substitution in the relevant formulae, and then solving it correctly by applying relevant computational steps. To understand further, we can divide these errors into three categories, namely, (i) conceptual error: where the correct concept, equation, or facts related to the problem are not retrieved, or the LLM hallucinates some facts, (ii) grounding error: where the relevant concepts are not correctly applied to the scenario or incorrect values are substituted in the equations (for example, ºC to K conversion not applied) and (iii) computational error: where the numerical computation is performed incorrectly [32]. Note that CoT prompting enables the model to reflect upon the knowledge it already has, connect it with multiple choices, and then arrive at the answer. Thus, in general, it has been observed that CoT helps in reducing grounding errors (in our case, it virtually eliminates them).

To analyze different errors, we created a subset of 100 random questions where GPT-4-CoT answered incorrectly. Of these 100 questions, 54 are NUM, 27 are MCQs, 14 are MCQN, and five are matching-type questions (MATCH) (Table 3). All the questions From the domain- specific categories are included from those domains with less than ten mistakes made during GPT-4-CoT prompting (see Table 2). The remaining questions are randomly sampled from the other categories. The number of questions across materials science sub-domains in the subset of 100 questions is shown in Table 4. Note that there may be questions with conceptual and numerical errors, but we have considered only the conceptual error in these questions since it is the first to be found. If the retrieved concept is incorrect, we deem the computational error secondary.

Table 3 shows the errors made by GPT-4-CoT in different categories. The analysis of the 100 questions reveals that most errors are conceptual. Even in numerical problems, we observe that as many conceptual errors are made as numerical errors. It is interesting to observe that GPT- 4-CoT is equally bad at retrieving concepts and doing calculations on NUM type questions. This explains the lowest performance of LLMs on these types of questions. Further, in MCQs and MATCH type questions, the error is always conceptual because answering such questions require retrieval of appropriate concepts and facts and then connecting them with relevant options. For MCQN, the computational error is more prevalent than the conceptual error. Most of the questions were answered incorrectly (64%) due to conceptual errors implying the need for domain-specific models or better prompting and problem-solving approaches.

As mentioned above, we observe that GPT-4-CoT makes no grounding errors. To evaluate whether this is due to the effectiveness of CoT, we investigate questions that are incorrectly answered by GPT-4 and correctly by GPT-4-CoT. Out of 65 questions from the entire dataset, GPT-4’s solutions had ~70% conceptual errors, ~30% computational errors, and no grounding errors. Further, we also analyzed the errors made by GPT-4-CoT that are correctly answered by GPT-4. There were 53 such questions in the complete dataset. Out of these questions, solutions of 42 questions (~79%) had conceptual errors; for 1 question, there was a grounding error, and the remaining ten questions had computational errors when solved using GPT-4- CoT. Since there are little to no grounding errors in either GPT-4 or GPT4-CoT, both models are adept in this regard. The CoT prompting is helping reduce some numerical errors.

Question TypeConceptual errorComputational error
# Questions%age# Questions%age
MCQs2710000
MATCH510000
MCQN535.71964.29
NUM27502750
Table 3. Types of the errors on 100 questions classified based on the structure

Table 4 shows the domain-wise distribution of conceptual and computational errors on the same subset of 100 questions. All categories have conceptual errors in most questions except for thermodynamics, transport phenomena, and fluid mechanics. Now, we will discuss some conceptual errors in different domains. The list of all questions subjected to analysis is provided in the GitHub repository of this work.

CategoryTotal QuestionsConceptual errorComputational error
# questions%age# questions%age
Thermodynamics11436.36872.73
Atomic structure11763.64436.36
Mechanical behavior11763.64436.36
Material manufacturing11872.73327.27
Electrical properties11654.55545.45
Phase transition10660.00440.00
Transport phenomena9444.44555.56
Material Applications77100.0000
Magnetic properties6466.67233.33
Material characterization44100.0000
Material processing44100.0000
Miscellaneous33100.0000
Fluid mechanics2002100.00
Table 4. Types of the error made by GPT-4-CoT on 100 questions classified according to domain expertise required to solve them

Fig. 4 (a) shows an example of the conceptual error made on a question related to thermodynamics. In this question, instead of considering the coefficient of thermal expansion the same in the planar dimension, it considered the coefficient of thermal expansion in the perpendicular direction as the same in one of the planar directions. Mathematically, instead of obtaining the final coefficient using 2 x parallel + perpendicular coefficients, GPT-4-CoT used parallel + 2 x perpendicular leading to an incorrect answer. While solving a question on atomic structure, as given in Fig. 8(b), GPT-4-CoT mistook the relation between lattice parameter (a) and atomic diameter (D) as 𝑎 = √3/2 𝐷 instead of 𝑎 =  2/3 𝐷 . In a question on the electrical properties of materials (Fig. 4(c)), the GPT-4-CoT answered that all the given statements were correct. Hence, it could not choose from the four options given as answers. According to the materials science domain and the Wikipedia entry of Pourbaix diagrams, one of their major limitations is that these diagrams do not estimate actual corrosion rates; also, these diagrams cannot be used while studying corrosion due to chloride ions. Hence, the statement R is incorrect, making (C) the correct choice. While solving the question shown in Fig. 4(d), GPT- 4-CoT did not convert the lattice parameter into the atomic diameter and considered them as same while using it in the formula required for solving the problem. For a question on materials manufacturing, GPT-4-CoT retrieved the functions of (P) blast furnace slag and (R) Torpedo car as opposite, thus leading to a wrong answer C when the correct option was A.

Figure 4. Visualizing some of the questions where GPT-4-CoT made conceptual errors in the solution.

Comparative analysis

Finally, to answer the third question raised in this work, i.e., what factors limit the performance of LLMs on MaScQA, we visualize the mistakes made by GPT-3.5-CoT and the solution provided by GPT-4-CoT. Fig. 5 shows one example where GPT-4-CoT yielded the correct solution. If we check the Wikipedia page for phase rule (cite), the first expression is the one that is proposed as a solution by GPT-3.5. However, GPT-4 reaches the correct solution, also available on the same Wikipedia page. Although the dataset details on which these models are trained are unknown to the users, it is assumed that openly available sources like Wikipedia are a common dataset used by researchers while training such language models[18,30]. Thus, it is interesting to note that while GPT-3.5 depicts a shallow understanding of concepts, GPT- 4 can provide a deeper understanding based on the context.

Figure 5. Visualizing output of GPT models on a sample MCQ question.

The matching-type questions require understanding different topics and then the ability to interlink them. An example of a matching question with the solution as per GPT-3.5-CoT and GPT-4-CoT is shown in Fig. 6. The scores in Table 1 indicated the exceptionally high performance of GPT-4 models in answering the matching-type question, which is more than two times the performance of GPT-3.5 models. It can be seen from the response of GPT-3.5- CoT that it is only able to determine the material properties required for the missile cone heads. Interestingly, GPT-3.5-CoT tries to arrive at the correct answer by eliminating the options. In contrast, GPT-4-CoT relied on understanding the topics and answering the question after inter- relating the previous information. This reinforces the idea that GPT-3.5 has a shallow understanding of the concepts.

Figure 6. Visualizing output of GPT models on a sample matching type question

An example of a numerical question with multiple options is shown in Fig. 7. The GPT-3.5- CoT solution used the correct concept but made calculation errors leading to a final incorrect answer. However, GPT-4-CoT used the correct concept and did not make calculation mistakes. It is observed in Table 3 that both GPT-4 and GPT-4-CoT achieve similar accuracy in answering MCQN questions. The red-colored text in the GPT-3.5-CoT solution shows the source of the error, which led to an incorrect answer.

Figure 7. Visualizing the output of GPT models on a numerical question with multiple options

Now, we show the comparison of the solution by GPT-3.5-CoT and GPT-4-CoT on a sample numerical question (NUM) in Fig. 8 related to platinum’s crystal structure. Both models applied the correct concept. However, GPT-3.5-CoT made a calculation mistake in obtaining the interplanar distance “d”, which is highlighted in boldface and red color in Fig. 8. Calculation

mistakes are a known issue with such kinds of LLMs from the literature[7–9,18,31] where similar order of accuracy was achieved on numerical questions solving tasks. The low accuracy of LLMs may also imply a lack of material science concepts previously observed in MCQ and MATCH-type questions in addition to calculation in capability.

Figure 8: Visualizing output of GPT models on a sample numerical question

Now, we will discuss the performance of GPT-4-CoT from the materials science domain perspective. The topics in Table 2 are arranged in decreasing order of the total number of questions in each category. The maximum percentage of incorrect questions is in questions under the electrical topic. The incorrectly answered questions require solving questions related to battery cells, the redox reactions, or identifying the potentials between the electrodes. The number of numerical questions answered wrong is 3 – 5 times of the other type of questions. Regarding questions related to the mechanical behavior of materials, GPT-4-CoT has the second-worst performance. Out of 53 incorrectly answered questions, 34 are numerical questions. The questions where mistakes happened were based on concepts of the materials’ stress-strain curve, fracture mechanics, and creep behavior. The thermodynamics category has a maximum number of questions and quite a high percentage of incorrectly answered questions (~46%). The incorrect questions require understanding concepts of formation energy, specific heat, heat transfer, and chemical equations, to name a few, and solving complex equations correctly. The category of atomic structure has ~42% incorrectly answered, mostly related to questions on the analysis of X-Ray diffraction studies to identify the crystal structure of the materials. This reflects that LLMs are unable to correlate theoretical concepts with experimental outcomes. The category magnetism has fewer questions (15), of which only eight are correct. The performance of LLMs in answering these questions reflects their inability to retrieve related concepts like magnetic moment and saturation magnetizations and avoid numerical errors. In phase transitions, the incorrectly answered questions (~41%) are related to solving for the composition of different phases after the transitions and conditions required for phase transition. The next category is transport, where the incorrectly answered questions (~38%) required understanding diffusion phenomena and concepts of thermodynamics and battery cell reactions.

To summarise, the CoT prompting cannot significantly improve the LLM performance as the mistakes are mainly conceptual. This makes a strong case for a domain-specific LLM for materials and potentially domain-specific alternate prompting strategies. Further, for questions where the LLMs give the incorrect response due to computational error, the solution involved unit conversions, logarithms, and exponentials and had numbers with multiplying factors (e.g., 1010). There have been recent works in the literature that suggest methods for improving calculations and for improving on concept-based mistakes[33]. Introducing such heuristics while prompting can help researchers in two ways: (1) probe the existing LLMs more deeply, (2) generate datasets to train LLMs with lesser parameters, thus, making the use of these models economical. Hence, this answers the third research question (limiting factors for LLMs) raised in this work.

Conclusion

Due to the increasing availability of large datasets and computation capabilities, developing an LLM is becoming relatively easier. In materials discovery, machine learning and natural language processing have played an instrumental role in identifying new materials or existing materials for a new application, discovering an optimal synthesis pathway, and planning. We are living in an era where machine learning, humans, and machines are working together in the pipeline of discovering new materials. At this juncture, it is crucial to ask how well LLMs understand the materials science domain, as the answer to this will determine their applications in such pipelines. To this end, our new dataset, MaScQA, used to test the mental abilities required to solve the questions and understand the materials science domain and their interrelated concepts, will provide a means to gain deeper insights. We observed that the LLMs make both numerical and conceptual mistakes. There are several core materials science domains where LLMs show poor performance, such as the atomic and crystal structure of materials and their electrical, magnetic, and thermodynamic behavior. Hence, to enable their use in the materials discovery pipeline, the language models must be finetuned on a domain- specific dataset.

Moreover, the performance of the LLMs on MaScQA can enable a deeper understanding of the lacunae in the LLMs, thereby providing new research avenues. For instance, LLMs’ poor performance in NUM questions suggests that a pipeline connecting the LLM to a math calculator can potentially yield improved results. Further, the conceptual mistakes made by the LLMs suggest areas where further improvements are required. The materials science domain

is a field that derives concepts from physics, chemistry, and mechanics. Therefore, a benchmark like MaScQA will allow the researchers to benchmark their results against a standard dataset. Further, the correct solutions can help researchers create a new dataset for training lightweight models, which are economical and hence, can be easily deployed on low- memory industrial devices for materials discovery and their usage for educational purposes.

References:

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proc. NAACL, Association for Computational Linguistics, Minneapolis, Minnesota, 2019: pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423.

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham,

H.W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez,

A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson,

R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A.M. Dai, T.S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, N. Fiedel, PaLM: Scaling Language Modeling with Pathways, (2022). https://doi.org/10.48550/arXiv.2204.02311.

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P.J. Liu, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, (2020). https://doi.org/10.48550/arXiv.1910.10683.

A. Kedia, S.C. Chinthakindi, W. Ryu, Beyond reptile: Meta-learned dot-product maximization between gradients for improved single-task regularization, in: Find. Assoc. Comput. Linguist. EMNLP 2021, Association for Computational Linguistics, Punta Cana, Dominican Republic, 2021: pp. 407–420. https://doi.org/10.18653/v1/2021.findings-emnlp.37.

B. Pang, E. Nijkamp, W. Kryściński, S. Savarese, Y. Zhou, C. Xiong, Long Document Summarization with Top-down and Bottom-up Inference, (2022). https://doi.org/10.48550/arXiv.2203.07586.

A. Fan, S. Bhosale, H. Schwenk, Z. Ma, A. El-Kishky, S. Goyal, M. Baines, O. Celebi,

G. Wenzek, V. Chaudhary, N. Goyal, T. Birch, V. Liptchinsky, S. Edunov, E. Grave, M. Auli, A. Joulin, Beyond english-centric multilingual machine translation, ArXiv Prepr. (2020).

OpenAI, GPT-4 Technical Report, (2023). https://doi.org/10.48550/arXiv.2303.08774.

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière,

N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, LLaMA: Open and Efficient Foundation Language Models, (2023). https://doi.org/10.48550/arXiv.2302.13971.

B. Peng, C. Li, P. He, M. Galley, J. Gao, Instruction Tuning with GPT-4, (2023). http://arxiv.org/abs/2304.03277 (accessed June 1, 2023).

M. Chen, J. Tworek, H. Jun, Q. Yuan, H.P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F.P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W.H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A.N. Carr, J. Leike, J. Achiam,

V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, W. Zaremba, Evaluating large language models trained on code, (2021).

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, J. Steinhardt, Measuring massive multitask language understanding, Proc. Int. Conf. Learn. Represent. ICLR. (2021).

D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, J. Steinhardt, Aligning AI with shared human values, Proc. Int. Conf. Learn. Represent. ICLR. (2021).

R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi, HellaSwag: Can a machine really finish your sentence?, in: Proc. 57th Annu. Meet. Assoc. Comput. Linguist., Association for Computational Linguistics, Florence, Italy, 2019: pp. 4791–4800. https://doi.org/10.18653/v1/P19-1472.

K. Sakaguchi, R. Le Bras, C. Bhagavatula, Y. Choi, WinoGrande: An Adversarial Winograd Schema Challenge at Scale, Proc. AAAI Conf. Artif. Intell. 34 (2020) 8732– 8740. https://doi.org/10.1609/aaai.v34i05.6399.

D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, M. Gardner, DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs, in: Proc. 2019 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. Vol. 1 Long Short Pap., Association for Computational Linguistics, Minneapolis, Minnesota, 2019: pp. 2368–2378. https://doi.org/10.18653/v1/N19-1246.

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, others, Training verifiers to solve math word problems, ArXiv Prepr. ArXiv211014168. (2021).

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, O. Tafjord, Think you have solved question answering? try arc, the ai2 reasoning challenge, ArXiv Prepr. ArXiv180305457. (2018).

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de L. Casas, L.A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J.W. Rae, O. Vinyals, L. Sifre, Training Compute-Optimal Large Language Models, (2022). https://doi.org/10.48550/arXiv.2203.15556.

L. Weston, V. Tshitoyan, J. Dagdelen, O. Kononova, A. Trewartha, K.A. Persson, G. Ceder, A. Jain, Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature, J. Chem. Inf. Model. 59 (2019) 3692–3702. https://doi.org/10.1021/acs.jcim.9b00470.

K. Cruse, A. Trewartha, S. Lee, Z. Wang, H. Huo, T. He, O. Kononova, A. Jain, G. Ceder, Text-mined dataset of gold nanoparticle synthesis procedures, morphologies, and size entities, Sci. Data. 9 (2022) 234. https://doi.org/10.1038/s41597-022-01321-6.

V. Venugopal, S. Sahoo, M. Zaki, M. Agarwal, N.N. Gosvami, N.M.A. Krishnan, Looking through glass: Knowledge discovery from materials science literature using natural language processing, Patterns. 2 (2021) 100290. https://doi.org/10.1016/j.patter.2021.100290.

T. Gupta, M. Zaki, N.M.A. Krishnan, Mausam, MatSciBERT: A materials domain language model for text mining and information extraction, Npj Comput. Mater. 8 (2022) 102. https://doi.org/10.1038/s41524-022-00784-w.

S. Huang, J.M. Cole, BatteryBERT: A Pretrained Language Model for Battery Database Enhancement, J. Chem. Inf. Model. 62 (2022) 6365–6377. https://doi.org/10.1021/acs.jcim.2c00035.

S. Mysore, Z. Jensen, E. Kim, K. Huang, H.-S. Chang, E. Strubell, J. Flanigan, A. McCallum, E. Olivetti, The materials science procedural text corpus: Annotating materials synthesis procedures with shallow semantic structures, in: Proc. 13th Linguist. Annot. Workshop, Association for Computational Linguistics, Florence, Italy, 2019: pp. 56–64. https://doi.org/10.18653/v1/W19-4007.

T. Gupta, M. Zaki, D. Khatsuriya, K. Hira, N.M.A. Krishnan, M. Mausam, DiSCoMaT: Distantly Supervised Composition Extraction from Tables in Materials Science Articles, (2022). https://doi.org/10.48550/arXiv.2207.01079.

A. Trewartha, N. Walker, H. Huo, S. Lee, K. Cruse, J. Dagdelen, A. Dunn, K.A. Persson, G. Ceder, A. Jain, Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science, Patterns. 3 (2022) 100488. https://doi.org/10.1016/j.patter.2022.100488.

P. Shetty, A.C. Rajan, C. Kuenneth, S. Gupta, L.P. Panchumarti, L. Holm, C. Zhang, R. Ramprasad, A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing, Npj Comput. Mater. 9 (2023) 1–12. https://doi.org/10.1038/s41524-023-01003-w.

J. Zhao, S. Huang, J.M. Cole, OpticalBERT and OpticalTable-SQA: Text- and Table- Based Language Models for the Optical-Materials Domain, J. Chem. Inf. Model. (2023). https://doi.org/10.1021/acs.jcim.2c01259.

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q.V. Le, D. Zhou, others, Chain-of-thought prompting elicits reasoning in large language models, Adv. Neural Inf. Process. Syst. 35 (2022) 24824–24837.

J.W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, others, Scaling language models: Methods, analysis & insights from training gopher, ArXiv Prepr. ArXiv211211446. (2021).

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C.C. Ferrer, M. Chen, G. Cucurull,

D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P.S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E.M. Smith, R. Subramanian,

X.E. Tan, B. Tang, R. Taylor, A. Williams, J.X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, T. Scialom, Llama 2: Open Foundation and Fine-Tuned Chat Models, (2023). https://doi.org/10.48550/arXiv.2307.09288.

D. Arora, H.G. Singh, Mausam, Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models, (2023). https://doi.org/10.48550/arXiv.2305.15074.

S. Gunasekar, Y. Zhang, J. Aneja, C.C.T. Mendes, A. Del Giorno, S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa, O. Saarikivi, A. Salim, S. Shah, H.S. Behl, X. Wang, S. Bubeck, R. Eldan, A.T. Kalai, Y.T. Lee, Y. Li, Textbooks Are All You Need, ArXiv.Org. (2023). https://arxiv.org/abs/2306.11644v1 (accessed June 28, 2023).