Skip to main content
Uncategorized

Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation

October 5, 2023

Tung Phung                                                                                                                                                mphung@mpi-sws.org

Max Planck Institute for Software Systems

Victor-Alexandru Pădurean                                                                                                             vpadurea@mpi-sws.org

Max Planck Institute for Software Systems

Anjali Singh                                                                                                                                                 singhanj@umich.edu

University of Michigan

Christopher Brooks∗                                                                                                                                brooksch@umich.edu

University of Michigan

José Cambronero∗                                                                                                                       jcambronero@microsoft.com

Microsoft

Sumit Gulwani∗                                                                                                                                      sumitg@microsoft.com

Microsoft

Adish Singla∗                                                                                                                                                adishs@mpi-sws.org

Max Planck Institute for Software Systems

Gustavo Soares∗                                                                                                                                    gsoares@microsoft.com

Microsoft

Abstract

Generative AI and large language models hold great promise in enhancing programming education by automatically generating individualized feedback for students. We investigate the role of generative AI models in providing human tutor-style programming hints to help students resolve errors in their buggy programs. Recent works have benchmarked state-of- the-art models for various feedback generation scenarios; however, their overall quality is still inferior to human tutors and not yet ready for real-world deployment. In this paper, we seek to push the limits of generative AI models toward providing high-quality programming hints and develop a novel technique, GPT4Hints-GPT3.5Val. As a first step, our technique leverages GPT-4 as a “tutor” model to generate hints – it boosts the generative quality by using symbolic information of failing test cases and fixes in prompts. As a next step, our technique leverages GPT-3.5, a weaker model, as a “student” model to further validate the hint quality – it performs an automatic quality validation by simulating the potential utility of providing this feedback. We show the efficacy of our technique via extensive evaluation using three real-world datasets of Python programs covering a variety of concepts ranging from basic algorithms to regular expressions and data analysis using pandas library.

1         Introduction

Generative AI and large language models (LLMs) have the potential to drastically improve the landscape of computing and programming education by powering next-generation educational technologies. This potential lies in the advanced capabilities of state-of-the-art models—like OpenAI’s GPT-4 [1] and ChatGPT (based on GPT-3.5) [2]—to automatically generate high-quality personalized content and feedback for students [3– 5]. A series of recent works have already shown us sparks of their capabilities for various programming education scenarios, including generating new programming assignments [6, 7], providing code explana- tions [6, 8], repairing buggy programs [9, 10], enhancing programming-error-messages [10, 11], and acting as pair programmer [12, 13]. In this paper, we investigate the role of LLMs in providing human tutor-style programming hints to help students resolve errors in their buggy programs.

More concretely, given a programming task and a student’s current buggy program, we want to generate natural language hints to help the student resolve bug(s) and make progress, inspired by how a human tutor would give pedagogical feedback. With the current scale of enrollments in introductory programming courses [14], it has become infeasible for human tutors to promptly provide individualized feedback to students, thereby motivating the need to develop automatic feedback generation techniques. To this end, we aim to leverage generative AI and LLMs for automating human tutor-style programming feedback to support students’ learning and reduce human tutors’ workload.

Recent works have studied state-of-the-art LLMs for generating various forms of programming feedback for students, including detailed explanations about bugs or single-sentence hints [4, 10, 11]. Despite promising initial results, the overall quality of feedback generated by LLMs is substantially inferior to that of human tutors and not yet ready for deployment in real-life classroom settings. For instance, a recent benchmark study in [4] evaluated GPT-4 in generating hints for buggy programs on introductory Python programming tasks and assessed its quality performance using expert annotations – GPT-4’s performance in terms of hints quality is only about 60% in contrast to human tutors’s performance of over 90%. This performance gap between GPT-4 vs. human tutors can be attributed to several factors, as discussed next. First, state-of-the-art models still struggle with symbolic reasoning and program execution abilities crucial for understanding the underlying bugs and possible student misconceptions [3–5, 15]. Second, these models also suffer from hallucination issues and the generated feedback text—even though seemingly plausible—may contain inaccurate information that could have detrimental effects on students’ learning [15–17]. Third, these models still lack a calibration mechanism to decide whether the generated content is of high quality or not [10]; in particular, they are unable to do a human tutor-style reasoning from a student’s perspective and judge if the generated feedback would likely help the student.

1.1       Our Approach and Contributions

In this paper, we seek to push the limits of generative AI and state-of-the-art LLMs toward providing high-quality programming hints. Given a base model, this would require improving the model’s abilities at input-level by developing better prompting strategies [18], at output-level by developing mechanisms to validate the generated content [10, 19, 20], or at model-level itself by fine-tuning (when considering open- source models [21]). In our work, we consider OpenAI’s GPT-4 [1] as the base model—the latest model presumably with over a trillion parameters—as it has shown to drastically improve existing models across various programming education scenarios [4].

We develop a novel technique, GPT4Hints-GPT3.5Val, to provide human tutor-style high-quality programming hints. Our technique leverages the GPT-4 model in the role of a “tutor” to generate hints and boosts the generative quality at the input level by prompting it with symbolic information of failing test cases and fixed programs. At the output level, it further validates the hint quality by leveraging the GPT-3.5 model as a “student” to simulate the potential utility of providing this feedback to human students. This validation step is designed to provide a quality assurance layer and decides whether the generated feedback should be provided to the human student or not – thereby trading off coverage (how many students are given automatic feedback) and precision (quality of the given feedback). We show the efficacy of our technique by conducting an extensive evaluation using three real-world datasets of Python programs covering a variety of concepts ranging from writing basic algorithms to regular expressions and data analysis using pandas [22]. Figures 1 and 2 showcase GPT4Hints-GPT3.5Val on two different buggy programs.1 More broadly, our work makes following contributions in leveraging generative AI and large language models for computing and programming education:

Figure 1: Illustrative example showcasing GPT4Hints-GPT3.5Val for the Palindrome problem shown in (a) from the BasicAlgo dataset. (b) shows a real-world buggy program. (c) shows a fixed program generated by the technique in an intermediate step, and (d) shows a test case where the buggy program fails to produce the correct output. (e) shows a detailed explanation generated by the technique that is used later in the validation stage. (f) shows the generated feedback (a single-sentence hint). (g) highlights that the validation stage of the technique successfully accepted the generated feedback as high-quality and suitable for sharing with the student.
  1. We showcase the utility of prompting the models with symbolic information, such as failing test cases and fixed programs, to enhance their reasoning abilities about the underlying bugs crucial for providing high-quality hints.
  2. We showcase the utility of using LLMs in a flipped role as a “student” model to simulate the potential effect of feedback on real human students. Our results highlight that using a weaker model (GPT-3.5, instead of GPT-4) provides better validation of programming hints from GPT-4. This flipped role opens up new opportunities in utilizing generative AI for in-context student modeling for automatic assessments, learning analytics, and simulations.
  3. Our technique achieves a precision of over 90% (reaching the quality of human tutors in our evaluation) while maintaining a high coverage of over 70% across three real-world datasets covering a variety of Python programming concepts.
Figure 2: Similar to Figure 1, this example showcases GPT4Hints-GPT3.5Val on a buggy program from the DataAnalysis dataset.

1.2       Related Work

Feedback generation for programming education. Prior to recent developments in generative AI and LLMs, the research on feedback generation for programming education had primarily focused on fixing buggy programs because of challenges in automatically generating natural language explanations [23, 24]. A parallel line of research explored crowdsourcing approaches to obtain explanations provided by other stu- dents/tutors [25]. Our work builds on recent developments in leveraging LLMs for generating programming feedback [4, 10, 11, 26], in particular, motivated by recent survey [4] that highlighted a substantial gap in GPT-4’s performance in terms of hints quality w.r.t. human tutors. Another closely related work is [10] that proposed PyFiXV technique for generating high-precision feedback for syntax errors. PyFiXV has a run-time feedback validation mechanism by leveraging OpenAI’s Codex-Edit model [27] at varying temperatures as a “student” model. Inspired by [10], we also leverage an LLM-based “student” model to perform validation. However, the validation method used in PyFiXV is not directly applicable to our setting as it is designed only for syntax errors that substantially simplify the validation process; crucially, GPT4Hints-GPT3.5Val is designed to provide feedback for any types of errors a student might encounter, including errors related to the program’s time complexity.

Enhancing a model’s generative performance. A series of recent works have focused on enhancing the gen- erative performance of a base model in a black-box setting, given the high monetary or computational costs involved in fine-tuning state-of-the-art models (in fact, the latest OpenAI’s GPT-4 model doesn’t have public APIs for fine-tuning). These works operate either at the input level by developing better prompting strategies [18] or at the output level by analyzing and correcting the generated content [10, 19, 20]. At the output

level enhancements, Self-Debugging [19] and Self-Refine [20] are two recently proposed methods that enable an LLM to analyze and correct its output automatically. Another recent work in [28] introduced the concept of Self-Repair that showed substantial performance gains when allowing an LLM to repair its output by receiving feedback from a more powerful LLM or expert. The key intuition behind the validation mechanism in GPT4Hints-GPT3.5Val differs from these works and is more related to [10] discussed above—we utilize another LLM as a “student” model to simulate the potential effect of feedback on real human students.

2         Problem Setup

Programming task and student’s buggy program as input. We start with a programming task 𝓨 and a buggy program Pb. A task 𝓨 , such as shown in Figures 1a and 2a, is represented by a textual description of the programming problem. Additionally, this description encompasses all requisite information essential for problem solving, such as expected algorithm complexity and any constraints on input, as applicable. In cases where the task necessitates interaction with an external file, 𝓨 should also contain all pertinent information of that file crucial for solving the problem, such as the file’s format or structure. Pb, as illustrated in Figures 1b and 2b, is an unsuccessful attempt of the student to solve 𝓨 . This program fails to pass at least one of the test cases in the test suite for 𝓨 . In general, Pb may contain one or multiple errors, spanning various error types including syntax and semantic errors.

Tutor-style hint as output and quality assessment. Given 𝓨 and Pb, we aim to generate a human tutor-style natural language hint H as feedback to aid the student in understanding and resolving the programming error. The quality assessment of such hints follows the rubric used in [4], which spans multiple distinctive dimensions. Firstly, HCorrect evaluates the correctness of the hint’s information concerning the actual bugs in Pb. Secondly, HInformative assesses whether the hint provides valuable insights to assist the student in grasping and resolving at least one bug (in case there are multiple bugs in Pb). Thirdly, HConceal gauges the hint’s capacity to maintain conciseness and abstraction, preventing a direct revelation of the solution to the student. Fourthly, HComprehensible, examines the clarity and absence of redundant information within the hint. All of these rubric dimensions are binary with a value of 1 indicating that the requirement is hold, and 0 otherwise. Further, HInformative and HConceal are conditioned on HCorrect, i.e., when the hint is incorrect, these two conditioned dimensions are automatically assigned a value of 0. HOverall measures the overall quality of the hint feedback and is 1 only when all the aforementioned dimensions are satisfied. For our final evaluation, we will assess generated feedback based on HOverall and any additional details about the hint.2

Metrics and objective. Next, we elaborate the metrics employed for the evaluation of hint generation techniques, as inspired by [10]. In general, given a task 𝓨 and a buggy program Pb, a technique has the option to either provide feedback or not, leading to a certain coverage. Furthermore, the provided feedback can be either of high- or low-quality, defining the technique’s precision. In particular, coverage represents the percentage number of instances in which feedback is delivered to the student; precision is the percentage number of occurrences wherein the delivered feedback is of high-quality. Having high precision is our main objective, necessary for ensuring that only high-quality feedback is given to students. In this work, we aim to develop a hint generation technique that trades off coverage for precision to achieve a precision level comparable to that of human tutors without compromising too much of its coverage.

3         Our Technique: GPT4Hints-GPT3.5Val

This section gives details about our proposed technique, GPT4Hints-GPT3.5Val, which leverages and improves upon generative AI models for feedback generation. Figure 3 shows an overview of our technique. In essence, GPT4Hints-GPT3.5Val employs GPT-4 as a simulated “tutor” model for generating feedback and GPT-3.5 as a simulated “student” model for feedback validation. In Section 3.1, we describe two types of symbolic information that are helpful for generating feedback and how to obtain them; in Section 3.2, we describe the process of feedback generation augmented with this symbolic information. Subsequently, in Section 3.3, we introduce a novel validation method aiming to elevate the precision of the delivered feedback while maintaining a high level of coverage.

Figure 3: Illustration of different stages in GPT4Hints-GPT3.5Val’s feedback generation process.

3.1       Stage-1: Generate Symbolic Data

Overview and intuition. As discussed in Section 1, there remains a notable performance gap between state- of-the-art generative AI models and human tutors regarding hint generation. One key factor contributing to this disparity is the inability to do symbolic reasoning and program execution. GPT-4 lacks the capability to execute the given code to retrieve an output, which could help it gain deeper understanding of the underlying bugs. In an effort to mitigate this gap, we employ external tools to execute programs and extract useful symbolic information. We then supply this relevant information to GPT-4 for feedback generation. Our approach centers on leveraging two categories of symbolic data: failing test cases and fixed programs.

Input/output for a failing test case. To highlight the error in the buggy program Pb, we provide GPT-4 with a test case for which Pb fails to produce the expected output. To acquire this test case, we run Pb on the existing test suite given for the corresponding task 𝓨 . The first test case in which Pb fails is selected. We denote the triplet comprising this input, the output generated by Pb, and the expected output, as ω and include it in the prompt for feedback generation.

Fixed program. The fixed program, denoted as Pf, is generated using GPT-4, employing a procedure adapted from the work in [10]. To be more specific, we initiate the process by requesting the model to produce 10 independent fixed programs. For this purpose, we include 𝓨 and Pb in the prompt to ask for 10 outputs (each output contains a fixed program) with the hyperparameter temperature set to 0.5. Then, from this set of 10, we take the programs that pass the test suite for 𝓨 and among them, identify Pf as the one with the smallest token-edit distance w.r.t. Pb. To compute the token-edit distance between two programs, we first tokenize them using Pygments library [29] and then calculate the Levenshtein edit distance based on the tokenized strings. If Pf is found, we include it in the prompt for feedback generation. If, however, none of the generated programs is correct, we opt to exclude this symbolic information from the prompt.

3.2       Stage-2: Generate Feedback

Overview and intuition. In this stage, we aim to obtain a human tutor-style hint to be given to the student, as previously mentioned in Section 2. In addition to our request for a hint from GPT-4, we also ask for a detailed explanation, denoted as X, for the bugs in Pb. The reason to get this detailed explanation draws inspiration from Chain-of-Thought [18], an established method renowned for enhancing the reasoning capabilities of LLMs. The essence of the Chain-of-Thought approach lies in encouraging LLMs to explain their thought process meticulously, step by step, prior to presenting the final output. Within the specific context of hint generation, we allow the model to elaborate its reasoning through X before coming up with the eventual concise single-sentence hint H. The hint is essentially an abstracted version of the explanation. Furthermore, X will also play a pivotal role in the subsequent feedback validation stage, which will be elaborated upon in Section 3.3.

Prompt for feedback generation. In Figure 4 (top), we provide our prompt for generating feedback. This prompt comprises the problem description for 𝓨 , the buggy program Pb, the symbolic information as extracted from the previous stage, and a request for an explanation X along with a hint H. To get a response from GPT-4, we use this prompt while configuring the hyperparameter temperature to 0, indicating our preference for the most probable answer. All other hyperparameters are kept at their default settings. Following this, X and H are then extracted automatically from the output.

3.3       Stage-3: Validate Feedback

Overview and intuition. This stage aims to enhance the precision of the feedback provided to the student. It is worth noting that despite the inclusion of augmented symbolic information in the prompt, the hint generated in Stage-2 may not always align with the desired quality criteria outlined in Section 2. To mitigate this issue, we introduce a validation approach that leverages an additional AI model, specifically GPT-3.5, to simulate students’ interaction with feedback. The fundamental idea behind this approach is to evaluate the quality of feedback by assessing its impact on the simulated students’ ability to fix the bugs. If, with the help of the feedback, the simulated students find it easier to fix Pb, then the feedback is deemed high-quality and can be subsequently provided to the real student. More concretely, we will use the detailed explanation X (instead of the single-sentence hint H) and a weaker “student” model GPT-3.5 (instead of GPT-4) to assess the utility of feedback for fixing the bugs. In our evaluation (Section 4.4 and Figure 7), we will demonstrate the effectiveness of these design choices.

Two prompts for validation. Figure 4 (middle and bottom) illustrates the two prompts for feedback valida- tion. Both prompts essentially instruct the “student” model (GPT-3.5) to fix Pb. The primary distinction lies in the fact that, in contrast to the second (standard) prompt, the first (augmented) prompt additionally incorporates the explanation X. For each prompt, we ask GPT-3.5 to generate a set of n = 10 independent outputs (the temperature is set to 0.5, similar to in Stage-1), effectively utilizing GPT-3.5 in the role of 10 simulated students. We shall denote the number of correct output programs resulting from the standard prompt as n1, and the number of correct output programs resulting from the augmented prompt as n2. The correctness of a program is determined by its ability to pass the whole test suite for the corresponding task 𝓨 . Next, we explain how we use these quantities for feedback validation.

Validation rules. Our main idea for validation is that a good feedback should help students to find it easier to fix the buggy program than without it. Thus, the primary rule for feedback validation is to have n2 ≥ n1 . Nonetheless, in situations where n1 assumes particularly low values, e.g., n1 = 0 or n1 = 1, this condition becomes less stringent and any feedback, regardless of its quality, may pass the validation. To address this, we incorporate an additional requirement to ensure that n2 attains a sufficient level independently. This is achieved through the inclusion of the following condition: n! n2 ≥ α” ∨ ! n2 ≥ n1 + β“, where we instantiate α as 0.50 and β as 0.25. In other words, we require the ratio of correct output programs generated with the help of the explanation to either exceed a certain fixed threshold (i.e., n2 ≥ 0.5) or be substantially higher than the ratio of correct output programs generated without the explanantion (i.e., n2 ≥ n1 + 0.25), or both. Consequently, our final validation method approves a feedback instance only when the following condition holds true: 1! n2 ≥ n1 ” ∧ !! n2 ≥ 0.50″ ∨ ! n2 ≥ n1 + 0.25″”2, and rejects it otherwise. In our experiments (Section 4), we will also compare the performance of different variants of threshold rules.

Multiple trials. When a feedback instance is rejected by the validation method, we may decide to not deliver it to the student [10]. However, adhering strictly to this inevitably leads to a reduction in coverage. To enhance precision while still maintaining a high level of coverage, we introduce an additional layer to the overall procedure. In particular, if a feedback instance is rejected, we re-start the entire process, including the acquisition of symbolic information, generation of a hint, and the subsequent validation. We maintain this iterative cycle until either a generated feedback instance is approved by the validation mechanism or a predefined maximum number of iterations, denoted as k, is attained (we set k = 3). After k trials, if none of the feedback instances pass validation then we terminate this outer loop and do not deliver any feedback to the student. In real-world scenarios, that likely corresponds to challenging bugs, under such circumstances, a human tutor could step in and take over the work of providing feedback for the student.

4         Experimental Evaluation

In this section, we evaluate our technique, GPT4Hints-GPT3.5Val, across three datasets spanning differ- ent domains of introductory Python programming. We assess the performance of GPT4Hints-GPT3.5Val in comparison to baselines such as GPT-4 and human tutors. Furthermore, we compare our validation with various alternative variants. In our experiments, we use OpenAI’s GPT-4 (model=gpt-4-0613 ) as the “tu- tor” model and ChatGPT based on GPT-3.5 (model=gpt-3.5-turbo-0613 ) as the “student” model unless otherwise stated.

Figure 4: Prompts employed by GPT4Hints-GPT3.5Val for feedback generation (top) and feedback validation (middle and bottom).
Figure 5: Overview of the datasets used in this work. See Section 4.1 for details.

4.1       Datasets

To comprehensively assess the techniques’ performance across diverse domains within introductory programming education, we use three datasets representing different types of learning objectives, as summarized in Figure 5. All datasets consist of students’ buggy programs written in the Python programming language. Below, we provide a detailed description of each of these datasets.

The first dataset, BasicAlgo, was introduced in [4]. It covers five popular introductory Python problems, and for each problem, there are five corresponding buggy programs. These problems capture a diverse set of basic programming concepts. The problems are: GCD (finding the greatest common divisor of two given numbers), Fibonacci (generating the list of Fibonacci numbers up to a given value), DivisorsDiv3 (counting the number of divisors that divide 3 of a given number), Palindrome (checking whether a given string is palindrome or not), and MergeStrs (merging two given strings alternatively). The buggy programs come from different users on the geeksforgeeks.org platform [30] and demonstrate a variety of bug types and code lengths. Some of the bugs present in this dataset are: unawareness of the mutability of lists (Figure 1), misuse of the ‘range’ function, time complexity is violated (Figure 10), and misordering of variables (Figure 11).

The second dataset, DataRegex, comes from an introductory data science programming course. This course is a part of an online master degree program in applied data science; students enrolling in the course are required to have basic Python programming and statistics knowledge. We examine the exercise number 2 from the first assignment of the course, which requires the students to use regular expressions to extract some information from a text file. In particular, the text file contains people’s names and their corresponding grades; the students need to fix a given buggy function so that it correctly reads the text file, matches a regular expression, captures and returns a list of people who got a grade of B. To solve the problem, students need to be able to understand and apply basic regular expression concepts such as wildcard characters, grouping, lookadround, and quantification. This dataset contains 24 buggy submissions, each from a unique student. For each student, if there are multiple buggy submissions, we take only the median one (w.r.t to submission time) to include in the dataset. Some of the common types of bug students made are: mishandling of grouping (Figure 9), returning names of all people in the text file, and returning only people’s last names. It is worth noting that there is only one test case in the test suite for this problem; this is in contrast to algorithmic problems, such as the ones in BasicAlgo, in which the test suites usually comprise a large number of input/output cases.

The third dataset, DataAnalysis, is from the second exercise of the second assignment in the same data science course. At that time, the students learnt to use data manipulation libraries such as pandas to load, filter, and extract meaningful information from data-frames. For this problem, the students are given a file in the csv format that contains the data-frame, a 252-page data guide file, a problem description, and a function signature. The problem asks the students to complete the given empty function to compute the ratios of vaccinated children who contracted chickenpox versus those who were vaccinated but did not contract chickenpox, separated by sex. To solve this problem, beside the basic Python syntax, the students also need to know how to select and use relevant libraries (such as pandas), understand and search for relevant information from the extensive data guide, and deal with missing data. We sample 30 buggy programs, each from a different student, using the same procedure as above, to form this dataset. Some bugs in the dataset are: mis-filtering of data (Figure 2), misreading of the requirements and computing a wrong ratio, and forgetting to handle or wrongly handling of missing values.

4.2       Baselines and Variants of Our Technique

Baseline GPT-4 and human tutors. As our first baseline, we employ GPT-4 in a straightforward manner by presenting it with the task description and the buggy program in the prompt to generate feedback. The format of the prompt closely resembles that depicted in Figure 4 (top), albeit without the inclusion of additional symbolic information. The second baseline employs human tutors with experience in Python programming and tutoring, which serves as the gold standard for our technique to match. In our experiments, two human tutors are employed to give hints independently. From here on, we refer to these baselines as GPT4Hints-Base and TutorHints, respectively.

Variants of our technique without validation. As mentioned previously, we introduce two additional types of symbolic information into our prompt for feedback generation. These additions consist of a failing test case and a fixed program, given that a correct fixed program can be produced (refer to Section 3.1). Accordingly, we have formulated two variant techniques: (i) GPT4Hints-IO involves enhancing GPT4Hints-Base by incorporating the failing test case into the prompt; (ii) GPT4Hints-IOFix integrates both of these types of symbolic information into the prompt. Note that neither of these techniques employ validation, i.e., the generated feedback is always deemed suitable for sharing.

Variations of validation stage in our technique. Next, we will consider variants of GPT4Hints-GPT3.5Val in terms of the validation stage. First, we look at the role of multiple trials when a feedback instance fails validation. We compare our technique with a variant where there is only a single trial (i.e., k = 1). Second, we examine the performance if GPT-4 is used as the simulated “student” model instead of GPT-3.5. Third, we investigate the case wherein the generated single-sentence hint, instead of the detailed explanation, is utilized in the validation process. Fourth and last, we vary the threshold rule used for validation. In this regard, there are three variations: !! n2 ≥ n1 ” ∧ ! n2 ≥ α“”, i.e., β is not considered in the rule; ! n2 ≥ n1 “,n “, i.e., is not considered in the rule.

4.3       Evaluation Procedure

Feedback ratings. The evaluation of hint feedback is carried out following the rubric outlined in Section 2. In essence, we base our feedback evaluation on overall hint quality (HOverall) and quality of the explanation (see Footnote 2), denoted as ECorrect. More concretely, Overall is 1 if and only if both HOverall and ECorrect are 1, and is 0 otherwise.

Reported metrics. We employ two human evaluators to independently annotate the feedback produced by all techniques.3 Then, we compute the precision and coverage for each technique by each tutor separately, before reporting the final results for each technique as aggregation of the two human evaluators by mean (stderr).

4.4       Results

Comparison with baselines and human tutors. Figure 6 illustrates a comparison between our technique and other baselines. It is evident that GPT4Hints-Base exhibits a substantial performance gap when compared to TutorHints. This gap is partially mitigated with the incorporation of failing test cases and fixed programs in the prompt, as seen with GPT4Hints-IO and GPT4Hints-IOFix, respectively.4 Our final technique, GPT4Hints-GPT3.5Val, consistently achieves precision levels comparable to TutorHints, surpassing 90% across all datasets.5 Importantly, the trade-off in coverage required to attaining such high precision remains modest, as our technique still maintains a coverage rate exceeding 70% for every dataset. More detailed results w.r.t all quality dimensions are shown in Figure 8. This figure also highlights that

Figure 6: Results for different techniques on three real-world Python programming datasets. For each technique and dataset, results are averaged across two evaluators and reported as mean (stderr) as per the evaluation procedure in Section 4.3. Our technique, GPT4Hints-GPT3.5Val, performs validation of the generated feedback to achieve a higher quality of the feedback in terms of precision level, thereby trading off precision and coverage. Our technique is able to achieve a precision of over 90% reaching the quality of human tutors while maintaining a high coverage of over 70% across three real-world datasets; see Section 4.4 for a detailed discussion of results.
Figure 7: Comparison of performance between GPT4Hints-GPT3.5Val and different variants w.r.t the validation stage. The first four variations (single trial, GPT-4 student model, using H, and threshold without considering n1) show how different design choices in our validation stage helps improve precision-coverage trade off. The last two variations with simplified threshold rules shows the robustness of the default threshold rule in terms of α and β. See Sections 3.3 and 4.4 for further details.
Figure 8: Fine-grained results w.r.t. evaluation rubric that assesses the quality of generated feedback across different attributes as discussed in Sections 2 and 4.3. For our technique, these fine-grained results demonstrate a high correlation between generating a correct detailed explanation (reasoning) and generating a high-quality hint.

for our technique, there is a close relation between reasoning (demonstrated through generating a correct detailed explanation) and generating a high-quality hint. This further justifies why explanation can be used for validation of the hint.

Comparison with variations of validation stage. Figure 7 shows the performance of different variants in comparison to GPT4Hints-GPT3.5Val. Notably, with a single trial (i.e. k = 1), there is a substantial decrease in coverage across all datasets. This result underscores the marked effect of incorporating multiple trials in maintaining a high coverage level. Intriguingly, when we substitute GPT-3.5 with the more advanced

Figure 9: Similar to Figure 1, this example showcases GPT4Hints-GPT3.5Val on a buggy program from the DataRegex dataset.

model, GPT-4, as the simulated “student” model, there is actually a reduction in precision. This suggests that weaker models can be better suited for the role of simulated student, showing that, in some cases, weak models still have their own edges over stronger ones in terms of performance. Similarly, using hints instead of explanations for validation yields inferior performance in general as the explanation contains more details about the bugs and fixes, thus better enlarging the performance gap between using the standard and the augmented prompt. Regarding variants of the validation rule, the overall performance remains relatively stable when α and β are excluded from the rule, suggesting a robust performance irrespective of specific settings for these hyperparameters. However, a noticeable decline in performance is observed when the relative condition ( n2 ≥ n1 ) is omitted, highlighting its importance in the validation process.

Qualitative analysis. We have included a few illustrative examples to showcase the effectiveness of our technique. Figures 1, 2, 9, and 10 exemplify cases where GPT4Hints-GPT3.5Val generated high-quality feedback during Stage-2 and then successfully accepted during Stage-3. Conversely, for the scenario in Figure 11, GPT4Hints-GPT3.5Val’s Stage-2 failed to produce high-quality feedback in all three trials, but Stage- 3 successfully rejected all of those low-quality feedback instances. To be more specific, the values of n1 and n2 for the three trials in this case were {n1 = 8, n2 = 0}, {n1 = 6, n2 = 0}, and {n1 = 5, n2 = 0}, respectively. In contrast, in the example shown in Figure 1, GPT4Hints-GPT3.5Val’s Stage-2 generated high-quality feedback during the first trial and Stage-3 subsequently accepted it with values {n1 = 2, n2 = 6}.

5         Concluding Discussions

We investigated the role of generative AI and large language models in providing human tutor-style programming hints to help students resolve errors in their buggy programs. In particular, we focused on improving the quality of generated feedback, which is crucial for deployment in real-life classroom settings. We developed a novel technique, GPT4Hints-GPT3.5Val, that leverages GPT-4 as a “tutor” model to generate hints and GPT-3.5 as a “student” model to validate the hint quality. This validation step provides a layer of quality assurance, thereby trading off coverage (how many students are given automatic feedback) and pre-

Figure 10: Similar to Figure 1, this example showcases GPT4Hints-GPT3.5Val on a buggy program for the GCD problem from BasicAlgo.

cision (quality of the given feedback). We performed an extensive evaluation to showcase the efficacy of our technique on three real-world Python programming datasets, reaching the precision-level of human tutors.

Our work has two important implications for the research community interested in leveraging generative AI and large language models for computing and programming education. First, our results show how we can effectively utilize these models as “tutor” by prompting them with symbolic data such as failing test cases. This symbolic data essentially provides in-context information to enhance the reasoning and execution abilities of these models where they typically struggle. Second, our results show how we can utilize these models in a flipped role as “student” to simulate the effect of feedback on a real human student. Interestingly, we also showed that a weaker model (GPT-3.5, instead of GPT-4) serves as a better “student” model for validating the effect of feedback generated by GPT-4. This flipped role opens up new opportunities in utilizing generative models as in-context student models for automatic assessments, learning analytics, and simulations.

Next, we discuss some limitations of our current work and ideas to tackle them in the future. First, our work involved OpenAI’s GPT family of models; it would be useful to evaluate alternate generative models, in particular, open-source variants like Llama-2. Second, our work didn’t leverage historical data on a given problem when generating hints, e.g., hints provided by human tutors for previous students’s buggy attempts on a problem. It would be important to develop techniques that can leverage this data, e.g., by fine-tuning these open-source variants to generate better-quality hints. Third, our evaluation considered small datasets comprising a total of 79 buggy programs; it would be useful to scale up the studies by considering larger- scale datasets. Fourth, we focused only on Python programming education; it would be interesting to conduct a similar study for other programming languages and other domains beyond programming. Fifth, our evaluation only considered expert-based annotations and didn’t involve students; it would be important to conduct studies with students to evaluate techniques from their perspectives.

Figure 11: Similar to Figure 1, this example showcases GPT4Hints-GPT3.5Val on a buggy program for the MergeStrs problem from the DataAnalysis dataset. For this example, the generated detailed explanation and single-sentence hint feedback are not correct (e.g., the explanation suggests fixing the program based on a different slicing strategy, which is not related to the bug in this program). The validation stage of the technique (that evaluates the potential utility of this detailed explanation, cf. Figure 3) successfully rejected the generated hint as low-quality and not suitable for sharing with the student. See Section 4.4 for further discussion of results.

Acknowledgments. Funded/Co-funded by the European Union (ERC, TOPS, 101039090). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them.

References

[1]   OpenAI. GPT-4 Technical Report. CoRR, abs/2303.08774, 2023.

[2]   OpenAI. ChatGPT. https://openai.com/blog/chatgpt, 2023.

[3]   Sébastien Bubeck et al. Sparks of Artificial General Intelligence: Early Experiments with GPT-4. CoRR, abs/2303.12712, 2023.

[4]   Tung Phung, Victor-Alexandru Pădurean, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Ma- jumdar, Adish Singla, and Gustavo Soares. Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors. In ICER V.2, 2023.

[5]   Adish Singla. Evaluating ChatGPT and GPT-4 for Visual Programming. In ICER V.2, 2023.

[6]   Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models. In ICER, 2022.

[7]   Victor-Alexandru Pădurean, Georgios Tzannetos, and Adish Singla. Neural Task Synthesis for Visual Programming. CoRR, abs/2305.18342, 2023.

[8]   Stephen MacNeil, Andrew Tran, Arto Hellas, Joanne Kim, Sami Sarsa, Paul Denny, Seth Bernstein, and Juho Leinonen. Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book. In SIGCSE, 2023.

[9]   Jialu Zhang, José Cambronero, Sumit Gulwani, Vu Le, Ruzica Piskac, Gustavo Soares, and Gust Ver- bruggen. Repairing Bugs in Python Assignments Using Large Language Models. CoRR, abs/2209.14876, 2022.

[10]    Tung Phung, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish Singla, and Gustavo Soares. Generating High-Precision Feedback for Programming Syntax Errors using Large Language Models. In EDM, 2023.

[11]    Juho Leinonen, Arto Hellas, Sami Sarsa, Brent N. Reeves, Paul Denny, James Prather, and Brett A. Becker. Using Large Language Models to Enhance Programming Error Messages. In SIGCSE, 2023.

[12]    GitHub. GitHub Copilot: Your AI Pair Programmer. https://github.com/features/copilot, 2022.

[13]    Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming. CoRR, abs/2210.14306, 2022.

[14]    Samim Mirhosseini, Austin Z. Henley, and Chris Parnin. What is Your Biggest Pain Point? An Investigation of CS Instructor Obstacles, Workarounds, and Desires. In SIGCSE, 2023.

[15]    Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. A Multitask, Multi- lingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. CoRR, abs/2302.04023, 2023.

Natalie Kiesler, Dominic Lohr, and Hieke Keuning. Exploring the Potential of Large Language Models to Generate Formative Programming Feedback. 2023.

[17]    Tiffany Wenting Li, Silas Hsu, Max Fowler, Zhilin Zhang, Craig B. Zilles, and Karrie Karahalios. Am I Wrong, or Is the Autograder Wrong? Effects of AI Grading Mistakes on Learning. In ICER, 2022.

[18]    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS, 2022.

[19]    Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching Large Language Models to Self-Debug. CoRR, abs/2304.05128, 2023.

[20]    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad Majumder, Shashank Gupta, Amir Yazdanbakhsh, and Peter Clark. Self-Refine: Iterative Refinement with Self- Feedback. CoRR, abs/2303.17651, 2023.

[21]    Hugo Touvron et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR, abs/2307.09288, 2023.

[22]    Wes McKinney et al. pandas: A Foundational Python Library for Data Analysis and Statistics. Python for High Performance and Scientific Computing, 14(9):1–9, 2011.

[23]    Rishabh Singh, Sumit Gulwani, and Armando Solar-Lezama. Automated Feedback Generation for Introductory Programming Assignments. In PLDI, 2013.

[24]    Sumit Gulwani, Ivan Radicek, and Florian Zuleger. Automated Clustering and Program Repair for Introductory Programming Assignments. In PLDI, 2018.

[25]    Andrew Head, Elena L. Glassman, Gustavo Soares, Ryo Suzuki, Lucas Figueredo, Loris D’Antoni, and Björn Hartmann. Writing Reusable Code Feedback at Scale with Mixed-Initiative Program Synthesis. In Learning @ Scale, 2017.

[26]    Maciej Pankiewicz and Ryan Shaun Baker. Large Language Models (GPT) for Automating Feedback on Programming Assignments. CoRR, abs/2307.00150, 2023.

[27]    OpenAI.                     Codex-Edit.                    https://beta.openai.com/playground?mode=edit&model= code-davinci-edit-001, 2022.

[28]    Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. Demystifying GPT Self-Repair for Code Generation. CoRR, abs/2306.09896, 2023.

[29]    Georg Brandl, Matthäus Chajdas, and Jean Abou-Samra. Pygments. https://pygments.org/, 2006.

[30]    geeksforgeeks.org. GeeksforGeeks: A Computer Science Portal for Geeks. https://www. geeksforgeeks.org/, 2009.

[31]    William G Cochran. The χ2 Test of Goodness of Fit. The Annals of Mathematical Statistics, 1952.