Predictive Process Mining
Towards Reproducibility in Predictive Process Mining: SPICE - A Deep Learning Library
In recent years, Predictive Process Mining (PPM) techniques based on artificial neural networks have evolved as a method for monitoring the future behavior of unfolding business processes and predicting Key Performance Indicators (KPIs). However, many PPM approaches often lack reproducibility, transparency in decision making, usability for incorporating novel datasets and benchmarking, making comparisons among different implementations very difficult. In this paper, we propose SPICE (Standardized Process Intelligence Comparison Engine), a Python framework that reimplements three popular, existing baseline deep-learning-based methods for PPM in PyTorch, while designing a common base framework with rigorous configurability to enable reproducible and robust comparison of past and future modelling approaches. We compare SPICE to original reported metrics and with fair metrics on 11 datasets.
Executive Impact: Enhancing Trust and Performance in PPM
This paper addresses the significant reproducibility challenges in Predictive Process Mining (PPM), a field crucial for monitoring and forecasting business process behavior. It highlights how inconsistent data splitting, varying preprocessing strategies, and the lack of standardized benchmarks have made comparing different deep learning implementations difficult. By proposing SPICE, a new Python library, the authors aim to standardize methods, enable robust comparisons, and mitigate common experimental design flaws, ultimately fostering more trustworthy and comparable research in PPM.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Introduction
The rapid digitalization of business processes and advances in digital monitoring technologies have enabled organizations to systematically track real-world operational events through digital traces. The volume and velocity of captured process data has created a pressing need for meaningful analytical methods. Process Mining (PM) addresses this need by providing post-hoc analysis of event logs, while the sub-field of Predictive Process Mining (PPM) extends these capabilities to support proactive decision-making. By performing predictive tasks on ongoing process instances such as suffix prediction, remaining time prediction, and outcome prediction PPM enables organizations to anticipate critical Key Performance Indicators (KPIs) with foresight and intervene accordingly before process completion. As process data can be defined as sequential by nature, unfolding over time, advances in the domain of Natural Language Processing (NLP) especially neural network based architectures such as Long Short-Term Memory (LSTM) [12] and Transformer-based models [33] appear as natural fit and have inspired numerous publications in PPM [31, 6, 5, 24]. However, while advances in model architectures have been made in the domain, standardized and openly available real-world datasets still remain rare, and the research field of PPM has therefore been focussed on transferring promising methods from other areas of application to PPM use cases before creating a level playing field. Other domains, such as computer vision [18, 8, 17, 19] or NLP [20, 34], have created datasets with clear splits, enabling fair conditions for future research and continuous benchmarking. The most prominent PPM datasets, the BPI Challenges, have never introduced such predefined splits, leading to various contributions with different splitting criteria and, moreover, also diverging preprocessing and filtering strategies (e.g., using a filtered subset of a dataset while other authors use the original full one). This renders those contributions hardly reproducible and barely comparable with already published work. This phenomenon is not specific to PPM but also appears in many other domains and is referred to as the reproducibility crisis [3, 16]. While others, such as Rama-Maneiro et al. [24], have already tried to reproduce PPM results, some of the original mistakes have been copied, thus also introducing experimental design flaws. Nevertheless, pioneering works [31, 6, 5] are still cited as benchmark results in recent PPM publications without correcting these deviations.
SPICE - Standardized Process Intelligence Comparison Engine
In this work, we aim to highlight these flaws. To address them, we introduce SPICE - Standardized Process Intelligence Comparison Engine, an open-source¹ Python library that mitigates three major pain-points in PPM: 1. Reimplementing prominent baseline models in a single library using PyTorch [22] for Next Activity, Next Timestamp, Remaining Time and Suffix Prediction. 2. Standardization of common reproducibility concerns such as data splitting, shuffling, pre-processing, and the removal of experimental design flaws to enable fair comparison across architectures and datasets. 3. High modularization to easily incorporate future methods and improve existing architectures. The workflow of SPICE is visualized in Figure 1.
Deep Learning in PPM
These prediction tasks have motivated the adoption of deep learning approaches in PPM, with researchers drawing inspiration from NLP techniques due to the sequential nature of business process event logs. The application of deep learning techniques to PPM began with the pioneering work of Evermann et al. [9], who introduced a fundamental shift in the PPM modeling paradigm by applying LSTM networks directly to event sequences without requiring explicit process models. By drawing parallels between event logs and natural language, they employed dense embeddings and a two-layer LSTM architecture for next-event and time prediction, though their approach was constrained to categorical variables. Extending this foundation, Tax et al. [31] introduced multi-task LSTM models that simultaneously predict next activity and times. Their approach combines one-hot encoding for activities with time-based feature enrichment, capturing temporal patterns such as work hours. This multi-task framework improved early-stage predictions and generalized successfully to full trace and remaining time predictions, although it exhibited scalability limitations with large event spaces. Camargo et al. [6] advanced LSTM-based PPM by integrating embeddings with tailored network architectures and sophisticated preprocessing strategies. Through comprehensive experiments across nine event logs, they demonstrated improved suffix prediction capabilities, particularly when employing shared architectures and log-normalized time features. Their approach addressed the scalability challenges of large vocabularies while indicating strong potential for simulation applications. Following the broader transformative impact of Transformer architectures across machine learning (ML) domains [15, 33], Bukhsh et al. [5] introduced a significant paradigm shift to PPM by replacing sequential LSTM-based processing with self-attention mechanisms that enable parallel computation over entire event sequences. This architectural departure addresses fundamental limitations of recurrent approaches by capturing long-range dependencies more effectively while eliminating the sequential bottleneck that constrains LSTM's and RNN's scalability. The attention-based framework reduces dependence on manual feature engineering and demonstrates enhanced adaptability across heterogeneous event logs, though this flexibility comes with trade-offs in model interpretability and computational overhead compared to simpler sequential approaches. In more recent research, Transformer-based architectures found multiple adopters applying them to PPM tasks [10, 11, 35]. We focus on these three approaches [31, 6, 5] due to (1) their high citation impact and being a de-facto benchmark approach, (2) the accessibility of their respective code base and (3) the fact that they represent key methodological advances in PPM: the introduction of multi-task learning paradigms, the optimization of LSTM architectures with enhanced preprocessing and embedding strategies, and the paradigm shift to attention-based architectures. In future versions of SPICE we aim at incorporating additional baseline models to provide broader coverage of the PPM landscape.
Experimental Design & Common Mistakes
Rigorous experimental design is essential for credible evaluation of all ML research. Poor design choices such as biased dataset selection, inconsistent preprocessing, unfair hyperparameter tuning, or inappropriate metrics can lead to misleading conclusions that overstate model performance or obscure genuine algorithmic contributions. A well-designed experiment ensures reproducibility, enables fair comparison between competing approaches, and provides reliable evidence of a method's practical value. This requires careful consideration of dataset representativeness, evaluation protocols, baseline selection, and statistical validation to ensure that reported improvements reflect real advances rather than experimental artifacts. Rama-Maneiro et al. [24] have already conducted a comprehensive review which tried to reproduce results from multiple papers, also including approaches chosen in this paper. But while they fixed some of the experimental design flaws such as uneven data splits, they did not focus on outlining implementation changes such as fixing crucial errors, information leakage problems, or unfair preprocessing in the respective modelling, but rather focused on the comparison of metrics instead. Data Splitting: Data splitting in ML is the practice of dividing the available data into separate subsets—typically for training, validation, and testing. This enables model fitting, hyperparameter selection, and an unbiased estimation of model generalization performance [25, 28]. When performed correctly, data splitting is essential to prevent overfitting and to ensure that a model's predictive ability is accurately evaluated on unseen data, as emphasized in comparative studies and methodological papers. In the context of PPM, data splitting can be performed in various ways: by case ID [31, 6, 5], by time-based splits [1], or by combining both approaches [35]. The latter strategy, however, raises open questions—for example, how to handle cases that begin in one split period but end in another, as any temporal split (e.g., train/validation/test cutoff dates) can slice cases across boundaries. Ideally, one would only include finished cases that start and end within the same split. Yet, in practice, datasets with many such overlapping traces (for instance, long case durations with a relatively short data collection window) pose challenges for this approach. Furthermore, activities and (sub-)sequences are often observed as unbalanced, making stratified splits a valid option in some cases. An overview of different strategies is given in Table 1. Information Leakage: Information leakage happens when a model is exposed during training to data that it wouldn't have access to during real-world predictions. This can lead to misleadingly high performance in evaluations and result in unreliable models [16, 7]. This includes identical or similar samples appearing in both train and test sets, as well as more subtle forms such as temporal leakage or improper preprocessing. As Table 1 illustrates multiple ways of correctly splitting a dataset in PPM, avoiding leakage is challenging. While repeated (sub-)sequences across splits are expected and valid, it may be crucial in some settings to prevent training on future or test-set information [1]. As discussed in Abb et al. [1], sometimes datasets are unbalanced with some (sub-)sequences being observed very often, and thus PPM faces a problem of low variation, risking model training to overfit. PPM experiments must define beforehand what models are built for and what is being tried to achieve. After that, splits must be designed carefully such that temporal and causal constraints are taken care of to ensure training is only done on appropriate data which is realistic to be encountered in an inference setting. Addressing both explicit (identical sequences) and implicit (temporal, preprocessing, data drifts) leakage is essential for reliable evaluation of model performance. Reproducibility and Setting Random Seeds: Controlling sources of randomness by setting random seeds is fundamental for reproducibility in ML, as the stochastic processes inherent in many algorithms can substantially impact results, leading to different outcomes from the same data if seeds are not fixed. Managing random seeds allows experiments to be repeated reliably and ensures that scientific findings are trustworthy and comparable [27]. While it is known that full reproducibility over different hardware configurations might not be achievable in every case², one should at all times try to mitigate those shortcomings by publishing random seeds, trained model artifacts or relying on standardized datasets [27, 16]. Evaluation Metrics: While there are a lot of differences in the design of PPM experiments, one common evaluation metrics seems to be Accuracy used for Next Activity Prediction (see Eq. (7)). The shortcomings of the plain Accuracy metric have been widely discussed and alternatives such as resampling or algorithmic splitting of data have been introduced [30, 4]. Figure 2 shows the highly unbalanced counts of activities for the Helpdesk dataset with Ncases = 4580. This highlights the need for strategies to deal with imbalanced target labels and thus rules out the usage of plain Accuracy as the metric of choice for Next Activity Prediction. We will therefore also compare results with the Balanced Accuracy (Eq. (8)) for multiclass predictions, where K = Nactivities. For measuring time differences in next timestamp and remaining time prediction, the Mean Absolute Error (MAE) (9) is commonly used. For this evaluation it has to be noted that neither approach that is recurrently predicting time targets in suffix prediction [6, 31] is using any kind of positive activation function (e.g. softplus, relu) in their models³. As a result this can lead to negative time predictions, especially when exhaustive time padding is used that yields many zero target values when predictive time distributions are learned based on (padded) historic time differences. This is even more dangerous when the metric which is evaluated on is MAE⁴, as absolute errors will hide models that tend to predict near-zero values without escaping negative ones. We copied those shortcomings for the sake of reproducibility. When measuring errors on suffix prediction, the most commonly used metric in PPM [31, 6, 24] is the Normalized Damerau-Levenshtein (DL) Similarity simDL = 1 − dDL / max(||,||) , which uses the DL-Distance dDL (Eq. (10)). The DL-Distance was originally developed as a distance measure for string comparisons with four possible operations, counting the operations needed to change characters from a source string a to a target one b. In PPM, it is used not on strings but on full token sequences and is less penalizing for generated sequences that are incorrect in the ordering of activities compared to those with wrongly predicted activities (hence, substituting two activities costs 1, while deleting or inserting a wrongly predicted activity costs 2). Normalization is done by dividing the DL-Distance by the length of the longest sequence, defined as the maximum of the ground truth sequence length and the predicted sequence length. Since the ultimate goal in suffix prediction is to predict correct sequences, we deem this a fair approach. Nonetheless, there are also other metrics commonly used in NLP when comparing sequences, such as the BLEU score (Eq. (11) or the Jaccard Similarity (Eq. (12)).
Evaluation & Metric Comparison
In this section we report resulting metrics for our reimplementation in comparison to the original reported ones for each of the three chosen approaches. To keep the comparisons as simple as possible, we first report the accuracy and timestamp prediction. For suffix and remaining time prediction we aligned scales (days vs. hours) and report similarity scores as simDL. We also provide metrics for balanced accuracy regarding next activity in subsection 6.2. In general it has to be noted that our results for suffix prediction might be worse by design as we drastically increase the pad sizes, which leads to more room for errors due to hallucination. Overall it becomes clear that while each paper compares its results to one another, they lacked a common experimental design or used different variations of the chosen datasets. The only dataset that is used by all authors in its original form was the Helpdesk one. We do not include derivations of datasets (such as BPI 2012W or BPI 2012WC), as we strive to create a baseline for comparison of past and future approaches. The plethora of different dataset combinations is not helpful for a sound and scientifically rigorous comparison. We provide results in for all datasets in Table 3 for each respective approach. For datasets which have not been used in the original paper, we chose hyperparameters from the most similar dataset in terms of dataset characteristics (see Table 3). For Bukhsh et al. [5], all hyperparameters were kept at default setting. In Camargo et al. [6] and Tax et al. [31], for BPI 2020d, BPI 2020i and Road Traffic we re-used hyperparameters from Helpdesk as all datasets share a medium amount of unique activities and max case lengths is considerably short. For Hospital which consists of some long cases we used best settings from BPI 2013 (BPI 2012 for Tax). Additionally, for Tax et al. [31] we re-used parameters from BPI 2012 for the BPI 2013 and BPI 2015-1 datasets. Table 7 reports metrics of all models on next activity prediction with balanced accuracy in comparison to plain accuracy scores. All models perform worse on the former one, which was expected. The drop in performance for datasets that have high imbalance in activities such as Helpdesk (compare Figure 2) is very alarming as it shows that models are not able to capture deviations from standard process paths. We again want to emphasize that these deviations represent the most valuable sequences to identify, especially when considering downstream applications. Such deviations are often overlooked in otherwise straightforward processes, yet they provide the most actionable insights. A model that only learns the standard process model and is biased toward frequently occurring process paths might be of little or no practical use.
Conclusion and Future Work
This work outlines a number of deviations in experimental designs from papers that are frequently used to benchmark new approaches and research ideas in the field of PPM. We have collected and aggregated opinions and findings that came up during researching the respective modelling approaches and discussed fundamental questions regarding fair data splitting, the use of class-imbalance-aware metrics, and fundamental design flaws. By that, we illustrate the shortcomings of the de-facto baseline models in PPM [31, 6, 5]. In our opinion, future research should keep the deviations highlighted in this work in mind and refrain from building on top of these flawed designs as such. Therefore, we provide a framework on which future research can build, enabling ablation studies, out-of-the-box comparison of newer methods with older ones, and the recreation of trustworthy baseline metrics using modern ML standards. We encourage researchers to reuse, modify, and extend our framework in the future. Furthermore, we believe that research in PPM should not stop at providing metrics but should also aim for practical impact by including evaluations with key-user studies to demonstrate real-world usefulness, while acknowledging that public process data are scarce and that finding key users for evaluations is challenging. Without such evidence, however, the risk of creating just another theoretical research bubble increases.
Enterprise Process Flow
| Approach | Key Contributions | SPICE Implementation Notes |
|---|---|---|
| Evermann et al. (LSTM) |
|
|
| Tax et al. (Multi-task LSTM) |
|
|
| Camargo et al. (LSTM with Embeddings) |
|
|
| Bukhsh et al. (Transformer-based) |
|
|
Case Study: Helpdesk Dataset Activity Imbalance
The Helpdesk dataset, with 4580 cases, exhibits highly unbalanced activity counts, as visualized in Figure 2 of the paper. This imbalance significantly affects model performance, particularly when evaluating Next Activity Prediction. While plain Accuracy might appear high, the Balanced Accuracy metric reveals a substantial drop (e.g., from 0.865 Accuracy to 0.398 Balanced Accuracy for Camargo et al.'s model on Helpdesk). This demonstrates that models tend to perform poorly on minority classes, failing to capture deviations from standard process paths. For enterprises, these deviations often represent the most valuable sequences for proactive intervention, indicating that models biased towards frequent paths have limited practical use. SPICE addresses this by providing balanced accuracy metrics for a fairer evaluation.
Estimate Your ROI from Reproducible AI
By adopting standardized frameworks like SPICE, organizations can significantly reduce time spent on debugging, re-implementing, and validating predictive process mining models. This leads to faster deployment of reliable AI solutions, improved decision-making, and substantial operational savings. Use our calculator to estimate potential annual savings and reclaimed hours for your enterprise.
SPICE Implementation Roadmap
A structured approach to integrating SPICE into your enterprise, ensuring a smooth transition to reproducible and high-performing predictive process mining.
Phase 1: Initial Integration & Baseline Setup
Integrate SPICE into your existing Python environment. Configure event log ingestion and establish baseline predictive models (Next Activity, Remaining Time) using provided examples. Focus on standardizing data splits and preprocessing pipelines.
Phase 2: Advanced Model Evaluation & Benchmarking
Leverage SPICE's comprehensive metrics (Balanced Accuracy, simDL) to rigorously evaluate baseline models. Benchmark against existing proprietary solutions or previous research. Identify areas where current models underperform due to data imbalance or experimental flaws.
Phase 3: Custom Model Development & Expansion
Utilize SPICE's modular architecture to integrate novel deep learning models or refine existing ones. Experiment with multi-step prediction and suffix generation. Contribute back to the SPICE community to foster collaborative advancements.
Phase 4: Operational Deployment & Continuous Improvement
Deploy validated, reproducible predictive models into production for real-time process monitoring. Establish MLOps practices using SPICE's logging capabilities (MLflow). Continuously refine models based on operational feedback and new data, ensuring robust and reliable performance.
Ready to Transform Your Process Mining?
Embrace reproducibility and unlock the full potential of Predictive Process Mining with SPICE. Our experts are ready to guide you through implementation and help you build trustworthy, high-impact AI solutions for your enterprise.