Enterprise AI Analysis
An Empirical Framework for Evaluating Semantic Preservation Using Hugging Face
This paper presents an empirical framework to evaluate semantic preservation in learning-enabled software systems (LESS) by analyzing model evolution data from Hugging Face. It addresses the critical challenge of ensuring trustworthiness in ML systems, especially in high-autonomy domains, where traditional software refactoring methods fall short due to the nondeterministic nature of ML semantics. The framework identifies and quantifies semantic drift and refactoring patterns through a large-scale analysis of model commit histories, Model Cards, and performance metrics.
Key Impact Metrics
Our analysis reveals critical insights into ML model evolution and the prevalence of semantic shifts.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
As machine learning (ML) becomes an integral part of high-autonomy systems, it is critical to ensure the trustworthiness of learning-enabled software systems (LESS). Yet, the nondeterministic and run-time-defined semantics of ML complicate traditional software refactoring. We define semantic preservation in LESS as the property that optimizations of intelligent components do not alter the system's overall functional behavior. This paper introduces an empirical framework to evaluate semantic preservation in LESS by mining model evolution data from HuggingFace. We extract commit histories, Model Cards, and performance metrics from a large number of models. To establish baselines, we conducted case studies in three domains, tracing performance changes across versions. Our analysis demonstrates how semantic drift can be detected via evaluation metrics across commits and reveals common refactoring patterns based on commit message analysis. Although API constraints limited the possibility of estimating a full-scale threshold, our pipeline offers a foundation for defining community-accepted boundaries for semantic preservation. Our contributions include: (1) a large-scale dataset of ML model evolution, curated from 1.7 million Hugging Face entries via a reproducible pipeline using the native HF hub API, (2) a practical pipeline for the evaluation of semantic preservation for a subset of 536 models and 4000+ metrics and (3) empirical case studies illustrating semantic drift in practice. Together, these contributions advance the foundations for more maintainable and trustworthy ML systems.
In traditional software engineering, behavior-preserving system transformation is a well-understood concept. As first introduced by Opdyke (1992), refactoring in object-oriented programming involves systematically restructuring code without altering its external behavior. However, a learning-enabled software systems (LESS)—where Machine Learning (ML) models and data drive system behavior—refactoring is far more ambiguous. How can we verify that fine-tuning produces a trustworthy transformation that maintains both the system's behavioral integrity and interpretable decision-making? The uncertainty is especially problematic in high-autonomy domains such as safety-critical infrastructure, finance, and healthcare, where ML components must remain reliable and explainable under continuous evolution. Yet, despite this need, there is currently no empirical baseline for what counts as a safe or semantics preserving change when updating models, training data, or documentation. This gap raises risks not only in performance regression but also in trust, reproducibility, and downstream system reliability.
To mitigate these risks, it is increasingly critical to understand how ML models evolve while retaining their original intent. Unlike traditional software artifacts, ML models are not static—they are rapidly updated through fine-tuning, performance optimization, and documentation updates. Our goal in this paper is to uncover patterns and boundaries of semantic preservation during system transformation in widely used pretrained ML models that are hosted on Hugging Face. The Hugging Face platform offers a uniquely rich and open environment to observe these dynamics at scale, allowing researchers to trace intra-repository evolution through version-controlled Model Cards and commit histories.
This study addresses its research questions by detecting semantic drift across versions of ML/DL models using publicly available metadata and documentation from Hugging Face. For RQ1, millions of pre-trained model repositories were curated and filtered to retain only non-trivial cases. For RQ2, functional behavior metrics (accuracy, F1-score) were extracted from versioned Model Card README documentation, and metric trends analyzed across commits to evaluate semantic preservation. The methodology comprises metadata collection and filtering, performance metric extraction, and evaluation case studies. The pipeline aims to quantify practical refactoring thresholds in LESS.
The process begins by accessing the HF hub via the HfApi interface, retrieving model and user metadata from over 1.7 million models. A multi-stage filtering strategy was applied: 1) excluding repositories with missing or empty Model Card files, 2) discarding models lacking valid architecture fields in config.json (reducing the dataset to 751,964 models), and 3) filtering non-functional keywords in versioned commits. This process resulted in a subset of 536 models with 4297 metrics for detailed analysis.
Performance metrics like accuracy, precision, F1-score, and loss were extracted. Initially, a rule-based extraction pipeline was used, but due to the diversity of documentation formats, a GPT-5-involved API call was adopted for more robust and accurate extraction, generating structured summaries of performance metrics and overcoming static regex limitations. Semantic preservation is defined as consistent predictive performance across model versions.
The empirical study identifies several patterns of refactoring maintenance in Hugging Face model repositories. Automated metric extraction successfully identifies both positive drift (optimization) and semantic preservation patterns. Commit histories and public documentation serve as valuable, though underexplored, artifacts for refactoring analysis in ML, revealing common refactoring patterns like 'update' (95.6% of commits) overshadowing explicit mentions of 'refactor' or 'optimize' (each 0.1%). This highlights that refactoring behaviors are often under-documented or embedded within general-purpose commit labels.
Documentation quality significantly impacts the observability of model evolution, underscoring the need for community-wide standards. The framework enabled detection of different semantic drift patterns: Optimized Drift (metrics improve steadily), Semantic Preservation (no significant change despite updates), and Performance Degradation (performance declines despite refactoring efforts). Overall, 16.6% of the analyzed models exhibited semantic drift, though statistical significance was limited due to data variability. The pipeline provides concrete evidence of semantic preservation in LESS by showing where key performance metrics remain stable across model updates.
Analysis of commit messages across sampled repositories revealed prevalent refactoring patterns. The vast majority of commits (95.6%) utilize the generic keyword 'update'. Other common, yet less frequent, keywords include 'test' (2.0%), 'style' (1.3%), and 'improve' (0.8%). Explicit mentions of non-functional maintenance activities such as 'refactor', 'optimize', or 'security' are notably rare, each accounting for only 0.1% of commits. This imbalance indicates that refactoring behaviors are often under-documented or embedded within general-purpose labels, making it challenging to precisely trace the intent and impact of changes. This highlights a critical need for more sophisticated information distillation tools to confirm whether changes preserve or improve semantic functionality.
By tracking performance metrics over time, three recurring patterns of semantic drift were observed across models:
- Optimized Drift: Metrics steadily improved across commits, often reflecting intentional tuning. For example, some models showed a 40.09% accuracy delta in image detection due to fine-tuning.
- Semantic Preservation: No significant change in performance despite updates. This indicates successful maintenance of the model's original intent. For instance, several models maintained high accuracy (e.g., 98% accuracy for image detection and 79.6% for text classification) over multiple commits.
- Performance Degradation: Performance did not always improve, sometimes fluctuating or declining. For example, a tabular data classification task showed accuracy fluctuating from 85.5% down to 84.8% over multiple commits.
The methodology's reliance on documentation quality and commit accessibility on Hugging Face presents limitations. Inconsistent use of structured evaluation metadata, varying README.md file formats, and diverse metric formats introduce noise despite extraction efforts. While aggregate metrics offer a starting point, they can mask bias or insufficiency at a system level.
Future work will address these by expanding beyond aggregate metrics to conduct more granular analyses, examining model predictions on specific data slices to detect subtle shifts in subgroup performance, and including dimensions such as decision boundary shifts and subgroup performance changes. The framework will also be extended to explore fairness, efficiency, and other multi-objective dimensions, aiming to improve robustness across diverse documentation styles.
Enterprise Process Flow for Semantic Preservation Evaluation
| Traditional Software Refactoring | ML Software Refactoring |
|---|---|
| Behavior-preserving, well-understood concept | Ambiguous, non-deterministic semantics |
| Systematic code restructuring | Data and model-driven behavior changes |
| Aims not to alter external behavior | Verification of trustworthy transformation is complex |
| Established empirical baselines for changes | Lack of empirical baseline for safe updates |
| Focus on code structure and function | Focus on behavioral integrity and interpretable decisions |
Real-world Semantic Drift Examples
Fairface Age Image Detection (FAID): A Vision Transformer (ViT) variant fine-tuned for fair facial recognition. Over 10 days and 13 commits, accuracy improved from 18% to 58.09%, a total performance delta of 40.09%. This gradual refinement indicates optimized semantic drift, rather than major architecture changes.
NYC SQF ARR MLP (Tabular Classification): A model training on tabular data related to stop-and-frisk incidents. After refactoring feature sets, parameters, and training data, the model became more aggressive in predicting the positive class. This led to an increase in recall (70.7% to 73.4%) but a simultaneous decrease in precision (81.9% to 78%) and overall accuracy (86.4% to 84.8%). This highlights performance degradation due to a surge in false positives.
Atari Icehockey Superhuman (Reinforcement Learning): A reinforcement learning task focused on enhancing throughput of APPO RL models. The initial mean reward of -3.4 (±3.35) increased significantly to 27.7 (±7.13) after refactoring. This increase is attributed to maintenance and optimization for the specific Atari environment, not major architecture changes, demonstrating positive semantic drift towards optimization.
| Keyword | Frequency |
|---|---|
| update | 95.6% |
| test | 2.0% |
| style | 1.3% |
| improve | 0.8% |
| refactor | 0.1% |
| optimize | 0.1% |
| security | 0.1% |
Calculate Your Potential AI ROI
Estimate the impact of optimized AI systems on your operational efficiency and cost savings.
Your Enterprise AI Implementation Roadmap
A structured approach to integrating semantic preservation into your ML lifecycle.
Initial Assessment & ML Landscape Mapping
Evaluate existing ML models and processes, identify critical systems for semantic preservation analysis, and define performance baselines.
Hugging Face Data & Metric Integration
Set up automated pipelines using the Hugging Face API to extract versioned Model Cards, commit histories, and performance metrics across chosen models.
Semantic Drift Detection & Threshold Establishment
Implement the empirical framework to detect and quantify semantic drift. Analyze metric stability over time and propose community-accepted thresholds for preservation.
Continuous Monitoring & Alerting
Integrate the semantic preservation framework into CI/CD pipelines for real-time monitoring, enabling proactive alerts on performance regressions or undesirable semantic shifts.
Refinement, Governance & Explainability
Refine semantic preservation thresholds, establish MLOps governance policies, and enhance documentation practices to improve model traceability and explainability over its lifecycle.
Ready to Future-Proof Your AI Investments?
Don't let semantic drift undermine your intelligent systems. Partner with us to build more maintainable and trustworthy AI.