Intelligent Sampling of Extreme-Scale Turbulence Datasets for Accurate and Efficient Spatiotemporal Model Training
Revolutionizing AI Training: SICKLE Achieves 38x Energy Savings with Enhanced Model Accuracy
This research introduces SICKLE, a sparse intelligent curation framework for efficient learning, designed to train better models with significantly less data through intelligent subsampling. Focusing on extreme-scale turbulence datasets from Direct Numerical Simulations (DNS), SICKLE employs a novel maximum entropy (MaxEnt) sampling approach, alongside scalable training and energy benchmarking on the Frontier supercomputer. The study demonstrates that intelligent subsampling can dramatically improve model accuracy and substantially reduce energy consumption, with observed reductions up to 38x, and up to two orders of magnitude in some scenarios, compared to training on full datasets or using naive sampling methods. This approach is critical for the development of energy-efficient scientific foundation models as traditional hardware scaling limits are reached.
Executive Impact & Core Metrics
Our analysis reveals the transformative potential of intelligent subsampling for scientific AI, showcasing significant gains in efficiency and accuracy.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
SICKLE Framework
SICKLE (Sparse Intelligent Curation framework for Learning Efficiently) is designed to enable machine learning on intelligently extracted data subsets from extreme-scale scientific simulations. It integrates state-of-the-art subsampling approaches, performance benchmarking, and energy efficiency evaluations.
Key features include MaxEnt sampling for optimal data selection, scalable training on HPC platforms like Frontier, and significant reductions in file storage requirements.
MaxEnt Sampling
The core of SICKLE's intelligent subsampling, MaxEnt, is a two-phase process based on maximum entropy principles.
- Phase 1: Hypercube Selection (Hmaxent) reduces dense datasets into sparse hypercubes using clustering and entropy-weighted random sampling, parallelized with MPI for efficiency.
- Phase 2: Point Selection (Xmaxent) involves further clustering and entropy-based selection within each hypercube, drawing samples based on node strengths. This method prioritizes informative regions, leading to more accurate models with less data.
Phase-space Sampling (UIPS)
Uniform-in-Phase-Space (UIPS) sampling generates probability density functions (PDFs) to guide sample selection. While effective for 2D datasets, UIPS can exhibit clumping behavior in 3D anisotropic flows, which limits its uniformity in representing complex data structures. It provides valuable generalization improvements by covering tail regions.
Temporal Sampling
Beyond spatial sparsification, SICKLE also incorporates intelligent temporal sampling. This strategy identifies and discards solution snapshots that do not provide novel training data, particularly for periodic or redundant solution trajectories. This prevents overfitting and ensures that the model trains on truly informative time instances, expanding the input PDF representation.
Model Training & Architectures
SICKLE leverages PyTorch for scalable training, supporting various neural network architectures:
- LSTM: For predicting single scalar values over time.
- MLP-Transformer: Takes unstructured down-sampled data to predict full flowfields.
- CNN-Transformer: Utilizes structured hypercubes to predict full flowfields.
The framework also includes mixed-precision training and scalable hyperparameter optimization via DeepHyper to optimize architectures and configurations.
Our intelligent MaxEnt subsampling approach on large-scale DNS datasets yielded up to 38x energy reduction while maintaining or improving model accuracy, a critical gain as traditional hardware scaling reaches its limits. In some cases, reductions reached two orders of magnitude (100x).
SICKLE Spatiotemporal Model Training Workflow
| Sampling Method | Key Advantages | Performance Context |
|---|---|---|
| MaxEnt Sampling (SICKLE) |
|
Optimal for large, anisotropic datasets with significant redundancy. Requires initial computational cost for clustering. |
| Random Sampling |
|
Can miss rare, information-rich regions in high-variability datasets. Less reproducible than MaxEnt due to higher variance. |
| Phase-space Sampling (UIPS) |
|
Tends to concentrate samples unevenly in 3D anisotropic flows, where uniformity breaks down. Less effective in energy savings compared to MaxEnt. |
Intelligent Sampling for Extreme-Scale Turbulence
Turbulence is a highly complex, multiscale, chaotic, and nonlinear physical phenomenon crucial in many scientific applications. High-fidelity Direct Numerical Simulations (DNS) generate petabytes of data, posing immense storage and processing challenges. Our SICKLE framework addresses this by intelligently subsampling these vast datasets, making training of machine-learned surrogates and scientific foundation models significantly more efficient and sustainable. For instance, on the SST-P1F100 dataset, MaxEnt sampling achieved a 171x speedup in parallel processing, demonstrating its efficacy for extreme-scale scientific computing.
Quantify Your AI Efficiency Gains
Use our calculator to estimate potential annual savings and reclaimed operational hours by implementing intelligent data curation in your enterprise.
Strategic Implementation Roadmap
Our phased approach ensures a smooth transition and maximizes the benefits of intelligent data sampling within your existing infrastructure.
Adaptive Temporal Sampling
Develop and integrate adaptive temporal sampling strategies that respond to transient phenomena and evolving model uncertainty, ensuring optimal data selection over time.
In-situ & Online Training Integration
Integrate SICKLE with in-situ, streaming, and online training frameworks such as SmartSim to enable real-time learning from simulations.
Federated Learning Support
Extend SICKLE to support federated learning across distributed HPC facilities, using frameworks like APPFL to leverage decentralized data while maintaining privacy and scalability.
Enhanced Visualization & Analysis
Develop and integrate enhanced visualization and analysis tools compatible with VTK and ParaView, providing better insights into sampled data and model performance.
Cross-Domain Applications
Expand SICKLE's application to other critical scientific domains, including climate modeling and fusion energy research, demonstrating its generalizability.
Foundation Model Integration
Integrate SICKLE into broader spatio-temporal foundation model frameworks, such as MATEY, to support training across diverse datasets of varying fidelity and scale for comprehensive scientific AI.
Ready to Transform Your Data Strategy?
Discover how intelligent sampling can revolutionize your AI model training, reduce costs, and accelerate scientific discovery.