Intelligent Sampling of Extreme-Scale Turbulence Datasets for Accurate and Efficient Spatiotemporal Model Training

Revolutionizing AI Training: SICKLE Achieves 38x Energy Savings with Enhanced Model Accuracy

This research introduces SICKLE, a sparse intelligent curation framework for efficient learning, designed to train better models with significantly less data through intelligent subsampling. Focusing on extreme-scale turbulence datasets from Direct Numerical Simulations (DNS), SICKLE employs a novel maximum entropy (MaxEnt) sampling approach, alongside scalable training and energy benchmarking on the Frontier supercomputer. The study demonstrates that intelligent subsampling can dramatically improve model accuracy and substantially reduce energy consumption, with observed reductions up to 38x, and up to two orders of magnitude in some scenarios, compared to training on full datasets or using naive sampling methods. This approach is critical for the development of energy-efficient scientific foundation models as traditional hardware scaling limits are reached.

Schedule Your Strategy Session

Executive Impact & Core Metrics

Our analysis reveals the transformative potential of intelligent subsampling for scientific AI, showcasing significant gains in efficiency and accuracy.

x38 Energy Consumption Reduction (up to)

x171 Parallel Speedup (MaxEnt sampling)

Improved Model Accuracy & Efficiency

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

SICKLE Framework

SICKLE (Sparse Intelligent Curation framework for Learning Efficiently) is designed to enable machine learning on intelligently extracted data subsets from extreme-scale scientific simulations. It integrates state-of-the-art subsampling approaches, performance benchmarking, and energy efficiency evaluations.

Key features include MaxEnt sampling for optimal data selection, scalable training on HPC platforms like Frontier, and significant reductions in file storage requirements.

MaxEnt Sampling

The core of SICKLE's intelligent subsampling, MaxEnt, is a two-phase process based on maximum entropy principles.

Phase 1: Hypercube Selection (Hmaxent) reduces dense datasets into sparse hypercubes using clustering and entropy-weighted random sampling, parallelized with MPI for efficiency.
Phase 2: Point Selection (Xmaxent) involves further clustering and entropy-based selection within each hypercube, drawing samples based on node strengths. This method prioritizes informative regions, leading to more accurate models with less data.

Phase-space Sampling (UIPS)

Uniform-in-Phase-Space (UIPS) sampling generates probability density functions (PDFs) to guide sample selection. While effective for 2D datasets, UIPS can exhibit clumping behavior in 3D anisotropic flows, which limits its uniformity in representing complex data structures. It provides valuable generalization improvements by covering tail regions.

Temporal Sampling

Beyond spatial sparsification, SICKLE also incorporates intelligent temporal sampling. This strategy identifies and discards solution snapshots that do not provide novel training data, particularly for periodic or redundant solution trajectories. This prevents overfitting and ensures that the model trains on truly informative time instances, expanding the input PDF representation.

Model Training & Architectures

SICKLE leverages PyTorch for scalable training, supporting various neural network architectures:

LSTM: For predicting single scalar values over time.
MLP-Transformer: Takes unstructured down-sampled data to predict full flowfields.
CNN-Transformer: Utilizes structured hypercubes to predict full flowfields.

The framework also includes mixed-precision training and scalable hyperparameter optimization via DeepHyper to optimize architectures and configurations.

38x Reduction in Energy Consumption for Training

Our intelligent MaxEnt subsampling approach on large-scale DNS datasets yielded up to 38x energy reduction while maintaining or improving model accuracy, a critical gain as traditional hardware scaling reaches its limits. In some cases, reductions reached two orders of magnitude (100x).

SICKLE Spatiotemporal Model Training Workflow

High-fidelity Simulations

→

SICKLE Extract

→

Training Dataset

→

DeepHyper

→

PyTorch

→

Model Validation

Sampling Method	Key Advantages	Performance Context
MaxEnt Sampling (SICKLE)	Yields more accurate and reproducible models, especially for anisotropic flows. Significantly lowers energy consumption (up to 38x-100x) and computational costs. Effectively captures essential flowfield structures with fewer samples. Provides near-optimal coverage in structured settings.	Optimal for large, anisotropic datasets with significant redundancy. Requires initial computational cost for clustering.
Random Sampling	Simple to implement and offers unbiased coverage. Performs competitively in many scenarios, especially for isotropic turbulence or very large sample sizes.	Can miss rare, information-rich regions in high-variability datasets. Less reproducible than MaxEnt due to higher variance.
Phase-space Sampling (UIPS)	Works well for 2D datasets and generating probability density functions. Contributes to improved model generalization by covering tail regions.	Tends to concentrate samples unevenly in 3D anisotropic flows, where uniformity breaks down. Less effective in energy savings compared to MaxEnt.

Intelligent Sampling for Extreme-Scale Turbulence

Turbulence is a highly complex, multiscale, chaotic, and nonlinear physical phenomenon crucial in many scientific applications. High-fidelity Direct Numerical Simulations (DNS) generate petabytes of data, posing immense storage and processing challenges. Our SICKLE framework addresses this by intelligently subsampling these vast datasets, making training of machine-learned surrogates and scientific foundation models significantly more efficient and sustainable. For instance, on the SST-P1F100 dataset, MaxEnt sampling achieved a 171x speedup in parallel processing, demonstrating its efficacy for extreme-scale scientific computing.

Quantify Your AI Efficiency Gains

Use our calculator to estimate potential annual savings and reclaimed operational hours by implementing intelligent data curation in your enterprise.

Your Industry

AI/Data Team Employees

Average Hours/Week on Data Prep

Average Hourly Cost per Employee ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Discuss Your ROI

Strategic Implementation Roadmap

Our phased approach ensures a smooth transition and maximizes the benefits of intelligent data sampling within your existing infrastructure.

Adaptive Temporal Sampling

Develop and integrate adaptive temporal sampling strategies that respond to transient phenomena and evolving model uncertainty, ensuring optimal data selection over time.

In-situ & Online Training Integration

Integrate SICKLE with in-situ, streaming, and online training frameworks such as SmartSim to enable real-time learning from simulations.

Federated Learning Support

Extend SICKLE to support federated learning across distributed HPC facilities, using frameworks like APPFL to leverage decentralized data while maintaining privacy and scalability.

Enhanced Visualization & Analysis

Develop and integrate enhanced visualization and analysis tools compatible with VTK and ParaView, providing better insights into sampled data and model performance.

Cross-Domain Applications

Expand SICKLE's application to other critical scientific domains, including climate modeling and fusion energy research, demonstrating its generalizability.

Foundation Model Integration

Integrate SICKLE into broader spatio-temporal foundation model frameworks, such as MATEY, to support training across diverse datasets of varying fidelity and scale for comprehensive scientific AI.

Accelerate Your AI Journey

Ready to Transform Your Data Strategy?

Discover how intelligent sampling can revolutionize your AI model training, reduce costs, and accelerate scientific discovery.

Book a Free Consultation

Intelligent Sampling of Extreme-Scale Turbulence Datasets for Accurate and Efficient Spatiotemporal Model Training

Revolutionizing AI Training: SICKLE Achieves 38x Energy Savings with Enhanced Model Accuracy

Executive Impact & Core Metrics

Deep Analysis & Enterprise Applications

SICKLE Framework

MaxEnt Sampling

Phase-space Sampling (UIPS)

Temporal Sampling

Model Training & Architectures

SICKLE Spatiotemporal Model Training Workflow

Intelligent Sampling for Extreme-Scale Turbulence

Quantify Your AI Efficiency Gains

Strategic Implementation Roadmap

Adaptive Temporal Sampling

In-situ & Online Training Integration

Federated Learning Support

Enhanced Visualization & Analysis

Cross-Domain Applications

Foundation Model Integration

Ready to Transform Your Data Strategy?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai