Skip to main content
Enterprise AI Analysis: Reinforcement learning based multi objective task scheduling for energy efficient and cost effective cloud edge computing

Enterprise AI Analysis

Reinforcement learning based multi objective task scheduling for energy efficient and cost effective cloud edge computing

This paper presents RL-MOTS, a novel Reinforcement Learning-Based Multi-Objective Task Scheduling framework for hybrid cloud-edge environments. Leveraging Deep Q-Networks (DQNs), RL-MOTS intelligently and adaptively allocates resources, optimizing for task latency, energy consumption, and operational costs. It incorporates a priority-aware dynamic queueing mechanism and a state-reward tensor to capture complex trade-offs in real-time. Simulations using CloudSim demonstrate RL-MOTS's robustness, achieving up to 28% energy reduction and 20% cost efficiency, while significantly reducing makespan and deadline violations compared to baseline strategies like FCFS, Min-Min, and multi-objective heuristic models, all while maintaining strict QoS.

Executive Impact Summary

The rapid proliferation of IoT devices and latency-sensitive applications has amplified the need for efficient task scheduling in hybrid cloud-edge environments. Traditional heuristic and metaheuristic algorithms often fall short in addressing the dynamic nature of workloads and the conflicting objectives of performance, energy efficiency, and cost-effectiveness. RL-MOTS (Reinforcement Learning-Based Multi-Objective Task Scheduling) leverages Deep Q-Networks (DQNs) for intelligent and adaptive resource allocation. It formulates scheduling as a Markov Decision Process, incorporates a priority-aware dynamic queueing mechanism and a multi-objective reward function that balances task latency, energy consumption, and operational costs. The framework also employs a state-reward tensor to capture trade-offs among objectives, enabling real-time decision-making across heterogeneous cloud and edge nodes.

0% Reduction in Energy Consumption
0% Improvement in Cost Efficiency
0% Makespan Reduction

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

This section examines task scheduling in cloud-edge computing as a multi-objective optimization issue, aiming to minimize waiting time, energy consumption, and expenses through a DQN-based reinforcement learning approach. It delineates the state and action spaces, a multi-faceted reward function, and essential metrics such as completion time and QoS limitations. The following formulas facilitate sustained optimization in this changing setting.

The state space encompasses the system's condition to facilitate informed decision-making in the DQN-based scheduling procedure. St = {Q (Tk), Ur (Ei, Vj), Ec (t), Cc (t)} (10) This delineates the state space at time t, encompassing the task queue Q (Tk), resource usage Ur, current energy consumption Ec (t) and Cc (t). The dimensionality of the state space is contingent upon the quantity of tasks and resources within the system. At each decision point, the state vector comprises: (a) task queue data Q (Tk), (b) levels of resource usage Ur(R, t), (c) current energy consumption Ec (t), and (d) current cost Cc(t). If the system has |Q| tasks pending and |R| resources, the state dimension increases approximately as: d = Q+3R. As each resource contributes to usage, energy, and cost parameters. The action space delineates the array of potential scheduling decisions for work distribution within the cloud-edge context. At = ak,i,j | Ak,i,j = Assign Tk to Ei or Vi (11) where: ak,i,j is the action at time t, denoting the allocation of task Tk to either edge node Ei or cloud VM Vj. Tk is the k-th task within the task queue. Er is the i-th edge node. Vj is the j-th virtual machine within the cloud infrastructure. To attain multi-objective optimization, the reward function integrates performance, energy efficiency, and cost efficiency with dynamic weights. Rt = W1. Pt + W2. Et (t) + w3. Ct (t) w1 + W2 + W3 = 1(12). where: Pt: performance metric (compliance with deadlines and minimization of waiting time), Et: word for energy efficiency, Ct: cost efficiency metric, w₁, W2, w3: adaptive dynamic weights42. The learning process of DQN is powered by the Q-value update mechanism, which allows the system to converge to an ideal scheduling policy. Q(St, at) ← Q(St, at) + a [rt + γ.maxa Q (St+1, a) – Q (St, at)] (13) Where, this modifies the Q-value for state st and action at, utilizing learning ratea, discount factory, and reward rt. Waiting time is a vital performance metric that signifies the duration a task remains unassigned prior to scheduling. WT (Tk) = tcurrent – tarrival (Tk) (14) The waiting time for task Tk is calculated as the difference between the current time and the task's arrival time43. Completion time measures the whole duration necessary for an activity to conclude, encompassing both waiting and execution periods. CT (Tk) = WT (Tk) + ET (Tk, R) (15) The completion time for task Tk is the aggregate of its waiting time WT (Tk) and execution time ET (Tk, R) on resource R. Energy consumption is predicted to facilitate energy-efficient scheduling by considering resource power utilization and job execution duration. E (Tk, R) = Pr (R) × ET (Tk, R) (16) Where, the energy consumption for task Tk on resource R is equal to the product of the resource power consumption Pr (R) and the execution time ET (Tk, R). The operational cost is assessed to enhance financial efficiency, taking into account both temporal and data transfer expenses. C (Tk, R) = Cr (R) × ET (Tk, R) + Ct (R) (17) Where Te is the task cost, R is the resource, Cr (R). ET (Tk, R) is the time-based cost, and Ct (R) is the fixed data transfer cost. In order to balance the priorities of cost, energy, and performance in the reward function, dynamic weights are updated according to resource utilization. wi (t + 1) = wi (t) + βx Ur (Ri, t) ΣRUr (R, t) (18) where wi: weight for objective i; β: update rate; Ur (Ri, t): usage of resource Ri at time t. Resource capacity limitations guarantee that the cumulative execution duration of tasks allocated to a resource remains within its capacity. Σ ET (Tk, Ri) < Cp (Ri) (19) Tke Ri where resource Ri's processing capacity Cp (Ri) cannot be exceeded by the total execution time of the tasks assigned to it. The computational difficulty of training the DQN is chiefly influenced by the dimensions of the state space, the architecture of the neural network, and the frequency of training updates. Each training step necessitates a forward and backward transit through the network. O (d. h + h²) (20) operations, where represents the input dimension and h denotes the number of hidden units per layer. Given a batch size B and total training episodes E, the comprehensive training difficulty is: O (E. B. (d. h + h²))

This section presents RL-MOTS, a scheduling system designed for heterogeneous cloud-edge situations. Figure 1 illustrates the comprehensive architecture. The growing complexity and dynamic characteristics of these environments require a decision-making system capable of optimizing numerous conflicting objectives concurrently. Consequently, RL-MOTS is designed to simultaneously improve energy efficiency, cost-effectiveness, and job priority, which are essential and frequently conflicting elements in extensive distributed systems. To attain these objectives, the system employs a priority-sensitive queue management strategy that utilizes adaptive queue structures alongside virtual waiting time matrices. These matrices change dynamically according to system status and task characteristics, allowing the scheduling agent to make context-sensitive judgments on task execution.

This mechanism guarantees that tasks are executed according to their computational needs, urgency levels, and resource availability, thereby improving responsiveness and equity. Task scheduling in cloud-edge infrastructures involves the immediate assignment of incoming tasks to a variety of computing units, comprising resource-limited edge nodes and high-capacity cloud virtual machines. This process must account for various dynamic factors, including fluctuating workloads, diverse resource capabilities, and changing network conditions. A significant challenge arises from the decentralized architecture of cloud-edge ecosystems, complicating global coordination and the consistent delivery of performance. Furthermore, the scheduling method must guarantee compliance with rigorous QoS requirements by minimizing execution latency and preventing deadline breaches. These requirements underscore the need for a learning-oriented adaptive policy that evolves over time, responds effectively to environmental changes, and facilitates scalable decision-making. Figure 2 illustrates the scheduling architecture based on Deep Q-Network (DQN). The state extractor initially processes the task queue and monitoring data, producing a proposed state-reward tensor that includes layers for waiting time, energy, and cost. This tensor is input into a deep Q-network (DQN), where the reward feedback from the performance evaluator is also aligned with the tensored objectives. The DQN generates Q values that inform the action selector and facilitate task allocation to edge or cloud resources.

The scheduling process is regulated by a DQN algorithm intended to acquire an optimal policy via ongoing interaction with the cloud-edge environment. The methodology utilizes a value iteration approach, wherein the Q-function Q (St, At) is successively updated to estimate the predicted discounted reward. The approach commences with arbitrary network weights and a replay memory to retain experience tuples (St, At, Rt, St+1). At each iteration, the agent perceives the state St, chooses an action At employing a € -greedy strategy, and performs the action, calculating the reward Rt according to the multi-objective function41. The Q-value is revised utilizing the Bellman equation: Q (St, At) ← Q (St, At) + a [Rt + γ maxQ (St+1, At+1) - Q(St, At) 4)] (9) where a denotes the learning rate and γ represents the discount factor. The procedure integrates preemption management through the virtual FIFO queue and converges on a policy that equilibrates waiting time, energy, and cost targets while adhering to QoS restrictions. This methodology offers a resilient structure for scalable and adaptable job scheduling in dynamic cloud-edge environments. Algorithm 1 introduces RL-MOTS to address the intricate issue of task scheduling in hybrid cloud edge computing systems, necessitating the simultaneous optimization of many objectives: performance, energy efficiency, and cost-effectiveness. In contrast to conventional heuristic-based scheduling methods, which frequently struggle to adjust to fluctuating workloads and resource limitations, RL-MOTS use reinforcement learning using a DQN to dynamically assign jobs to edge nodes or cloud virtual machines (VMs). This novel use of reinforcement learning facilitates real-time adaptive decision-making and addresses the constraints of static optimization techniques.

The selection of DQN as the primary learning algorithm was predicated on the discrete characteristics of the scheduling activities, in which value-based approaches excel. To enhance the proposed methodology, PPO and A3C are examined as supplementary baselines in the findings section, with RL-MOTS consistently demonstrating superior performance. Furthermore, the out-of-policy nature of DQN with replay memory improved data efficiency under non-stationary and bursty inputs, and the objective network mechanism stabilized training—features that have been found essential for online scheduling. Thus, PPO and A3C are reported as actor-critic baselines, and it is discussed where MARL is preferred when decentralization is needed.

To assess the efficacy of the proposed RL-MOTS system, simulations were performed utilizing CloudSim 3.0.3, emulating a hybrid cloud-edge scenario. The simulation comprises three categories of virtual machines with varying configurations Table 2, edge nodes with minimal computing resources, provisions for virtual machine resilience, and the capability to dynamically allocate and deallocate virtual machines according to workload requirements. The values enumerated under cost/second in Table 2 denote Cr (R) and are utilized in calculating the overall cost of job execution during our tests. The hyperparameters of the DQN scheduler were meticulously optimized to guarantee a fair and reproducible assessment. Subsequent to the initial grid search, the ensuing setup was implemented: Learning rate a = 1 × 10-3, discount factor γ = 0.95, and replay buffer size of 10,000 transitions. The mini-batch size was established at 64, and updates to the target network occurred every 100 steps. The ε-greedy exploration approach commenced at ε = 1.0 and underwent linear decay to ε = 0.1 over 20,000 steps, subsequently maintaining a constant value. Tasks are generated following a Poisson distribution with a rate of λ = 3. Every task is defined by its computational duration, input/output size, deadline, cost, and preemptive status. Tasks are categorized into three kinds Table 3. Deadlines are determined by normal distributions centered on the average execution time for each class, hence assuring realistic SLA limits.

The suggested methodology is evaluated against FCFS (First-in-First-served), min-min scheduling, priority-based scheduling, multi-objective function (MOF), and non-dominated sorting with thresholding algorithms. Each baseline signifies a unique scheduling philosophy, from static heuristics to cost-performance trade-offs. We evaluated the performance using the metrics of total time from the arrival of the first task to the completion of the last task, average waiting time, total energy consumption, operational cost, deadline violation rate, number of active virtual machines over time, and convergence of Q value during training to evaluate the dynamics of reinforcement learning. The makespan, defined as the total duration from the arrival of the initial task to the completion of the final work, serves as a fundamental metric of scheduling efficacy. Figure 5 illustrates that RL-MOTS consistently attains reduced makespan values as job volumes increase, in comparison to conventional baseline approaches. This illustrates the framework's capacity to dynamically assign jobs to suitable resources, thereby reducing idle periods and enhancing throughput. Energy usage and expenditure are crucial in extensive dispersed systems. Figure 6 demonstrates that RL-MOTS markedly decreases energy use, achieving up to 28% lower usage relative to heuristic-based approaches. This is ascribed to the DQN agent's capacity to prioritize energy-efficient nodes in real time. Also, Figure 7 illustrates the whole operational expenditure, whereby RL-MOTS exhibits enhanced cost optimization proficiency, realizing roughly 20% savings. This cost-effectiveness arises from dynamic virtual machine selection and diminished deadline infringements that frequently result in penalties. Adhering to SLAs is essential. As illustrated in Figure 8, RL-MOTS demonstrates a significantly reduced deadline violation rate across all task volumes. This signifies improved responsiveness and reliability, especially under high-load scenarios. Furthermore, Figure 9 illustrates the model's elasticity by monitoring the quantity of active VMs over time. RL-MOTS dynamically modifies VM utilization in reaction to variable workloads, demonstrating effective resource allocation and deallocation. Figure 10 depicts the convergence of average Q-values throughout numerous training episodes to evaluate the learning efficiency of the DQN agent. The consistent convergence trend affirms the stability and efficacy of the learning process, substantiating the agent's capacity to develop an ideal scheduling policy. The experimental results collectively demonstrate that RL-MOTS significantly enhances performance, energy efficiency, cost reduction, SLA compliance, and learning stability relative to traditional scheduling methods. These benefits render it a formidable contender for practical implementation in dynamic cloud-edge settings. These improvements can be directly ascribed to the architecture of RL-MOTS. The dynamic weight update technique allows the scheduler to prioritize energy efficiency during high use, resulting in a reduction of total energy consumption by up to 28% compared to baseline measurements. The anticipatory queue management technique prioritizes urgent or cost-critical tasks, reducing waiting times and thereby decreasing overall operational expenses by around 20%. Furthermore, the application of the state-reward tensor representation equips the DQN with a systematic perspective on performance, energy, and cost objectives concurrently, enabling the policy to leverage trade-offs more efficiently.

28% Reduction in Energy Consumption
20% Improvement in Cost Efficiency

RL-MOTS Scheduling Process Flow

Task Queue & Monitor Data
State Extractor
State-Reward Tensor (Waiting Time | Energy | Cost)
Deep Q-Network (DQN)
Action Selector (ε-greedy policy)
Assign Task to Resource
Reward Feedback

Comparative Performance Metrics

Algorithm Completion Rate (%) Energy Saving (%) Cost Reduction (%)
RL-MOTS 98 30 25
FCFS 85 10 8
Min-Min 88 15 12
Priority-Based 90 18 14
MOF 92 22 18

Real-World Cloud-Edge Testbeds Performance (100 Tasks)

Scenario: Evaluation on AWS Greengrass (100 tasks) and Azure IoT Edge (100 tasks) demonstrates RL-MOTS's superior performance in real-world scenarios, maintaining robust scheduling and low decision latency under heterogeneous resources and realistic network conditions.

Findings:

  • On AWS Greengrass (100 tasks), RL-MOTS lowers costs to $178 and deadline violations to 8.9%, cutting makespan to 470 s.
  • On Azure IoT Edge (100 tasks), RL-MOTS completes tasks in 485 s with a cost of $171 and a 9.1% deadline violation rate.

Quantify Your Potential ROI

Use our interactive calculator to estimate the financial and operational benefits of implementing advanced AI-driven task scheduling in your enterprise.

Estimated Annual Savings
Hours Reclaimed Annually

Your Path to Advanced Scheduling

Our phased implementation roadmap ensures a smooth transition to an RL-MOTS driven scheduling system, tailored to your enterprise needs.

Phase 1: Deep Q-Network Integration & State-Reward Tensor Development

Integrate DQN for adaptive resource allocation. Develop the multi-dimensional state-reward tensor to capture latency, energy, and cost trade-offs, enabling intelligent real-time decision-making.

Phase 2: Priority-Aware Queue & Dynamic Weight Optimization

Implement the priority-aware dynamic queueing mechanism and dynamic weight adjustment for the reward function, ensuring Pareto-optimal solutions and effective management of high-priority tasks.

Phase 3: Hybrid Scheduling & Real-time Adaptability Validation

Integrate virtual FIFO queue for preemptive/non-preemptive scheduling. Conduct comprehensive simulations using CloudSim to validate robustness under varying workloads and dynamic conditions.

Phase 4: Deployment & Continuous Learning in Cloud-Edge Environments

Deploy RL-MOTS in hybrid cloud-edge testbeds (e.g., AWS Greengrass, Azure IoT Edge) for real-world validation and continuous learning, adapting to non-stationary workloads and network conditions.

Ready to Transform Your Operations?

Unlock unparalleled efficiency and cost savings with our Reinforcement Learning solutions. Schedule a free consultation with our AI experts today.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking