Task graph scheduling in the presence of performance fluctuations of computational resources

Most of the existing work in the area of task graph scheduling considers resources with fixed processing capacity. The algorithms in these works rely on an estimation of the execution times of tasks on different resources. However, in practice, due to fluctuations in performance of cloud resources, these algorithms have challenges in these environments. In this paper, we focus on the problem of fault-tolerant scheduling of task graphs in the presence of performance fluctuations of computational resources. With the aim of reducing the adverse impacts of both soft errors and resource performance degradations, we propose an opportunistic task replication scheme that uses idle durations of resources for replicating tasks. Unlike the previous works, the proposed algorithm does not rely on estimation of task execution times for finding idle resources. We introduce the notion of concurrency graphs and propose a graph theory-based algorithm for finding the number of idle resources during the execution of a set of tasks. The appropriate redundancy for each task is chosen with respect to the number of idle resources and the characteristics of the set of tasks that are being processed concurrently. Simulation experiments show that, in most situations, the proposed algorithm outperforms the previous algorithms in terms of average execution time and cost.