Learn how to use machine learning to improve cluster performance.
This talk describes the use of very fine-grained performance data from many Hadoop clusters to build a model predicting excessive swapping events.
Performance of batch processing systems such as YARN is generally determined by the throughput, which measures the amount of workload (tasks) completed in a given time window. For a given cluster size, the throughput can be increased by running as much workload as possible on each host, to utilize all the free resources available on host. Because each node is running a complex combination of different tasks/containers, the performance characteristics of the cluster are dynamically changing. As a result, there is always a danger of overutilizing host memory, which can result into extreme swapping or thrashing. The impacts of thrashing can be very severe; it can actually reduce the throughput instead of increasing it.
By using very fine-grained (5 second) data from many production clusters running very different workloads, we have trained a generalized model that very rapidly detects the onset of thrashing, within seconds from the first symptom. This detection has proven fast enough to enable effective mitigation of the negative symptom of thrashing, allowing the hosts to continuously provide high throughput.
To build this system we used hand-labeling of bad events combined with large scale data processing using Hadoop, HBase, Spark, and iPython for experimentation. We will discuss the methods used as well as the novel findings about Big Data cluster performance.