Fixing Azkaban’s excessive memory usage

Project Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track user’s workflows.

Some Azkaban servers may grow huge, using tens of gigabytes of memory. Heap dumps that we analyzed were generated by the JVM running with 45G max heap. It routinely utilized more than 35G, had long GC pauses and was occasionally crashing with OutOfMemoryError.

JXRay was the only tool that could process such heap dumps in reasonable time. It was also the only tool that could run on the big machines in data center, with no screen attached. We found various problems, such as duplicate strings and underutilized collections, that were pretty easy to fix. We also found a more curious issue: duplicate objects other than Strings.

In one dump, there were nearly 200 million instances of class FlowProps, but only about 39,000 of them were unique. All other FlowProps objects were duplicates (copies) of these 39k. They wasted about 7G of memory, or about 20% of used heap.

It turned out that most instances of FlowProps never change after creation. That, plus some creativity, allowed the Azkaban team to rewrite FlowProps code, replacing it with class ImmutableFlowProps. See the pull request where it was done. Since ImmutableFlowProps objects don’t change after creation, they can be safely and easily deduplicated (interned). Once this change was applied, we effectively replaced 200 million of FlowProps with just 39k of ImmutableFlowProps objects. Needless to say, saving 20% of the heap and preventing repeated creation and GCing of long-lived objects really helped both the memory footprint and GC performance of Azkaban.

Leave a Reply

Your email address will not be published. Required fields are marked *