Background of Memory Leakings
I encountered another memory leak problem recently. I’ve debugged a few of memory leak problems in the past, including the one in Splice Machine, an open source distributed SQL engine based on HBase but was sadly discontinued. The memory leak problems are interesting because it’s challenging to find the root cause. However, I’ve never written a blog about it. Memory leak problems are not so usual, so when I encountered a new one, I kind of need to remember what tools I’ve used. So this time, even though not as interesting as some other memory leak problems I’ve debugged in the past, I decide to write it down as a note for my own reference in the future. The tool set I used this time is relatively simple. I guess I can write more when I use others in the future. This is more like a dev log instead of a tutorial. The “boring” in the title means it’s a pretty standard process and the problem is not that hard to find this time.
Most of the memory leak bugs are very easy to fix once found the root cause, but the part of finding the root cause is tricky. First of all, it’s hard to reproduce: sometime it only happens on production environment. Without knowing the cause, it’s hard to reproduce locally. Even it can be reproduced consistently, it may take some time to let the memory accumulate so the debugging loop can be time consuming sometimes. Last of all, unlike some other bugs that you have an exception and a nice stack trace to help you identify which code causes the problem, it’s almost impossible to find the root cause without the help of a profiler, which itself has challenging parts depending on the platform.
Luckily, JVM has good profilers. That’s one of the reasons Scala, a JVM based language, is my favorite language. (The criteria of a good production language for me is not only the language itself, but also the ecosystem like library, IDE and profilers. JVM based language makes lots of the criteria easy to meet.) This time I uses a very popular profiler JProfiler. Other popular choices that I have used are VisulaVM and Java Msission Control. But I found JProfiler is both powerful and easy to use. The only downside is you need to buy a license. It has free trail and open source license. So if you have an open source project or just need to use it for a few days, you can still use it for free.
Okay, enough of the background. Let’s dive into the memory leaking problem I encountered this time. As mentioned in the previous blog Migrate Scala2grpc to Cats Effect 3, I migrated one of my side projects to Cats Effect 3 as well, with a lot of other dependencies. This side project is RSS Brain. There are two parts on the backend: one for serving client requests with gRPC and gRPC web, another one for fetching RSS feeds. The fetcher gets the RSS feeds that haven’t been fetched for a while from the database with the help of doobie and quill, and fetch them in parallel with the help of fs2 stream and Cats Effect.
After the mass upgrades, I looked into the metrics to make sure everything is okay. Then I found the fetcher’s memory starts to increase slowly. Looks like a memory leak problem to me. Here is the memory usage graph:
Seems eventually the JVM will run out of memory but I didn’t wait for it. It’s good to try if force a full gc will reclaim the memory or not, in my case full gc doesn’t help much.
Another metrics to look at is the GC metrics. Only after I shipped the fix, I realized the GC didn’t look normal when there was this memory problem:
Before the blue line starts, GCs are all “Copy” and “MarkSweepCompact”, which means the memory are mostly being moved around instead of reclaimed. After the blue line starts, which was when the fix was shipped, we starts to see normal young and old generation GC.
So these metrics indicates that we may have a memory leak issue. Let’s starts to debug it.
In this case I’m using JProfiler. But as I mentioned above, VisualVM or Java Mission Control should also be able to do the job.
JProfiler has a nice wizard to let you setup the profiler. In my case, since I run the service in Kubernetes, I need to select remote server profiling and go through the wizard. We are going to use
kubectl to forward the debugging port to local, so that we can just use
localhost:8849 as the remote address. At the end of the setup wizard, it will prompt you to download the profiler agent and include it with a Java command line argument. Since the service is running in container, I added the following lines to the Dockerfile in order to include the agent in it:
Also add this flag to Java command line when starting the service:
After the new container is deployed, we can port forward 8849 from the service to our localhost with kubectl:
Since it’s a memory leaking problem, we want to find out what objects are increasing. First let’s restart the JVM, connect JProfiler to it and take a snapshot of all objects in live memory:
We can see
byte takes the most memory but it doesn’t mean it’s responsible for memory leak, since we need to look at the increase of the memory.
So we need to wait for a while for the memory problem starts to happen. In my case, obvious memory increase can be occurred after the JVM run for about 12 hours. Normally if this is a work related thing, I may want to make it faster by increasing the work load. In this case, the code is fetching RSS feeds, so I could make the interval shorter so that it makes more requests. But since this is only a side project, I don’t need to continues working on it, and I also don’t quite like the idea to increase the requests to target RSS websites to increase their load. So I decide just let the JVM run during the night and take another look next day:
scala.collection.mutable.LinkedHashMap$LinkedEntry increased a lot. But is there anything else? Conveniently, JProfiler has the feature to compare 2 snapshots. Just go to “Session” -> “Start Center” -> “Open Snapshots” -> “Compare Multiple Snapshots”. After open those 2 snapshots, select both of them on the left and then compare memory:
We can see
LinkedEntry indeed increased the most by instance count. However, if we sort by size, we fill find
byte increased the most by memory size.
A False Root Cause
byte increased the most by memory size, I’d like to start there. By using “Allocation Call Tree”, we can check which code allocates
byte the most. After profiling for a while, we get the following result:
Okay, the top allocation goes to my own code
me.binwang.rss.parser.SourceParser. It’s the class that parse the xml from RSS feeds. So I looked into it if it has any code that can cause memory leak and I found this:
So there is an exception thrown in a
Resource.use makes sure to cleanup the resource when the
use scope is over. But what will happen if it throws an example in there? I thought it will cause
use to not handle the cleanup properly. So I changed it to use
IO.raiseError instead of throw it directly.
However, while I deploying the code, I thought I should really test it. So I wrote a piece of simple code to see whether
Resource will still be cleaned up if there is any exception thrown in
use, and the answer is yes. So this shouldn’t be the root cause. And the deployment result also confirms that: the memory kept increasing with this fix.
The Real Root Cause
byte just happened to uses more memory because it’s parsing a large xml at that time. It’s okay that it isn’t the real root cause since we have another lead:
scala.collection.mutable.LinkedHashMap$LinkedEntry. From the profiling, its allocation tree looks like this:
Okay, so seems most of them come from quill. quill is a library that compiles Scala DSL to SQL queries. It is fairly complex since it uses macros. I checked the code in the allocation tree and couldn’t find out what is wrong.
Then I tried to check the object reference to see which instances are pointed to the these LinkedEntry:
No surprise, they are basically all from quill as well. However, I couldn’t understand the internal AST represents of quill and not sure where are they coming from.
It’s time to search the Internet to see if there is any known issue in quill about memory leak. Maybe I didn’t have the right query, I didn’t find proper results from Internet.
After struggle for a while, I went to its Github repo to search “Memory leak” directly and found 3 issues. That’s good! And there is one describes the exact problem we have. If we see the allocation tree above, we can find there is a call from
NormalizeCaching (at the bottom of the tree in the picture), which is the class that the issue describes. I guess I didn’t go that far enough to check this class. I’m glad someone else did and found the issue! Basically the root cause is there is a map in the caching doesn’t have any bound. So the cache triggered by dynamic queries never got expired and is growing more and more:
Fix the Memory Leak
The issue is pretty old and is related to a core feature. I’m surprised it’s not fixed yet. As I said, once we found the root cause, the fix should be easy. We just need a way to make the cache expire. I replaced the cache implementation with Guava’s cache, and after the suggestion of maintainer changed it to Caffeine’s cache implementation. Here is the PR.
I built quill with the fix locally and tested with RSS Brain. The memory leak is indeed fixed! How exciting it is!
Let’s review the process of fixing the memory leak in this case:
- Setup profiler.
- Run full GC cannot resolve the memory issue.
- Compare the snapshots between when JVM first started and when the memory increases. See which classes increased most.
- Using allocation tree to find out which part of the code is creating the instances.
- Using references in heap walker to check which classes holds references of those instances.
- Check the identified code and classes.
- If it’s a third party library and we cannot find the root cause, check if the issue is reported. Otherwise report the issue.
- Fix the memory leak based on the root cause.