12 hours Disaster Production Down — AWS EKS Kubernetes Node.js on Nest.js Memory Leak JS Heap out of Memory
Why 9 Hour is Wasted — No root cause finding
We speculate that the part that what we patch today break the production.
This work 95% of the time to revert back. We spent a lot of time commenting out anything that might be the problem.
Under pressure, most people try to do this. Me too and it’s ok.
However …
Golden rule is give cut-off time like 2 hours.
After that we need to notify user that this gonna be LONG … and start doing deep root cause analysis.
Root cause: Code 7 Months Ago
This system support data input and there are content update everyday.
TLDR; We finally examine that the new data input break the very old code our team wrote long long time ago.
How to debug memory leak — Heap Snapshot
Steps are:
- Open logging option
- Copy it out
- Open and inspect the heap snapshot
- Now we see where it leak → fix the code
Step 1 — How to open logging option
This is not given by default because of if you log heap it would affect perf. So you’d better do it when needed.
You open the Node.js option like below
In your package.json, Docker or run command
node \
--max-old-space-size=2048 \
--heapsnapshot-near-heap-limit=3 \
dist/main.js
Other optional debugging
--heapsnapshot-signal=SIGUSR2 \
--heap-prof \
--report-on-fatalerror \
--report-uncaught-exception \
--inspect=0.0.0.0:9229
# I don't do this real mess with networking, port forward etc.Step 2 — How to get the heap snapshot
When Node.js is about to run out of JS heap (close to the--max-old-space-size you configured),
It will automatically write a heap snapshot file (.heapsnapshot) to disk.
Then some k command to get it out in time:
# <ns> = namespace if you don't config this leave this option
# <pod> = pod name like `foo-1fmaw32`
k -n <ns> logs -f <pod>
# observe heapsnapshot write
k cp <ns>/<pod>:/usr/src/app/Heap.20251001.124051.25.0.001.heapsnapshot ./20251001.124051.25.0.001.heapsnapshot
# then copy it out the file is big and you have to be fast
# there might be `tar` warning of encounter `/` but it's ok
ls -lah
# repeat this command you will see file size going upTips: If you are not quick enough, the above option --heapsnapshot-signal=SIGUSR2 help. You exec signal to Node.js to create snapshot.
kubectl exec -it <pod-name> -n traveljoy-prd -- kill -USR2 1There are other way like volume mount but I think it is harder and the pressure is high when we see issue like this and we don’t feel like doing so.
Step 3 — How to open and inspect heap snapshot
Open Chrome and type in chrone://inspect then open dedicated DevTools for Node.
Example of Heap snapshot. You expand to see your object that is leaking.
Now you will see retained memory that CANNOT BE FREED ranked at top
These retained memory combined can be large because it can be parent & child of the other
Now we can pinpoint the problematic object and leak hole.
Hope this help !