How to Identify the Process Terminated by OOM

Posted on Wed, 11 Sep 2024 in Other

You can find pid of the process killed by out-of-memory using one command. If you have logs with pid, start a grep on logs to identify what exactly the killed process did.

OOM-killer is heartless. It looks like it chooses process to kill on a machine randomly. Under the hood it works differently. It has no random in its logic. Looking up on the process list and inside of couple of files you can predict what process will be killed in case of OOM happens right now.

It is useless for finding of the already lost process. When I actually start my investigation, there is a completely new state on a machine. At least, there is no killed process any more.

To find the pid of killed by oom process and its application name use the command below.

dmesg -T | egrep -i 'killed process' 

In case of Python in the result name of the application is python. It could help to narrow results a bit. If you have logs with pid in it, you have all necessary information to find out what process was killed.

Why it is important to find out what process was killed? For web applications, with main process controlled by a supervisor, it is not super important: one process was killed, new one was started...

But what if oom-killer killed long-running calculation? New calculation will start in couple of hours. If you can understand what happened on the machine, you can notice users that the results will be with a delay. And restart that calculations without waisting extra hours.

OOM-Killer does not work randomly. It uses badness score: the more memory the process uses, the higher the score; the longer a process is alive in the system, the smaller the score. More about OOM Killer and badness score in the article Taming the OOM killer on LWN.

---
Got a question? Hit me on Twitter: avkorablev

oom