[torqueusers] Possible bug in TM implementation

Martin Schafföner martin.schaffoener at e-technik.uni-magdeburg.de
Fri Feb 3 09:43:29 MST 2006


I think I have discovered a bug in the TM implementation. Consider a scenario 
where a job master script spawns a lot of single tasks during it's lifetime. 
In this case USC's mpiexec is spawned which runs a command on a remote (or 
the local) node and waits for the TM interface to notify it of the remote 
command's termination. This works for quite a long time, and then it suddenly 
stops working with one or two mpiexec processes not getting notified about a 
task's termination.

The last entry in mom's logfile regarding the job in question is:

02/02/2006 23:00:41;0008;   pbs_mom;Job;6222.master;start_process: task 
started, tid 3395, sid 30410, cmd /bin/sh
02/02/2006 23:02:11;0080;   pbs_mom;Job;6222.master;scan_for_terminated: job 
6222.master task 562 terminated, sid 30410

I would have expected the task to be 3395 in the last line, as TID 562 had 
terminated a long time ago:

02/02/2006 15:32:12;0008;   pbs_mom;Job;6222.master;start_process: task 
started, tid 562, sid 30410, cmd /bin/sh
02/02/2006 15:33:33;0080;   pbs_mom;Job;6222.master;scan_for_terminated: job 
6222.master task 562 terminated, sid 30410

So I guess SID has been overflowing during the lifetime of the job which 
results in some mapping problem from SID to TID or some such.

What do you think about this?
Martin Schafföner

Cognitive Systems Group, Institute of Electronics, Signal Processing and 
Communication Technologies, Department of Electrical Engineering, 
Otto-von-Guericke University Magdeburg
Phone: +49 391 6720063

More information about the torqueusers mailing list