[torqueusers] Possible bug in TM implementation

Martin Schafföner martin.schaffoener at e-technik.uni-magdeburg.de
Fri Feb 3 09:43:29 MST 2006


Hi,

I think I have discovered a bug in the TM implementation. Consider a scenario 
where a job master script spawns a lot of single tasks during it's lifetime. 
In this case USC's mpiexec is spawned which runs a command on a remote (or 
the local) node and waits for the TM interface to notify it of the remote 
command's termination. This works for quite a long time, and then it suddenly 
stops working with one or two mpiexec processes not getting notified about a 
task's termination.

The last entry in mom's logfile regarding the job in question is:

02/02/2006 23:00:41;0008;   pbs_mom;Job;6222.master;start_process: task 
started, tid 3395, sid 30410, cmd /bin/sh
02/02/2006 23:02:11;0080;   pbs_mom;Job;6222.master;scan_for_terminated: job 
6222.master task 562 terminated, sid 30410

I would have expected the task to be 3395 in the last line, as TID 562 had 
terminated a long time ago:

02/02/2006 15:32:12;0008;   pbs_mom;Job;6222.master;start_process: task 
started, tid 562, sid 30410, cmd /bin/sh
02/02/2006 15:33:33;0080;   pbs_mom;Job;6222.master;scan_for_terminated: job 
6222.master task 562 terminated, sid 30410

So I guess SID has been overflowing during the lifetime of the job which 
results in some mapping problem from SID to TID or some such.

What do you think about this?
-- 
Martin Schafföner

Cognitive Systems Group, Institute of Electronics, Signal Processing and 
Communication Technologies, Department of Electrical Engineering, 
Otto-von-Guericke University Magdeburg
Phone: +49 391 6720063


More information about the torqueusers mailing list