[torqueusers] Possible bug in TM implementation
Martin Schafföner
martin.schaffoener at e-technik.uni-magdeburg.de
Fri Feb 3 09:43:29 MST 2006
Hi,
I think I have discovered a bug in the TM implementation. Consider a scenario
where a job master script spawns a lot of single tasks during it's lifetime.
In this case USC's mpiexec is spawned which runs a command on a remote (or
the local) node and waits for the TM interface to notify it of the remote
command's termination. This works for quite a long time, and then it suddenly
stops working with one or two mpiexec processes not getting notified about a
task's termination.
The last entry in mom's logfile regarding the job in question is:
02/02/2006 23:00:41;0008; pbs_mom;Job;6222.master;start_process: task
started, tid 3395, sid 30410, cmd /bin/sh
02/02/2006 23:02:11;0080; pbs_mom;Job;6222.master;scan_for_terminated: job
6222.master task 562 terminated, sid 30410
I would have expected the task to be 3395 in the last line, as TID 562 had
terminated a long time ago:
02/02/2006 15:32:12;0008; pbs_mom;Job;6222.master;start_process: task
started, tid 562, sid 30410, cmd /bin/sh
02/02/2006 15:33:33;0080; pbs_mom;Job;6222.master;scan_for_terminated: job
6222.master task 562 terminated, sid 30410
So I guess SID has been overflowing during the lifetime of the job which
results in some mapping problem from SID to TID or some such.
What do you think about this?
--
Martin Schafföner
Cognitive Systems Group, Institute of Electronics, Signal Processing and
Communication Technologies, Department of Electrical Engineering,
Otto-von-Guericke University Magdeburg
Phone: +49 391 6720063
More information about the torqueusers
mailing list