[torquedev] torque 2.5.10 - interactive jobs startup
Lukasz Flis
l.flis at cyf-kr.edu.pl
Wed Feb 22 05:19:55 MST 2012
Hi,
Torque: 2.5.10
On a busy clusters where many jobs share the same node we can observe
that some of interactive jobs get interrupted during startup.
From user point of view problem manifests itself with message:
qsub: Job apparently deleted
Corresponding log file from pbs_mom indicates Interrupted System call
during read() on a socket in function rcvttype
Feb 19 11:00:35 <daemon.err> n6-2-32.local pbs_mom[]:
LOG_ERROR::Interrupted system call (4) in TMomFinalizeChild, cannot get
termtype
It looks like read is interrupted by SIGCHLD or SIGALARM (pbs_mom
definfes 5 second limit for rcvtermtype() to return, might be not long
enough for busy systems
I didn't have time to write the fix for it but is should be trivial
Cheers
--
Lukasz Flis
More information about the torquedev
mailing list