[torqueusers] torque 2.5.10 - interactive jobs startup

Lukasz Flis l.flis at cyf-kr.edu.pl
Wed Feb 22 05:19:55 MST 2012


Hi,

Torque: 2.5.10

On a busy clusters where many jobs share the same node we can observe 
that some of interactive jobs get interrupted during startup.

 From user point of view problem manifests itself with message:
qsub: Job apparently deleted

Corresponding log file from pbs_mom indicates Interrupted System call 
during read() on a socket in function rcvttype

Feb 19 11:00:35 <daemon.err> n6-2-32.local pbs_mom[]: 
LOG_ERROR::Interrupted system call (4) in TMomFinalizeChild, cannot get 
termtype

It looks like read is interrupted by SIGCHLD or SIGALARM (pbs_mom 
definfes 5 second limit for rcvtermtype() to return, might be not long 
enough for busy systems


I didn't have time to write the fix for it but is should be trivial


Cheers
--
Lukasz Flis



More information about the torqueusers mailing list