[torquedev] stage(in) failed jobs policy

Ramon Bastiaans ramon.bastiaans at sara.nl
Wed Nov 3 05:07:14 MDT 2010


Is there an mechanism/policy in Torque on how to handle jobs where 
stagein fails?

Due to miscellaneous submission scripts, unrelated to this issue, 
sometimes jobs are submitted to our cluster with "stagein" directives 
pointing to non-existent files.

Whenever this happens we see this happening:

11/03/2010 11:35:22;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 8056228.ce.XXX state from TRANSIT-TRANSICM to 
QUEUED-PRESTAGEIN (1-11)
11/03/2010 11:35:22;0100;PBS_Server;Job;8056228.ce.XXX;enqueuing into 
infra, state 1 hop 1
11/03/2010 11:35:22;0008;PBS_Server;Job;8056228.ce.XXX;Job Queued at 
request of ramonb at ce.gina.sara.nl, owner = ramonb at ce.XXX, job name = 
testjob.sh, queue = infra
11/03/2010 11:35:30;0008;PBS_Server;Job;8056228.ce.XXX;Job Run at 
request of root at ce.XXX
11/03/2010 11:35:30;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 8056228.ce.XXX state from QUEUED-PRESTAGEIN to 
RUNNING-STAGEGO (4-15)
11/03/2010 11:35:35;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 8056228.ce.XXX state from RUNNING-STAGEGO to 
WAITING-STAGEFAIL (3-37)

Then what? Is there a way to tell Torque for example to discard/delete 
all jobs where staging fails?

What happens now is that a cpu remains allocated (in maui) and these 
stagefail jobs are clogging up the scheduling and preventing new jobs 
from being run on that cpu.

In addition, whenever a certain number of jobs fail staging, equal to 
the "RESERVATIONDEPTH" in maui, no more new jobs get started at all and 
the cluster stays empty.

It's also impossible to test the staging in the prologue, because 
prologue is only executed AFTER the staging is done.

I could write a shell script to automatically delete stagefail-ed jobs, 
but that seems like a dirty workaround to me. If there is no mechanism 
in place for this, perhaps this is a feature worth considering.


Kind regards,
- Ramon.

-- 
R. Bastiaans, B.ICT :: Systems Programmer, HPC&V

SARA - Computing&  Networking Services
Science Park 121     PO Box 94613
1098 XG Amsterdam NL 1090 GP Amsterdam NL
P.+31 (0)20 592 3000 F.+31 (0)20 668 3167


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5883 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20101103/f0981de7/attachment-0001.bin 


More information about the torquedev mailing list