[torqueusers] stage(in) failed jobs policy
Ramon Bastiaans
ramon.bastiaans at sara.nl
Wed Nov 3 05:07:14 MDT 2010
Is there an mechanism/policy in Torque on how to handle jobs where
stagein fails?
Due to miscellaneous submission scripts, unrelated to this issue,
sometimes jobs are submitted to our cluster with "stagein" directives
pointing to non-existent files.
Whenever this happens we see this happening:
11/03/2010 11:35:22;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
setting job 8056228.ce.XXX state from TRANSIT-TRANSICM to
QUEUED-PRESTAGEIN (1-11)
11/03/2010 11:35:22;0100;PBS_Server;Job;8056228.ce.XXX;enqueuing into
infra, state 1 hop 1
11/03/2010 11:35:22;0008;PBS_Server;Job;8056228.ce.XXX;Job Queued at
request of ramonb at ce.gina.sara.nl, owner = ramonb at ce.XXX, job name =
testjob.sh, queue = infra
11/03/2010 11:35:30;0008;PBS_Server;Job;8056228.ce.XXX;Job Run at
request of root at ce.XXX
11/03/2010 11:35:30;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
setting job 8056228.ce.XXX state from QUEUED-PRESTAGEIN to
RUNNING-STAGEGO (4-15)
11/03/2010 11:35:35;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
setting job 8056228.ce.XXX state from RUNNING-STAGEGO to
WAITING-STAGEFAIL (3-37)
Then what? Is there a way to tell Torque for example to discard/delete
all jobs where staging fails?
What happens now is that a cpu remains allocated (in maui) and these
stagefail jobs are clogging up the scheduling and preventing new jobs
from being run on that cpu.
In addition, whenever a certain number of jobs fail staging, equal to
the "RESERVATIONDEPTH" in maui, no more new jobs get started at all and
the cluster stays empty.
It's also impossible to test the staging in the prologue, because
prologue is only executed AFTER the staging is done.
I could write a shell script to automatically delete stagefail-ed jobs,
but that seems like a dirty workaround to me. If there is no mechanism
in place for this, perhaps this is a feature worth considering.
Kind regards,
- Ramon.
--
R. Bastiaans, B.ICT :: Systems Programmer, HPC&V
SARA - Computing& Networking Services
Science Park 121 PO Box 94613
1098 XG Amsterdam NL 1090 GP Amsterdam NL
P.+31 (0)20 592 3000 F.+31 (0)20 668 3167
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5883 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20101103/f0981de7/attachment-0001.bin
More information about the torqueusers
mailing list