[torqueusers] using dependencies and arrays

Darren R Gitelman d-gitelman at northwestern.edu
Wed Sep 28 07:54:08 MDT 2011


I am having a problem with dependencies and job arrays. I've seen several messages
on the list about this but no resolution. We are using torque 2.5.8

I first submit several jobs in an array: qsub -l -t 1-3
This returns jobid1[]

If I call the job with qsub -l depend:afteranyarray:jobid1[]

Then the second job doesn't wait until the job array (jobid1[]) has completed. It
starts up about 20 seconds after the various jobs in the job array are started and
of course fails since the results of jobid1 aren't ready yet. I've also tried using
depend:afterokarray, depend:afterok and depend:afterany.

I've also tried submitting the second job with: qsub -W
depend:afteranyarray:jobid1[] (as well as the same permutations as above). In this
case the second job does hold... forever.

When I run checkjob on each job in the array I find they have all completed
successfully with an exit status of 0.

When I checkjob the held job I get

[xxx at quser04 ~]$ checkjob -vvv 1183767
job 1183767 (RM job '1183767.qsched01')

AName: xxx.defragment
State: Hold
Creds:  user:xxx  group:xxx  account:t20213  class:short
WallTime:   00:00:00 of 3:58:20
SubmitTime: Tue Sep 27 11:33:33
  (Time Queued  Total: 1:45:28  Eligible: 00:00:05)

NodeMatchPolicy: EXACTNODE
Total Requested Tasks: 1
Total Requested Nodes: 1

Req[0]  TaskCount: 1  Partition: ALL 
NodeCount:  1

IWD:            /home/xxx
UMask:          0000
OutputFile:     quser04:/home/xxx/./xxx_logs/xxx.defragment.o1183767
ErrorFile:      quser04:/home/xxx/./xxx_logs/xxx.defragment.e1183767
Partition List: quest1,quest2,questgpu1,SHARED
SrcRM:          torque  DstRM: torque  DstRMJID: 1183767.qsched01
Submit Args:    -V -d . -r y -q short -M d-xxx at xxx.edu -N xxx.defragment -m abe -o
./xxx_logs/ -e ./xxx_logs/ -l walltime=14300 -W depend=afteranyarray:1183766[]
/home/xxx/tempcmd20332
Flags:          RESTARTABLE
Attr:           checkpoint
StartPriority:  256
PE:             1.00
 NOTE:  job cannot run  (job has hold in place)
NOTE:  job violates constraints for partition hyperthread (non-idle state 'Hold')

NOTE:  job violates constraints for partition quest1 (non-idle state 'Hold')

NOTE:  job violates constraints for partition quest2 (non-idle state 'Hold')

NOTE:  job violates constraints for partition questgpu1 (non-idle state 'Hold')

NOTE:  job violates constraints for partition pim (non-idle state 'Hold')

BLOCK MSG: non-idle state 'Hold' (recorded at last scheduling iteration)

I don't really understand what constraints the job is violating and why the
dependency isn't working with either -l or -W.

Thanks
Darren


More information about the torqueusers mailing list