[Mauiusers] Problem with Torque/Maui Interaction
Wickliffe, Blake W
blake.wickliffe at aramco.com
Wed Mar 19 06:34:09 MDT 2008
Howdy,
I'm having a bit of a problem with the way Torque interacts with Maui. I've done a lot of searching on the web and the mail archive, but I can't seem to find anyone who has had the same problem.
Basically, we have a cluster of heterogeneous nodes. Most of them are compute nodes, but some are "master" nodes which have very high I/O capacity. Whenever we submit a job to the cluster, we assign one I/O node (master node), and some number of CPU (or compute) nodes. Basically, a job submission looks something like:
Echo "job.sh" | qsub -l nodes=1:master:ppn=2+128:compute:ppn=2
So far, so good. This works as expected with Torque and the pbs_sched scheduler or Torque and Maui.
But, we'd like to make it easier for the users. We define, in qmgr, a default queue "parallel" which has, among other things:
create queue parallel
set queue parallel queue_type = Execution
set queue parallel resources_default.neednodes = 1:master:ppn=2+128:compute:ppn=2
set queue parallel resources_default.nodect = 129
set queue parallel resources_default.nodes = 1:master:ppn=2+128:compute:ppn=2
set queue parallel enabled = True
set queue parallel started = True
This way, the job submission above becomes:
Echo "job.sh" | qsub
Still so far, so good....with pbs_sched.
Then, we replace pbs_sched with Maui and everything breaks. If you do a checkjob on a job submitted into a Torque/Maui environment, you get:
Req[0] TaskCount: 2 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [1][master][ppn=2+128][compute][ppn=2]
Req[1] TaskCount: 10 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [compute]
As if Maui is interpreting EVERYTHING separated by a colon in the parallel resources_default.nodes line as a resource. No job ever runs.
I am at my wit's end here. Has anyone seen this before? Better still, has anyone seen it and solved it?
Thanks in advance,
Blake Wickliffe
Saudi Aramco
ENOD/CSYS/USG HPC Team
(873-4417)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20080319/73ab2423/attachment.html
More information about the mauiusers
mailing list