[Mauiusers] Can't get pass "ALERT: jobs active on node but state is Idle"
Nicolas Bigaouette
nbigaouette at gmail.com
Tue Nov 30 16:23:37 MST 2010
Hi,
I've been trying to setup a Torque+Maui without success.
The cluster here is composed of 70 nodes connected with InfiniBand (IB) and
5 nodes without IB but better processors. Each node have 2 quad-core
processors with hyperthreading enabled for a total of 16 cores per node.
I want to setup a queue to run on the non-IB nodes which I call the
supernodes. There is 80 cores on these. Here is Torque's queue config:
> set queue supernodes queue_type = Execution
set queue supernodes resources_available.nodes = 5
set queue supernodes resources_default.neednodes = supernode
set queue supernodes resources_default.walltime = 00:10:00
set queue supernodes enabled = True
set queue supernodes started = True
set server scheduling = True
set server acl_hosts = mydomain.com
set server operators = root at mydomain.com
set server operators += me at mydomain.com
set server default_queue = supernodes
set server log_events = 511
set server mail_from = adm
set server resources_available.ncpus = 1200
set server resources_available.nodect = 75
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server auto_node_np = False
set server next_job_number = 90842
The node file "/var/spool/torque/server_priv/nodes" contains (for the super
nodes):
> node101 np=16 supernode
node102 np=16 supernode
node103 np=16 supernode
node104 np=16 supernode
node105 np=16 supernode
I can submit and run a job only if the number of cpu requested is less or
equal then the number of cores of a single node: -l nodes=1:ppn=5. But if I
request 5 cores on 2 different nodes (-l nodes=2:ppn=5) the job stays in a
"queued" state indefinitely. Maui's log file does not seems to report
anything suspicious, even at log level 7. Checkjob reports which nodes it is
scheduled to run on (node104 and node105):
> $ checkjob 90843
checking job 90843
> State: Running
Creds: user:me group:me class:supernodes qos:DEFAULT
WallTime: 00:00:00 of 00:00:30
SubmitTime: Tue Nov 30 18:04:11
(Time Queued Total: 00:12:19 Eligible: 00:12:19)
> StartTime: Tue Nov 30 18:16:30
Total Tasks: 10
> Req[0] TaskCount: 10 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [supernode]
Allocated Nodes:
[node105:5][node104:5]
>
> IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 739
PartitionMask: [ALL]
Reservation '90843' (00:00:00 -> 00:00:30 Duration: 00:00:30)
PE: 10.00 StartPriority: 12
But then checking each node using checknode the node seems idle:
> $ checknode -v node104
checking node node104
> State: Idle (in current state for 00:00:00)
Expected State: Running SyncDeadline: Tue Nov 30 18:27:34
Configured Resources: PROCS: 16 MEM: 23G SWAP: 23G DISK: 1M
Utilized Resources: PROCS: 5
Dedicated Resources: PROCS: 5
Opsys: DEFAULT Arch: [NONE]
Speed: 1.00 Load: 5.000
Location: Partition: DEFAULT Frame/Slot: 1/1
Network: [DEFAULT]
Features: [supernode]
Attributes: [Batch]
Classes: [supernodes 11:16]
> Total Time: INFINITY Up: INFINITY (73.79%) Active: 00:01:42 (0.00%)
> Reservations:
Job '90843'(x5) 00:00:00 -> 00:00:30 (00:00:30)
JobList: 90843
>
Google does not say much about the "ALERT: jobs active on node but state is
Idle"... Does anyone have a clue?
Thank you very much.
Regards,
Nicolas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20101130/c29e7f12/attachment.html
More information about the mauiusers
mailing list