[torqueusers] Can't get pass "ALERT: jobs active on node but state is Idle"
Nicolas Bigaouette
nbigaouette at gmail.com
Tue Nov 30 18:39:07 MST 2010
Hi,
I sent an email to mauiusers at supercluster.org but I think my issue is more
about Torque then Maui so I'm forwarding it here. Also, I discovered
something else while trying to diagnose my problem.
I've been trying to setup a Torque+Maui without success.
The cluster here is composed of 70 nodes connected with InfiniBand (IB) and
5 nodes without IB but better processors. Each node have 2 quad-core
processors with hyperthreading enabled for a total of 16 cores per node.
I want to setup a queue to run on the non-IB nodes which I call the
supernodes. There is 80 cores on these. Here is Torque's queue config:
> set queue supernodes queue_type = Execution
set queue supernodes resources_available.nodes = 5
set queue supernodes resources_default.neednodes = supernode
set queue supernodes resources_default.walltime = 00:10:00
set queue supernodes enabled = True
set queue supernodes started = True
set server scheduling = True
set server acl_hosts = mydomain.com
set server operators = root at mydomain.com
set server operators += me at mydomain.com
set server default_queue = supernodes
set server log_events = 511
set server mail_from = adm
set server resources_available.ncpus = 1200
set server resources_available.nodect = 75
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server auto_node_np = False
set server next_job_number = 90842
The node file "/var/spool/torque/server_priv/nodes" contains (for the super
nodes):
> node101 np=16 supernode
node102 np=16 supernode
node103 np=16 supernode
node104 np=16 supernode
node105 np=16 supernode
I can submit and run a job only if the number of cpu requested is less or
equal then the number of cores of a single node: -l nodes=1:ppn=5. But if I
request 5 cores on 2 different nodes (-l nodes=2:ppn=5) the job stays in a
"queued" state indefinitely. Maui's log file does not seems to report
anything suspicious, even at log level 7. Checkjob reports which nodes it is
scheduled to run on (node104 and node105):
> $ checkjob 90843
checking job 90843
> State: Running
Creds: user:me group:me class:supernodes qos:DEFAULT
WallTime: 00:00:00 of 00:00:30
SubmitTime: Tue Nov 30 18:04:11
(Time Queued Total: 00:12:19 Eligible: 00:12:19)
> StartTime: Tue Nov 30 18:16:30
Total Tasks: 10
> Req[0] TaskCount: 10 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [supernode]
Allocated Nodes:
[node105:5][node104:5]
>
> IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 739
PartitionMask: [ALL]
Reservation '90843' (00:00:00 -> 00:00:30 Duration: 00:00:30)
PE: 10.00 StartPriority: 12
But then checking each node using checknode the node seems idle:
> $ checknode -v node104
checking node node104
> State: Idle (in current state for 00:00:00)
Expected State: Running SyncDeadline: Tue Nov 30 18:27:34
Configured Resources: PROCS: 16 MEM: 23G SWAP: 23G DISK: 1M
Utilized Resources: PROCS: 5
Dedicated Resources: PROCS: 5
Opsys: DEFAULT Arch: [NONE]
Speed: 1.00 Load: 5.000
Location: Partition: DEFAULT Frame/Slot: 1/1
Network: [DEFAULT]
Features: [supernode]
Attributes: [Batch]
Classes: [supernodes 11:16]
> Total Time: INFINITY Up: INFINITY (73.79%) Active: 00:01:42 (0.00%)
> Reservations:
Job '90843'(x5) 00:00:00 -> 00:00:30 (00:00:30)
JobList: 90843
>
Google does not say much about the "ALERT: jobs active on node but state is
Idle". Looking at the pbs_mon log file on the two nodes, I get these:
node105:
> 11/30/2010 20:10:09;0008; pbs_mom;Req;send_sisters;sending ABORT to
> sisters
11/30/2010 20:10:09;0008; pbs_mom;Job;90857.unicron.cl.uottawa.ca;Job
> Modified at request of PBS_Server at mycluster
11/30/2010 20:10:09;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
11/30/2010 20:10:09;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
> while loop
11/30/2010 20:10:09;0080; pbs_mom;Svr;preobit_reply;cannot locate job that
> triggered req
11/30/2010 20:10:09;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
11/30/2010 20:10:09;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
> while loop
11/30/2010 20:10:09;0080; pbs_mom;Svr;preobit_reply;in while loop, no
> error from job stat
11/30/2010 20:10:09;0008; pbs_mom;Job;90857.mycluster;checking job
> post-processing routine
11/30/2010 20:10:09;0080; pbs_mom;Job;90857.mycluster;obit sent to server
11/30/2010 20:10:10;0001; pbs_mom;Svr;pbs_mom;Bad UID for job execution
> (15023) in 90857.mycluster, job_start_error from node 10.0.0.104:15003 in
> job_start_error
11/30/2010 20:10:10;0001; pbs_mom;Svr;pbs_mom;Bad UID for job execution
> (15023) in 90857.mycluster, abort attempted 16 times in job_start_error.
> ignoring abort request from node 10.0.0.104:15003
[repeated many times, until I cancel the job]
node104:
> 11/30/2010 20:31:31;0008; pbs_mom;Job;90858.mycluster;no group entry for
> group me, user=me, errno=0 (Success)
11/30/2010 20:31:31;0008; pbs_mom;Job;90858.mycluster;ERROR: received
> request 'ABORT_JOB' from 10.0.0.105:1023 for job '90858.mycluster' (job
> does not exist locally)
[repeated many times, until I cancel the job]
On the headnode, running "groups" as my username returns my username as the
group, but on each nodes it returns 1009, the group ID. Could that be the
problem? Using "getent" on the headnode I get:
> $ getent passwd me
me:x:1001:1009:My Name:/home/me:/bin/bash
while on a compute node I get:
> $ getent passwd me
me:Password_checksum:1001:1009:My Name:/home/me:/bin/bash
On either the headnode or compute nodes, using "ypcat group" returns
nothing. Shouldn't that return the same groups as "groups" on the headnode?
I really don't know what to check next.
Thank you very much for any suggestion.
Regards,
Nicolas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20101130/0b22758e/attachment-0001.html
More information about the torqueusers
mailing list