[torqueusers] Can't get pass "ALERT: jobs active on node but state is Idle"

Nicolas Bigaouette nbigaouette at gmail.com
Tue Nov 30 18:39:07 MST 2010


Hi,

I sent an email to mauiusers at supercluster.org but I think my issue is more
about Torque then Maui so I'm forwarding it here. Also, I discovered
something else while trying to diagnose my problem.

I've been trying to setup a Torque+Maui without success.

The cluster here is composed of 70 nodes connected with InfiniBand (IB) and
5 nodes without IB but better processors. Each node have 2 quad-core
processors with hyperthreading enabled for a total of 16 cores per node.

I want to setup a queue to run on the non-IB nodes which I call the
supernodes. There is 80 cores on these. Here is Torque's queue config:

> set queue supernodes queue_type = Execution

set queue supernodes resources_available.nodes = 5

set queue supernodes resources_default.neednodes = supernode

set queue supernodes resources_default.walltime = 00:10:00

set queue supernodes enabled = True

set queue supernodes started = True

set server scheduling = True

set server acl_hosts = mydomain.com

set server operators  = root at mydomain.com

set server operators += me at mydomain.com

set server default_queue = supernodes

set server log_events = 511

set server mail_from = adm

set server resources_available.ncpus = 1200

set server resources_available.nodect = 75

set server scheduler_iteration = 600

set server node_check_rate = 150

set server tcp_timeout = 6

set server auto_node_np = False

set server next_job_number = 90842


The node file "/var/spool/torque/server_priv/nodes" contains (for the super
nodes):

> node101 np=16 supernode

node102 np=16 supernode

node103 np=16 supernode

node104 np=16 supernode

node105 np=16 supernode


I can submit and run a job only if the number of cpu requested is less or
equal then the number of cores of a single node: -l nodes=1:ppn=5. But if I
request 5 cores on 2 different nodes (-l nodes=2:ppn=5) the job stays in a
"queued" state indefinitely. Maui's log file does not seems to report
anything suspicious, even at log level 7. Checkjob reports which nodes it is
scheduled to run on (node104 and node105):

> $ checkjob 90843

checking job 90843


> State: Running

Creds:  user:me  group:me  class:supernodes  qos:DEFAULT

WallTime: 00:00:00 of 00:00:30

SubmitTime: Tue Nov 30 18:04:11

  (Time Queued  Total: 00:12:19  Eligible: 00:12:19)


> StartTime: Tue Nov 30 18:16:30

Total Tasks: 10


> Req[0]  TaskCount: 10  Partition: DEFAULT

Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0

Opsys: [NONE]  Arch: [NONE]  Features: [supernode]

Allocated Nodes:

[node105:5][node104:5]


>
> IWD: [NONE]  Executable:  [NONE]

Bypass: 0  StartCount: 739

PartitionMask: [ALL]

Reservation '90843' (00:00:00 -> 00:00:30  Duration: 00:00:30)

PE:  10.00  StartPriority:  12


But then checking each node using checknode the node seems idle:

> $ checknode -v node104

checking node node104


> State:      Idle  (in current state for 00:00:00)

Expected State:  Running   SyncDeadline: Tue Nov 30 18:27:34

Configured Resources: PROCS: 16  MEM: 23G  SWAP: 23G  DISK: 1M

Utilized   Resources: PROCS: 5

Dedicated  Resources: PROCS: 5

Opsys:       DEFAULT  Arch:      [NONE]

Speed:      1.00  Load:       5.000

Location:   Partition: DEFAULT  Frame/Slot:  1/1

Network:    [DEFAULT]

Features:   [supernode]

Attributes: [Batch]

Classes:    [supernodes 11:16]


> Total Time:   INFINITY  Up:   INFINITY (73.79%)  Active: 00:01:42 (0.00%)


> Reservations:

  Job '90843'(x5)  00:00:00 -> 00:00:30 (00:00:30)

JobList:  90843


>
Google does not say much about the "ALERT:  jobs active on node but state is
Idle". Looking at the pbs_mon log file on the two nodes, I get these:
node105:

> 11/30/2010 20:10:09;0008;   pbs_mom;Req;send_sisters;sending ABORT to
> sisters

11/30/2010 20:10:09;0008;   pbs_mom;Job;90857.unicron.cl.uottawa.ca;Job
> Modified at request of PBS_Server at mycluster

11/30/2010 20:10:09;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply

11/30/2010 20:10:09;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
> while loop

11/30/2010 20:10:09;0080;   pbs_mom;Svr;preobit_reply;cannot locate job that
> triggered req

11/30/2010 20:10:09;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply

11/30/2010 20:10:09;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
> while loop

11/30/2010 20:10:09;0080;   pbs_mom;Svr;preobit_reply;in while loop, no
> error from job stat

11/30/2010 20:10:09;0008;   pbs_mom;Job;90857.mycluster;checking job
> post-processing routine

11/30/2010 20:10:09;0080;   pbs_mom;Job;90857.mycluster;obit sent to server

11/30/2010 20:10:10;0001;   pbs_mom;Svr;pbs_mom;Bad UID for job execution
> (15023) in 90857.mycluster, job_start_error from node 10.0.0.104:15003 in
> job_start_error

11/30/2010 20:10:10;0001;   pbs_mom;Svr;pbs_mom;Bad UID for job execution
> (15023) in 90857.mycluster, abort attempted 16 times in job_start_error.
>  ignoring abort request from node 10.0.0.104:15003

[repeated many times, until I cancel the job]

node104:

> 11/30/2010 20:31:31;0008;   pbs_mom;Job;90858.mycluster;no group entry for
> group me, user=me, errno=0 (Success)

11/30/2010 20:31:31;0008;   pbs_mom;Job;90858.mycluster;ERROR:    received
> request 'ABORT_JOB' from 10.0.0.105:1023 for job '90858.mycluster' (job
> does not exist locally)

[repeated many times, until I cancel the job]


On the headnode, running "groups" as my username returns my username as the
group, but on each nodes it returns 1009, the group ID. Could that be the
problem? Using "getent" on the headnode I get:

> $ getent passwd me

me:x:1001:1009:My Name:/home/me:/bin/bash

while on a compute node I get:

> $ getent passwd me

me:Password_checksum:1001:1009:My Name:/home/me:/bin/bash


On either the headnode or compute nodes, using "ypcat group" returns
nothing. Shouldn't that return the same groups as "groups" on the headnode?

I really don't know what to check next.

Thank you very much for any suggestion.

Regards,

Nicolas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20101130/0b22758e/attachment-0001.html 


More information about the torqueusers mailing list