[Mauiusers] Can't get pass "ALERT: jobs active on node but state is Idle"

Nicolas Bigaouette nbigaouette at gmail.com
Tue Nov 30 16:23:37 MST 2010


I've been trying to setup a Torque+Maui without success.

The cluster here is composed of 70 nodes connected with InfiniBand (IB) and
5 nodes without IB but better processors. Each node have 2 quad-core
processors with hyperthreading enabled for a total of 16 cores per node.

I want to setup a queue to run on the non-IB nodes which I call the
supernodes. There is 80 cores on these. Here is Torque's queue config:

> set queue supernodes queue_type = Execution

set queue supernodes resources_available.nodes = 5

set queue supernodes resources_default.neednodes = supernode

set queue supernodes resources_default.walltime = 00:10:00

set queue supernodes enabled = True

set queue supernodes started = True

set server scheduling = True

set server acl_hosts = mydomain.com

set server operators  = root at mydomain.com

set server operators += me at mydomain.com

set server default_queue = supernodes

set server log_events = 511

set server mail_from = adm

set server resources_available.ncpus = 1200

set server resources_available.nodect = 75

set server scheduler_iteration = 600

set server node_check_rate = 150

set server tcp_timeout = 6

set server auto_node_np = False

set server next_job_number = 90842

The node file "/var/spool/torque/server_priv/nodes" contains (for the super

> node101 np=16 supernode

node102 np=16 supernode

node103 np=16 supernode

node104 np=16 supernode

node105 np=16 supernode

I can submit and run a job only if the number of cpu requested is less or
equal then the number of cores of a single node: -l nodes=1:ppn=5. But if I
request 5 cores on 2 different nodes (-l nodes=2:ppn=5) the job stays in a
"queued" state indefinitely. Maui's log file does not seems to report
anything suspicious, even at log level 7. Checkjob reports which nodes it is
scheduled to run on (node104 and node105):

> $ checkjob 90843

checking job 90843

> State: Running

Creds:  user:me  group:me  class:supernodes  qos:DEFAULT

WallTime: 00:00:00 of 00:00:30

SubmitTime: Tue Nov 30 18:04:11

  (Time Queued  Total: 00:12:19  Eligible: 00:12:19)

> StartTime: Tue Nov 30 18:16:30

Total Tasks: 10

> Req[0]  TaskCount: 10  Partition: DEFAULT

Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0

Opsys: [NONE]  Arch: [NONE]  Features: [supernode]

Allocated Nodes:


> IWD: [NONE]  Executable:  [NONE]

Bypass: 0  StartCount: 739

PartitionMask: [ALL]

Reservation '90843' (00:00:00 -> 00:00:30  Duration: 00:00:30)

PE:  10.00  StartPriority:  12

But then checking each node using checknode the node seems idle:

> $ checknode -v node104

checking node node104

> State:      Idle  (in current state for 00:00:00)

Expected State:  Running   SyncDeadline: Tue Nov 30 18:27:34

Configured Resources: PROCS: 16  MEM: 23G  SWAP: 23G  DISK: 1M

Utilized   Resources: PROCS: 5

Dedicated  Resources: PROCS: 5

Opsys:       DEFAULT  Arch:      [NONE]

Speed:      1.00  Load:       5.000

Location:   Partition: DEFAULT  Frame/Slot:  1/1

Network:    [DEFAULT]

Features:   [supernode]

Attributes: [Batch]

Classes:    [supernodes 11:16]

> Total Time:   INFINITY  Up:   INFINITY (73.79%)  Active: 00:01:42 (0.00%)

> Reservations:

  Job '90843'(x5)  00:00:00 -> 00:00:30 (00:00:30)

JobList:  90843

Google does not say much about the "ALERT:  jobs active on node but state is
Idle"... Does anyone have a clue?

Thank you very much.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20101130/c29e7f12/attachment.html 

More information about the mauiusers mailing list