[Mauiusers] Question about corrupted job requirements

Tom Rudwick tomr at intrinsity.com
Tue Oct 23 16:40:59 MDT 2007


We are using Maui 3.2.6p19 and Torque 2.1.9.

When using the software resource to track global floating license
usage, we are seeing a corruption of the job requirements, which
causes bogus error messages to be generated in Maui. In the example
below, there is output from two sequential checkjob commands. In the
first, you can see that Req[0] has the node phobos allocated. In the
second, 5 seconds later, you can see that the Req[0] has been corrupted
by replacing the allocated node phobos, with the GLOBAL node, which
is used by Req[1]. All the information on the torque side seems
to be OK. The job runs normally to completion on the correct
node. The software license seems to be counted correctly, but we
get errors from Maui like this:

10/22 18:39:18 ALERT:    task 0 changed from 'phobos' to 'GLOBAL' for active job '25349'
10/22 18:39:18 INFO:     task 0 assigned to job '25349'
10/22 18:39:18 ALERT:    RM state corruption.  job '25349' has idle node 'GLOBAL' allocated (node forced to active state)

Additionally, I think that there are other errors generated indirectly from
the corruption of the node usage information when jobs try to
start on the "idle" node, which is not really idle.

If anyone has seen this, or can suggest what part of the code may
be modifying these requirements, I would appreciate any help
you can provide.

Thanks,
Tom

Output from successive checkjobs:

[root at metx01 server_priv]# checkjob 25120


checking job 25120

State: Running
Creds:  user:customer1  group:users  class:linux  qos:DEFAULT
WallTime: 00:00:00 of 00:01:00
SubmitTime: Mon Oct 22 15:50:56
   (Time Queued  Total: 00:01:17  Eligible: 00:01:17)

StartTime: Mon Oct 22 15:52:13
Total Tasks: 2

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: linux  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 100M
NodeCount: 1
Allocated Nodes:
[phobos:1]

Req[1]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory NC 0  Disk NC 0  Swap NC 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: hsimplus: 1
NodeCount: 1
Allocated Nodes:
[GLOBAL:1]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

Reservation '25120' (00:00:00 -> 00:01:00  Duration: 00:01:00)
PE:  1.00  StartPriority:  100

[root at metx01 server_priv]# checkjob 25120


checking job 25120

State: Running
Creds:  user:customer1  group:users  class:linux  qos:DEFAULT
WallTime: 00:00:06 of 00:01:00
SubmitTime: Mon Oct 22 15:50:56
   (Time Queued  Total: 00:01:17  Eligible: 00:01:17)

StartTime: Mon Oct 22 15:52:13
Total Tasks: 1

Req[0]  TaskCount: 0  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: linux  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 100M
Allocated Nodes:
[GLOBAL:1]
WARNING:  allocated tasks do not match requested tasks (1 != 0)

Req[1]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory NC 0  Disk NC 0  Swap NC 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: hsimplus: 1
NodeCount: 1
Allocated Nodes:
[GLOBAL:1]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

Reservation '25120' (-00:00:06 -> 00:00:54  Duration: 00:01:00)
PE:  0.00  StartPriority:  100




[root at metx01 server_priv]# checkjob 25358


checking job 25358

State: Running
Creds:  user:customer1  group:users  class:linux  qos:DEFAULT
WallTime: 00:00:00 of 00:01:00
SubmitTime: Mon Oct 22 19:16:46
   (Time Queued  Total: 00:01:29  Eligible: 00:01:29)

StartTime: Mon Oct 22 19:18:15
Total Tasks: 2

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: linux  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 100M
NodeCount: 1
Allocated Nodes:
[phobos:1]

Req[1]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory NC 0  Disk NC 0  Swap NC 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: hsimplus: 1
NodeCount: 1
Allocated Nodes:
[GLOBAL:1]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

Reservation '25358' (00:00:00 -> 00:01:00  Duration: 00:01:00)
PE:  1.00  StartPriority:  100

[root at metx01 server_priv]# checkjob 25358


checking job 25358

State: Running
Creds:  user:customer1  group:users  class:linux  qos:DEFAULT
WallTime: 00:00:05 of 00:01:00
SubmitTime: Mon Oct 22 19:16:46
   (Time Queued  Total: 00:01:29  Eligible: 00:01:29)

StartTime: Mon Oct 22 19:18:15
Total Tasks: 1

Req[0]  TaskCount: 0  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: linux  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 100M
Allocated Nodes:
[GLOBAL:1]
WARNING:  allocated tasks do not match requested tasks (1 != 0)

Req[1]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory NC 0  Disk NC 0  Swap NC 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: hsimplus: 1
NodeCount: 1
Allocated Nodes:
[GLOBAL:1]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

Reservation '25358' (-00:00:05 -> 00:00:55  Duration: 00:01:00)
PE:  0.00  StartPriority:  100




More information about the mauiusers mailing list