[Mauiusers] Question about corrupted job requirements

Tom Rudwick tomr at intrinsity.com
Tue Nov 13 14:06:48 MST 2007


I've attached a patch which we used to fix the corrupted job
requirements problem.

Tom

Tom Rudwick wrote:
> We are using Maui 3.2.6p19 and Torque 2.1.9.
> 
> When using the software resource to track global floating license
> usage, we are seeing a corruption of the job requirements, which
> causes bogus error messages to be generated in Maui. In the example
> below, there is output from two sequential checkjob commands. In the
> first, you can see that Req[0] has the node phobos allocated. In the
> second, 5 seconds later, you can see that the Req[0] has been corrupted
> by replacing the allocated node phobos, with the GLOBAL node, which
> is used by Req[1]. All the information on the torque side seems
> to be OK. The job runs normally to completion on the correct
> node. The software license seems to be counted correctly, but we
> get errors from Maui like this:
> 
> 10/22 18:39:18 ALERT:    task 0 changed from 'phobos' to 'GLOBAL' for 
> active job '25349'
> 10/22 18:39:18 INFO:     task 0 assigned to job '25349'
> 10/22 18:39:18 ALERT:    RM state corruption.  job '25349' has idle node 
> 'GLOBAL' allocated (node forced to active state)
> 
> Additionally, I think that there are other errors generated indirectly from
> the corruption of the node usage information when jobs try to
> start on the "idle" node, which is not really idle.
> 
> If anyone has seen this, or can suggest what part of the code may
> be modifying these requirements, I would appreciate any help
> you can provide.
> 
> Thanks,
> Tom
> 
> Output from successive checkjobs:
> 
> [root at metx01 server_priv]# checkjob 25120
> 
> 
> checking job 25120
> 
> State: Running
> Creds:  user:customer1  group:users  class:linux  qos:DEFAULT
> WallTime: 00:00:00 of 00:01:00
> SubmitTime: Mon Oct 22 15:50:56
>   (Time Queued  Total: 00:01:17  Eligible: 00:01:17)
> 
> StartTime: Mon Oct 22 15:52:13
> Total Tasks: 2
> 
> Req[0]  TaskCount: 1  Partition: DEFAULT
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: linux  Arch: [NONE]  Features: [NONE]
> Dedicated Resources Per Task: PROCS: 1  MEM: 100M
> NodeCount: 1
> Allocated Nodes:
> [phobos:1]
> 
> Req[1]  TaskCount: 1  Partition: ALL
> Network: [NONE]  Memory NC 0  Disk NC 0  Swap NC 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> Dedicated Resources Per Task: hsimplus: 1
> NodeCount: 1
> Allocated Nodes:
> [GLOBAL:1]
> 
> 
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 1
> PartitionMask: [ALL]
> Flags:       RESTARTABLE
> 
> Reservation '25120' (00:00:00 -> 00:01:00  Duration: 00:01:00)
> PE:  1.00  StartPriority:  100
> 
> [root at metx01 server_priv]# checkjob 25120
> 
> 
> checking job 25120
> 
> State: Running
> Creds:  user:customer1  group:users  class:linux  qos:DEFAULT
> WallTime: 00:00:06 of 00:01:00
> SubmitTime: Mon Oct 22 15:50:56
>   (Time Queued  Total: 00:01:17  Eligible: 00:01:17)
> 
> StartTime: Mon Oct 22 15:52:13
> Total Tasks: 1
> 
> Req[0]  TaskCount: 0  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: linux  Arch: [NONE]  Features: [NONE]
> Dedicated Resources Per Task: PROCS: 1  MEM: 100M
> Allocated Nodes:
> [GLOBAL:1]
> WARNING:  allocated tasks do not match requested tasks (1 != 0)
> 
> Req[1]  TaskCount: 1  Partition: ALL
> Network: [NONE]  Memory NC 0  Disk NC 0  Swap NC 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> Dedicated Resources Per Task: hsimplus: 1
> NodeCount: 1
> Allocated Nodes:
> [GLOBAL:1]
> 
> 
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 1
> PartitionMask: [ALL]
> Flags:       RESTARTABLE
> 
> Reservation '25120' (-00:00:06 -> 00:00:54  Duration: 00:01:00)
> PE:  0.00  StartPriority:  100
> 
> 
> 
> 
> [root at metx01 server_priv]# checkjob 25358
> 
> 
> checking job 25358
> 
> State: Running
> Creds:  user:customer1  group:users  class:linux  qos:DEFAULT
> WallTime: 00:00:00 of 00:01:00
> SubmitTime: Mon Oct 22 19:16:46
>   (Time Queued  Total: 00:01:29  Eligible: 00:01:29)
> 
> StartTime: Mon Oct 22 19:18:15
> Total Tasks: 2
> 
> Req[0]  TaskCount: 1  Partition: DEFAULT
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: linux  Arch: [NONE]  Features: [NONE]
> Dedicated Resources Per Task: PROCS: 1  MEM: 100M
> NodeCount: 1
> Allocated Nodes:
> [phobos:1]
> 
> Req[1]  TaskCount: 1  Partition: ALL
> Network: [NONE]  Memory NC 0  Disk NC 0  Swap NC 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> Dedicated Resources Per Task: hsimplus: 1
> NodeCount: 1
> Allocated Nodes:
> [GLOBAL:1]
> 
> 
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 1
> PartitionMask: [ALL]
> Flags:       RESTARTABLE
> 
> Reservation '25358' (00:00:00 -> 00:01:00  Duration: 00:01:00)
> PE:  1.00  StartPriority:  100
> 
> [root at metx01 server_priv]# checkjob 25358
> 
> 
> checking job 25358
> 
> State: Running
> Creds:  user:customer1  group:users  class:linux  qos:DEFAULT
> WallTime: 00:00:05 of 00:01:00
> SubmitTime: Mon Oct 22 19:16:46
>   (Time Queued  Total: 00:01:29  Eligible: 00:01:29)
> 
> StartTime: Mon Oct 22 19:18:15
> Total Tasks: 1
> 
> Req[0]  TaskCount: 0  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: linux  Arch: [NONE]  Features: [NONE]
> Dedicated Resources Per Task: PROCS: 1  MEM: 100M
> Allocated Nodes:
> [GLOBAL:1]
> WARNING:  allocated tasks do not match requested tasks (1 != 0)
> 
> Req[1]  TaskCount: 1  Partition: ALL
> Network: [NONE]  Memory NC 0  Disk NC 0  Swap NC 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> Dedicated Resources Per Task: hsimplus: 1
> NodeCount: 1
> Allocated Nodes:
> [GLOBAL:1]
> 
> 
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 1
> PartitionMask: [ALL]
> Flags:       RESTARTABLE
> 
> Reservation '25358' (-00:00:05 -> 00:00:55  Duration: 00:01:00)
> PE:  0.00  StartPriority:  100
> 
> 
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers


-------------- next part --------------
--- src/moab/MPBSI.c~	2007-10-22 15:43:16.000000000 -0500
+++ src/moab/MPBSI.c	2007-11-09 15:01:48.000000000 -0600
@@ -4112,40 +4112,58 @@
         }
       else if (!strcmp(AP->resource,"software"))
         {
-        /* NOTE:  old hack (map software to node feature */
+      int rqindex;
 
-        /* MReqSetAttr(J,RQ,mrqaReqNodeFeature,(void **)AP->value,mdfString,mAdd); */
+      int RIndex;
 
-        /* NOTE:  software handled at job load time, no support for dynamic software spec */
+      mreq_t *tmpRQ;
+
+      if ((RIndex = MUMAGetIndex(eGRes,AP->value,mAdd)) == 0)
+        {
+        /* cannot add support for generic res */
 
-        /* Food for further ruminations:
+        DBG(1,fPBS) DPrint("ALERT:    cannot add support for GRes software '%s'\n",
+          AP->value);
+ 
+        continue;
+        }
+
+      /* verify software req does not already exist */
+
+      for (rqindex = 0;J->Req[rqindex] != NULL;rqindex++)
+        {
+        if (J->Req[rqindex]->DRes.GRes[RIndex].count > 0)
+          break;
+        }  /* END for (rqindex) */
 
-            * software licenses can be either floating or node-locked
+      if (J->Req[rqindex] != NULL)
+        {
+        /* software req already added */
 
-            * the above works in the situation of a node-locked license
-               for unlimited users; limiting # of concurrent uses could
-              be accomplished by forcing users to submit to a specific
-               queue/class and limit the number of concurrent jobs in
-              that class
+	  continue;
+        }
 
-            * one can imagine future support looking something like this (from the POV
-              of the config file):
+      /* add software req */
 
-              # Node-locked on a single host, unlimited concurrent usage
-               SOFTWARECFG[pkg1] HOSTLIST=node01
+      if (MReqCreate(J,NULL,&tmpRQ,FALSE) == FAILURE)
+        {
+        DBG(1,fPBS) DPrint("ALERT:    cannot add req to job %s for GRes software '%s'\n",
+          J->Name,
+          AP->value);
 
-              # Node-locked on a single host, limited to one concurrent use
-               SOFTWARECFG[pkg2] HOSTMAXCOUNT=1 HOSTLIST=node02
+        continue;
+        }
 
-               # Floating across several hosts, global maximum on concurrent usage
-               SOFTWARECFG[pkg3] MAXCOUNT=5 HOSTLIST=node[1-4][0-9]
+      /* NOTE:  PBS currently supports only one license request per job */
 
-              # Floating across several hosts, global and per-host maxima on concurrent usage
-              SOFTWARECFG[pkg4] MAXCOUNT=10 HOSTMAXCOUNT=2 HOSTLIST=node[5-8][0-9]
+      tmpRQ->DRes.GRes[RIndex].count = 1;
+      tmpRQ->DRes.GRes[0].count      = 1;
+      tmpRQ->TaskCount               = 1;
+      tmpRQ->NodeCount               = 1;
+ 
+      /* NOTE:  prior workaround (map software to node feature */
 
-            * this would probably also require support in diagnose ("diagnose -S",
-              maybe?)
-        */
+      /* MReqSetAttr(J,RQ,mrqaReqNodeFeature,(void **)AP->value,mdfString,mAdd); */
         }
       else
         {



More information about the mauiusers mailing list