[Mauiusers] Maui dies

Gerson Galang gerson.sapac at gawab.com
Tue Feb 15 18:42:37 MST 2005


Hi,

I've tried Jeffery's suggestion of setting the default walltime of a 
queue to a very high value and it looks like it has solved maui's 
crashing problem. I'm not yet 100% sure about this so I'm still testing 
and monitoring maui.

What I want to know now is how maui scheduling algorithm works when it 
comes to rerunning suspended jobs.

I have three levels of queues: special, parallel, and batch. special can 
preempt both parallel and batch and parallel can only preempt batch. 
I've taken snapshots of pbstop to show you what's happening with the 
jobs I've submitted to the resource manager.

As you can see from the 2nd snapshot taken at 951am, job 834 was 
preempted by job 840 because 840 has been submitted to the parallel 
queue. And the 3rd and 4th snapshots show you that even the parallel job 
(840) has been suspended because a job has been submitted to the special 
queue. Everything is still okay up to this part.

What I don't understand is why maui decided to run job 835 instead of 
just resuming jobs 834 or 840. Can one of the developers tell me how 
maui decides which job to run after the jobs submitted to the preemptor 
queue finishes executing?

I've also taken the priorities (shown in the 9th snapshot up to the 
last) of all the jobs in the queue. You can see there that jobs 840 and 
834 really have the highest priorities compared to the other jobs. If 
they have the highest priorities, why are they not getting run and why 
does it look like they're getting starved?

Regards,
Gerson

PS.
I'm using maui3.2.6p11

1st snapshot - 951am
  visible CPUs: 0,1
           1  2  3  4  5  6  7  8  9  0    1  2  3  4  5  6  7  8  9  0 
 
-------------------------------------------------------------------------
   green01 .. gg gg gg gg gg gg
------------------------------------------------------------------------- 
 

 

       Job#  Username  Queue    Jobname    Nodes   S  Elapsed/Requested
   g = 834   gerson    batch    mpitest-6-     6   R       --/2376:
       835   gerson    batch    mpitest-6-     6   Q       --/2376:
       836   gerson    batch    mpitest-6-     6   Q       --/2376:
       837   gerson    batch    mpitest-6-     6   Q       --/2376:
       838   gerson    batch    mpitest-6-     6   Q       --/2376:
       839   gerson    batch    mpitest-6-     6   Q       --/2376:
       840   gerson    parallel mpitest-3-     3   Q       --/2376:


2nd snapshot - 951am
  visible CPUs: 0,1
           1  2  3  4  5  6  7  8  9  0    1  2  3  4  5  6  7  8  9  0 
 
-------------------------------------------------------------------------
   green01 .. .. gg gg gg .. ..
------------------------------------------------------------------------- 
 

 

       Job#  Username  Queue    Jobname    Nodes   S  Elapsed/Requested
       834   gerson    batch    mpitest-6-     6   S       --/2376:
       835   gerson    batch    mpitest-6-     6   Q       --/2376:
       836   gerson    batch    mpitest-6-     6   Q       --/2376:
       837   gerson    batch    mpitest-6-     6   Q       --/2376:
       838   gerson    batch    mpitest-6-     6   Q       --/2376:
       839   gerson    batch    mpitest-6-     6   Q       --/2376:
   g = 840   gerson    parallel mpitest-3-     3   R       --/2376:

3rd snapshot - 952am
  visible CPUs: 0,1
           1  2  3  4  5  6  7  8  9  0    1  2  3  4  5  6  7  8  9  0 
   -------------------------------------------------------------------------
   green01 .. gg gg gg gg gg gg 
------------------------------------------------------------------------- 
 

 

       Job#  Username  Queue    Jobname    Nodes   S  Elapsed/Requested
       834   gerson    batch    mpitest-6-     6   S       --/2376:
       835   gerson    batch    mpitest-6-     6   Q       --/2376:
       836   gerson    batch    mpitest-6-     6   Q       --/2376:
       837   gerson    batch    mpitest-6-     6   Q       --/2376:
       838   gerson    batch    mpitest-6-     6   Q       --/2376:
       839   gerson    batch    mpitest-6-     6   Q       --/2376:
   g = 840   gerson    parallel mpitest-3-     3   R       --/2376:
   g = 841   gerson    special  mpitest-3-     3   R       --/2376:
       842   gerson    special  mpitest-3-     3   Q       --/2376:


4th snapshot - 953am
  visible CPUs: 0,1
           1  2  3  4  5  6  7  8  9  0    1  2  3  4  5  6  7  8  9  0 
  -------------------------------------------------------------------------
   green01 .. .. gg gg gg .. ..
------------------------------------------------------------------------- 
 

 

       Job#  Username  Queue    Jobname    Nodes   S  Elapsed/Requested
       834   gerson    batch    mpitest-6-     6   S       --/2376:
       835   gerson    batch    mpitest-6-     6   Q       --/2376:
       836   gerson    batch    mpitest-6-     6   Q       --/2376:
       837   gerson    batch    mpitest-6-     6   Q       --/2376:
       838   gerson    batch    mpitest-6-     6   Q       --/2376:
       839   gerson    batch    mpitest-6-     6   Q       --/2376:
       840   gerson    parallel mpitest-3-     3   S       --/2376:
   g = 842   gerson    special  mpitest-3-     3   R    00:00/2376:
 


5th snapshot - 955am
  visible CPUs: 0,1
           1  2  3  4  5  6  7  8  9  0    1  2  3  4  5  6  7  8  9  0 
   -------------------------------------------------------------------------
   green01 .. gg gg gg gg gg gg
------------------------------------------------------------------------- 
 

 

       Job#  Username  Queue    Jobname    Nodes   S  Elapsed/Requested
       834   gerson    batch    mpitest-6-     6   S       --/2376:
   g = 835   gerson    batch    mpitest-6-     6   R       --/2376:
       836   gerson    batch    mpitest-6-     6   Q       --/2376:
       837   gerson    batch    mpitest-6-     6   Q       --/2376:
       838   gerson    batch    mpitest-6-     6   Q       --/2376:
       839   gerson    batch    mpitest-6-     6   Q       --/2376:
       840   gerson    parallel mpitest-3-     3   S       --/2376:



6th snapshot - 1004am
  visible CPUs: 0,1
           1  2  3  4  5  6  7  8  9  0    1  2  3  4  5  6  7  8  9  0 
 
-------------------------------------------------------------------------
   green01 .. gg gg gg gg gg gg
------------------------------------------------------------------------- 
 

 

       Job#  Username  Queue    Jobname    Nodes   S  Elapsed/Requested
       834   gerson    batch    mpitest-6-     6   S       --/2376:
   g = 838   gerson    batch    mpitest-6-     6   R    00:01/2376:
       839   gerson    batch    mpitest-6-     6   Q       --/2376:
       840   gerson    parallel mpitest-3-     3   S       --/2376:

7th snapshot - 1009am
  visible CPUs: 0,1
           1  2  3  4  5  6  7  8  9  0    1  2  3  4  5  6  7  8  9  0 
 
-------------------------------------------------------------------------
   green01 .. gg gg gg gg gg gg
------------------------------------------------------------------------- 
 

 

       Job#  Username  Queue    Jobname    Nodes   S  Elapsed/Requested
       834   gerson    batch    mpitest-6-     6   S       --/2376:
   g = 843   gerson    batch    mpitest-6-     6   R    00:02/2376:
       844   gerson    batch    mpitest-6-     6   Q       --/2376:
       845   gerson    batch    mpitest-6-     6   Q       --/2376:
       840   gerson    parallel mpitest-3-     3   S       --/2376:

8th snapshot - 1012
  visible CPUs: 0,1
           1  2  3  4  5  6  7  8  9  0    1  2  3  4  5  6  7  8  9  0 
   -------------------------------------------------------------------------
   green01 .. gg gg gg gg gg gg
------------------------------------------------------------------------- 
 

 

       Job#  Username  Queue    Jobname    Nodes   S  Elapsed/Requested
       834   gerson    batch    mpitest-6-     6   S       --/2376:
   g = 845   gerson    batch    mpitest-6-     6   R       --/2376:
       846   gerson    batch    mpitest-6-     6   Q       --/2376:
       847   gerson    batch    mpitest-6-     6   Q       --/2376:
       848   gerson    batch    mpitest-6-     6   Q       --/2376:
       849   gerson    batch    mpitest-6-     6   Q       --/2376:
       840   gerson    parallel mpitest-3-     3   S       --/2376:


****************************************************
batch jobs submitted randomly within every 2 minutes
****************************************************

9th snapshot - 1022am
  visible CPUs: 0,1
           1  2  3  4  5  6  7  8  9  0    1  2  3  4  5  6  7  8  9  0 
 
-------------------------------------------------------------------------
   green01 .. gg gg gg gg gg gg
------------------------------------------------------------------------- 
 

 

       Job#  Username  Queue    Jobname    Nodes   S  Elapsed/Requested
       834   gerson    batch    mpitest-6-     6   S       --/2376:
   g = 848   gerson    batch    mpitest-6-     6   R       --/2376:
       849   gerson    batch    mpitest-6-     6   Q       --/2376:
       850   gerson    batch    mpitest-6-     6   Q       --/2376:
       851   gerson    batch    mpitest-6-     6   Q       --/2376:
       852   gerson    batch    mpitest-6-     6   Q       --/2376:
       853   gerson    batch    mpitest-6-     6   Q       --/2376:
       854   gerson    batch    mpitest-6-     6   Q       --/2376:
       855   gerson    batch    mpitest-6-     6   Q       --/2376:
       840   gerson    parallel mpitest-3-     3   S       --/2376:

[gerson at drongo maui-3.2.6p11]$ diagnose -p
diagnosing job priority information (partition: ALL)

Job                    PRIORITY*   Cred(  QOS:Class)  Serv(QTime)
              Weights   --------       1(    1:    1)     1(    1)

840                          66    52.9(  0.0: 35.0)  47.1( 31.1)
834                          42    24.1(  0.0: 10.0)  75.9( 31.5)
849                          20    50.4(  0.0: 10.0)  49.6(  9.8)
850                          18    56.1(  0.0: 10.0)  43.9(  7.8)
851                          17    57.4(  0.0: 10.0)  42.6(  7.4)
852                          16    61.3(  0.0: 10.0)  38.7(  6.3)
853                          16    64.4(  0.0: 10.0)  35.6(  5.5)
854                          13    77.8(  0.0: 10.0)  22.2(  2.9)
855                          12    85.7(  0.0: 10.0)  14.3(  1.7)
856                          10    99.8(  0.0: 10.0)   0.2(  0.0)

Percent Contribution   --------    54.6(  0.0: 54.6)  45.4( 45.4)

10th snapshot - 1042am

  visible CPUs: 0,1
           1  2  3  4  5  6  7  8  9  0    1  2  3  4  5  6  7  8  9  0 
 
-------------------------------------------------------------------------
   green01 .. gg gg gg gg gg gg
------------------------------------------------------------------------- 
 

 

       Job#  Username  Queue    Jobname    Nodes   S  Elapsed/Requested
       834   gerson    batch    mpitest-6-     6   S       --/2376:
   g = 855   gerson    batch    mpitest-6-     6   R    00:01/2376:
       856   gerson    batch    mpitest-6-     6   Q       --/2376:
       857   gerson    batch    mpitest-6-     6   Q       --/2376:
       858   gerson    batch    mpitest-6-     6   Q       --/2376:
       859   gerson    batch    mpitest-6-     6   Q       --/2376:
       860   gerson    batch    mpitest-6-     6   Q       --/2376:
       861   gerson    batch    mpitest-6-     6   Q       --/2376:
       862   gerson    batch    mpitest-6-     6   Q       --/2376:
       863   gerson    batch    mpitest-6-     6   Q       --/2376:
       864   gerson    batch    mpitest-6-     6   Q       --/2376:
       865   gerson    batch    mpitest-6-     6   Q       --/2376:
       866   gerson    batch    mpitest-6-     6   Q       --/2376:
       867   gerson    batch    mpitest-6-     6   Q       --/2376:
       868   gerson    batch    mpitest-6-     6   Q       --/2376:
       840   gerson    parallel mpitest-3-     3   S       --/2376:


[gerson at drongo maui-3.2.6p11]$ diagnose -p
diagnosing job priority information (partition: ALL)

Job                    PRIORITY*   Cred(  QOS:Class)  Serv(QTime)
              Weights   --------       1(    1:    1)     1(    1)

840                          86    40.9(  0.0: 35.0)  59.1( 50.5)
834                          61    16.4(  0.0: 10.0)  83.6( 50.9)
856                          29    34.0(  0.0: 10.0)  66.0( 19.4)
857                          28    36.3(  0.0: 10.0)  63.7( 17.6)
858                          27    36.7(  0.0: 10.0)  63.3( 17.2)
859                          26    39.0(  0.0: 10.0)  61.0( 15.7)
860                          24    41.7(  0.0: 10.0)  58.3( 14.0)
861                          23    43.4(  0.0: 10.0)  56.6( 13.0)
862                          20    48.8(  0.0: 10.0)  51.2( 10.5)
863                          18    54.2(  0.0: 10.0)  45.8(  8.4)
864                          17    58.1(  0.0: 10.0)  41.9(  7.2)
865                          16    62.6(  0.0: 10.0)  37.4(  6.0)
866                          13    74.4(  0.0: 10.0)  25.6(  3.4)
867                          13    77.2(  0.0: 10.0)  22.8(  3.0)
868                          10    99.8(  0.0: 10.0)   0.2(  0.0)

Percent Contribution   --------    42.5(  0.0: 42.5)  57.5( 57.5)

1051am
  visible CPUs: 0,1
           1  2  3  4  5  6  7  8  9  0    1  2  3  4  5  6  7  8  9  0 
   -------------------------------------------------------------------------
   green01 .. gg gg gg gg gg gg
------------------------------------------------------------------------- 
 

 

       Job#  Username  Queue    Jobname    Nodes   S  Elapsed/Requested
       834   gerson    batch    mpitest-6-     6   S       --/2376:
   g = 859   gerson    batch    mpitest-6-     6   R    00:01/2376:
       860   gerson    batch    mpitest-6-     6   Q       --/2376:
       861   gerson    batch    mpitest-6-     6   Q       --/2376:
       862   gerson    batch    mpitest-6-     6   Q       --/2376:
       863   gerson    batch    mpitest-6-     6   Q       --/2376:
       864   gerson    batch    mpitest-6-     6   Q       --/2376:
       865   gerson    batch    mpitest-6-     6   Q       --/2376:
       866   gerson    batch    mpitest-6-     6   Q       --/2376:
       867   gerson    batch    mpitest-6-     6   Q       --/2376:
       868   gerson    batch    mpitest-6-     6   Q       --/2376:
       840   gerson    parallel mpitest-3-     3   S       --/2376:

diagnosing job priority information (partition: ALL)

Job                    PRIORITY*   Cred(  QOS:Class)  Serv(QTime)
              Weights   --------       1(    1:    1)     1(    1)

840                          96    36.4(  0.0: 35.0)  63.6( 61.2)
834                          72    14.0(  0.0: 10.0)  86.0( 61.6)
861                          34    29.6(  0.0: 10.0)  70.4( 23.8)
862                          31    32.0(  0.0: 10.0)  68.0( 21.2)
863                          29    34.3(  0.0: 10.0)  65.7( 19.2)
864                          28    35.8(  0.0: 10.0)  64.2( 17.9)
865                          27    37.5(  0.0: 10.0)  62.5( 16.7)
866                          24    41.4(  0.0: 10.0)  58.6( 14.2)
867                          24    42.2(  0.0: 10.0)  57.8( 13.7)
868                          21    48.2(  0.0: 10.0)  51.8( 10.8)

Percent Contribution   --------    32.4(  0.0: 32.4)  67.6( 67.6)


1101am
  visible CPUs: 0,1
           1  2  3  4  5  6  7  8  9  0    1  2  3  4  5  6  7  8  9  0 
 
------------------------------------------------------------------------- 
  green01 .. gg gg gg gg gg gg
------------------------------------------------------------------------- 
 

 

       Job#  Username  Queue    Jobname    Nodes   S  Elapsed/Requested
       834   gerson    batch    mpitest-6-     6   S       --/2376:
   g = 863   gerson    batch    mpitest-6-     6   R    00:01/2376:
       864   gerson    batch    mpitest-6-     6   Q       --/2376:
       865   gerson    batch    mpitest-6-     6   Q       --/2376:
       866   gerson    batch    mpitest-6-     6   Q       --/2376:
       867   gerson    batch    mpitest-6-     6   Q       --/2376:
       868   gerson    batch    mpitest-6-     6   Q       --/2376:
       840   gerson    parallel mpitest-3-     3   S       --/2376:

[gerson at drongo maui-3.2.6p11]$ diagnose -p
diagnosing job priority information (partition: ALL)

Job                    PRIORITY*   Cred(  QOS:Class)  Serv(QTime)
              Weights   --------       1(    1:    1)     1(    1)

840                         106    32.9(  0.0: 35.0)  67.1( 71.3)
834                          82    12.2(  0.0: 10.0)  87.8( 71.7)
865                          37    27.2(  0.0: 10.0)  72.8( 26.8)
866                          34    29.2(  0.0: 10.0)  70.8( 24.2)
867                          34    29.6(  0.0: 10.0)  70.4( 23.8)
868                          31    32.4(  0.0: 10.0)  67.6( 20.8)

Percent Contribution   --------    26.3(  0.0: 26.3)  73.7( 73.7)

1111am
  visible CPUs: 0,1
           1  2  3  4  5  6  7  8  9  0    1  2  3  4  5  6  7  8  9  0 
 
------------------------------------------------------------------------- 
  green01 .. gg gg gg gg gg gg
------------------------------------------------------------------------- 
 

 

       Job#  Username  Queue    Jobname    Nodes   S  Elapsed/Requested
       834   gerson    batch    mpitest-6-     6   S       --/2376:
   g = 867   gerson    batch    mpitest-6-     6   R       --/2376:
       868   gerson    batch    mpitest-6-     6   Q       --/2376:
       869   gerson    batch    mpitest-6-     6   Q       --/2376:
       870   gerson    batch    mpitest-6-     6   Q       --/2376:
       871   gerson    batch    mpitest-6-     6   Q       --/2376:
       872   gerson    batch    mpitest-6-     6   Q       --/2376:
       873   gerson    batch    mpitest-6-     6   Q       --/2376:
       840   gerson    parallel mpitest-3-     3   S       --/2376:

[gerson at drongo maui-3.2.6p11]$ diagnose -p
diagnosing job priority information (partition: ALL)

Job                    PRIORITY*   Cred(  QOS:Class)  Serv(QTime)
              Weights   --------       1(    1:    1)     1(    1)

840                         115    30.4(  0.0: 35.0)  69.6( 80.2)
834                          91    11.0(  0.0: 10.0)  89.0( 80.6)
868                          40    25.2(  0.0: 10.0)  74.8( 29.7)
869                          18    54.9(  0.0: 10.0)  45.1(  8.2)
870                          15    65.0(  0.0: 10.0)  35.0(  5.4)
871                          13    76.6(  0.0: 10.0)  23.4(  3.0)
872                          13    78.4(  0.0: 10.0)  21.6(  2.8)
873                          10    99.7(  0.0: 10.0)   0.3(  0.0)

Percent Contribution   --------    33.3(  0.0: 33.3)  66.7( 66.7)

1130am
  visible CPUs: 0,1
           1  2  3  4  5  6  7  8  9  0    1  2  3  4  5  6  7  8  9  0 
 
-------------------------------------------------------------------------
   green01 .. gg gg gg gg gg gg
------------------------------------------------------------------------- 
 

 

       Job#  Username  Queue    Jobname    Nodes   S  Elapsed/Requested
       834   gerson    batch    mpitest-6-     6   S       --/2376:
   g = 875   gerson    batch    mpitest-6-     6   R       --/2376:
       876   gerson    batch    mpitest-6-     6   Q       --/2376:
       877   gerson    batch    mpitest-6-     6   Q       --/2376:
       878   gerson    batch    mpitest-6-     6   Q       --/2376:
       879   gerson    batch    mpitest-6-     6   Q       --/2376:
       880   gerson    batch    mpitest-6-     6   Q       --/2376:
       881   gerson    batch    mpitest-6-     6   Q       --/2376:
       882   gerson    batch    mpitest-6-     6   Q       --/2376:
       883   gerson    batch    mpitest-6-     6   Q       --/2376:
       884   gerson    batch    mpitest-6-     6   Q       --/2376:
       885   gerson    batch    mpitest-6-     6   Q       --/2376:
       840   gerson    parallel mpitest-3-     3   S       --/2376:

[gerson at drongo maui-3.2.6p11]$ diagnose -p
diagnosing job priority information (partition: ALL)

Job                    PRIORITY*   Cred(  QOS:Class)  Serv(QTime)
              Weights   --------       1(    1:    1)     1(    1)

840                         134    26.1(  0.0: 35.0)  73.9( 99.2)
834                         110     9.1(  0.0: 10.0)  90.9( 99.6)
876                          26    38.5(  0.0: 10.0)  61.5( 15.9)
877                          23    42.7(  0.0: 10.0)  57.3( 13.4)
878                          21    47.7(  0.0: 10.0)  52.3( 10.9)
879                          20    50.7(  0.0: 10.0)  49.3(  9.7)
880                          17    57.5(  0.0: 10.0)  42.5(  7.4)
881                          16    64.2(  0.0: 10.0)  35.8(  5.6)
882                          14    69.9(  0.0: 10.0)  30.1(  4.3)
883                          14    72.0(  0.0: 10.0)  28.0(  3.9)
884                          12    80.9(  0.0: 10.0)  19.1(  2.4)
885                          10    96.9(  0.0: 10.0)   3.1(  0.3)

Percent Contribution   --------    34.7(  0.0: 34.7)  65.3( 65.3)

1208pm
  visible CPUs: 0,1
           1  2  3  4  5  6  7  8  9  0    1  2  3  4  5  6  7  8  9  0 
 
------------------------------------------------------------------------- 
  green01 .. gg gg gg gg gg gg 
------------------------------------------------------------------------- 
 

 

       Job#  Username  Queue    Jobname    Nodes   S  Elapsed/Requested
       834   gerson    batch    mpitest-6-     6   S       --/2376:
   g = 890   gerson    batch    mpitest-6-     6   R       --/2376:
       891   gerson    batch    mpitest-6-     6   Q       --/2376:
       892   gerson    batch    mpitest-6-     6   Q       --/2376:
       893   gerson    batch    mpitest-6-     6   Q       --/2376:
       894   gerson    batch    mpitest-6-     6   Q       --/2376:
       895   gerson    batch    mpitest-6-     6   Q       --/2376:
       896   gerson    batch    mpitest-6-     6   Q       --/2376:
       897   gerson    batch    mpitest-6-     6   Q       --/2376:
       898   gerson    batch    mpitest-6-     6   Q       --/2376:
       899   gerson    batch    mpitest-6-     6   Q       --/2376:
       900   gerson    batch    mpitest-6-     6   Q       --/2376:
       901   gerson    batch    mpitest-6-     6   Q       --/2376:
       902   gerson    batch    mpitest-6-     6   Q       --/2376:
       903   gerson    batch    mpitest-6-     6   Q       --/2376:
       904   gerson    batch    mpitest-6-     6   Q       --/2376:
       905   gerson    batch    mpitest-6-     6   Q       --/2376:
       906   gerson    batch    mpitest-6-     6   Q       --/2376:
       907   gerson    batch    mpitest-6-     6   Q       --/2376:
       908   gerson    batch    mpitest-6-     6   Q       --/2376:
       909   gerson    batch    mpitest-6-     6   Q       --/2376:
       910   gerson    batch    mpitest-6-     6   Q       --/2376:
       911   gerson    batch    mpitest-6-     6   Q       --/2376:
       912   gerson    batch    mpitest-6-     6   Q       --/2376:
       913   gerson    batch    mpitest-6-     6   Q       --/2376:
       840   gerson    parallel mpitest-3-     3   S       --/2376:

Jeffery Ludwig wrote:
> Sorry for the undescriptive subject line, I'm hoping the mail software 
> might thread this properly for future searches. This is a continuation 
> of a thread from this September regarding the scheduler "randomly" 
> dying.  See:
> 
> http://www.supercluster.org/pipermail/mauiusers/2004-September/001338.html
> 
> We've recently upgraded our scheduling software from the OpenPBS default 
> to Maui on two separate Beowulf clusters I administer.  The first, 
> running SUSE 9.1, Linux 2.6.5-7.111.5-default and OpenPBS server (as 
> provided by SUSE) has been working flawlessly for a week, the pbs_server 
> was not even shut down to make the change.
> 
> The second cluster has to this point been riddled with problems, I think 
> I can provide some information via a compare and contrast to help 
> isolate this bug if still unresolved.  The second cluster is running 
> Redhat 7.3, Linux 2.4.20-28.7.  Originally this cluster was running 
> OpenPBS 2.3.16/maui-3.2.6p11 stock, now running torque-1.2.0p0/maui, 
> problems occurred with both setups.  We are using LAM 6.5.9/MPI 2 for 
> the most part, the stable cluster is using LAM 7.0.6/MPI 2 C++/ROMIO. It 
> appears the cluster using LAM 7.0.6 is accounting for usage in Fairshare 
> properly, while the 6.5.9 is not.
> 
> Key config difference between the setups, as noted previously in the 
> thread, is that the problematic cluster is using standing reservations, 
> while the other is not.
> 
> Based on our logs, maui seemed to have problems as it was dumping 
> parallel jobs due to a wallclock violation. It is unable (or torque is 
> rather) to actually kill jobs running with LAM-MPI (or it accounts for 
> the time improperly); they remain running even after maui thinks they 
> are gone.  Since issuing:
> 
> set queue dque resources_default.walltime = 2376:00:00
> 
> Everything seems much more stable, save fairshare values being off, I 
> have my fingers crossed right now. Sorry for the brain dump... if it 
> does crash again, i will run it through gdb to get a stack trace.  Both 
> are "production" machines so I'm really not free to test too much...
> 


More information about the mauiusers mailing list