[Mauiusers] hurting in a bad way

James A. Peltier jpeltier at cs.sfu.ca
Mon Mar 2 13:58:27 MST 2009


Sorry for the cross post all, but I'm hurting in a bad way.  I attempted 
an upgrade to the latest 2.4 beta snapshot to check out the new features, 
ever since then things have gone down hill fast.  The cluster began only 
scheduling to
half of the available compute resources.  Jobs were getting cancelled and 
all sorts of other odd behaviour.

Out of desperation I reinstalled the whole cluster, reverting back to the 
latest released stable snapshots, yet I'm still seeing odd behaviour.  For 
example jobs are being "lost".  They appear in showq as idle then 
immediately go
blocked then if I try to releasehold them they say the job doesn't exist.

As it sits right now it is a freshly installed cluster running the latest 
stable torque and maui.  Some jobs are running normally, others are not.

Any help would be greatly appreciated.


details below

3963MB Nodes
             NodeName Available      Busy NodeState
               sdats0    96.30%    58.27%      Idle
               sdats1    94.82%    70.07%      Idle
Summary:    2 3963MB Nodes   95.56% Avail   61.28% Busy  (Current: 100.00% 
Avail    0.00% Busy)

4565MB Nodes
             NodeName Available      Busy NodeState
              a05-nll    95.78%    59.02%   Running
              a06-nll    95.51%    59.19%   Running
              a07-nll    95.51%    60.19%   Running
              a08-nll    95.14%    57.64%      Idle
Summary:    4 4565MB Nodes   95.48% Avail   56.35% Busy  (Current: 100.00% 
Avail    0.00% Busy)

8120MB Nodes
             NodeName Available      Busy NodeState
              a02-nll    94.36%    59.91%   Running
Summary:    1 8120MB Nodes   94.36% Avail   56.53% Busy  (Current: 100.00% 
Avail    0.00% Busy)

16047MB Nodes
             NodeName Available      Busy NodeState
              ilhpc01    97.36%    49.23%      Idle
              ilhpc02    97.98%    48.86%      Idle
             linear-a    97.98%    71.86%      Idle
             linear-b    96.26%    77.95%      Idle
Summary:    4 16047MB Nodes   97.39% Avail   60.31% Busy  (Current: 
100.00% Avail    0.00% Busy)

30775MB Nodes
             NodeName Available      Busy NodeState
                prism    95.16%    57.72%   Running
Summary:    1 30775MB Nodes   95.16% Avail   54.92% Busy  (Current: 
100.00% Avail    0.00% Busy)

32169MB Nodes
             NodeName Available      Busy NodeState
             r1s40c01    97.98%    46.65%      Idle
             r1s39c02    97.98%    46.44%      Idle
             r1s38c03    97.36%    46.69%      Idle
             r1s37c04    96.91%    48.08%      Idle
             r1s36c05    96.60%    50.12%      Idle
             r1s35c06    96.91%    49.09%      Idle
             r1s34c07    96.60%    43.57%      Idle
             r1s33c08    96.91%    48.87%      Idle
             r1s31c09    96.60%    58.53%      Idle
Summary:    9 32169MB Nodes   97.09% Avail   47.25% Busy  (Current: 
100.00% Avail    0.00% Busy)

32189MB Nodes
             NodeName Available      Busy NodeState
              rosetta    95.51%    62.02%      Busy
Summary:    1 32189MB Nodes   95.51% Avail   59.23% Busy  (Current: 
100.00% Avail  100.00% Busy)

61606MB Nodes
             NodeName Available      Busy NodeState
               icarus    99.57%    56.49%   Running
Summary:    1 61606MB Nodes   99.57% Avail   56.25% Busy  (Current: 
100.00% Avail    0.00% Busy)

128997MB Nodes
             NodeName Available      Busy NodeState
             r1s26c10    96.30%    75.33%   Running
             r1s22c11    95.51%    58.83%      Idle
             r1s18c12    96.60%    57.63%      Idle
             r1s14c13    96.30%    59.44%      Idle
             r1s10c14    96.30%    45.73%      Idle
Summary:    5 128997MB Nodes   96.20% Avail   57.13% Busy  (Current: 
100.00% Avail    0.00% Busy)

System Summary:   30 Nodes   90.09% Avail   50.90% Busy  (Current:  93.33% 
Avail    3.33% Busy)


Idle Jobs

            JobName    Priority  XFactor  Q      User    Group  Procs 
WCLimit     Class      SystemQueueTime

               2901*      20058      1.1  -  user students      1 
7:12:00:00     batch  Sun Mar  1 19:11:47
               2924       19966      1.1  -  user students      1 
7:12:00:00     batch  Sun Mar  1 19:16:22
               2925       19966      1.1  -  user students      1 
7:12:00:00     batch  Sun Mar  1 19:16:22
               2921       19916      1.1  -  user students      1 
7:12:00:00     batch  Sun Mar  1 19:18:52
               2922       19916      1.1  -  user students      1 
7:12:00:00     batch  Sun Mar  1 19:18:52

Jobs: 5  Total Backlog:  900.00 ProcHours  (3.17 Hours)




BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT 
QUEUETIME

3692                 user2        Hold     1    00:00:00  Mon Mar  2 
10:52:55
3693                 user2        Hold     1    00:00:00  Mon Mar  2 
10:52:55
3694                 user2        Hold     1    00:00:00  Mon Mar  2 
10:52:55
3695                 user2        Hold     1    00:00:00  Mon Mar  2 
10:52:56
3696                 user2        Hold     1 99:23:59:59  Mon Mar  2 
10:52:56
3697                 user2        Hold     1 99:23:59:59  Mon Mar  2 
10:52:56
3698                 user2        Hold     1 99:23:59:59  Mon Mar  2 
10:52:56
3699                 user2        Hold     1 99:23:59:59  Mon Mar  2 
10:52:56
3700                 user2        Hold     1 99:23:59:59  Mon Mar  2 
10:52:56
3701                 user2        Hold     1 99:23:59:59  Mon Mar  2 
10:52:56

Total Jobs: 62   Active Jobs: 47   Idle Jobs: 5   Blocked Jobs: 10



# checkjob 2901


checking job 2901

State: Idle
Creds:  user:user  group:students  class:batch  qos:DEFAULT
WallTime: 00:00:00 of 7:12:00:00
SubmitTime: Sun Mar  1 19:00:35
   (Time Queued  Total: 17:53:13  Eligible: 17:42:01)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 2048M
NodeCount: 1


IWD: [NONE]  Executable:  [NONE]
Bypass: 647  StartCount: 3
PartitionMask: [ALL]
Flags:       HOSTLIST RESTARTABLE
HostList:
   [a07-nll:1]
Reservation '2901' (2:08:40:19 -> 9:20:40:19  Duration: 7:12:00:00)
Messages:  cannot start job - RM failure, rc: 15044, msg: 'Resource 
temporarily unavailable REJHOST=a07-nll MSG=cannot allocate node 'a07-nll' 
to job - node not currently available (nps needed/free: 1/0,  joblist: 
2539.queen:0,2540.queen:1,2899.queen:2,2542.queen:3)'
PE:  1.00  StartPriority:  20159
job cannot run in partition DEFAULT (idle procs do not meet requirements : 
0 of 1 procs found)
idle procs: 276  feasible procs:   0

Rejection Reasons: [CPU          :    1][HostList     :   29]


# checkjob 3692


checking job 3692

State: Hold
Creds:  user:user2  group:othergroup  class:batch  qos:DEFAULT
WallTime: 00:00:00 of 00:00:00
SubmitTime: Mon Mar  2 10:52:55
   (Time Queued  Total: 2:02:40  Eligible: 00:00:00)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: x86_64  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 400M
NodeCount: 1


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

WARNING:  job has not been detected in 2:02:39
PE:  1.00  StartPriority:  2787
cannot select job 3692 for partition DEFAULT (non-idle state 'Hold')


-- 
James A. Peltier
Systems Analyst (FASNet), VIVARIUM Technical Director
Simon Fraser University - Burnaby Campus
Phone   : 778-782-6573
Fax     : 778-782-3045
E-Mail  : jpeltier at sfu.ca
Website : http://www.fas.sfu.ca | http://vivarium.cs.sfu.ca
            http://blogs.sfu.ca/people/jpeltier
MSN     : subatomic_spam at hotmail.com

Your mouse has moved.  Windows has detected hardware
changes that require a reboot. Click OK to reboot.


More information about the mauiusers mailing list