[torqueusers] maui diagnose reports no nodes - Scheduler does not start jobs

Anne M. Hammond hammond at txcorp.com
Wed Feb 28 15:39:13 MST 2007


The maui scheduler still does not start jobs.  However,
"sudo qrun {nn}" works.  Does this mean it's a permissions
issue?

I upgraded maui to patch 19 and implemented Lawrence Sorillo's
change to maui.cfg (as suggested by Paul Gray (thanks Paul)):

# Resource Manager Definition
#
#RMCFG[STORAGE3.CL.TXCORP.COM] TYPE=PBS at RMNMHOST@
RMCFG[STORAGE3.CL.TXCORP.COM] TYPE=PBS

maui is running as user hammond.

Now diagnose reports the nodes:

[hammond at storage3 bin]$ ./diagnose -n
diagnosing node table (5120 slots)
Name                    State  Procs     Memory         Disk          Swap 
Speed  Opsys   Arch Par   Load Res Classes                        Network 
Features
....

Total Nodes: 24  (Active: 20  Idle: 4  Down: 0)

-----

Maui version 3.2.6p19
torque-2.1.6

--------

But maui does not start a queued job.  It runs ok if you
"sudo qrun {nn}"

In the log, maui checks every 30 seconds:
02/28 15:31:36 INFO:     scheduling complete.  sleeping 30 seconds

but jobs remain queued:

                                                                    Req'd 
Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory 
Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ 
----- - -----
83.storage3.cl.txcor swsides  s3opt8   a059111920  24440     5   2 4000mb 
24:00 R 01:03
85.storage3.cl.txcor swsides  s3opt8   a059104223  21310     5   2 4000mb 
24:00 R 00:36
86.storage3.cl.txcor swsides  s3opt8   a059104403  20584     5   2 4000mb 
24:00 R 00:36
87.storage3.cl.txcor swsides  s3opt8   a059104528   5971     5   2 4000mb 
24:00 R 00:36
88.storage3.cl.txcor hammond  s3opt24  s.sh          --      1   2 4000mb 
24:00 Q   --

This is from maui.log for job 88:

02/28 15:35:13 INFO:     job '88' Priority:        1A
02/28 15:35:13 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr: 
0(00.0
)  Serv:      0(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us: 
0(00.0)

Any suggestions appreciated.

Anne

Anne M. Hammond - Systems / Network Administration - Tech-X Corp
                   hammond_at_txcorp.com 720-974-1840

On Tue, 6 Feb 2007, Paul Gray wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Mon, Feb 05, 2007 at 11:48:31PM -0700, Anne Hammond wrote:
>> torqueue 2.1.6
>> maui client version 3.2.6p18
>>
>> The maui scheduler is not starting jobs.  They will start if you
>> do a "qrun nn", but otherwise they remain queued.
>>
>> I've searched for the cause and found a couple of symptoms:
>>
>> [hammond at storage3 bin]$  sudo ./diagnose -n
>> --host=storage3.xx.xxxxxx.com
>> -v
>> diagnosing node table (5120 slots)
>> Name                    State  Procs     Memory         Disk          Swap
>> Speed  Opsys   Arch Par   Load Res Classes                        Network
>> Features
>>
>> -----                     ---   0:0        0:0           0:0           0:0
>>
>> Total Nodes: 0  (Active: 0  Idle: 0  Down: 0)
>>
>>
>> --------------------------------
>> However, pbsnodes -a lists all nodes as free.
>>
>> This also fails:
>>
>> [hammond at storage3 bin]$ sudo ./checkjob 51.storage3.xx.xxxxxx.com
>> ERROR:    'checkjob' failed
>> ERROR:  cannot locate job '51.storage3.xx.xxxxxx.com'
>> -------------------------------
>
> These symptoms are similar to those that I have when configuring Maui on
> Debian boxes.  Maui starts, torque is going strong, but the two just don't want to
> communicate.  Your issue might be caused by the same maui.cfg configuration
> recently discussed on the mauiusers list here:
>   http://www.supercluster.org/pipermail/mauiusers/2007-February/thread.html
>
> See if Lawrence's suggestion on tweaking the Resource Manager's Definition
> and restarting maui helps to address the issue.
>
> - --
> Paul Gray                                         -o)
> 314 East Gym, Dept. of Computer Science           /\\
> University of Northern Iowa                      _\_V
> Message void if penguin violated ...  Don't mess with the penguin
> No one says, "Hey, I can't read that ASCII attachment ya sent me."
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
>
> iD8DBQFFyIPWOH45TZW7mh4RAnUaAJwPQa39Or/ns2demmi5tGF34KwxZgCg7TCR
> 5TKNheBj/1e69atXuav9rrE=
> =Lc2O
> -----END PGP SIGNATURE-----
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>


More information about the torqueusers mailing list