[torqueusers] maui diagnose reports no nodes - Scheduler does
not start jobs
Anne M. Hammond
hammond at txcorp.com
Wed Feb 28 15:39:13 MST 2007
The maui scheduler still does not start jobs. However,
"sudo qrun {nn}" works. Does this mean it's a permissions
issue?
I upgraded maui to patch 19 and implemented Lawrence Sorillo's
change to maui.cfg (as suggested by Paul Gray (thanks Paul)):
# Resource Manager Definition
#
#RMCFG[STORAGE3.CL.TXCORP.COM] TYPE=PBS at RMNMHOST@
RMCFG[STORAGE3.CL.TXCORP.COM] TYPE=PBS
maui is running as user hammond.
Now diagnose reports the nodes:
[hammond at storage3 bin]$ ./diagnose -n
diagnosing node table (5120 slots)
Name State Procs Memory Disk Swap
Speed Opsys Arch Par Load Res Classes Network
Features
....
Total Nodes: 24 (Active: 20 Idle: 4 Down: 0)
-----
Maui version 3.2.6p19
torque-2.1.6
--------
But maui does not start a queued job. It runs ok if you
"sudo qrun {nn}"
In the log, maui checks every 30 seconds:
02/28 15:31:36 INFO: scheduling complete. sleeping 30 seconds
but jobs remain queued:
Req'd
Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory
Time S Time
-------------------- -------- -------- ---------- ------ ----- --- ------
----- - -----
83.storage3.cl.txcor swsides s3opt8 a059111920 24440 5 2 4000mb
24:00 R 01:03
85.storage3.cl.txcor swsides s3opt8 a059104223 21310 5 2 4000mb
24:00 R 00:36
86.storage3.cl.txcor swsides s3opt8 a059104403 20584 5 2 4000mb
24:00 R 00:36
87.storage3.cl.txcor swsides s3opt8 a059104528 5971 5 2 4000mb
24:00 R 00:36
88.storage3.cl.txcor hammond s3opt24 s.sh -- 1 2 4000mb
24:00 Q --
This is from maui.log for job 88:
02/28 15:35:13 INFO: job '88' Priority: 1A
02/28 15:35:13 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0
) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us:
0(00.0)
Any suggestions appreciated.
Anne
Anne M. Hammond - Systems / Network Administration - Tech-X Corp
hammond_at_txcorp.com 720-974-1840
On Tue, 6 Feb 2007, Paul Gray wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Mon, Feb 05, 2007 at 11:48:31PM -0700, Anne Hammond wrote:
>> torqueue 2.1.6
>> maui client version 3.2.6p18
>>
>> The maui scheduler is not starting jobs. They will start if you
>> do a "qrun nn", but otherwise they remain queued.
>>
>> I've searched for the cause and found a couple of symptoms:
>>
>> [hammond at storage3 bin]$ sudo ./diagnose -n
>> --host=storage3.xx.xxxxxx.com
>> -v
>> diagnosing node table (5120 slots)
>> Name State Procs Memory Disk Swap
>> Speed Opsys Arch Par Load Res Classes Network
>> Features
>>
>> ----- --- 0:0 0:0 0:0 0:0
>>
>> Total Nodes: 0 (Active: 0 Idle: 0 Down: 0)
>>
>>
>> --------------------------------
>> However, pbsnodes -a lists all nodes as free.
>>
>> This also fails:
>>
>> [hammond at storage3 bin]$ sudo ./checkjob 51.storage3.xx.xxxxxx.com
>> ERROR: 'checkjob' failed
>> ERROR: cannot locate job '51.storage3.xx.xxxxxx.com'
>> -------------------------------
>
> These symptoms are similar to those that I have when configuring Maui on
> Debian boxes. Maui starts, torque is going strong, but the two just don't want to
> communicate. Your issue might be caused by the same maui.cfg configuration
> recently discussed on the mauiusers list here:
> http://www.supercluster.org/pipermail/mauiusers/2007-February/thread.html
>
> See if Lawrence's suggestion on tweaking the Resource Manager's Definition
> and restarting maui helps to address the issue.
>
> - --
> Paul Gray -o)
> 314 East Gym, Dept. of Computer Science /\\
> University of Northern Iowa _\_V
> Message void if penguin violated ... Don't mess with the penguin
> No one says, "Hey, I can't read that ASCII attachment ya sent me."
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
>
> iD8DBQFFyIPWOH45TZW7mh4RAnUaAJwPQa39Or/ns2demmi5tGF34KwxZgCg7TCR
> 5TKNheBj/1e69atXuav9rrE=
> =Lc2O
> -----END PGP SIGNATURE-----
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
More information about the torqueusers
mailing list