[torqueusers] RE: Nagios Plugins for Torque (Chris Vaughan)
Brian_Gupta at timeinc.com
Brian_Gupta at timeinc.com
Mon Apr 16 12:30:48 MDT 2007
I would defiantly monitor whether individual queues are working.
I feel that the following should also be monitored and graphed over
time: CPU/node load distribution, job run times, memory usage, and
network IO. If you are using a Cluster file system, I would also monitor
IOWait, and FS usage.
These and other metrics are useful for cluster capacity planning.
One thing I am wondering. Why don't you use Moab? It integrate with
Torque/OpenPBS and provides monitoring and reporting (among other
things), and your company sells it.
P.S. - I don't work with Torque anymore, but if you are interested, I
can setup a test environment at home, and work with you on getting
Nagios/Torque integration working. I've been meaning to test some HA
ideas I had, but since I don't work with it anymore, and time has been
scarce, it fell by the wayside.)
From: Chris Vaughan <chris at clusterresources.com>
Subject: [torqueusers] Nagios Plugins for Torque
I was wondering if anyone has developed any Nagios plugins for Torque?
If so would you mind sharing?
I'm looking for something to monitor the moms and the server. Can
anyone think of any other services related to torque that would be good
EMEA Systems Engineer
Cluster Resources, Ltd.
Direct - UK Office: +44 (0)1223 437 132
US Headquarters: +1 801 717 3700
Date: Mon, 16 Apr 2007 18:25:14 +0200
From: Jan Ploski <Jan.Ploski at offis.de>
Subject: Re: [torqueusers] Job stdout/stderr file empty after transfer
To: torqueusers at supercluster.org
<OF63219B6C.73AC4ACC-ONC12572BF.0059BCCF-422572BF.005A33AC at offis.uni-old
Content-Type: text/plain; charset="US-ASCII"
torqueusers-bounces at supercluster.org schrieb am 04/13/2007 03:22:45 PM:
> I am using TORQUE 2.1.6, trying to transfer stdout of a job using the
> option of qsub. Unfortunately, no matter whether I transfer via scp or
> up $usecp, the transferred file is created with size 0 (zero). When I
> the option "-k oe" instead, the file remains in $HOME on the execute
> machine and contains the expected output. Can anyone please explain
> or give a tip which log file to inspect or what experiments to perform
> gather more information?
Solved. The disk with /var/spool/torque on the execute machine was full.
I'd classify it as an error handling bug in TORQUE. We had to strace the
child process to debug it - shouldn't be necessary.
Best regards -
Dipl.-Inform. (FH) Jan Ploski
Escherweg 2 - 26121 Oldenburg - Germany
Fon: +49 441 9722 - 184 Fax: +49 441 9722 - 202
E-Mail: Jan.Ploski at offis.de - URL: http://www.offis.de
torqueusers mailing list
torqueusers at supercluster.org
End of torqueusers Digest, Vol 33, Issue 17
More information about the torqueusers