[torqueusers] torque requires --enable-debug to work

Andrew Sharpe andrew.sharpe1 at jcu.edu.au
Tue Nov 22 02:26:30 MST 2005


Hi again, thanks for the suggestions Garrick, I didn't actually get to 
try them out as I found the problem instead.

Turns out the problem was the size of our ldap database - very big.  The 
mom would take so long to go through the groups (see init_groups in 
start_exec.c) that the server would timeout and report failure, however 
the client would continue running (as no fault occurred) then try and 
submit the obituary for the job and fail. 

I've tried some simple things like defining NGROUPS_MAX to something 
small like 16, and setting /proc/sys/kernel/ngroups_max to the same with 
no success.  It seems as though setgrent/getgrent/endgrent want to read 
all of the groups no matter what.  Even when using NSCD (name service 
caching daemon) all of the groups get read (at least on the first 
pass).  So my solution was to return from the function init_groups 
(start_exec.c:4178) just before the call to setgrent.

I've attached a patch that allows the compilation option 
--disable-group-cache to (surprisingly) disable the group caching. If 
the option is not specified, the default is to include group caching.  I 
couldn't get autoconf (or whatever is needed) to play nice so I just 
patched configure and the source.

Maybe this will help someone sometime, maybe not.

Thanks again for your time and efforts, Andrew


 
Andrew Sharpe wrote:

> Hi Garrick, thanks for the reply.
>
> I've already tried without "--disable-rpp" with the same results, and 
> I've tried a heap of other configurations as well.  I'll have a go at 
> everything you've suggested and post results.
>
> Thanks, Andrew
>
> Garrick Staples wrote:
>
>> On Wed, Oct 12, 2005 at 01:50:35PM +1000, Andrew Sharpe alleged:
>>  
>>
>>> Hi all,
>>>
>>> I'm fairly new to torque but I think this is non-standard 
>>> behaviour.  I've verified that this problem exists on
>>> torque-1.2.0p4
>>> torque-1.2.0p6
>>> torque-1.2.0p7-snap.1127772314
>>> using CentOS4.1 (full install) on x86_64.
>>>
>>> The problem is that torque only works if I compile it with 
>>> --enable-debug.  Here's the steps I follow to obtain the results:
>>>
>>> NOTE: pbs is a CNAME to machine1, which correctly resolves to 
>>> 10.1.1.12, machine2 correctly resolves to 10.1.1.13 - by resolves I 
>>> mean forward and reverse lookups are ok.
>>>
>>> 1. compile any of the above versions using the following commands
>>> ./configure --prefix=/usr/local --set-default-server=pbs 
>>> --set-server-home=/var/spool/PBS --enable-server --enable-docs 
>>> --enable-mom --enable-clients --enable-syslog --disable-rpp
>>> make
>>> make install
>>>   
>>
>>
>> Thanks for the very good debugging info!  Did you happen to check for
>> anything in /var/log/messages?  Since you configured with
>> --enable-syslog, there might be some valuable info.
>>
>> What is the server tcp_timeout ('p s tcp_timeout' in qmgr)?  That should
>> be at least 6 or so.
>>
>> Please do 'set server log_level = 7' in qmgr and run 'momctl -q
>> loglevel=7 -h machine1,machine2' and retry the tests.  That will add
>> additional info to the server and mom logs.
>>
>> Also try again without '--disable-rpp' (be sure to reinstall and
>> restart server and moms).
>>
>>  
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>  
>>
>
-------------- next part --------------
diff -ruN torque-1.2.0p4.orig/configure torque-1.2.0p4/configure
--- torque-1.2.0p4.orig/configure	2005-05-06 03:41:14.000000000 +1000
+++ torque-1.2.0p4/configure	2005-11-22 18:54:37.937548664 +1000
@@ -191,6 +191,9 @@
   --enable-syslog         enable the use of syslog for error reporting"
 ac_help="$ac_help
 
+  --enable-group-cache    enable caching of group information"
+ac_help="$ac_help
+
   --set-sched=TYPE        sets the scheduler type. If TYPE is
                           \"c\" the scheduler will be written in C
                           \"tcl\" the server will use a Tcl based scheduler
@@ -1450,6 +1453,23 @@
 fi
 
 
+# Check whether --enable-group-cache or --disable-group-cache was given.
+if test "${enable_group_cache+set}" = set; then
+  enableval="$enable_group_cache"
+  :
+fi
+
+if test "${enable_group_cache}" = "no" ; then
+    cat >> confdefs.h <<\EOF
+#define GRP_CACHE 0
+EOF
+else
+    cat >> confdefs.h <<\EOF
+#define GRP_CACHE 1
+EOF
+fi
+
+
 # Check whether --set-sched was given.
 if test "${set_sched+zzq}" = zzq; then
   setval="$set_sched"
diff -ruN torque-1.2.0p4.orig/src/include/pbs_config.h.in torque-1.2.0p4/src/include/pbs_config.h.in
--- torque-1.2.0p4.orig/src/include/pbs_config.h.in	2005-05-06 03:41:08.000000000 +1000
+++ torque-1.2.0p4/src/include/pbs_config.h.in	2005-11-22 18:56:15.239760848 +1000
@@ -171,6 +171,9 @@
 /* PBS specific: Define if PBS should use syslog for error reporting */
 #undef SYSLOG
 
+/* PBS specific: Define if PBS should use caching for group info */
+#undef GRP_CACHE
+
 /* PBS specific: Define if PBS should use Tcl in its tools */
 #undef TCL
 
diff -ruN torque-1.2.0p4.orig/src/resmom/start_exec.c torque-1.2.0p4/src/resmom/start_exec.c
--- torque-1.2.0p4.orig/src/resmom/start_exec.c	2005-06-03 03:07:25.000000000 +1000
+++ torque-1.2.0p4/src/resmom/start_exec.c	2005-11-22 18:55:26.804126389 +1000
@@ -4192,6 +4192,9 @@
   if (pwgrp != 0)
     *(groups + n++) = pwgrp;
 
+  if(!GRP_CACHE)
+    return n;
+
   setgrent();
 
   while ((grp = getgrent())) 


More information about the torqueusers mailing list