[torquedev] Re: [torqueusers] Torque 2.3.1-snap.200805191843 out of filedescriptor errors

Chad Vizino vizino at psc.edu
Wed May 28 06:36:24 MDT 2008


Hi Chris,

I've not looked at the source recently but there was a file descriptor 
leak problem with cpuset cleanup--not sure if a fix has been applied yet 
(see my note (below) with a fix from a couple months ago) but this could 
be the problem.

Regards,

   -Chad

> Subject: [torquedev] 2.3.0 cpuset fixes
> Date: Thu, 27 Mar 2008 19:45:14 -0400
> From: Chad Vizino <vizino at psc.edu>
> To: torquedev at supercluster.org
> 
> I misdirected the message below to torquedev-request.  Sorry about that.
> 
> Perhaps the fixes below could be considered for the next 2.3.0 snap if 
> they haven't been repaired already.
> 
>   -Chad
> 
> -------- Original Message --------
> Subject: Re: [torqueusers] ncpus=? 0
> Date: Thu, 27 Mar 2008 08:20:52 -0400
> From: Chad Vizino <vizino at psc.edu>
> To: siri <didier.siri at univ-provence.fr>
> CC: torquedev-request at supercluster.org, torqueusers at supercluster.org
> References: <F020B8F2-B338-45A4-8555-AD3E4FBF705A at univ-provence.fr>
> 
> Greetings,
> 
> We have been using 2.3.0 (the release, not a snap) for a week or so on
> our Altix systems (one with 144 cpus, the other with 768) with cpusets
> enabled.  There are a few bugs that you should be aware of:
> 
> 1) ncpus not obtained and showing ncpus=? in pbsnodes output.
> 
> Fix: src/resmom/linux/mom_mach.c
> at lines 3333-3338, delete if condition around fscanf.
> 
> 2) cpuset handling has a logic error in displaying clean up messages.
> There's a file descriptor leak in cpuset cleanup and left unchecked
> pbs_mom will hit its file descriptor limit and stop working.  Also,
> cpuset cleanups are slow (N seconds, where N is the number of cpus in
> the cpuset) due to an unnecessary sleep.  Finally, depending on how big
> your machine is, you may need to increase an array in the cpuset
> creation routine.
> 
> Fix: src/resmom/linux/cpuset.c
> at line 67, remove "!" before cpuset_delete(childpath)
> at line 84, add "fclose(fd);"
> at line 85, delete "sleep(1);"
> at line 92, add "closedir(dir);"
> at line 222, array size for cpusbuf[] may not be big enough (depends on
> how many cpus you have (need about 4 chars per cpu to be safe))
> 
> Not a bug, but depending on the size of your machine, you may need to
> increase the number of file descriptors per process in
> setup_program_environment() in src/resmom/mom_mach.c at line 6271.  We
> added this and upped limit in /etc/security/limits.conf:
> 
>   /* temporary hack to work around 1024 limit */
>   getrlimit(RLIMIT_NOFILE,   &rlimit);
>   if (rlimit.rlim_cur < 4096 || rlimit.rlim_max < 4096) {
>           rlimit.rlim_cur = 4096;
>           rlimit.rlim_max = 4096;
>           setrlimit(RLIMIT_NOFILE,   &rlimit);
>   }
> 
> 
> In our server "nodes" file we have:
> 
> host np=N
> 
> (note no ":ts" after host).  We choose to make N 4 cpus smaller than the
> physical number of cpus on the system since our boot cpuset is 4 cpus
> and 1 memory node.
> 
> When submitting a job, use "-l nodes=1:ppn=4" for example.  cpusets are
> not constructed when using "-l ncpus=...".
> 
> Hope this helps a little.  We're still playing with
> exlusive/non-exclusive cpuset settings and limiting the job to specific
> memory nodes to see how things work.  I'd be interested in your experiences.
> 
> Regards,
> 
> Chad Vizino



Chris Samuel wrote:
> Never quite sure whether these should go to the users
> or the dev list, so this is going to both. :-)
> 
> With Torque 2.3.1-snap.200805191843 I'm suddenly seeing
> pbs_mom's dieing with:
> 
> 05/28/2008 16:23:22;0001;   pbs_mom;Svr;pbs_mom;Too many open files (24) in mom_get_sample, 31772: get_proc_stat
> 
> This is odd because the maximum number of open files
> according to ulimit is 1024:
> 
> open files                      (-n) 1024
> 
> This is with cpusets enabled and with two simple patches
> applied to extend inter-mom TCP timeouts and to put
> tm_spawn tasks in the jobset rather than the per-vnode
> cpusets (as we're using OpenMPI).
> 
> Looking at the output of lsof they are almost all open file
> descriptors on various deleted cpuset files for past jobs,
> and I've attached a sample output for one that hasn't (yet)
> keeled over.
> 
> Will attempt to see if I can spot why they're not getting closed..
> 
> cheers!
> Chris


More information about the torquedev mailing list