[torqueusers] pbs_mom getsize() failed errors.

James A. Peltier jpeltier at cs.sfu.ca
Sat Mar 14 02:10:42 MDT 2009


On Thu, 5 Mar 2009, James A. Peltier wrote:

> Hi All,
>
> Things are starting to stabalize on my cluster again.  However, a couple of 
> nodes are still seeing errors.  i386 nodes seem to be at issue now.
>
> Mar  5 20:58:29 a08-nll pbs_mom: TMomFinalizeChild, about to create cpuset 
> for job 10204.queen.
> Mar  5 20:58:29 a08-nll pbs_mom: create_jobset, CPUSET: 0 job 10204.queen 
> path /dev/cpuset/torque/10204.queen/cpus
> Mar  5 20:58:29 a08-nll pbs_mom: create_jobset, TASKSET: 
> /dev/cpuset/torque/10204.queen/0/cpus cpus 0
> Mar  5 20:58:29 a08-nll pbs_mom: move_to_jobset, CPUSET MOVE: 
> /dev/cpuset/torque/10204.queen/tasks  9486
> Mar  5 20:58:29 a08-nll pbs_mom: Bad file descriptor (9) in 
> TMomFinalizeChild, getsize() failed for mem/pmem in mom_set_limits
>
>
> a08-nll
>      state = free
>      np = 4
>      properties = matlab,freesurfer_v4.1.0
>      ntype = cluster
>      status = arch=i386,opsys=linux,uname=Linux a08-nll 
> 2.6.18-92.1.22.el5PAE #1 SMP Tue Dec 16 12:36:25 EST 2008 i686,sessions=2527 
> 2612 3548 9247 21042 24153 27555 
> 28510,nsessions=8,nusers=7,idletime=1730,totmem=6715704kb,availmem=6366156kb,physmem=4675460kb,ncpus=4,loadave=0.29,netload=632908548,size=25348656kb:25614624kb,state=free,jobs=,varattr=,rectime=1236316680
i
I found that I was able to stop these errors by switching from a PAE 
kernel to a standard kernel.  this limits the amount of memory available 
but has stopped the errors from occuring.  Perhaps there is code that is 
not properly written to take into account PAE but for the time being this 
is fixed.

-- 
James A. Peltier
Systems Analyst (FASNet), VIVARIUM Technical Director
Simon Fraser University - Burnaby Campus
Phone   : 778-782-6573
Fax     : 778-782-3045
E-Mail  : jpeltier at sfu.ca
Website : http://www.fas.sfu.ca | http://vivarium.cs.sfu.ca
            http://blogs.sfu.ca/people/jpeltier
MSN     : subatomic_spam at hotmail.com

Your mouse has moved.  Windows has detected hardware
changes that require a reboot. Click OK to reboot.


More information about the torqueusers mailing list