[torqueusers] pbs_mom getsize() failed errors.
James A. Peltier
jpeltier at cs.sfu.ca
Sat Mar 14 02:10:42 MDT 2009
On Thu, 5 Mar 2009, James A. Peltier wrote:
> Hi All,
>
> Things are starting to stabalize on my cluster again. However, a couple of
> nodes are still seeing errors. i386 nodes seem to be at issue now.
>
> Mar 5 20:58:29 a08-nll pbs_mom: TMomFinalizeChild, about to create cpuset
> for job 10204.queen.
> Mar 5 20:58:29 a08-nll pbs_mom: create_jobset, CPUSET: 0 job 10204.queen
> path /dev/cpuset/torque/10204.queen/cpus
> Mar 5 20:58:29 a08-nll pbs_mom: create_jobset, TASKSET:
> /dev/cpuset/torque/10204.queen/0/cpus cpus 0
> Mar 5 20:58:29 a08-nll pbs_mom: move_to_jobset, CPUSET MOVE:
> /dev/cpuset/torque/10204.queen/tasks 9486
> Mar 5 20:58:29 a08-nll pbs_mom: Bad file descriptor (9) in
> TMomFinalizeChild, getsize() failed for mem/pmem in mom_set_limits
>
>
> a08-nll
> state = free
> np = 4
> properties = matlab,freesurfer_v4.1.0
> ntype = cluster
> status = arch=i386,opsys=linux,uname=Linux a08-nll
> 2.6.18-92.1.22.el5PAE #1 SMP Tue Dec 16 12:36:25 EST 2008 i686,sessions=2527
> 2612 3548 9247 21042 24153 27555
> 28510,nsessions=8,nusers=7,idletime=1730,totmem=6715704kb,availmem=6366156kb,physmem=4675460kb,ncpus=4,loadave=0.29,netload=632908548,size=25348656kb:25614624kb,state=free,jobs=,varattr=,rectime=1236316680
i
I found that I was able to stop these errors by switching from a PAE
kernel to a standard kernel. this limits the amount of memory available
but has stopped the errors from occuring. Perhaps there is code that is
not properly written to take into account PAE but for the time being this
is fixed.
--
James A. Peltier
Systems Analyst (FASNet), VIVARIUM Technical Director
Simon Fraser University - Burnaby Campus
Phone : 778-782-6573
Fax : 778-782-3045
E-Mail : jpeltier at sfu.ca
Website : http://www.fas.sfu.ca | http://vivarium.cs.sfu.ca
http://blogs.sfu.ca/people/jpeltier
MSN : subatomic_spam at hotmail.com
Your mouse has moved. Windows has detected hardware
changes that require a reboot. Click OK to reboot.
More information about the torqueusers
mailing list