[torqueusers] Kernel upgrade breaks torques idea of how many cpus a
node has
John Hanks
griznog at gmail.com
Mon Sep 1 06:42:18 MDT 2008
I just upgraded the kernel on my compute nodes from stock centos
2.6.18-53.1.21.el5 to a vanilla 2.6.25 with perfctr patch applied and
now I'm seeing some odd behavior. On nodes with the stock kernel, this
works fine:
qsub -I -l nodes=1:ppn=4
but on nodes with teh new vanilla kernel, it fails and checkjob
reports several different reasons. As the job is waiting to stat,
checkjob -v JOBID alternates between these two messages:
NOTE: job violates constraints for partition Moab (startdate in '00:00:01')
and
NOTE: job can run in partition Moab (4 procs available 4 procs required)
Then eventually it is rejected with
Message[0] job cancelled - job was rejected
Starting a job with:
qsub -I -l nodes=1:ppn=3
has no problems. Momctl doesn't see any difference between a stock
kernel node and a patched/vanilla kernel node:
# Stock
[root at jobs ~]# momctl -q ncpus -h uinta-0000
uinta-0000: ncpus = 'ncpus=4'
# Patched/vanilla
[root at jobs ~]# momctl -q ncpus -h uinta-0001
uinta-0001: ncpus = 'ncpus=4'
Any suggestions as to why the newer kernel causes this and how to fix
it would be greatly appreciated.
Thanks,
jbh
More information about the torqueusers
mailing list