[torqueusers] Kernel upgrade breaks torques idea of how many cpus a node has

John Hanks griznog at gmail.com
Mon Sep 1 06:42:18 MDT 2008


I just upgraded the kernel on my compute nodes from stock centos
2.6.18-53.1.21.el5 to a vanilla 2.6.25 with perfctr patch applied and
now I'm seeing some odd behavior. On nodes with the stock kernel, this
works fine:

qsub -I -l nodes=1:ppn=4

but on nodes with teh new vanilla kernel, it fails and checkjob
reports several different reasons. As the job is waiting to stat,
checkjob -v JOBID alternates between these two messages:

NOTE:  job violates constraints for partition Moab (startdate in '00:00:01')

and

NOTE:  job can run in partition Moab (4 procs available  4 procs required)

Then eventually it is rejected with

Message[0] job cancelled - job was rejected

Starting a job with:

qsub -I -l nodes=1:ppn=3

has no problems. Momctl doesn't see any difference between a stock
kernel node and a patched/vanilla kernel node:

# Stock
[root at jobs ~]# momctl -q ncpus -h uinta-0000
  uinta-0000:        ncpus = 'ncpus=4'
# Patched/vanilla
[root at jobs ~]# momctl -q ncpus -h uinta-0001
  uinta-0001:        ncpus = 'ncpus=4'


Any suggestions as to why the newer kernel causes this and how to fix
it would be greatly appreciated.

Thanks,

jbh


More information about the torqueusers mailing list