[torqueusers] new nodes will not run jobs

Aaron Sims aaron_sims at ncsu.edu
Wed Feb 10 08:05:36 MST 2010


Sorry if this is a duplicate, but I sent this originally before I joined 
the lists, so I didnt know if it actually posted....
Ive added 2 new nodes (node 2 and 3) to my cluster.  I am able to run on 
all nodes using mpirun at the command line (without using batch queue).  
And I can submit a job (qsub) to all my other nodes. But if I submit a 
job that includes these nodes, it just hangs.  The status is reported as 
Running. BUT nothing is happening. A pbsnodes -a command shows the nodes 
in the list just like all the other ones, and they have the same 
status.  When I run a "tracejob" on the jobid, all my nodes say "JOIN 
JOB as node ..." except these two do not.  I have the servername listed 
as node1-ib.  Ive rebooted everything, etc.  Any ideas on how to get 
these 2 servers into the mix?

Thanks,
Aaron

pbsnodes -a (just listing a few here)
node1-ib
    state = job-exclusive
    np = 4
    ntype = cluster
    jobs = 0/10916.node1, 1/10916.node1, 2/10916.node1, 3/10916.node1
    status = opsys=linux,uname=Linux node1 2.6.18-128.el5 #1 SMP Wed Dec 
17 11:41:38 EST 2008 
x86_64,sessions=3990,nsessions=1,nusers=1,idletime=1833,totmem=10201656kb,availmem=10028356kb,physmem=3920448kb,ncpus=4,loadave=0.01,netload=4511950,state=free,jobs=10916.node1,varattr=,rectime=1265756258 


node2-ib
    state = job-exclusive
    np = 4
    ntype = cluster
    jobs = 0/10916.node1, 1/10916.node1, 2/10916.node1, 3/10916.node1
    status = opsys=linux,uname=Linux node2 2.6.18-128.el5 #1 SMP Wed Dec 
17 11:41:38 EST 2008 x86_64,sessions=? 0,nsessions=? 
0,nusers=0,idletime=2056,totmem=6017592kb,availmem=5910824kb,physmem=3920448kb,ncpus=4,loadave=0.00,netload=851015,state=free,jobs=,varattr=,rectime=1265756256 


node3-ib
    state = job-exclusive
    np = 4
    ntype = cluster
    jobs = 0/10916.node1, 1/10916.node1, 2/10916.node1, 3/10916.node1
    status = opsys=linux,uname=Linux node3 2.6.18-128.el5 #1 SMP Wed Dec 
17 11:41:38 EST 2008 x86_64,sessions=? 0,nsessions=? 
0,nusers=0,idletime=2060,totmem=10146360kb,availmem=10041140kb,physmem=4051520kb,ncpus=4,loadave=0.00,netload=747045,state=free,jobs=,varattr=,rectime=1265756253 


node4-ib
    state = job-exclusive
    np = 4
    ntype = cluster
    jobs = 0/10916.node1, 1/10916.node1, 2/10916.node1, 3/10916.node1
    status = opsys=linux,uname=Linux node4 2.6.18-128.el5 #1 SMP Wed Dec 
17 11:41:38 EST 2008 
x86_64,sessions=3642,nsessions=1,nusers=1,idletime=2104,totmem=10337000kb,availmem=10240836kb,physmem=4047564kb,ncpus=4,loadave=0.00,netload=3655182,state=free,jobs=10916.node1,varattr=,rectime=1265756256 




More information about the torqueusers mailing list