[torqueusers] Torque on a Scyld cluster

Alexander Soudackov souda at chem.psu.edu
Fri Sep 28 17:39:31 MDT 2007


Has anybody ever experienced any problems with torque on a Scyld 
Cluster? I'm running torque 2.1.8 on a Scyld cluster (Clusterware 4.13). 
When I have a configuration with the master node defined as a single 
node with 48 processes (we have 24 duals as compute nodes) with the 
server, mom, and scheduler (Maui) running on a master node, everything 
seems to be working. BUT! After recent upgrade of torque (from Scyld 
repository) the new configuration includes MOMs running on all the 
compute nodes, and the compute nodes are explicitly listed in 
.../server_priv/nodes file. Now Maui doesn't work at all (all the jobs 
deferred), and the original torque scheduler works but does 
unexplainable things. All the serial jobs are fine (submitted using 
beorun), but the MPI jobs behave very strangely. When submitted with 
mpiexec from within pbs script (with #PBL -l nodes=<n>:ppn=2 the 
processes don't communicate with each other and each process runs 
independently like with ncpus=1. When submitted with mpirun from within 
pbs script all the processes are started on a single node... Could 
somebody tell me what is going on and how we can fix it?

Thanks,
Alex
-- 


More information about the torqueusers mailing list