[torqueusers] cpuset on Torque 2.4.16, CentOS 6.2, AMD Opteron Bulldozer

Gus Correa gus at ldeo.columbia.edu
Wed Mar 28 18:05:33 MDT 2012


Dear Torque Pros

I am having trouble with cpuset again,
and would love to hear your suggestions.

I installed Torque 2.4.16 on a standalone machine
running CentOS 6.2. Processors are AMD Opteron Bulldozer.

I have Torque 2.4.11 with cpuset working right on
CentOS 5.2 + AMD Opteron Shanghai,
and on CentOS 5.4 AMD Opteron Magny-Cours.

Anybody there using the same Torque
and CentOS versions with Bulldozer
and getting cpuset right?

**********************************************
More info [sorry, lengthy, hopefully useful]
**********************************************

The configure line looks like this:

../configure \
--prefix=${MYINSTALLDIR} \
--with-server-home=${MYINSTALLDIR} \
--enable-cpuset \
--enable-geometry-requests \
--with-pam \
2>&1 | tee configure_${build_id}.log

However, the test jobs flip forever between Q and R states,
and never run.

*********

syslog messages look like this [repeated forever]:
Mar 28 19:21:36 galera pbs_mom: LOG_ERROR::TMomFinalizeChild, Could not 
create cpuset for job 1.galera.ldeo.columbia.edu.

********

server logs shows these errors [repeated ad nauseam until it exits]:
03/28/2012 19:21:44;0008;PBS_Server;Job;1.galera.ldeo.columbia.edu;Job 
Run at request of Scheduler at galera.ldeo.columbia.edu
03/28/2012 
19:21:44;0040;PBS_Server;Svr;galera.ldeo.columbia.edu;Scheduler was sent 
the command recyc
03/28/2012 
19:21:44;0040;PBS_Server;Svr;galera.ldeo.columbia.edu;Scheduler was sent 
the command new

*********

mom_logs report errors like the ones below [also repeated many times].

Indeed the file /dev/cpuset/cpus
that it cannot locate doesn't exist.
It seems to have been renamed [in CentOS 6.2 perhaps?] to
/dev/cpuset/cpuset.cpus .
Actually, most file names there seem to have benefited from this
wonderful prefix "cpuset."
Oh, well, innovation has no bounds ...

Would this filename mismatch be the problem?
Any workaround or patch if this is really the reason for the failure?



03/28/2012 19:09:57;0002;   pbs_mom;Svr;Log;Log opened
03/28/2012 19:09:57;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 
2.4.16, loglevel = 0
03/28/2012 19:09:57;0002; 
pbs_mom;Svr;setpbsserver;galera.ldeo.columbia.edu
03/28/2012 19:09:57;0002;   pbs_mom;Svr;mom_server_add;server 
galera.ldeo.columbia.edu added
03/28/2012 19:09:57;0002;   pbs_mom;Svr;usecp;*:/home  /home
03/28/2012 19:09:57;0002;   pbs_mom;Svr;usecp;*:/data00  /data00
03/28/2012 19:09:57;0002;   pbs_mom;n/a;initialize;independent
03/28/2012 19:09:57;0080;   pbs_mom;Svr;pbs_mom;before init_abort_jobs
03/28/2012 19:09:57;0002;   pbs_mom;Svr;initialize_root_cpuset;Init 
TORQUE cpuset /dev/cpuset/torque.
03/28/2012 19:09:57;0001; 
pbs_mom;Svr;pbs_mom;LOG_ERROR::initialize_root_cpuset, cannot locate 
/dev/cpuset/cpus - cpusets not configured/enabled on host
03/28/2012 19:09:57;0002;   pbs_mom;Svr;pbs_mom;Is up
03/28/2012 19:09:57;0002;   pbs_mom;Svr;setup_program_environment;MOM 
executable path and mtime at launch: 
/data00/sw/torque/2.4.16/gnu-4.4.6/sbin/pbs_mom 1332200273
03/28/2012 19:09:57;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 
2.4.16, loglevel = 0
03/28/2012 19:09:57;0002; 
pbs_mom;n/a;mom_server_check_connection;sending hello to server 
galera.ldeo.columbia.edu
03/28/2012 19:11:14;0001;   pbs_mom;Job;TMomFinalizeJob3;job not 
started, Retry job exec failure, retry will be attempted (see syslog for 
more information)
03/28/2012 19:11:14;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters for job 1.galera.ldeo.columbia.edu
03/28/2012 19:11:14;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
03/28/2012 19:11:14;0080; 
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top 
of while loop
03/28/2012 19:11:14;0080;   pbs_mom;Svr;preobit_reply;in while loop, no 
error from job stat
03/28/2012 19:11:14;0080;   pbs_mom;Job;1.galera.ldeo.columbia.edu;obit 
sent to server
03/28/2012 19:11:14;0001;   pbs_mom;Job;TMomFinalizeJob3;job not 
started, Retry job exec failure, retry will be attempted (see syslog for 
more information)
03/28/2012 19:11:14;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters for job 1.galera.ldeo.columbia.edu
03/28/2012 19:11:14;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
03/28/2012 19:11:14;0080; 
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top 
of while loop
03/28/2012 19:11:14;0080;   pbs_mom;Svr;preobit_reply;in while loop, no 
error from job stat
03/28/2012 19:11:14;0080;   pbs_mom;Job;1.galera.ldeo.columbia.edu;obit 
sent to server
03/28/2012 19:11:14;0001;   pbs_mom;Job;TMomFinalizeJob3;job not 
started, Retry job exec failure, retry will be attempted (see syslog for 
more information)
03/28/2012 19:11:14;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters for job 1.galera.ldeo.columbia.edu
03/28/2012 19:11:14;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
03/28/2012 19:11:14;0080; 
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top 
of while loop
03/28/2012 19:11:14;0080;   pbs_mom;Svr;preobit_reply;in while loop, no 
error from job stat
03/28/2012 19:11:14;0080;   pbs_mom;Job;1.galera.ldeo.columbia.edu;obit 
sent to server


**********

Thank you,
Gus Correa



More information about the torqueusers mailing list