[torqueusers] cpuset on Torque 2.4.16, CentOS 6.2, AMD Opteron Bulldozer
Gus Correa
gus at ldeo.columbia.edu
Wed Mar 28 18:05:33 MDT 2012
Dear Torque Pros
I am having trouble with cpuset again,
and would love to hear your suggestions.
I installed Torque 2.4.16 on a standalone machine
running CentOS 6.2. Processors are AMD Opteron Bulldozer.
I have Torque 2.4.11 with cpuset working right on
CentOS 5.2 + AMD Opteron Shanghai,
and on CentOS 5.4 AMD Opteron Magny-Cours.
Anybody there using the same Torque
and CentOS versions with Bulldozer
and getting cpuset right?
**********************************************
More info [sorry, lengthy, hopefully useful]
**********************************************
The configure line looks like this:
../configure \
--prefix=${MYINSTALLDIR} \
--with-server-home=${MYINSTALLDIR} \
--enable-cpuset \
--enable-geometry-requests \
--with-pam \
2>&1 | tee configure_${build_id}.log
However, the test jobs flip forever between Q and R states,
and never run.
*********
syslog messages look like this [repeated forever]:
Mar 28 19:21:36 galera pbs_mom: LOG_ERROR::TMomFinalizeChild, Could not
create cpuset for job 1.galera.ldeo.columbia.edu.
********
server logs shows these errors [repeated ad nauseam until it exits]:
03/28/2012 19:21:44;0008;PBS_Server;Job;1.galera.ldeo.columbia.edu;Job
Run at request of Scheduler at galera.ldeo.columbia.edu
03/28/2012
19:21:44;0040;PBS_Server;Svr;galera.ldeo.columbia.edu;Scheduler was sent
the command recyc
03/28/2012
19:21:44;0040;PBS_Server;Svr;galera.ldeo.columbia.edu;Scheduler was sent
the command new
*********
mom_logs report errors like the ones below [also repeated many times].
Indeed the file /dev/cpuset/cpus
that it cannot locate doesn't exist.
It seems to have been renamed [in CentOS 6.2 perhaps?] to
/dev/cpuset/cpuset.cpus .
Actually, most file names there seem to have benefited from this
wonderful prefix "cpuset."
Oh, well, innovation has no bounds ...
Would this filename mismatch be the problem?
Any workaround or patch if this is really the reason for the failure?
03/28/2012 19:09:57;0002; pbs_mom;Svr;Log;Log opened
03/28/2012 19:09:57;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
2.4.16, loglevel = 0
03/28/2012 19:09:57;0002;
pbs_mom;Svr;setpbsserver;galera.ldeo.columbia.edu
03/28/2012 19:09:57;0002; pbs_mom;Svr;mom_server_add;server
galera.ldeo.columbia.edu added
03/28/2012 19:09:57;0002; pbs_mom;Svr;usecp;*:/home /home
03/28/2012 19:09:57;0002; pbs_mom;Svr;usecp;*:/data00 /data00
03/28/2012 19:09:57;0002; pbs_mom;n/a;initialize;independent
03/28/2012 19:09:57;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs
03/28/2012 19:09:57;0002; pbs_mom;Svr;initialize_root_cpuset;Init
TORQUE cpuset /dev/cpuset/torque.
03/28/2012 19:09:57;0001;
pbs_mom;Svr;pbs_mom;LOG_ERROR::initialize_root_cpuset, cannot locate
/dev/cpuset/cpus - cpusets not configured/enabled on host
03/28/2012 19:09:57;0002; pbs_mom;Svr;pbs_mom;Is up
03/28/2012 19:09:57;0002; pbs_mom;Svr;setup_program_environment;MOM
executable path and mtime at launch:
/data00/sw/torque/2.4.16/gnu-4.4.6/sbin/pbs_mom 1332200273
03/28/2012 19:09:57;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
2.4.16, loglevel = 0
03/28/2012 19:09:57;0002;
pbs_mom;n/a;mom_server_check_connection;sending hello to server
galera.ldeo.columbia.edu
03/28/2012 19:11:14;0001; pbs_mom;Job;TMomFinalizeJob3;job not
started, Retry job exec failure, retry will be attempted (see syslog for
more information)
03/28/2012 19:11:14;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters for job 1.galera.ldeo.columbia.edu
03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
03/28/2012 19:11:14;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
of while loop
03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
03/28/2012 19:11:14;0080; pbs_mom;Job;1.galera.ldeo.columbia.edu;obit
sent to server
03/28/2012 19:11:14;0001; pbs_mom;Job;TMomFinalizeJob3;job not
started, Retry job exec failure, retry will be attempted (see syslog for
more information)
03/28/2012 19:11:14;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters for job 1.galera.ldeo.columbia.edu
03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
03/28/2012 19:11:14;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
of while loop
03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
03/28/2012 19:11:14;0080; pbs_mom;Job;1.galera.ldeo.columbia.edu;obit
sent to server
03/28/2012 19:11:14;0001; pbs_mom;Job;TMomFinalizeJob3;job not
started, Retry job exec failure, retry will be attempted (see syslog for
more information)
03/28/2012 19:11:14;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters for job 1.galera.ldeo.columbia.edu
03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
03/28/2012 19:11:14;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
of while loop
03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
03/28/2012 19:11:14;0080; pbs_mom;Job;1.galera.ldeo.columbia.edu;obit
sent to server
**********
Thank you,
Gus Correa
More information about the torqueusers
mailing list