[torqueusers] cpuset on Torque 2.4.16, CentOS 6.2, AMD Opteron Bulldozer
David Gabriel Simas
dsimas at stanford.edu
Fri Mar 30 16:28:06 MDT 2012
----- Original Message -----
> Dear Torque Pros
>
> I am having trouble with cpuset again,
> and would love to hear your suggestions.
>
> I installed Torque 2.4.16 on a standalone machine
> running CentOS 6.2. Processors are AMD Opteron Bulldozer.
>
> I have Torque 2.4.11 with cpuset working right on
> CentOS 5.2 + AMD Opteron Shanghai,
> and on CentOS 5.4 AMD Opteron Magny-Cours.
>
> Anybody there using the same Torque
> and CentOS versions with Bulldozer
> and getting cpuset right?
>
> **********************************************
> More info [sorry, lengthy, hopefully useful]
> **********************************************
>
> The configure line looks like this:
>
> ../configure \
> --prefix=${MYINSTALLDIR} \
> --with-server-home=${MYINSTALLDIR} \
> --enable-cpuset \
> --enable-geometry-requests \
> --with-pam \
> 2>&1 | tee configure_${build_id}.log
>
> However, the test jobs flip forever between Q and R states,
> and never run.
>
> *********
>
> syslog messages look like this [repeated forever]:
> Mar 28 19:21:36 galera pbs_mom: LOG_ERROR::TMomFinalizeChild, Could
> not
> create cpuset for job 1.galera.ldeo.columbia.edu.
>
> ********
>
> server logs shows these errors [repeated ad nauseam until it exits]:
> 03/28/2012
> 19:21:44;0008;PBS_Server;Job;1.galera.ldeo.columbia.edu;Job
> Run at request of Scheduler at galera.ldeo.columbia.edu
> 03/28/2012
> 19:21:44;0040;PBS_Server;Svr;galera.ldeo.columbia.edu;Scheduler was
> sent
> the command recyc
> 03/28/2012
> 19:21:44;0040;PBS_Server;Svr;galera.ldeo.columbia.edu;Scheduler was
> sent
> the command new
>
> *********
>
> mom_logs report errors like the ones below [also repeated many
> times].
>
> Indeed the file /dev/cpuset/cpus
> that it cannot locate doesn't exist.
> It seems to have been renamed [in CentOS 6.2 perhaps?] to
> /dev/cpuset/cpuset.cpus .
> Actually, most file names there seem to have benefited from this
> wonderful prefix "cpuset."
> Oh, well, innovation has no bounds ...
>
> Would this filename mismatch be the problem?
> Any workaround or patch if this is really the reason for the failure?
>
A work-around I found in testing Torque 4.0.0 is:
umount /dev/cpuset
umount /sys/fs/cgroup/cpuset
mount -t cgroup -o cupset,noprefix X /sys/fs/cgroup/cpuset
Then pbs_mom starts up and works with no errors.
However, torque doesn't seem to understand the difference
between cores and hyperthreads. With cpusets enabled, a one
processor job will be bound to a core, giving it two
hyperthreads.
DGS
>
>
> 03/28/2012 19:09:57;0002; pbs_mom;Svr;Log;Log opened
> 03/28/2012 19:09:57;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.4.16, loglevel = 0
> 03/28/2012 19:09:57;0002;
> pbs_mom;Svr;setpbsserver;galera.ldeo.columbia.edu
> 03/28/2012 19:09:57;0002; pbs_mom;Svr;mom_server_add;server
> galera.ldeo.columbia.edu added
> 03/28/2012 19:09:57;0002; pbs_mom;Svr;usecp;*:/home /home
> 03/28/2012 19:09:57;0002; pbs_mom;Svr;usecp;*:/data00 /data00
> 03/28/2012 19:09:57;0002; pbs_mom;n/a;initialize;independent
> 03/28/2012 19:09:57;0080; pbs_mom;Svr;pbs_mom;before
> init_abort_jobs
> 03/28/2012 19:09:57;0002; pbs_mom;Svr;initialize_root_cpuset;Init
> TORQUE cpuset /dev/cpuset/torque.
> 03/28/2012 19:09:57;0001;
> pbs_mom;Svr;pbs_mom;LOG_ERROR::initialize_root_cpuset, cannot locate
> /dev/cpuset/cpus - cpusets not configured/enabled on host
> 03/28/2012 19:09:57;0002; pbs_mom;Svr;pbs_mom;Is up
> 03/28/2012 19:09:57;0002; pbs_mom;Svr;setup_program_environment;MOM
> executable path and mtime at launch:
> /data00/sw/torque/2.4.16/gnu-4.4.6/sbin/pbs_mom 1332200273
> 03/28/2012 19:09:57;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.4.16, loglevel = 0
> 03/28/2012 19:09:57;0002;
> pbs_mom;n/a;mom_server_check_connection;sending hello to server
> galera.ldeo.columbia.edu
> 03/28/2012 19:11:14;0001; pbs_mom;Job;TMomFinalizeJob3;job not
> started, Retry job exec failure, retry will be attempted (see syslog
> for
> more information)
> 03/28/2012 19:11:14;0008; pbs_mom;Req;send_sisters;sending ABORT to
> sisters for job 1.galera.ldeo.columbia.edu
> 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;top of
> preobit_reply
> 03/28/2012 19:11:14;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top
> of while loop
> 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;in while loop,
> no
> error from job stat
> 03/28/2012 19:11:14;0080;
> pbs_mom;Job;1.galera.ldeo.columbia.edu;obit
> sent to server
> 03/28/2012 19:11:14;0001; pbs_mom;Job;TMomFinalizeJob3;job not
> started, Retry job exec failure, retry will be attempted (see syslog
> for
> more information)
> 03/28/2012 19:11:14;0008; pbs_mom;Req;send_sisters;sending ABORT to
> sisters for job 1.galera.ldeo.columbia.edu
> 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;top of
> preobit_reply
> 03/28/2012 19:11:14;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top
> of while loop
> 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;in while loop,
> no
> error from job stat
> 03/28/2012 19:11:14;0080;
> pbs_mom;Job;1.galera.ldeo.columbia.edu;obit
> sent to server
> 03/28/2012 19:11:14;0001; pbs_mom;Job;TMomFinalizeJob3;job not
> started, Retry job exec failure, retry will be attempted (see syslog
> for
> more information)
> 03/28/2012 19:11:14;0008; pbs_mom;Req;send_sisters;sending ABORT to
> sisters for job 1.galera.ldeo.columbia.edu
> 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;top of
> preobit_reply
> 03/28/2012 19:11:14;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top
> of while loop
> 03/28/2012 19:11:14;0080; pbs_mom;Svr;preobit_reply;in while loop,
> no
> error from job stat
> 03/28/2012 19:11:14;0080;
> pbs_mom;Job;1.galera.ldeo.columbia.edu;obit
> sent to server
>
>
> **********
>
> Thank you,
> Gus Correa
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
More information about the torqueusers
mailing list