[torqueusers] cpuset on Torque 2.4.16, CentOS 6.2, AMD Opteron Bulldozer

David Gabriel Simas dsimas at stanford.edu
Fri Mar 30 16:28:06 MDT 2012


----- Original Message -----
> Dear Torque Pros
> 
> I am having trouble with cpuset again,
> and would love to hear your suggestions.
> 
> I installed Torque 2.4.16 on a standalone machine
> running CentOS 6.2. Processors are AMD Opteron Bulldozer.
> 
> I have Torque 2.4.11 with cpuset working right on
> CentOS 5.2 + AMD Opteron Shanghai,
> and on CentOS 5.4 AMD Opteron Magny-Cours.
> 
> Anybody there using the same Torque
> and CentOS versions with Bulldozer
> and getting cpuset right?
> 
> **********************************************
> More info [sorry, lengthy, hopefully useful]
> **********************************************
> 
> The configure line looks like this:
> 
> ../configure \
> --prefix=${MYINSTALLDIR} \
> --with-server-home=${MYINSTALLDIR} \
> --enable-cpuset \
> --enable-geometry-requests \
> --with-pam \
> 2>&1 | tee configure_${build_id}.log
> 
> However, the test jobs flip forever between Q and R states,
> and never run.
> 
> *********
> 
> syslog messages look like this [repeated forever]:
> Mar 28 19:21:36 galera pbs_mom: LOG_ERROR::TMomFinalizeChild, Could
> not
> create cpuset for job 1.galera.ldeo.columbia.edu.
> 
> ********
> 
> server logs shows these errors [repeated ad nauseam until it exits]:
> 03/28/2012
> 19:21:44;0008;PBS_Server;Job;1.galera.ldeo.columbia.edu;Job
> Run at request of Scheduler at galera.ldeo.columbia.edu
> 03/28/2012
> 19:21:44;0040;PBS_Server;Svr;galera.ldeo.columbia.edu;Scheduler was
> sent
> the command recyc
> 03/28/2012
> 19:21:44;0040;PBS_Server;Svr;galera.ldeo.columbia.edu;Scheduler was
> sent
> the command new
> 
> *********
> 
> mom_logs report errors like the ones below [also repeated many
> times].
> 
> Indeed the file /dev/cpuset/cpus
> that it cannot locate doesn't exist.
> It seems to have been renamed [in CentOS 6.2 perhaps?] to
> /dev/cpuset/cpuset.cpus .
> Actually, most file names there seem to have benefited from this
> wonderful prefix "cpuset."
> Oh, well, innovation has no bounds ...
> 
> Would this filename mismatch be the problem?
> Any workaround or patch if this is really the reason for the failure?
> 

A work-around I found in testing Torque 4.0.0 is:

   umount /dev/cpuset
   umount /sys/fs/cgroup/cpuset
   mount -t cgroup -o cupset,noprefix X /sys/fs/cgroup/cpuset

Then pbs_mom starts up and works with no errors.

However, torque doesn't seem to understand the difference
between cores and hyperthreads. With cpusets enabled, a one
processor job will be bound to a core, giving it two 
hyperthreads.

DGS


> 
> 
> 03/28/2012 19:09:57;0002;   pbs_mom;Svr;Log;Log opened
> 03/28/2012 19:09:57;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.4.16, loglevel = 0
> 03/28/2012 19:09:57;0002;
> pbs_mom;Svr;setpbsserver;galera.ldeo.columbia.edu
> 03/28/2012 19:09:57;0002;   pbs_mom;Svr;mom_server_add;server
> galera.ldeo.columbia.edu added
> 03/28/2012 19:09:57;0002;   pbs_mom;Svr;usecp;*:/home  /home
> 03/28/2012 19:09:57;0002;   pbs_mom;Svr;usecp;*:/data00  /data00
> 03/28/2012 19:09:57;0002;   pbs_mom;n/a;initialize;independent
> 03/28/2012 19:09:57;0080;   pbs_mom;Svr;pbs_mom;before
> init_abort_jobs
> 03/28/2012 19:09:57;0002;   pbs_mom;Svr;initialize_root_cpuset;Init
> TORQUE cpuset /dev/cpuset/torque.
> 03/28/2012 19:09:57;0001;
> pbs_mom;Svr;pbs_mom;LOG_ERROR::initialize_root_cpuset, cannot locate
> /dev/cpuset/cpus - cpusets not configured/enabled on host
> 03/28/2012 19:09:57;0002;   pbs_mom;Svr;pbs_mom;Is up
> 03/28/2012 19:09:57;0002;   pbs_mom;Svr;setup_program_environment;MOM
> executable path and mtime at launch:
> /data00/sw/torque/2.4.16/gnu-4.4.6/sbin/pbs_mom 1332200273
> 03/28/2012 19:09:57;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 2.4.16, loglevel = 0
> 03/28/2012 19:09:57;0002;
> pbs_mom;n/a;mom_server_check_connection;sending hello to server
> galera.ldeo.columbia.edu
> 03/28/2012 19:11:14;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> started, Retry job exec failure, retry will be attempted (see syslog
> for
> more information)
> 03/28/2012 19:11:14;0008;   pbs_mom;Req;send_sisters;sending ABORT to
> sisters for job 1.galera.ldeo.columbia.edu
> 03/28/2012 19:11:14;0080;   pbs_mom;Svr;preobit_reply;top of
> preobit_reply
> 03/28/2012 19:11:14;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top
> of while loop
> 03/28/2012 19:11:14;0080;   pbs_mom;Svr;preobit_reply;in while loop,
> no
> error from job stat
> 03/28/2012 19:11:14;0080;
>   pbs_mom;Job;1.galera.ldeo.columbia.edu;obit
> sent to server
> 03/28/2012 19:11:14;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> started, Retry job exec failure, retry will be attempted (see syslog
> for
> more information)
> 03/28/2012 19:11:14;0008;   pbs_mom;Req;send_sisters;sending ABORT to
> sisters for job 1.galera.ldeo.columbia.edu
> 03/28/2012 19:11:14;0080;   pbs_mom;Svr;preobit_reply;top of
> preobit_reply
> 03/28/2012 19:11:14;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top
> of while loop
> 03/28/2012 19:11:14;0080;   pbs_mom;Svr;preobit_reply;in while loop,
> no
> error from job stat
> 03/28/2012 19:11:14;0080;
>   pbs_mom;Job;1.galera.ldeo.columbia.edu;obit
> sent to server
> 03/28/2012 19:11:14;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> started, Retry job exec failure, retry will be attempted (see syslog
> for
> more information)
> 03/28/2012 19:11:14;0008;   pbs_mom;Req;send_sisters;sending ABORT to
> sisters for job 1.galera.ldeo.columbia.edu
> 03/28/2012 19:11:14;0080;   pbs_mom;Svr;preobit_reply;top of
> preobit_reply
> 03/28/2012 19:11:14;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top
> of while loop
> 03/28/2012 19:11:14;0080;   pbs_mom;Svr;preobit_reply;in while loop,
> no
> error from job stat
> 03/28/2012 19:11:14;0080;
>   pbs_mom;Job;1.galera.ldeo.columbia.edu;obit
> sent to server
> 
> 
> **********
> 
> Thank you,
> Gus Correa
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 


More information about the torqueusers mailing list