[torqueusers] cpuset.c sleep timer causing job deferrals with 'RM Failure' response

Curry, William D wcurry at tulane.edu
Thu Oct 6 16:11:45 MDT 2011


Dear list,

I've had trouble with jobs deferring when assigned to nodes following
finished larger (many ppn) jobs. `checkjob` reports an "RM Failure".
Usually after waiting a minute or two `releasehold` allows the job to run.

I noticed in the mom's log that a ppn=32 job spent 32 seconds deleting
task directories in /dev/cpuset after the job finished. Inside
src/resmom/linux/cpuset.c there is a 'sleep(1)' command along with a FIXME
note that is apparently responsible for this behavior. After removing this
line and restarting the mom, my tests no longer end with deferred jobs.

What are the potential pitfalls of doing this?

Sincerely,

-Will

--
Will Curry
Sr. HPC Systems Analyst
Center for Computational Science
Tulane University
office: (504) 862-8393
--






More information about the torqueusers mailing list