[torqueusers] cpuset.c sleep timer causing job deferrals with 'RM Failure' response
Curry, William D
wcurry at tulane.edu
Thu Oct 6 16:11:45 MDT 2011
I've had trouble with jobs deferring when assigned to nodes following
finished larger (many ppn) jobs. `checkjob` reports an "RM Failure".
Usually after waiting a minute or two `releasehold` allows the job to run.
I noticed in the mom's log that a ppn=32 job spent 32 seconds deleting
task directories in /dev/cpuset after the job finished. Inside
src/resmom/linux/cpuset.c there is a 'sleep(1)' command along with a FIXME
note that is apparently responsible for this behavior. After removing this
line and restarting the mom, my tests no longer end with deferred jobs.
What are the potential pitfalls of doing this?
Sr. HPC Systems Analyst
Center for Computational Science
office: (504) 862-8393
More information about the torqueusers