[torqueusers] Job eligible, nodes free, but job would not start
narora at Princeton.EDU
Fri Oct 13 10:43:00 MDT 2006
I notice that when these jobs are stuck, one way to get them started is
to set a walltime (using qalter) less than the default walltime. We set
a default_walltime of 9999:00:00 at the server level and require the
users to specify the needed cpu-time.
This was set a long time ago and has not been causing any issues. But it
seems now that if you have set this default and then a user submits a
job with an explicit -l walltime=<time> specification, then that job
runs while older jobs with default walltime wait.
Can some one please shed some light on this - I am out of clues here?
Neelesh Arora wrote:
> Hi All,
> I am using torque-2.0.0p2 and maui-3.2.6p13, and notice the following
> behavior today:
> - There are several jobs in the queue that are in the Q state. When I do
> checkjob <jobid>, I get (among other things):
> "job can run in partition DEFAULT (63 procs available. 1 procs required)"
> but the job remains in Q forever. It is not the case of a resource
> requirement not being met (as the above message indicates)
> - nothing untoward in the torque logs
> - I see several of these messages in maui.log:
> MSysRegEvent(JOBCORRUPTION: job 'jobid' has the following idle node(s)
> allocated: 'node114' ,0,0,1)
> but these are for the running jobs, not the Q'ed jobs in question
> - I also see messages like these in the maui.log:
> INFO: PBS node node114 set to state Idle (free)
> INFO: node 'node114' changed states from Running to Idle
> although, this node has 2 out of 4 procs busy
> this message is repeated for several nodes.
> - restarting torque and maui did not help either
> - if I say qrun <jobid> for the stuck jobs, I get:
> qrun: Resource temporarily unavailable <jobid>
> - but if I do runjob <jobid>, the jobs are started !!
> I am unable to correlate all this information. Does anyone know what can
> be going wrong, or where else can I hunt for things?
More information about the torqueusers