[torqueusers] Maui and Torque not agreeing on jobstate

Craig West cwest at astro.umass.edu
Wed Sep 24 08:21:32 MDT 2008


I think you will find the job or a node has an error. It is being 
continuously restarted. Notice the start_count variable is high, also 
there is the exit_status variable which doesn't usually appear until the 
job has exited (at least once).

I think the job is being continuously re-queued. You may want to put a 
hold on it, or delete it until you can understand why.

I would check the logs on the server to see which nodes it trying to run 
on, then check that node to see if there is a problem.
"tracejob <jobid>" should show some useful information, but needs to be 
run by root to get detailed information. The "exec_host" variable will 
tell you which nodes it is trying to run on.


>     etime = Wed Sep 24 14:08:27 2008
>     exit_status = -3
>     submit_args = qsubtest.com
>     start_time = Wed Sep 24 14:08:28 2008
>     start_count = 1756
> I don't understand the priority being zero, as maui lists the 
> startpriority as 60. Something appears to be not communicating 
> somewhere. Could someone shed some light on it?

More information about the torqueusers mailing list