[torqueusers] Enabling BLCR on Torque roll (problems with "qhold")

Al Taufer ataufer at adaptivecomputing.com
Wed Mar 16 10:38:51 MDT 2011


Qhold will only stop the running job if the job is successfully checkpointed.  You need to look at the syslog and mom logs on the node where the job is running to see why the checkpoint is failing.

Al Taufer

----- Original Message -----
> Thank you for the help.
> I was able to solve my problem, with the pbs_mom error and replaced
> the blcr scripts from the online tutorial with the ones available in
> the source code of Torque (I'm using Torque 3.0.0 by the way).
> 
> Unfortunately I am having another problem.
> I am able to run the checkpoint commands (qhold and qchkpt) without
> getting any error message but neither command creates a checkpoint
> file for the job. Actually the qhold command does not even stops the
> the running job.
> The jobs are submitted with checkpointing enabled and a path for the
> checkpoint file.
> 
> This is the result of the tracejob of a qhold command:
> 
> 03/16/2011 15:58:39 S Holds u set at request of <some
> user>@cluster.PAC
> 03/16/2011 15:58:39 S Job Modified at request of
> root at compute-0-1.local
> 03/16/2011 15:58:39 S Holds uos released at request of
> root at compute-0-1.local
> 
> Does anybody has any idea why the checkpoint commands are not
> generating a checkpoint file and why qhold does not stop a running
> job?
> 
> Thank you.
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list