[torqueusers] Enabling BLCR on Torque roll (problems with "qhold")

Al Taufer ataufer at adaptivecomputing.com
Mon Mar 28 15:33:17 MDT 2011


I assume that you are using the -c option on the qsub command, without the -c checkpointing will not occur.

Please look at the syslog on the mom node to see what might be happening.  For a successful checkpoint you should see messages such as the following:

Mar 28 15:20:36 molo pbs_mom: LOG_DEBUG::blcr_checkpoint_job, checkpoint args: /var/spool/torque/mom_priv/checkpoint_script 3310 684000.molo ataufer ataufer /var/spool/torque/checkpoint/684000.molo.CK ckpt.684000.molo.1301347236 0 - 2>&1 1>/dev/null

Mar 28 15:20:36 molo checkpoint_script: Invoked: /var/spool/torque/mom_priv/checkpoint_script 3310 684000.molo ataufer ataufer /var/spool/torque/checkpoint/684000.molo.CK ckpt.684000.molo.1301347236 0 -

Mar 28 15:20:36 molo checkpoint_script: Subcommand (cr_checkpoint --tree 3310 --file ckpt.684000.molo.1301347236) yielded rc=0:

Do you see anything like these or any other messages with "checkpoint" in them?

Al Taufer

----- Original Message -----
> Hi Taufer,
> 
> Thank you for the reply.
> I've looked through mom logs and I can't find any error. Every time I
> send a hold command to a running job, pbs_mom receives it and tries to
> hold the job. As far as I can understand, the logs say the checkpoint
> was done successfully. Unfortunately the checkpoints are not being
> made correctly because not only the job doesn't stop but also no
> checkpoint file is created.
> 
> The mom log file is attached to this mail. The log starts right after
> pbs_mom receives the hold signal
> 
> 
> 
> On Mar 16, 2011, at 4:38 PM, Al Taufer wrote:
> 
> > Qhold will only stop the running job if the job is successfully
> > checkpointed. You need to look at the syslog and mom logs on the
> > node where the job is running to see why the checkpoint is failing.
> >
> > Al Taufer
> >
> > ----- Original Message -----
> >> Thank you for the help.
> >> I was able to solve my problem, with the pbs_mom error and replaced
> >> the blcr scripts from the online tutorial with the ones available
> >> in
> >> the source code of Torque (I'm using Torque 3.0.0 by the way).
> >>
> >> Unfortunately I am having another problem.
> >> I am able to run the checkpoint commands (qhold and qchkpt) without
> >> getting any error message but neither command creates a checkpoint
> >> file for the job. Actually the qhold command does not even stops
> >> the
> >> the running job.
> >> The jobs are submitted with checkpointing enabled and a path for
> >> the
> >> checkpoint file.
> >>
> >> This is the result of the tracejob of a qhold command:
> >>
> >> 03/16/2011 15:58:39 S Holds u set at request of <some
> >> user>@cluster.PAC
> >> 03/16/2011 15:58:39 S Job Modified at request of
> >> root at compute-0-1.local
> >> 03/16/2011 15:58:39 S Holds uos released at request of
> >> root at compute-0-1.local
> >>
> >> Does anybody has any idea why the checkpoint commands are not
> >> generating a checkpoint file and why qhold does not stop a running
> >> job?
> >>
> >> Thank you.
> >>
> >> _______________________________________________
> >> torqueusers mailing list
> >> torqueusers at supercluster.org
> >> http://www.supercluster.org/mailman/listinfo/torqueusers
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list