[torqueusers] Enabling BLCR on Torque roll (problems with "qhold")

Ricardo Alves rdq.alves at gmail.com
Fri Mar 11 08:20:38 MST 2011


Hi,

I have been trying to enable BLCR (Berkeley Checkpoint Restart) on torque in my cluster but with no success. 
I followed the tutorial on the webpage (http://www.clusterresources.com/torquedocs21/2.6jobcheckpoint.shtml) to enable BLCR support and everything compiles smoothly. 
After installation I am able to submit jobs normally, with no checkpointing enabled or with checkpointing enabled ("qsub -c enabled ..."), but I am unable to hold a running job ("qhold <job number>") submitted with checkpointing enabled. 
Every time I try to hold a job I get the following message:
"qhold: something specified didn't exist MSG=MOM rejected hold request: 15204 <job number>.<cluster name>" 

I am running Rocks 5.4 on the cluster and BLCR 0.8.2. The BLCR kernel modules are running correctly on the compute nodes and the paths to BLCR commands and libraries are present in $PATH and $LD_LIBRARY_PATH variables respectively.

In the attachments section is the result of a tracejob.

I would really appreciate some help with this problem.
Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110311/4c390e39/attachment.html 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: qtracejob.txt
Url: http://www.supercluster.org/pipermail/torqueusers/attachments/20110311/4c390e39/attachment.txt 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110311/4c390e39/attachment-0001.html 


More information about the torqueusers mailing list