[torqueusers] BLCR and Torque

meysam miralipoor ir_m_m_a_p at yahoo.com
Mon Apr 30 06:00:58 MDT 2012


Hi
Although integrating BLCR and Torque already documented but i didn't find a reasonable solution for my problems with check pointing.
when i try to check point a job by qchkpt it seems all things are good but no checkpoint file created.
I find following error in /var/log/messages

Apr 30 11:01:10 node1 checkpoint_script: Invoked: /var/spool/torque/mom_priv/checkpoint_script 7366 1.server root root /var/spool/torque/checkpoint/1.server.CK ckpt.1.server.1335798070 0 -
Apr 30 11:01:10 node1 kernel: blcr: Retry request on -CR_ENOSUPPORT
Apr 30 11:01:10 node1 checkpoint_script: Subcommand (cr_checkpoint --tree 7366 --file ckpt.1.server.1335798070) failed with rc=52:#012- Retry request on -CR_ENOSUPPORT#012Checkpoint failed: support missing from application
Apr 30 11:01:10 node1 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 52
And here i provide additional information
**************PBS Script*******************
#!/bin/sh
# Beginning of PBS batch script.
#PBS -l nodes=1:ppn=4
##PBS -j oe
#PBS -o /share/output$JOB_ID.log
#PBS -e /share/error$JOB_ID.log
#PBS -N NOTMPI
#PBS -q batch
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/lib64

/share/ex2

# End of PBS batch script.

**************status of relevant process on node1******************
#ps -A|grep 7336 
7366 ?        00:00:00 bash
#ps -A|grep ex2
7368 ?        00:48:07 ex2
***********************************************************


May be it is useful to know that i can check point that running ex2(7368) process by using cr_checkpoint but check pointing bash(7366) process return same error message.
Any help is appreciated

Meysam
miralipoor at ipm.ir
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120430/cf186098/attachment.html 


More information about the torqueusers mailing list