[torqueusers] torque/blcr integration
robinr at muohio.edu
Wed Sep 22 09:49:44 MDT 2010
Thanks. I re-looked at it. I got further, the PATH is set on the /etc/profile.d/ and apparently, it did get propagated to that perl script.
I'm getting an error where pbs_mom has trouble scp'ing.
However, if I do it manually, login as the running user and repeat the same scp command listed below, the scp went through fine.
Sep 22 11:39:00 compute-1-1 pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB /usr/local/torque/2.4.6/var/spool/torque/checkpoint/110761.torque.hpc.muohio.edu.CK/ckpt.110761.torque.hpc.muohio.edu.1285169919 robinr at robin-head-1.hpc.muohio.edu:/usr/local/torque/2.4.6/var/spool/torque/checkpoint/110761.torque.hpc.muohio.edu.CK/ckpt.110761.torque.hpc.muohio.edu.1285169919' failed with status=1, giving up after 4 attempts
I'm not fully certain what's the difference between me running it manually and pbs_mom invoked it. The necessary files and directories do exist before the logged scp attempt.
Code 1: General error in the file copy
On Sep 21, 2010, at 4:15 PM, Al Taufer wrote:
> The 2 scripts from the contrib/blcr directory may need to be modified for your installation. They both need to set the PATH correctly so they can find certain executables. You might verify that this is correct for your system.
> ----- Original Message -----
>> Thanks, I just tried it with the ones in contrib dir.
>> I'm getting the same error, the return code matches as if not enough
>> comment = Checkpoint script failed with return value of 255
>> On Sep 21, 2010, at 2:26 PM, Al Taufer wrote:
>>> I do not know how up to date the scripts are on the web page but
>>> there are 2 scripts included with the distribution, they are in the
>>> torque/contrib/blcr directory and should be up to date. They are
>>> checkpoint_script and restart script, I would try using these.
>>> Al Taufer
>>> Adaptive Computing
>>> ----- Original Message -----
>>>> I'm following the instructions on
>>>> Torque is compiled with --enable-blcr, version 2.4.10, I'm aware
>>>> the doc is for 2.5.x, I did not easily find the doc for 2.4.x.
>>>> Attached are my
>>>> It's essentially the scripts from the doc, but the script on the
>>>> needs correction (or it would not run).
>>>> blcr_checkpoint_script was editted to declare variable $depth and
>>>> a missing comma -- the aim was to fix the syntax (I didn't spend
>>>> time on the scripts).
>>>> [ It would be nice to see the webpage has the code fixed. ]
>>>> I submitted my test job with "qsub -c enabled test.job", then issue
>>>> qhold jobid. It did not checkpoint the job, under qstat -f, there's
>>>> output line for that job:
>>>> comment = "Usage:
>>>> Mom logs say that it the blcr_checkpoint_script exited with code
>>>> which is consistent with running the script without parameters.
>>>> I take that the pbs_mom did not issue the blcr_checkpoint_script
>>>> command with all the required parameters.
>>>> Any comments, helpful hints, or outright help will be most welcome.
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>> torqueusers mailing list
>> torqueusers at supercluster.org
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers