[torqueusers] torque/blcr integration

Robin robinr at muohio.edu
Wed Sep 22 12:06:00 MDT 2010


Thanks Al. 
That filled in the info I need. It would be nice if this behavior is documented the Torque documentation site.

Robin

On Sep 22, 2010, at 12:11 PM, Al Taufer wrote:

> I think the issue is the ssh keys used by scp.  When the pbs_mom issues the scp command it does so as root but the destination is to robinr at ....  This means the ssh keys must be set up so root can transfer to the user on the destination machine.  We are currently trying to rework the scp copy portion so the transfer occurs as either root or as the user but not a combination of both.
> 
> Al
> 
> ----- Original Message -----
>> Thanks. I re-looked at it. I got further, the PATH is set on the
>> /etc/profile.d/ and apparently, it did get propagated to that perl
>> script.
>> 
>> I'm getting an error where pbs_mom has trouble scp'ing.
>> However, if I do it manually, login as the running user and repeat the
>> same scp command listed below, the scp went through fine.
>> 
>> Sep 22 11:39:00 compute-1-1 pbs_mom: LOG_ERROR::sys_copy, command
>> '/usr/bin/scp -rpB
>> /usr/local/torque/2.4.6/var/spool/torque/checkpoint/110761.torque.hpc.muohio.edu.CK/ckpt.110761.torque.hpc.muohio.edu.1285169919
>> robinr at robin-head-1.hpc.muohio.edu:/usr/local/torque/2.4.6/var/spool/torque/checkpoint/110761.torque.hpc.muohio.edu.CK/ckpt.110761.torque.hpc.muohio.edu.1285169919'
>> failed with status=1, giving up after 4 attempts
>> 
>> I'm not fully certain what's the difference between me running it
>> manually and pbs_mom invoked it. The necessary files and directories
>> do exist before the logged scp attempt.
>> 
>> Code 1: General error in the file copy
>> 
>> Robin
>> 
>> 
>> 
>> On Sep 21, 2010, at 4:15 PM, Al Taufer wrote:
>> 
>>> The 2 scripts from the contrib/blcr directory may need to be
>>> modified for your installation. They both need to set the PATH
>>> correctly so they can find certain executables. You might verify
>>> that this is correct for your system.
>>> 
>>> Al
>>> ----- Original Message -----
>>>> Thanks, I just tried it with the ones in contrib dir.
>>>> I'm getting the same error, the return code matches as if not
>>>> enough
>>>> parameter.
>>>> 
>>>> ===
>>>> comment = Checkpoint script failed with return value of 255
>>>> ===
>>>> 
>>>> 
>>>> Robin
>>>> 
>>>> On Sep 21, 2010, at 2:26 PM, Al Taufer wrote:
>>>> 
>>>>> I do not know how up to date the scripts are on the web page but
>>>>> there are 2 scripts included with the distribution, they are in
>>>>> the
>>>>> torque/contrib/blcr directory and should be up to date. They are
>>>>> checkpoint_script and restart script, I would try using these.
>>>>> 
>>>>> Al Taufer
>>>>> Adaptive Computing
>>>>> 
>>>>> ----- Original Message -----
>>>>>> Hi,
>>>>>> 
>>>>>> I'm following the instructions on
>>>>>> http://www.clusterresources.com/products/torque/docs/2.6jobcheckpoint.shtml
>>>>>> Torque is compiled with --enable-blcr, version 2.4.10, I'm aware
>>>>>> that
>>>>>> the doc is for 2.5.x, I did not easily find the doc for 2.4.x.
>>>>>> 
>>>>>> Attached are my
>>>>>> mom_priv/{config,epilogue,blcr_checkpoint_script,blcr_restart_script}.
>>>>>> It's essentially the scripts from the doc, but the script on the
>>>>>> doc
>>>>>> needs correction (or it would not run).
>>>>>> blcr_checkpoint_script was editted to declare variable $depth and
>>>>>> put
>>>>>> a missing comma -- the aim was to fix the syntax (I didn't spend
>>>>>> much
>>>>>> time on the scripts).
>>>>>> [ It would be nice to see the webpage has the code fixed. ]
>>>>>> 
>>>>>> I submitted my test job with "qsub -c enabled test.job", then
>>>>>> issue
>>>>>> qhold jobid. It did not checkpoint the job, under qstat -f,
>>>>>> there's
>>>>>> an
>>>>>> output line for that job:
>>>>>> comment = "Usage:
>>>>>> /usr/local/torque/current/var/spool/torque/mom_priv/blcr_checkpoint_script"
>>>>>> 
>>>>>> Mom logs say that it the blcr_checkpoint_script exited with code
>>>>>> 255,
>>>>>> which is consistent with running the script without parameters.
>>>>>> 
>>>>>> I take that the pbs_mom did not issue the blcr_checkpoint_script
>>>>>> command with all the required parameters.
>>>>>> 
>>>>>> Any comments, helpful hints, or outright help will be most
>>>>>> welcome.
>>>>>> 
>>>>>> Thanks,
>>>>>> Robin
>>>>>> 
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>> 
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> 
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list