[torqueusers] Torque 2.3.7 and connection Refused

Prakash Velayutham prakash.velayutham at cchmc.org
Fri Jul 10 09:38:35 MDT 2009


Hi,

Please copy the list in your emails as this might help others looking  
for the same answers.

On Jul 10, 2009, at 11:22 AM, Anil Thapa wrote:

> Hello
>
> Thanks for you help. I actually remove all the torque from server  
> and client then rebuild with ./configure ---with- scp for both head  
> node and client. jobs sumission was working fine but jobs were  
> always in /var/spool/torque/underliverd directory. then  I followed  
> 6.1.5 - Enabling Bi-Directional SCP Access from cluster resources "http://www.clusterresources.com/products/torque/docs/6.1scpsetup.shtml 
> ". I created the identical users locally in both head node and  
> computer node with same uid. Then it worked as it supposed to. jobs  
> are sent to compute node and result back to users /home/user  
> directory. At least this looks working but this is not an ideal way.
>
> It would be ideal I don´t have to create user and its home directory  
> for very users. I was thinking adding every compute node to LDAP  
> server and export their home directory as my head node. What is your  
> thought in this.

It is supposed to work that way if LDAP and automounter are configured  
correctly and your home directory server's export list allows head  
node and compute nodes to mount the relevant directory with the  
required permissions.

>
> However still user have to ssh-keygen -t rsa bi-directionally in  
> order to compute node could send the result back (or  are there any  
> better option) ?

Not needed. Please see
usecp
directive in PBS Mom's configuration file. As the home directories are  
uniform and automounted across the whole cluster, MOM needs to just cp  
the output and error files instead of using any kind of remote copy.

>
> Thanks and have a good weekend.
>
> A

Prakash

> Prakash Velayutham wrote:
>> Hi,
>>
>> On Jul 10, 2009, at 5:00 AM, Anil Thapa wrote:
>>
>>> Hello !
>>>
>>> Thanks for your response
>>>
>>> Prakash Velayutham wrote:
>>>> Hi,
>>>>
>>>> On Jul 9, 2009, at 8:31 PM, Anil Thapa wrote:
>>>>
>>>>> Hello again,
>>>>>
>>>>> I been into number of problem today while testing 2.3.7 version.
>>>>>
>>>>> 1. After re-building the all package and clean installation now  
>>>>> pbs_mom
>>>>> is reporting to the server with all the firewall turned off.
>>>>> 2. Jobs can be submitted to the frontnode - when i submit job it  
>>>>> gives
>>>>> you jobid and etc as usual. For example : I tested with echo  
>>>>> "sleep 30"
>>>>> | qsub ---this submits the job and watch qsub can be seen the job
>>>>> activities. - that okay and normal.
>>>>> 3. then i wrote very simple script and submit the job it does  
>>>>> submitted
>>>>> but it does not download the output. It looks job has been  
>>>>> submitted-
>>>>> finished but don´t know what happened? mom_log shows this -
>>>>>
>>>>> pbs_mom;Req;dis_request_read;decoding command CopyFiles from  
>>>>> PBS_Server
>>>>> pbs_mom;Req;;Type CopyFiles request received from
>>>>> PBS_Server at p34.test.local, sock=12
>>>>> pbs_mom;Job;dispatch_request;dispatching request CopyFiles on  
>>>>> sd=12
>>>>> pbs_mom;Job;62.bhairab.rhi.hi.is;attempting to copy file
>>>>> 'bhairab.rhi.hi.is:/test/anil/test.sh.o62'
>>>>> pbs_mom;Job;62.bhairab.rhi.hi.is;forking to user, uid: 501  gid:  
>>>>> 501
>>>>> homedir: '/test/anil'
>>>>> pbs_mom;n/a;mom_close_poll;entered
>>>>> pbs_mom;Svr;mom_get_sample;proc_array load started
>>>>> pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=126
>>>>> pbs_mom;Svr;mom_get_sample;proc_array load started
>>>>> pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=126
>>>>> pbs_mom;n/a;cput_sum;proc_array loop start - jobid =  
>>>>> 62.bhairab.rhi.hi.is
>>>>> pbs_mom;n/a;mem_sum;proc_array loop start - jobid =  
>>>>> 62.bhairab.rhi.hi.is
>>>>> pbs_mom;n/a;resi_sum;proc_array loop start - jobid =  
>>>>> 62.bhairab.rhi.hi.is
>>>>> pbs_mom;n/a;cput_sum;proc_array loop start - jobid =  
>>>>> 61.bhairab.rhi.hi.is
>>>>> pbs_mom;n/a;mem_sum;proc_array loop start - jobid =  
>>>>> 61.bhairab.rhi.hi.is
>>>>> pbs_mom;n/a;resi_sum;proc_array loop start - jobid =  
>>>>> 61.bhairab.rhi.hi.is
>>>>> pbs_mom;Job;scan_for_terminated;pid 5027 not tracked, exitcode=0
>>>>> pbs_mom;Req;dis_request_read;decoding command DeleteJob from  
>>>>> PBS_Server
>>>>
>>>> I think this means your job completed successfully. Not sure what  
>>>> the job is and what it is supposed to do.
>>>>
>>> Okay!, the job was just to output the hostname. Normally, when job  
>>> finished it dowload the output in file with joid.e01 or similar in  
>>> users home directory.
>>>>>
>>>>> I am not sure if this is issue with this version or just the
>>>>> configuration error. In the past I did the similar way with  
>>>>> 2.1.7 and
>>>>> have no problem. Can someone point me to right direction to fix  
>>>>> this
>>>>> installation or configuration.
>>>>>
>>>>> 4. I joined headnode into our ldap server so users can login to  
>>>>> headnode
>>>>> with their own username passowrd. users home directory are in  
>>>>> different
>>>>> servers which are mounted at the time they login so users get  
>>>>> their own
>>>>> home drive space. But when the users submit thier jobs does  
>>>>> users home
>>>>> dirve and accounts also be created in all the nodes ? or how is  
>>>>> the
>>>>> general practice ?
>>>>
>>>> I guess you mean LDAP for name services and authentication and  
>>>> automounted home directories, right? If yes, is LDAP and  
>>>> automounter configured correctly on the head node and all the  
>>>> compute nodes and is automounter daemon running on the head node  
>>>> and all the compute nodes?
>>>>
>>>> If LDAP is not configured correctly on the compute nodes, user  
>>>> jobs won't even start as the compute nodes won't know anything  
>>>> about the user. And if automounter is not running, compute nodes  
>>>> will not run a job as a shell cannot be started.
>>>>
>>> Yes this what I am talking about. In the head node home  
>>> directories are auto mounted and users can get their home  
>>> directories in head node. Isn´t a similar configuration LDAP  
>>> configuration for nodes ?
>>
>> That depends. There are 2 things going on here.
>> Name services are configured in /etc/nsswitch.conf. If LDAP is told  
>> as an option in that file for authentication (shadow and passwd)  
>> and for automounts (automount), then LDAP will be used.
>> Next, LDAP configuration (for system-level authentication)  
>> generally is in /etc/ldap.conf (could be somewhere else in your  
>> distribution). So, if that file is common in the head node and the  
>> compute nodes, sure they should all behave the same way.
>>
>>>>>
>>>>> Finally, last time I build the packages it also created the  
>>>>> torque pam
>>>>> package but i do not see in this version by default. I would be  
>>>>> grateful
>>>>> if someone point me to torque pam pakacge link.
>>>>
>>>> Not sure about 2.3.7, but 2.3.6 has it.
>>>>
>>>>>
>>> what this pam pakage do actually !
>>
>> It is supposed to let you control who can gain access to the  
>> compute nodes through services other than Torque (like SSH). You  
>> would not want users to be able to SSH into compute nodes and start  
>> any process outside of Torque. But you would want to enable users  
>> to be able to SSH to a compute node where they already have a job  
>> running. That is what pam_pbssimpleauth module lets you configure.
>>
>>>
>>> Thanks
>>> A
>>>>
>>>>> Anil Thapa wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I just build the new version  and installed. I am having a small
>>>>>> problem. Everything looks well however when job submitted it  
>>>>>> always
>>>>>> stays in R mode and rest other jobs are in Q mode when i do  
>>>>>> qstat. I
>>>>>> looked at the node mom_logs it has loads of this error:
>>>>>>
>>>>>> 07/09/2009 16:18:10;0001;    
>>>>>> pbs_mom;Svr;pbs_mom;LOG_ERROR::Connection
>>>>>> refused (111) in scan_for_exiting, cannot bind to port 1023 in
>>>>>> client_to_svr - connection refused
>>>>>>
>>>>>> Note: submitted job is very simple script. and for the testing  
>>>>>> purpose
>>>>>> firewall has temporarily turned off.
>>>>>>
>>>>>> Has anyone came up with this error. Any tips, help and  
>>>>>> suggestion would
>>>>>> be appreciated.
>>>>>>
>>>>>> Regards,
>>>>>> \A
>>
>> Hope that helps,
>> Prakash
>



More information about the torqueusers mailing list