[torqueusers] Torque 2.3.7 and connection Refused

Prakash Velayutham prakash.velayutham at cchmc.org
Thu Jul 9 20:04:25 MDT 2009


Hi,

On Jul 9, 2009, at 8:31 PM, Anil Thapa wrote:

> Hello again,
>
> I been into number of problem today while testing 2.3.7 version.
>
> 1. After re-building the all package and clean installation now  
> pbs_mom
> is reporting to the server with all the firewall turned off.
> 2. Jobs can be submitted to the frontnode - when i submit job it gives
> you jobid and etc as usual. For example : I tested with echo "sleep  
> 30"
> | qsub ---this submits the job and watch qsub can be seen the job
> activities. - that okay and normal.
> 3. then i wrote very simple script and submit the job it does  
> submitted
> but it does not download the output. It looks job has been submitted-
> finished but don´t know what happened? mom_log shows this -
>
> pbs_mom;Req;dis_request_read;decoding command CopyFiles from  
> PBS_Server
>  pbs_mom;Req;;Type CopyFiles request received from
> PBS_Server at p34.test.local, sock=12
>   pbs_mom;Job;dispatch_request;dispatching request CopyFiles on sd=12
>   pbs_mom;Job;62.bhairab.rhi.hi.is;attempting to copy file
> 'bhairab.rhi.hi.is:/test/anil/test.sh.o62'
>   pbs_mom;Job;62.bhairab.rhi.hi.is;forking to user, uid: 501  gid: 501
> homedir: '/test/anil'
>   pbs_mom;n/a;mom_close_poll;entered
>   pbs_mom;Svr;mom_get_sample;proc_array load started
>   pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=126
>   pbs_mom;Svr;mom_get_sample;proc_array load started
>   pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=126
>   pbs_mom;n/a;cput_sum;proc_array loop start - jobid =  
> 62.bhairab.rhi.hi.is
>   pbs_mom;n/a;mem_sum;proc_array loop start - jobid =  
> 62.bhairab.rhi.hi.is
>   pbs_mom;n/a;resi_sum;proc_array loop start - jobid =  
> 62.bhairab.rhi.hi.is
>   pbs_mom;n/a;cput_sum;proc_array loop start - jobid =  
> 61.bhairab.rhi.hi.is
>   pbs_mom;n/a;mem_sum;proc_array loop start - jobid =  
> 61.bhairab.rhi.hi.is
>   pbs_mom;n/a;resi_sum;proc_array loop start - jobid =  
> 61.bhairab.rhi.hi.is
>   pbs_mom;Job;scan_for_terminated;pid 5027 not tracked, exitcode=0
>    pbs_mom;Req;dis_request_read;decoding command DeleteJob from  
> PBS_Server

I think this means your job completed successfully. Not sure what the  
job is and what it is supposed to do.

>
> I am not sure if this is issue with this version or just the
> configuration error. In the past I did the similar way with 2.1.7 and
> have no problem. Can someone point me to right direction to fix this
> installation or configuration.
>
> 4. I joined headnode into our ldap server so users can login to  
> headnode
> with their own username passowrd. users home directory are in  
> different
> servers which are mounted at the time they login so users get their  
> own
> home drive space. But when the users submit thier jobs does users home
> dirve and accounts also be created in all the nodes ? or how is the
> general practice ?

I guess you mean LDAP for name services and authentication and  
automounted home directories, right? If yes, is LDAP and automounter  
configured correctly on the head node and all the compute nodes and is  
automounter daemon running on the head node and all the compute nodes?

If LDAP is not configured correctly on the compute nodes, user jobs  
won't even start as the compute nodes won't know anything about the  
user. And if automounter is not running, compute nodes will not run a  
job as a shell cannot be started.

>
> Finally, last time I build the packages it also created the torque pam
> package but i do not see in this version by default. I would be  
> grateful
> if someone point me to torque pam pakacge link.

Not sure about 2.3.7, but 2.3.6 has it.

>
> Some input - help, suggesstion would be great
>
> apologies for long
>
> Regards,
> A

Good luck,
Prakash

> Anil Thapa wrote:
>> Hi all,
>>
>> I just build the new version  and installed. I am having a small
>> problem. Everything looks well however when job submitted it always
>> stays in R mode and rest other jobs are in Q mode when i do qstat. I
>> looked at the node mom_logs it has loads of this error:
>>
>> 07/09/2009 16:18:10;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Connection
>> refused (111) in scan_for_exiting, cannot bind to port 1023 in
>> client_to_svr - connection refused
>>
>> Note: submitted job is very simple script. and for the testing  
>> purpose
>> firewall has temporarily turned off.
>>
>> Has anyone came up with this error. Any tips, help and suggestion  
>> would
>> be appreciated.
>>
>> Regards,
>> \A
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list