[torqueusers] Torque 2.3.7 and connection Refused

Anil Thapa anilth at hi.is
Thu Jul 9 18:31:18 MDT 2009


Hello again,

I been into number of problem today while testing 2.3.7 version.

1. After re-building the all package and clean installation now pbs_mom 
is reporting to the server with all the firewall turned off.
2. Jobs can be submitted to the frontnode - when i submit job it gives 
you jobid and etc as usual. For example : I tested with echo "sleep 30" 
| qsub ---this submits the job and watch qsub can be seen the job 
activities. - that okay and normal.
3. then i wrote very simple script and submit the job it does submitted 
but it does not download the output. It looks job has been submitted- 
finished but don´t know what happened? mom_log shows this -
  
 pbs_mom;Req;dis_request_read;decoding command CopyFiles from PBS_Server
  pbs_mom;Req;;Type CopyFiles request received from 
PBS_Server at p34.test.local, sock=12
   pbs_mom;Job;dispatch_request;dispatching request CopyFiles on sd=12
   pbs_mom;Job;62.bhairab.rhi.hi.is;attempting to copy file 
'bhairab.rhi.hi.is:/test/anil/test.sh.o62'
   pbs_mom;Job;62.bhairab.rhi.hi.is;forking to user, uid: 501  gid: 501  
homedir: '/test/anil'
   pbs_mom;n/a;mom_close_poll;entered
   pbs_mom;Svr;mom_get_sample;proc_array load started
   pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=126
   pbs_mom;Svr;mom_get_sample;proc_array load started
   pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=126
   pbs_mom;n/a;cput_sum;proc_array loop start - jobid = 62.bhairab.rhi.hi.is
   pbs_mom;n/a;mem_sum;proc_array loop start - jobid = 62.bhairab.rhi.hi.is
   pbs_mom;n/a;resi_sum;proc_array loop start - jobid = 62.bhairab.rhi.hi.is
   pbs_mom;n/a;cput_sum;proc_array loop start - jobid = 61.bhairab.rhi.hi.is
   pbs_mom;n/a;mem_sum;proc_array loop start - jobid = 61.bhairab.rhi.hi.is
   pbs_mom;n/a;resi_sum;proc_array loop start - jobid = 61.bhairab.rhi.hi.is
   pbs_mom;Job;scan_for_terminated;pid 5027 not tracked, exitcode=0
    pbs_mom;Req;dis_request_read;decoding command DeleteJob from PBS_Server

I am not sure if this is issue with this version or just the 
configuration error. In the past I did the similar way with 2.1.7 and 
have no problem. Can someone point me to right direction to fix this 
installation or configuration.

4. I joined headnode into our ldap server so users can login to headnode 
with their own username passowrd. users home directory are in different 
servers which are mounted at the time they login so users get their own 
home drive space. But when the users submit thier jobs does users home 
dirve and accounts also be created in all the nodes ? or how is the 
general practice ?

Finally, last time I build the packages it also created the torque pam 
package but i do not see in this version by default. I would be grateful 
if someone point me to torque pam pakacge link.

Some input - help, suggesstion would be great

apologies for long

Regards,
A
Anil Thapa wrote:
> Hi all,
>
> I just build the new version  and installed. I am having a small 
> problem. Everything looks well however when job submitted it always 
> stays in R mode and rest other jobs are in Q mode when i do qstat. I 
> looked at the node mom_logs it has loads of this error:
>
> 07/09/2009 16:18:10;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Connection 
> refused (111) in scan_for_exiting, cannot bind to port 1023 in 
> client_to_svr - connection refused
>
> Note: submitted job is very simple script. and for the testing purpose 
> firewall has temporarily turned off.
>
> Has anyone came up with this error. Any tips, help and suggestion would 
> be appreciated.
>
> Regards,
> \A
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>   



More information about the torqueusers mailing list