[torqueusers] Torque 2.3.7 and connection Refused
Anil Thapa
anilth at hi.is
Thu Jul 9 18:31:18 MDT 2009
Hello again,
I been into number of problem today while testing 2.3.7 version.
1. After re-building the all package and clean installation now pbs_mom
is reporting to the server with all the firewall turned off.
2. Jobs can be submitted to the frontnode - when i submit job it gives
you jobid and etc as usual. For example : I tested with echo "sleep 30"
| qsub ---this submits the job and watch qsub can be seen the job
activities. - that okay and normal.
3. then i wrote very simple script and submit the job it does submitted
but it does not download the output. It looks job has been submitted-
finished but don´t know what happened? mom_log shows this -
pbs_mom;Req;dis_request_read;decoding command CopyFiles from PBS_Server
pbs_mom;Req;;Type CopyFiles request received from
PBS_Server at p34.test.local, sock=12
pbs_mom;Job;dispatch_request;dispatching request CopyFiles on sd=12
pbs_mom;Job;62.bhairab.rhi.hi.is;attempting to copy file
'bhairab.rhi.hi.is:/test/anil/test.sh.o62'
pbs_mom;Job;62.bhairab.rhi.hi.is;forking to user, uid: 501 gid: 501
homedir: '/test/anil'
pbs_mom;n/a;mom_close_poll;entered
pbs_mom;Svr;mom_get_sample;proc_array load started
pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=126
pbs_mom;Svr;mom_get_sample;proc_array load started
pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=126
pbs_mom;n/a;cput_sum;proc_array loop start - jobid = 62.bhairab.rhi.hi.is
pbs_mom;n/a;mem_sum;proc_array loop start - jobid = 62.bhairab.rhi.hi.is
pbs_mom;n/a;resi_sum;proc_array loop start - jobid = 62.bhairab.rhi.hi.is
pbs_mom;n/a;cput_sum;proc_array loop start - jobid = 61.bhairab.rhi.hi.is
pbs_mom;n/a;mem_sum;proc_array loop start - jobid = 61.bhairab.rhi.hi.is
pbs_mom;n/a;resi_sum;proc_array loop start - jobid = 61.bhairab.rhi.hi.is
pbs_mom;Job;scan_for_terminated;pid 5027 not tracked, exitcode=0
pbs_mom;Req;dis_request_read;decoding command DeleteJob from PBS_Server
I am not sure if this is issue with this version or just the
configuration error. In the past I did the similar way with 2.1.7 and
have no problem. Can someone point me to right direction to fix this
installation or configuration.
4. I joined headnode into our ldap server so users can login to headnode
with their own username passowrd. users home directory are in different
servers which are mounted at the time they login so users get their own
home drive space. But when the users submit thier jobs does users home
dirve and accounts also be created in all the nodes ? or how is the
general practice ?
Finally, last time I build the packages it also created the torque pam
package but i do not see in this version by default. I would be grateful
if someone point me to torque pam pakacge link.
Some input - help, suggesstion would be great
apologies for long
Regards,
A
Anil Thapa wrote:
> Hi all,
>
> I just build the new version and installed. I am having a small
> problem. Everything looks well however when job submitted it always
> stays in R mode and rest other jobs are in Q mode when i do qstat. I
> looked at the node mom_logs it has loads of this error:
>
> 07/09/2009 16:18:10;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Connection
> refused (111) in scan_for_exiting, cannot bind to port 1023 in
> client_to_svr - connection refused
>
> Note: submitted job is very simple script. and for the testing purpose
> firewall has temporarily turned off.
>
> Has anyone came up with this error. Any tips, help and suggestion would
> be appreciated.
>
> Regards,
> \A
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
More information about the torqueusers
mailing list