[torqueusers] Torque + Maui Configuration
Tomás Soares
tomas at lsd.ufcg.edu.br
Thu Oct 22 08:23:43 MDT 2009
Arnau Bria wrote:
> On Thu, 22 Oct 2009 10:24:00 -0300
> Tomás Soares wrote:
>
>
>> Hello All,
>>
> Hi,
>
>
>> I have 9 nodes with torque and maui installed and I'm using this
>> versions: torque-2.3.0 and maui-3.2.6p20.
>> I was trying to submit a simple job using "qsub -q prod script.sh"
>> but when I run qstat the jobs status is switching between Queued, E
>> and Running but don't return an output and never goes out of the
>> qstat. Someone suspects whats wrong?
>>
>
> status goes from Q -> E -> R?
>
Yes, the status goes from Q -> E -> R -> Q. Waits for a sleep time and
again does the same way.
> what does a qstat -f show¿ (on R status)¿
>
It's switching fast all the time but I had:
Job Id: 6647.server1.my.host.name
Job_Name = script.sh
Job_Owner = prod000 at server1.my.host.name
job_state = E
queue = prod
server = server1.my.host.name
Checkpoint = u
ctime = Thu Oct 22 10:11:10 2009
Error_Path = server1.my.host.name:/home/prod000/script.sh.e6647
exec_host = WN9.my.host.name/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Thu Oct 22 11:12:16 2009
Output_Path = server1.my.host.name:/home/prod000/script.sh.o6647
Priority = 0
qtime = Thu Oct 22 10:11:10 2009
Rerunable = True
Resource_List.cput = 48:00:00
Resource_List.neednodes = WN9.my.host.name
Resource_List.walltime = 72:00:00
session_id = 19632
substate = 60
Variable_List = PBS_O_HOME=/home/prod000,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=prod000,
PBS_O_PATH=/usr/kerberos/bin:/opt/edg/bin:/opt/glite/bin:/opt/lcg/bin
:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/opt/globus/bin:/home/eel
aprod000/bin,PBS_O_MAIL=/var/spool/mail/prod000,
PBS_O_SHELL=/bin/bash,PBS_SERVER=server1.my.host.name,
PBS_O_HOST=server1.my.host.name,PBS_O_WORKDIR=/home/prod000,
PBS_O_QUEUE=prod
euser = prod000
egroup = prod
hashname = 6647.ce
queue_rank = 49
queue_type = E
etime = Thu Oct 22 10:11:10 2009
exit_status = -4
submit_args = -q prod script.sh
start_time = Thu Oct 22 10:17:55 2009
start_count = 525
> have you check undelivered dir under WN for stderr/stdout?
>
In the /var/spool/pbs/mom_logs/ I found:
10/22/2009 11:12:13;0080; pbs_mom;Req;scan_for_exiting;no contact with
server at hostaddr 96a50f32, port 15001, jobid 6647.
server1.my.host.name errno 111
10/22/2009 11:12:16;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
10/22/2009 11:12:16;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
of while loop
10/22/2009 11:12:16;0080; pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
10/22/2009 11:12:16;0008; pbs_mom;Job;scan_for_terminated;checking job
post-processing routine
10/22/2009 11:12:16;0080; pbs_mom;Job;6647. server1.my.host.name;obit
sent to server
>
> same happens if you only submit dummy command out of teh script?
> echo sleep 1000|qsub -q prod
>
> if answer is no:
>
>
Yes, the same problem with this command.
> is copy with no passwd allowed between hosts?
>
> http://www.clusterresources.com/torquedocs/6.1scpsetup.shtml
>
>
Yes, the copy is allowed.
>
>
>> Thanks a lot!!!
>>
> Cheers,
> Arnau
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
Thanks...
More information about the torqueusers
mailing list