[torqueusers] Torque + Maui Configuration

Tomás Soares tomas at lsd.ufcg.edu.br
Thu Oct 22 08:23:43 MDT 2009


Arnau Bria wrote:
> On Thu, 22 Oct 2009 10:24:00 -0300
> Tomás Soares wrote:
>
>   
>> Hello All,
>>     
> Hi,
>  
>   
>> I have 9 nodes with torque and maui installed and I'm using this 
>> versions: torque-2.3.0 and maui-3.2.6p20.
>> I was trying to submit a simple job using "qsub -q prod script.sh"
>> but when I run qstat the jobs status is switching between Queued, E
>> and Running but don't return an output and never goes out of the
>> qstat. Someone suspects whats wrong?
>>     
>
> status goes from Q -> E -> R? 
>   
Yes, the status goes from Q -> E -> R -> Q. Waits for a sleep time and 
again does the same way.
> what does a qstat -f show¿ (on R status)¿
>   
It's switching fast all the time but I had:

Job Id: 6647.server1.my.host.name
    Job_Name = script.sh
    Job_Owner = prod000 at server1.my.host.name
    job_state = E
    queue = prod
    server = server1.my.host.name
    Checkpoint = u
    ctime = Thu Oct 22 10:11:10 2009
    Error_Path = server1.my.host.name:/home/prod000/script.sh.e6647
    exec_host = WN9.my.host.name/0
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Thu Oct 22 11:12:16 2009
    Output_Path = server1.my.host.name:/home/prod000/script.sh.o6647
    Priority = 0
    qtime = Thu Oct 22 10:11:10 2009
    Rerunable = True
    Resource_List.cput = 48:00:00
    Resource_List.neednodes = WN9.my.host.name
    Resource_List.walltime = 72:00:00
    session_id = 19632
    substate = 60
    Variable_List = PBS_O_HOME=/home/prod000,PBS_O_LANG=en_US.UTF-8,
    PBS_O_LOGNAME=prod000,
    PBS_O_PATH=/usr/kerberos/bin:/opt/edg/bin:/opt/glite/bin:/opt/lcg/bin
    :/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/opt/globus/bin:/home/eel
    aprod000/bin,PBS_O_MAIL=/var/spool/mail/prod000,
    PBS_O_SHELL=/bin/bash,PBS_SERVER=server1.my.host.name,
    PBS_O_HOST=server1.my.host.name,PBS_O_WORKDIR=/home/prod000,
    PBS_O_QUEUE=prod
    euser = prod000
    egroup = prod
    hashname = 6647.ce
    queue_rank = 49
    queue_type = E
    etime = Thu Oct 22 10:11:10 2009
    exit_status = -4
    submit_args = -q prod script.sh
    start_time = Thu Oct 22 10:17:55 2009
    start_count = 525

> have you check undelivered dir under WN for stderr/stdout?
>   
In the /var/spool/pbs/mom_logs/ I found:

10/22/2009 11:12:13;0080;   pbs_mom;Req;scan_for_exiting;no contact with 
server at hostaddr 96a50f32, port 15001, jobid 6647. 
server1.my.host.name errno 111
10/22/2009 11:12:16;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
10/22/2009 11:12:16;0080;   
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top 
of while loop
10/22/2009 11:12:16;0080;   pbs_mom;Svr;preobit_reply;in while loop, no 
error from job stat
10/22/2009 11:12:16;0008;   pbs_mom;Job;scan_for_terminated;checking job 
post-processing routine
10/22/2009 11:12:16;0080;   pbs_mom;Job;6647. server1.my.host.name;obit 
sent to server

>
> same happens if you only submit dummy command out of teh script?
> echo sleep 1000|qsub -q prod
>
> if answer is no:
>
>   
Yes, the same problem with this command.
> is copy with no passwd allowed between hosts?
>
> http://www.clusterresources.com/torquedocs/6.1scpsetup.shtml
>
>   
Yes, the copy is allowed.

>
>   
>> Thanks a lot!!!
>>     
> Cheers,
> Arnau
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>   
Thanks...



More information about the torqueusers mailing list