[torqueusers] node is down in logs but free and online in pbsnodes

Sreedhar Manchu sm4082 at nyu.edu
Tue Nov 26 14:10:41 MST 2013


Hi Ricardo,

Just to add to what Gus wrote, you can run qrun with -H option I believe.
If you wanted to push your job, let's say onto compute-3-2 and compute-3-3,
something like this would do it

qrun -H compute-3-2:ppn=8+compute-3-3:ppn=8 <job ID>

Sreedhar.

Sreedhar Manchu
HPC Systems Administrator
eSystems & Research Services
New York University, New York 10012



On Tue, Nov 26, 2013 at 4:03 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:

> Hi Ricardo
>
> 1. I think qrun will use the "worst possible node".
> See 'man qrun'.
>
> 2. Which scheduler are you using?  There were problems
> (job in Q state forever) reported
> with pbs_sched on Torque 4.X (probably fixed at this point).
> Maui should start the jobs.
>
> 3. Did you enable scheduling?
> qmgr -c 'set server scheduling = True'
>
> I hope this helps,
> Gus Correa
>
> On 11/26/2013 03:15 PM, Ricardo Román Brenes wrote:
> > Hello everyone.
> >
> > I got a server running torque and a lot of nodes. Im trying to make ONE
> > of those nodes run something.
> >
> > im issuing
> >
> > [rroman at meta ~]$ qsub -I -q ps3 -l nodes=1:ppn=2
> >
> > and the job never runs.
> >
> > i tried to force-run it with qrun but oddly... it started the
> > interactive job in a node that does not belong to that queue...
> >
> > [rroman at meta ~]$ qsub -I -q ps3 -l nodes=1:ppn=2
> > qsub: waiting for job 1038.meta.cnca to start
> > qsub: job 1038.meta.cnca ready
> >
> > [rroman at cadejos-0 ~]$ qstat -f 1038
> > Job Id: 1038.meta.cnca
> >      Job_Name = STDIN
> >      Job_Owner = rroman at meta.cnca
> >      job_state = R
> >      queue = ps3
> >      server = meta.cnca
> >      Checkpoint = u
> >      ctime = Tue Nov 26 20:27:26 2013
> >      Error_Path = /dev/pts/0
> >      exec_host = cadejos-0.cnca/1+cadejos-0.cnca/0
> >      Hold_Types = n
> >      interactive = True
> >      Join_Path = n
> >      Keep_Files = n
> >      Mail_Points = a
> >      mtime = Tue Nov 26 20:38:17 2013
> >      Output_Path = /dev/pts/0
> >      Priority = 0
> >      qtime = Tue Nov 26 20:27:26 2013
> >      Rerunable = False
> >      Resource_List.nodect = 1
> >      Resource_List.nodes = 1:ppn=2
> >      Resource_List.walltime = 01:00:00
> >      session_id = 1668
> >      Variable_List = PBS_O_QUEUE=ps3,PBS_O_HOME=/home/rroman,
> > PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=rroman,
> > PBS_O_PATH=/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/lo
> > cal/sbin:/usr/sbin:/sbin:/home/rroman/bin,
> > PBS_O_MAIL=/var/spool/mail/rroman,PBS_O_SHELL=/bin/bash,
> > PBS_O_HOST=meta.cnca,PBS_SERVER=meta.cnca,PBS_O_WORKDIR=/home/rroman,
> > HOSTNAME=meta.cnca,TERM=xterm,SHELL=/bin/bash,HISTSIZE=1000,
> > SSH_CLIENT=10.4.255.9 59203 22,QTDIR=/usr/lib64/qt-3.3,
> > QTINC=/usr/lib64/qt-3.3/include,SSH_TTY=/dev/pts/1,USER=rroman,
> > LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=
> > 40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=3
> > 0;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj
> > =01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.
> > zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01
> > ;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=0
> > 1;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpi
> > o=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.b
> > mp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*
> > .xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;
> > 35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=
> > 01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.v
> > ob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.r
> > mvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*
> > .dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:
> > *.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36
> > :*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=0
> > 1;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=
> > 01;36:*.xspf=01;36:,MAIL=/var/spool/mail/rroman,
> > PATH=/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sb
> > in:/usr/sbin:/sbin:/home/rroman/bin,PWD=/home/rroman,LANG=en_US.UTF-8,
> > SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass,
> > HISTCONTROL=ignoredups,SHLVL=1,HOME=/home/rroman,LOGNAME=rroman,
> > QTLIB=/usr/lib64/qt-3.3/lib,CVS_RSH=ssh,
> > SSH_CONNECTION=10.4.255.9 59203 192.168.0.2 22,
> > LESSOPEN=|/usr/bin/lesspipe.sh %s,DISPLAY=localhost:11.0,
> > G_BROKEN_FILENAMES=1,_=/usr/bin/qsub
> >      etime = Tue Nov 26 20:27:26 2013
> >      submit_args = -V -I -q ps3 -l nodes=1:ppn=2
> >      start_time = Tue Nov 26 20:38:17 2013
> >      Walltime.Remaining = 3585
> >      start_count = 1
> >      fault_tolerant = False
> >      submit_host = meta.cnca
> >      init_work_dir = /home/rroman
> >
> > [rroman at cadejos-0 ~]$
> >
> >
> > my nodes file:
> > [root at meta server_priv]# cat nodes
> > cadejos-0.cnca np=8 tesla xeon
> > cadejos-1.cnca np=8 tesla xeon
> > cadejos-2.cnca np=8tesla xeon
> > cadejos-3.cnca np=8 xeon
> > cadejos-4.cnca np=8 xeon
> > zarate-0.cncanp=2ps3
> > zarate-1.cncanp=2ps3
> > zarate-2.cncanp=2ps3
> > zarate-3.cncanp=2ps3
> >
> >
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131126/b694fb5a/attachment-0001.html 


More information about the torqueusers mailing list