[torqueusers] node is down in logs but free and online in pbsnodes

glen.beane at gmail.com glen.beane at gmail.com
Tue Nov 26 14:48:49 MST 2013


qrun circumvents whatever policies you put in place with Maui so it will ignore things like queue to host mappings in Maui, reservations, etc

Sent from my iPhone

> On Nov 26, 2013, at 4:32 PM, Ricardo Román Brenes <roman.ricardo at gmail.com> wrote:
> 
> thanks to everyone for your input!
> 
> @Gus:
> Im runnning maui 3.3.1 and torque server 2.5.7 on the server and torque client 2.2.1 on the compute node.
> 
> scheduling is enabled
> 
> @Sree:
> thanks, ive been forcing it jsut for test!
> 
> @aaron:
> i was thinking on reconfiguring but there are users logged on right now :O
> 
> 
>> On Tue, Nov 26, 2013 at 3:15 PM, Aaron Cordero <cordero.aaron at gene.com> wrote:
>> Just verified what I posted on my Torque 4.2.6 cluster and it worked as expected (echo sleep 60 | qsub -q batch -l nodes=node1:ppn=2).
>> 
>> Is the server_priv/nodes file correct?  
>> 
>> Also, I have noticed that sometimes the torque database gets corrupted.  Its an easy fix.
>> 
>> Do a:  echo p s @default | qmgr > /tmp/config.out
>> Shutdown the server.
>> Issue pbs_server -t create
>> type y to overwrite
>> 
>> then do:  cat /tmp/config.out | qmgr
>> 
>> type qterm
>> 
>> then restart pbs_server
>> 
>> If you dont want to spend a bunch of time troubleshooting-sometimes uninstalling, deleting the torque home dir (/var/spool/torque) and reinstalling is quicker (on everything server/sched/moms).  
>> 
>> But what you are seeing is definitely not normal.  Just the fact that submitting a job to a particular node does not work is suspicious.
>> 
>> ac
>> 
>> 
>> 
>> 
>>> On Tue, Nov 26, 2013 at 12:47 PM, Ricardo Román Brenes <roman.ricardo at gmail.com> wrote:
>>> ive tried with -l nodes=NODENAME
>>> and still not working 
>>> 
>>> but i can live with torque not sending the jobs to the specified nodes, but sending them to OTHER nodes? thats just not right 
>>> 
>>> 
>>>> On Tue, Nov 26, 2013 at 2:27 PM, Aaron Cordero <cordero.aaron at gene.com> wrote:
>>>> Try-
>>>> 
>>>> qsub -q ps3 -l nodes=<nodename>:ppn=2
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Tue, Nov 26, 2013 at 12:15 PM, Ricardo Román Brenes <roman.ricardo at gmail.com> wrote:
>>>>> Hello everyone.
>>>>> 
>>>>> I got a server running torque and a lot of nodes. Im trying to make ONE of those nodes run something.
>>>>> 
>>>>> im issuing
>>>>> 
>>>>> [rroman at meta ~]$ qsub -I -q ps3 -l nodes=1:ppn=2
>>>>> 
>>>>> and the job never runs.
>>>>> 
>>>>> i tried to force-run it with qrun but oddly... it started the interactive job in a node that does not belong to that queue...
>>>>> 
>>>>> [rroman at meta ~]$ qsub -I -q ps3 -l nodes=1:ppn=2
>>>>> qsub: waiting for job 1038.meta.cnca to start
>>>>> qsub: job 1038.meta.cnca ready
>>>>> 
>>>>> [rroman at cadejos-0 ~]$ qstat -f 1038
>>>>> Job Id: 1038.meta.cnca
>>>>>     Job_Name = STDIN
>>>>>     Job_Owner = rroman at meta.cnca
>>>>>     job_state = R
>>>>>     queue = ps3
>>>>>     server = meta.cnca
>>>>>     Checkpoint = u
>>>>>     ctime = Tue Nov 26 20:27:26 2013
>>>>>     Error_Path = /dev/pts/0
>>>>>     exec_host = cadejos-0.cnca/1+cadejos-0.cnca/0
>>>>>     Hold_Types = n
>>>>>     interactive = True
>>>>>     Join_Path = n
>>>>>     Keep_Files = n
>>>>>     Mail_Points = a
>>>>>     mtime = Tue Nov 26 20:38:17 2013
>>>>>     Output_Path = /dev/pts/0
>>>>>     Priority = 0
>>>>>     qtime = Tue Nov 26 20:27:26 2013
>>>>>     Rerunable = False
>>>>>     Resource_List.nodect = 1
>>>>>     Resource_List.nodes = 1:ppn=2
>>>>>     Resource_List.walltime = 01:00:00
>>>>>     session_id = 1668
>>>>>     Variable_List = PBS_O_QUEUE=ps3,PBS_O_HOME=/home/rroman,
>>>>> 	PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=rroman,
>>>>> 	PBS_O_PATH=/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/lo
>>>>> 	cal/sbin:/usr/sbin:/sbin:/home/rroman/bin,
>>>>> 	PBS_O_MAIL=/var/spool/mail/rroman,PBS_O_SHELL=/bin/bash,
>>>>> 	PBS_O_HOST=meta.cnca,PBS_SERVER=meta.cnca,PBS_O_WORKDIR=/home/rroman,
>>>>> 	HOSTNAME=meta.cnca,TERM=xterm,SHELL=/bin/bash,HISTSIZE=1000,
>>>>> 	SSH_CLIENT=10.4.255.9 59203 22,QTDIR=/usr/lib64/qt-3.3,
>>>>> 	QTINC=/usr/lib64/qt-3.3/include,SSH_TTY=/dev/pts/1,USER=rroman,
>>>>> 	LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=
>>>>> 	40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=3
>>>>> 	0;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj
>>>>> 	=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.
>>>>> 	zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01
>>>>> 	;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=0
>>>>> 	1;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpi
>>>>> 	o=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.b
>>>>> 	mp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*
>>>>> 	.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;
>>>>> 	35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=
>>>>> 	01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.v
>>>>> 	ob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.r
>>>>> 	mvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*
>>>>> 	.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:
>>>>> 	*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36
>>>>> 	:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=0
>>>>> 	1;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=
>>>>> 	01;36:*.xspf=01;36:,MAIL=/var/spool/mail/rroman,
>>>>> 	PATH=/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sb
>>>>> 	in:/usr/sbin:/sbin:/home/rroman/bin,PWD=/home/rroman,LANG=en_US.UTF-8,
>>>>> 	SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass,
>>>>> 	HISTCONTROL=ignoredups,SHLVL=1,HOME=/home/rroman,LOGNAME=rroman,
>>>>> 	QTLIB=/usr/lib64/qt-3.3/lib,CVS_RSH=ssh,
>>>>> 	SSH_CONNECTION=10.4.255.9 59203 192.168.0.2 22,
>>>>> 	LESSOPEN=|/usr/bin/lesspipe.sh %s,DISPLAY=localhost:11.0,
>>>>> 	G_BROKEN_FILENAMES=1,_=/usr/bin/qsub
>>>>>     etime = Tue Nov 26 20:27:26 2013
>>>>>     submit_args = -V -I -q ps3 -l nodes=1:ppn=2
>>>>>     start_time = Tue Nov 26 20:38:17 2013
>>>>>     Walltime.Remaining = 3585
>>>>>     start_count = 1
>>>>>     fault_tolerant = False
>>>>>     submit_host = meta.cnca
>>>>>     init_work_dir = /home/rroman
>>>>> 
>>>>> [rroman at cadejos-0 ~]$ 
>>>>> 
>>>>> 
>>>>> my nodes file:
>>>>> [root at meta server_priv]# cat nodes
>>>>> cadejos-0.cnca 	np=8 	tesla xeon
>>>>> cadejos-1.cnca 	np=8 	tesla xeon
>>>>> cadejos-2.cnca 	np=8	tesla xeon
>>>>> cadejos-3.cnca 	np=8 	xeon
>>>>> cadejos-4.cnca 	np=8 	xeon
>>>>> zarate-0.cnca	np=2	ps3
>>>>> zarate-1.cnca	np=2	ps3
>>>>> zarate-2.cnca	np=2	ps3
>>>>> zarate-3.cnca	np=2	ps3
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>> 
>>>> 
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>> 
>>> 
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> 
>> 
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131126/89459b65/attachment.html 


More information about the torqueusers mailing list