[torqueusers] torqueusers Digest, Vol 112, Issue 20

suryawanshiprashant069 suryawanshiprashant069 suryawanshiprashant069 at gmail.com
Mon Nov 18 07:13:40 MST 2013


language in hindi

On 18/11/2013, torqueusers-request at supercluster.org
<torqueusers-request at supercluster.org> wrote:
> Send torqueusers mailing list submissions to
> 	torqueusers at supercluster.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://www.supercluster.org/mailman/listinfo/torqueusers
> or, via email, send a message with subject or body 'help' to
> 	torqueusers-request at supercluster.org
>
> You can reach the person managing the list at
> 	torqueusers-owner at supercluster.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of torqueusers digest..."
>
>
> Today's Topics:
>
>    1. Re: newbe Q. about killing job arrays (Andrus, Brian Contractor)
>    2. Re: [newbie] {problems with,	alternatives to} 'tail log'
>       (Gustavo Correa)
>    3. Re: [newbie] {problems with,	alternatives to} 'tail log'
>       (Gustavo Correa)
>    4. Help Queue Priority (Juno Kim | AGM)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 17 Nov 2013 02:34:44 +0000
> From: "Andrus, Brian Contractor" <bdandrus at nps.edu>
> Subject: Re: [torqueusers] newbe Q. about killing job arrays
> To: Torque Users Mailing List <torqueusers at supercluster.org>
> Message-ID:
> 	<ADC981242279AD408816CB7141A2789D7B422DCD at HORNET.ern.nps.edu>
> Content-Type: text/plain; charset="us-ascii"
>
> Ian,
>
> Are you using just the jobid portion or array?
> qdel 123
> vs
> qdel 123[]
>
> Also, do you have permissions to those jobs
> what state are the jobs in
> Finally, what is the response you get when you make your attempt?
>
>
>
> Brian Andrus
> ITACS/Research Computing
> Naval Postgraduate School
> Monterey, California
> voice: 831-656-6238
>
>
>
> From: torqueusers-bounces at supercluster.org
> [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ian Miller
> Sent: Friday, November 15, 2013 10:59 AM
> To: Torque Users Mailing List
> Subject: [torqueusers] newbe Q. about killing job arrays
>
> Hi All,
> I'm having a hard time killing off a large job array and want to make sure
> I'm attempting it correctly
>>qdel <arrary id>
> Should kill all of the job.  But they don't die.
> Torque version 4.1.5.1
> Maui 3.3.1
>
> ?
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://www.supercluster.org/pipermail/torqueusers/attachments/20131117/69f8cef8/attachment-0001.html
>
>
> ------------------------------
>
> Message: 2
> Date: Sun, 17 Nov 2013 13:11:01 -0500
> From: Gustavo Correa <gus at ldeo.columbia.edu>
> Subject: Re: [torqueusers] [newbie] {problems with,	alternatives to}
> 	'tail log'
> To: Torque Users Mailing List <torqueusers at supercluster.org>
> Message-ID: <37B30D48-5B52-4FC8-B7E2-B11F3E868922 at ldeo.columbia.edu>
> Content-Type: text/plain; charset=us-ascii
>
> Hi Tom
>
> If you want stdout/stderr written directly to the work directory, you can
> add:
>
> $spool_as_final_name
>
> to the $TORQUE/mom_priv/config files on your compute nodes.
>
> See the Torque Admin Guide:
>
> http://docs.adaptivecomputing.com/torque/help.htm#topics/12-appendices/parameters.htm%3FTocPath%3DAppendices|Appendix%20C%3A%20Node%20manager%20%28MOM%29%20configuration|_____1
>
> Beware this may add some impact on your networked file system IO.
> See this recent discussion thread in the list archives:
>
> http://www.supercluster.org/pipermail/torqueusers/2013-October/016352.html
>
> If you want only the output of a particular executable or script inside the
> submitted job,
> you can redirect stdout of that part with something like ">
> $PBS_O_WORKDIR/program.log".
> This can be a bit tricky with MPI programs, though.
> Again, this may tax NFS to some extent.
>
> Besides the above, if you are testing/debugging a program,
> you can submit an interactive job, instead of batch.
> That will put you in the compute node,
> with access to the program stdout/stderr:
>
> qsub -I ...
>
> See 'man qsub' for more details about the -I (*this is capital letter i, not
> lowercase L*)
> qsub switch.
>
> For a well tested/debugged program I guess there is little need to check
> log/stdout/stderr
> files while job is running.
> Hence, I prefer the latter solution, while keeping the stdout/stderr files
> in the
> local compute node spool directory until the job ends.
>
> My two cents,
> Gus Correa
>
>
> On Nov 11, 2013, at 11:19 PM, Tom Roche wrote:
>
>>
>> How to make Torque flush stderr/stdout to file before end-of-job? Can I do
>> this as a user, or would I need privileges? Or is this just not doable,
>> and I should try Something Completely Different? What I mean, why I ask:
>>
>> I'm a new PBS/Torque user, relatively new to scientific computing, but a
>> linux/unix user for many years, and not GUI-dependent. I'm accustomed to
>> monitoring progress by
>>
>> 1. redirecting (or `tee`ing) stderr/stdout to one or more logfiles
>> 2. `tail`ing the resulting log(s)
>>
>> so am annoyed that my Torque jobs only write log after the job stops
>> (whether ab- or normally). I can get some progress feedback from examining
>> other output, but not with the detail (much less the `grep`ability) I can
>> get from the logfile. Hence I'd greatly prefer to be able to flush Torque
>> more often ... but I'm not finding much information about this problem,
>> beyond one SO post
>>
>> http://stackoverflow.com/questions/10527061/pbs-refresh-stdout
>>
>> which suggests that I'd need admin on Torque. Is that correct? (I'm "just
>> a user," but I could probably get an admin to make required change(s) if I
>> could present them with clear directions (and $20 :-) and performance was
>> not noticeably degraded.) If not,
>>
>> 1. Are there alternatives I can pursue as "just a user"? E.g., I notice in
>> the above
>>
>> http://stackoverflow.com/questions/10527061/pbs-refresh-stdout
>>> I ended up capturing stdout outside the queue
>>
>>  but am not sure what is meant, much less how that would be done.
>>
>> 2. Is this just not feasible? Should I be focusing my effort on, e.g.,
>> better parsing my output to determine progress?
>>
>> Apologies if the above is a FAQ, but I didn't see it @
>>
>> http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/11-troubleshooting/faq.htm
>>
>> and am having trouble googling this topic (given the real-world
>> overloading of the terms 'PBS' and 'Torque'). Pointers to relevant docs
>> and other information sources esp appreciated.
>>
>> FWIW, my job has two parts: a Torque-aware, `qsub`ing, "outer" bash script
>> (written by me) that wraps a previously-written (not by me),
>> unmanaged/serial-native, "inner" csh script. The outer script sets and
>> tests a bazillion environment variables, before delivering payload like
>>
>> QSUB_ARG="-V -q ${queue_name} -N ${job_name} -l
>> nodes=${N}:ppn=${PPN},walltime=${WT} -m ${mail_opts} -j ${join_opts} -o
>> ${path_to_logfile_capturing_inner-script_stdout}"
>> for CMD in \
>>  "ls -alh ${path_to_outer-script_logfile}" \
>>  "ls -alh ${path_to_logfile_capturing_inner-script_stdout}" \
>>  "find ${path_to_output_directory}/ -type f | wc -l" \
>>  "du -hs ${path_to_output_directory}/" \
>>  "ls -alt ${path_to_output_directory}/" \
>>  "qsub ${QSUB_ARG} ${path_to_inner_script}" \
>> ; do
>>  echo -e "$ ${CMD}" 2>&1 | tee -a "${path_to_outer-script_logfile}"
>>  eval "${CMD}" 2>&1 | tee -a "${path_to_outer-script_logfile}"
>> done
>>
>> after which I want to be able to do `tail
>> ${path_to_logfile_capturing_inner-script_stdout}` , but can't.
>>
>> your assistance is appreciated, Tom Roche <Tom_Roche at pobox.com>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
> ------------------------------
>
> Message: 3
> Date: Sun, 17 Nov 2013 13:16:16 -0500
> From: Gustavo Correa <gus at ldeo.columbia.edu>
> Subject: Re: [torqueusers] [newbie] {problems with,	alternatives to}
> 	'tail log'
> To: Torque Users Mailing List <torqueusers at supercluster.org>
> Message-ID: <C9A5CC76-FEF7-4DE0-ABA0-4D9947F629F1 at ldeo.columbia.edu>
> Content-Type: text/plain; charset=us-ascii
>
> Oops.  Correction:
>
> $spool_as_final_name true
>
>
> On Nov 17, 2013, at 1:11 PM, Gustavo Correa wrote:
>
>> Hi Tom
>>
>> If you want stdout/stderr written directly to the work directory, you can
>> add:
>>
>> $spool_as_final_name
>>
>> to the $TORQUE/mom_priv/config files on your compute nodes.
>>
>> See the Torque Admin Guide:
>>
>> http://docs.adaptivecomputing.com/torque/help.htm#topics/12-appendices/parameters.htm%3FTocPath%3DAppendices|Appendix%20C%3A%20Node%20manager%20%28MOM%29%20configuration|_____1
>>
>> Beware this may add some impact on your networked file system IO.
>> See this recent discussion thread in the list archives:
>>
>> http://www.supercluster.org/pipermail/torqueusers/2013-October/016352.html
>>
>> If you want only the output of a particular executable or script inside
>> the submitted job,
>> you can redirect stdout of that part with something like ">
>> $PBS_O_WORKDIR/program.log".
>> This can be a bit tricky with MPI programs, though.
>> Again, this may tax NFS to some extent.
>>
>> Besides the above, if you are testing/debugging a program,
>> you can submit an interactive job, instead of batch.
>> That will put you in the compute node,
>> with access to the program stdout/stderr:
>>
>> qsub -I ...
>>
>> See 'man qsub' for more details about the -I (*this is capital letter i,
>> not lowercase L*)
>> qsub switch.
>>
>> For a well tested/debugged program I guess there is little need to check
>> log/stdout/stderr
>> files while job is running.
>> Hence, I prefer the latter solution, while keeping the stdout/stderr files
>> in the
>> local compute node spool directory until the job ends.
>>
>> My two cents,
>> Gus Correa
>>
>>
>> On Nov 11, 2013, at 11:19 PM, Tom Roche wrote:
>>
>>>
>>> How to make Torque flush stderr/stdout to file before end-of-job? Can I
>>> do this as a user, or would I need privileges? Or is this just not
>>> doable, and I should try Something Completely Different? What I mean, why
>>> I ask:
>>>
>>> I'm a new PBS/Torque user, relatively new to scientific computing, but a
>>> linux/unix user for many years, and not GUI-dependent. I'm accustomed to
>>> monitoring progress by
>>>
>>> 1. redirecting (or `tee`ing) stderr/stdout to one or more logfiles
>>> 2. `tail`ing the resulting log(s)
>>>
>>> so am annoyed that my Torque jobs only write log after the job stops
>>> (whether ab- or normally). I can get some progress feedback from
>>> examining other output, but not with the detail (much less the
>>> `grep`ability) I can get from the logfile. Hence I'd greatly prefer to be
>>> able to flush Torque more often ... but I'm not finding much information
>>> about this problem, beyond one SO post
>>>
>>> http://stackoverflow.com/questions/10527061/pbs-refresh-stdout
>>>
>>> which suggests that I'd need admin on Torque. Is that correct? (I'm "just
>>> a user," but I could probably get an admin to make required change(s) if
>>> I could present them with clear directions (and $20 :-) and performance
>>> was not noticeably degraded.) If not,
>>>
>>> 1. Are there alternatives I can pursue as "just a user"? E.g., I notice
>>> in the above
>>>
>>> http://stackoverflow.com/questions/10527061/pbs-refresh-stdout
>>>> I ended up capturing stdout outside the queue
>>>
>>> but am not sure what is meant, much less how that would be done.
>>>
>>> 2. Is this just not feasible? Should I be focusing my effort on, e.g.,
>>> better parsing my output to determine progress?
>>>
>>> Apologies if the above is a FAQ, but I didn't see it @
>>>
>>> http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/11-troubleshooting/faq.htm
>>>
>>> and am having trouble googling this topic (given the real-world
>>> overloading of the terms 'PBS' and 'Torque'). Pointers to relevant docs
>>> and other information sources esp appreciated.
>>>
>>> FWIW, my job has two parts: a Torque-aware, `qsub`ing, "outer" bash
>>> script (written by me) that wraps a previously-written (not by me),
>>> unmanaged/serial-native, "inner" csh script. The outer script sets and
>>> tests a bazillion environment variables, before delivering payload like
>>>
>>> QSUB_ARG="-V -q ${queue_name} -N ${job_name} -l
>>> nodes=${N}:ppn=${PPN},walltime=${WT} -m ${mail_opts} -j ${join_opts} -o
>>> ${path_to_logfile_capturing_inner-script_stdout}"
>>> for CMD in \
>>> "ls -alh ${path_to_outer-script_logfile}" \
>>> "ls -alh ${path_to_logfile_capturing_inner-script_stdout}" \
>>> "find ${path_to_output_directory}/ -type f | wc -l" \
>>> "du -hs ${path_to_output_directory}/" \
>>> "ls -alt ${path_to_output_directory}/" \
>>> "qsub ${QSUB_ARG} ${path_to_inner_script}" \
>>> ; do
>>> echo -e "$ ${CMD}" 2>&1 | tee -a "${path_to_outer-script_logfile}"
>>> eval "${CMD}" 2>&1 | tee -a "${path_to_outer-script_logfile}"
>>> done
>>>
>>> after which I want to be able to do `tail
>>> ${path_to_logfile_capturing_inner-script_stdout}` , but can't.
>>>
>>> your assistance is appreciated, Tom Roche <Tom_Roche at pobox.com>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
> ------------------------------
>
> Message: 4
> Date: Mon, 18 Nov 2013 11:21:57 -0200
> From: Juno Kim | AGM <redes03 at agm.com.br>
> Subject: [torqueusers] Help Queue Priority
> To: Torque Users Mailing List <torqueusers at supercluster.org>
> Message-ID: <528A1475.3010302 at agm.com.br>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hello
> people
>
> How do I set the priorities of my queue in torque?
> I have the queues:
> batch
> sapda
> user1
> user2
> user3
>
> Queues user1, user2 and user3 want them to have the same priorities and
> queues:
> sapda batch and have a higher priority than queues of users.
>
> All of which may compete with each other as the priority, or
> users of queues user1, user2 and user3 when they send jobs they also
> compete with each other ... that runs a job each user.
>
>
> Likewise in batch queues and sapda.
> *====================*
> Atenciosamente,
>
> *Juno Costa Kim*
> *Departamento de Redes*
> *AGM Telecom*
> *====================*
> IP Phone: +55 (48) 3221-0100
> Fax : +55 (48) 3222-7747
> Email : redes03 at agm.com.br
> Website: www.agm.com.br
> Rua Joe Colla?o, 163
> 88037-010 - Santa M?nica - Florian?polis - SC
>
> Em 15-11-2013 18:11, Jagga Soorma escreveu:
>> So, this is a brand new install of torque without anything running on
>> the server/client except the torque processes.  I checked and I don't
>> think the server is running into any process limits.
>>
>> I setup the server & sched processes on the client itself and now am
>> running everything on the client host to rule out external components.
>>  I see the same problem with the connection to 15002 being a problem.
>>  I had a 1Gig copper connection on this server as well and migrated my
>> network to  a completely different nic and that did not help either.
>>
>> This is really a bizarre one that I can't seem to find the cause for.
>>  Any other things you guys think might help me troubleshoot this problem?
>>
>> Thanks,
>> -J
>>
>>
>> On Fri, Nov 15, 2013 at 4:05 AM, Jonathan Barber
>> <jonathan.barber at gmail.com <mailto:jonathan.barber at gmail.com>> wrote:
>>
>>     On 15 November 2013 03:18, Jagga Soorma <jagga13 at gmail.com
>>     <mailto:jagga13 at gmail.com>> wrote:
>>
>>         I changed the log level and here is what I see on the server:
>>
>>         Looks like it is intermittently having issues connecting to
>>         port 15002 on the client.  This client was just fine under the
>>         2.5.9 torque production environment that we have but seems to
>>         be intermittently having issues in the 2.5.13 test environment
>>         that is setup with gpu support.
>>
>>     [snip]
>>
>>
>>         11/14/2013
>>         19:15:20;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
>>         setting job 7352.server1.xxx.com <http://7352.server1.xxx.com>
>>         state from QUEUED-QUEUED to RUNNING-PRERUN (4-40)
>>         11/14/2013 19:15:20;0008;PBS_Server;Job;7352.server1.xxx.com
>>         <http://7352.server1.xxx.com>;forking in send_job
>>         *11/14/2013
>>         19:15:20;0004;PBS_Server;Svr;svr_connect;attempting connect to
>>         host 72.34.135.64 port 15002
>>         11/14/2013 19:15:20;0004;PBS_Server;Svr;svr_connect;cannot
>>         connect to host port 15002 - cannot establish connection () -
>>         time=0 seconds*
>>         *11/14/2013
>>         19:15:22;0004;PBS_Server;Svr;svr_connect;attempting connect to
>>         host 72.34.135.64 port 15002
>>         11/14/2013 19:15:22;0004;PBS_Server;Svr;svr_connect;cannot
>>         connect to host port 15002 - cannot establish connection () -
>>         time=0 seconds*
>>         11/14/2013 19:15:22;0008;PBS_Server;Job;7352.server1.xxx.com
>>         <http://7352.server1.xxx.com>;entering post_sendmom
>>
>>
>>     You might be running up against limits on the number of file
>>     descriptors the pbs_server process or the OS is allowed to have
>>     open. You can use tools such as lsof to see how many files the
>>     pbs_server has open:
>>     $ sudo lsof -c pbs_server
>>
>>     It's also possible that you're running out of ports to bind to.
>>     Running lsof/netstat and looking to see if there are massive
>>     numbers of connections/files open will reveal this.
>>
>>     Although you say there is no firewall configured on the servers,
>>     do you know if there a firewall between the pbs_server and the nodes?
>>
>>     You can do a simple TCP connect to the mom to see if it's listening:
>>     $ nmap -p 15002 ava01.grid.fe.up.pt <http://ava01.grid.fe.up.pt>
>>     -oG -
>>     # Nmap 6.40 scan initiated Fri Nov 15 11:52:17 2013 as: nmap -p
>>     15002 -oG - ava01.grid.fe.up.pt <http://ava01.grid.fe.up.pt>
>>     Host: 192.168.147.1 (ava01.grid.fe.up.pt
>>     <http://ava01.grid.fe.up.pt>)Status: Up
>>     Host: 192.168.147.1 (ava01.grid.fe.up.pt
>>     <http://ava01.grid.fe.up.pt>)Ports: 15002/open/tcp//unknown///
>>     # Nmap done at Fri Nov 15 11:52:17 2013 -- 1 IP address (1 host
>>     up) scanned in 0.04 seconds
>>     $
>>
>>     Or continuously with hping3 (I'm sure there are other tools that
>>     will do this as well):
>>     $ sudo hping3 -S -p 15002 ava01.grid.fe.up.pt
>>     <http://ava01.grid.fe.up.pt>
>>     HPING ava01.grid.fe.up.pt <http://ava01.grid.fe.up.pt> (em1
>>     192.168.147.1): S set, 40 headers + 0 data bytes
>>     len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=0
>>     win=14600 rtt=1.5 ms
>>     len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=1
>>     win=14600 rtt=0.8 ms
>>     len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=2
>>     win=14600 rtt=0.6 ms
>>     len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=3
>>     win=14600 rtt=1.0 ms
>>     len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=4
>>     win=14600 rtt=1.2 ms
>>
>>     (SA means it's open)
>>
>>     HTH
>>     --
>>     Jonathan Barber <jonathan.barber at gmail.com
>>     <mailto:jonathan.barber at gmail.com>>
>>
>>     _______________________________________________
>>     torqueusers mailing list
>>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>>     http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://www.supercluster.org/pipermail/torqueusers/attachments/20131118/8c4a5749/attachment.html
>
>
> ------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> End of torqueusers Digest, Vol 112, Issue 20
> ********************************************
>


More information about the torqueusers mailing list