[torqueusers] [newbie] {problems with, alternatives to} 'tail log'

Tom Roche Tom_Roche at pobox.com
Mon Nov 11 21:19:34 MST 2013


How to make Torque flush stderr/stdout to file before end-of-job? Can I do this as a user, or would I need privileges? Or is this just not doable, and I should try Something Completely Different? What I mean, why I ask:

I'm a new PBS/Torque user, relatively new to scientific computing, but a linux/unix user for many years, and not GUI-dependent. I'm accustomed to monitoring progress by

1. redirecting (or `tee`ing) stderr/stdout to one or more logfiles
2. `tail`ing the resulting log(s)

so am annoyed that my Torque jobs only write log after the job stops (whether ab- or normally). I can get some progress feedback from examining other output, but not with the detail (much less the `grep`ability) I can get from the logfile. Hence I'd greatly prefer to be able to flush Torque more often ... but I'm not finding much information about this problem, beyond one SO post

http://stackoverflow.com/questions/10527061/pbs-refresh-stdout

which suggests that I'd need admin on Torque. Is that correct? (I'm "just a user," but I could probably get an admin to make required change(s) if I could present them with clear directions (and $20 :-) and performance was not noticeably degraded.) If not,

1. Are there alternatives I can pursue as "just a user"? E.g., I notice in the above

http://stackoverflow.com/questions/10527061/pbs-refresh-stdout
> I ended up capturing stdout outside the queue

  but am not sure what is meant, much less how that would be done.

2. Is this just not feasible? Should I be focusing my effort on, e.g., better parsing my output to determine progress?

Apologies if the above is a FAQ, but I didn't see it @

http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/11-troubleshooting/faq.htm

and am having trouble googling this topic (given the real-world overloading of the terms 'PBS' and 'Torque'). Pointers to relevant docs and other information sources esp appreciated.

FWIW, my job has two parts: a Torque-aware, `qsub`ing, "outer" bash script (written by me) that wraps a previously-written (not by me), unmanaged/serial-native, "inner" csh script. The outer script sets and tests a bazillion environment variables, before delivering payload like

QSUB_ARG="-V -q ${queue_name} -N ${job_name} -l nodes=${N}:ppn=${PPN},walltime=${WT} -m ${mail_opts} -j ${join_opts} -o ${path_to_logfile_capturing_inner-script_stdout}"
for CMD in \
  "ls -alh ${path_to_outer-script_logfile}" \
  "ls -alh ${path_to_logfile_capturing_inner-script_stdout}" \
  "find ${path_to_output_directory}/ -type f | wc -l" \
  "du -hs ${path_to_output_directory}/" \
  "ls -alt ${path_to_output_directory}/" \
  "qsub ${QSUB_ARG} ${path_to_inner_script}" \
; do
  echo -e "$ ${CMD}" 2>&1 | tee -a "${path_to_outer-script_logfile}"
  eval "${CMD}" 2>&1 | tee -a "${path_to_outer-script_logfile}"
done

after which I want to be able to do `tail ${path_to_logfile_capturing_inner-script_stdout}` , but can't.

your assistance is appreciated, Tom Roche <Tom_Roche at pobox.com>


More information about the torqueusers mailing list