[torquedev] Disappearence of /dev/null
Eygene Ryabinkin
rea+maui at grid.kiae.ru
Thu Aug 5 11:50:40 MDT 2010
Me again.
Today I had faced the problem when the majority of our nodes had
/dev/null as the empty regular file (and not the character device).
It is a known problem, it comes from time to time to our cluster
and others experiencing it as well: [1], [2]. And there is even
a bug [3] in the Torque Bugzilla.
I had briefly examined the sources of pbs_mom and found that function
preobit_reply() has the following code that is executed if variable
deletejob is set to 1:
{{{
if (!(pjob->ji_wattr[(int)JOB_ATR_interactive].at_flags & ATR_VFLAG_SET) ||
(pjob->ji_wattr[(int)JOB_ATR_interactive].at_val.at_long == 0))
{
int x; /* dummy */
/* do this if not interactive */
unlink(std_file_name(pjob, StdOut, &x));
unlink(std_file_name(pjob, StdErr, &x));
unlink(std_file_name(pjob, Checkpoint, &x));
}
}}}
The thing is that std_file_name() can supply us "/dev/null" if
the condition
{{{
(pjob->ji_wattr[(int)JOB_ATR_keep].at_flags & ATR_VFLAG_SET) &&
(strchr(pjob->ji_wattr[(int)JOB_ATR_keep].at_val.at_str, key))
}}}
evaluates to false.
I can not judge if these two conditions are orthogonal to each other,
but it seems to me that they are not, so this might so happen that
std_file_name() will really return us "/dev/null" and unlink will
be called for it.
By the way, the log from pbs_mom in [3] says that unlink happens
after the message "top of preobit_reply" and after the message
"unknown on server, deleting locally". And deletejob is set to
1 once pbs_mom will spit the last message. So, my scenario looks
not so improbable.
I propose a simple fix for this: proxy all unlink calls via a new
routine, pbs_unlink(), that will check if we are not deleting
"/dev/null" (or alike) and will write a log message (preferrably,
with the stack trace, looks like Linux supports this,
http://www.gnu.org/software/libc/manual/html_node/Backtraces.html)
that will be proxied to syslog and pbs_mom log.
That's not a long term solution, but it will allow one to
- avoid these _very_ harmful errors: node becomes the job sucker
and destroyer, since every job ends with rc=-9 after such accident;
- catch the cases where /dev/null is going to be destroyed and
supply the developers with the additional useful information.
I will try to come up with the patch, but may be after weekend,
since I am currently not in mood to leave my cluster in the
experimental state for the Saturday and Sunday ;))
Thanks for your time.
[1] http://permalink.gmane.org/gmane.comp.clustering.torque.user/7844
[2] https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1007&L=LCG-ROLLOUT&F=&S=&P=100149
[3] http://www.clusterresources.com/bugzilla/show_bug.cgi?id=61
--
Eygene Ryabinkin, Russian Research Centre "Kurchatov Institute"
More information about the torquedev
mailing list