[torqueusers] Torque 4.1.2 does not accept hostname with '-'

Michael Jennings mej at lbl.gov
Fri Oct 19 16:34:22 MDT 2012


On Thursday, 18 October 2012, at 09:03:03 (+0800),
Clotho Tsang wrote:

> The following problem is found at Torque 4.1.2, but not 4.1.0.
> 
> At RHEL6, if the headnode hostname consists of char "-",
> jobs will keep running but not stop, checkjob shows message
> "cannot start job - RM failure, rc: 15033, msg: 'End of File' "
> 
> The problem is not found if the hostname has no "-".

We are seeing the same issue at our site.  (Our master node's name
ends in "-00")  We have a ticket open with Adaptive for this, but so
far it's proved very elusive.

Looking at the code, the only place that really sticks out to me where
'-' is handled specially (at least in terms of hostnames) has to do
with NUMA.  NUMA nodes appear to be named using a hyphen followed by
one or more digits.

I noticed that your hostname also had a hyphen followed by a digit.
Have you by any chance tried a hostname with hyphens but no numbers in
it?

Have you had any luck tracking down the issue in the code?  I've been
looking at it, but I don't see anything jumping out at me.

Michael

-- 
Michael Jennings <mej at lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E        W: 510-495-2687
MS 050B-3209          F: 510-486-8615


More information about the torqueusers mailing list