[torqueusers] 3 jobs falsely scheduled to one host with 2 processors

Michael Krause grid-admin at mpib-berlin.mpg.de
Thu Nov 11 09:10:54 MST 2010


Am 12.07.10 14:04, schrieb Grid-Admins:
>>>>> we just set up a torque-system and are experiencing a weird behaviour.
>>>>> Although all of our nodes have 2 processors (np=2 in
>>>>> /var/spool/pbs/server_priv/nodes) the very first one (and only this
>>>>> server) is always getting 3 jobs.
>>>>> Does anyone know why this could be?
>>>>>
>>>> I've seen this in 2 cases: suspended jobs (this is normal), and broken torque
>>>> in early 2.3.x releases.
>>>>
>>> Sadly none of this is the case. We just switched to the newborn debian
>>> packages (2.4.8) and no job was suspended.
>>>
>>> Do you have any other ideas?
>>>
>> Sorry for asking, but have you excluded the "typo in nodes file" kind of
>> problem?
>> What does pbsnodes say about that node and another one? What happens if
>> you send that first one offline, will then get another node the 3 jobs?
>> What happens if you run that first machine under another name as a
>> client than it runs as a server, i.e. let your current node name is
>> "torqueserver" and you have a second name for the machine (say node00)
>> which is used for being a friendly client?
>>
>> No other ideas for the moment.
>
> Thank you for your suggestions. We double checked your first hint.
> The nodes-file is correct, the program pbsnodes and qmgr -c "l n node-x"
> list the same settings for all nodes.
> We removed node-1 (the one that got 3 jobs instead of 2) and restarted
> the torque server.
> What happens now is that the new first node in the list (node-2) gets 3
> instead of 2 jobs. I could workaround this problem by assigning node-1
> np=1 instead of np=2 so that it would get only 2 jobs at a time..

Hello everyone,

sadly I have to bump this thread again. The issue mentioned above is 
causing a lot of entries in the server log files and I don't feel very 
comfortable with this workaround.

The error message is:


$DATE;0008;PBS_Server;Job;15383.$host;could not locate requested 
resources '1#shared' (node_spec failed) cannot allocate node 'node-1' to 
job - node not currently available (nps needed/free: 1/-1,  joblist: 
15370.$host:0,15341.$host:0)

$DATE;0080;PBS_Server;Req;req_reject;Reject reply code=15044(Resource 
temporarily unavailable REJHOST=node-1 MSG=cannot allocate node 'node-1' 
to job - node not currently available (nps needed/free: 1/-1,  joblist: 
15370.$host:0,15341.$host:0)), aux=0, type=RunJob, from Scheduler@$host

I am using torque 2.4.8 (debian)

Can anyone come up with an idea on what might cause this?

cheers,
-- 
Michael - MPIB Berlin


More information about the torqueusers mailing list