[torqueusers] torque server mail_domain being ignored

Edward Walter ewalter at cs.cmu.edu
Thu Mar 13 10:12:21 MDT 2014


Hi Gus... answers inline.

On 03/13/2014 10:39 AM, Gus Correa wrote:
> Hi Edward
>
> 1) Which mail server is running on the pbs_server? (sendmail,
> postscript, ...)

We're using postfix (this is a Rocks 5 cluster).

> C2) an you send email with "mail" from the pbs_server machine?
> If not, it is likely to be a sendmail/postscript problem,
> configuration or other.

Normal email from the cluster works properly.  We get logwatch 
notifications, RAID alerts, etc without any trouble.

> 3) Did you add, say,
>
> #PBS -m abe
> #PBS -M user_name at cs.cmu.edu
>
> to the job script?

When I explicitly specify an address using "#PBS -M ewalter at cs.cmu.edu" 
in my job script; the job notifications get sent to the correct address 
as expected.

> 4) Are there any symptoms/hints/messages about the Torque
> email failure in your $TORQUE/server_logs/YYYYMMDD files?

There isn't anything indicating an email failure in the Torque logs.

> 5) Are there any symptoms/hints/messages about the general
> email failure in your /var/log/maillog files?

The email server logs on the frontend show it sending job status 
messages out to our relay host using the format <user>@<submit-host> for 
almost all of the jobs.  Oddly enough; the email notices for my test 
jobs are addressed correctly <user>@<domain>.

As unlikely as it seems; I believe we have multiple users submitting 
jobs where they instruct PBS to send email to their ID at the submit 
host.  I've found a few (not many) occurrences of this in some 
.bash_history files.  What I really suspect is that their automated job 
submission scripts are sourcing some shared file which tells PBS to send 
email to an incorrect address.  I haven't identified the shared source 
file yet though.

We've created a generic postfix map on the PBS server that rewrites all 
email destined for the submit host to use the domain suffix instead of 
the submit host's hostname.  This seems to have fixed the problem.

Thanks.

-Ed

> I hope this helps,
> Gus Correa
>
> On 03/12/2014 01:42 PM, Edward Walter wrote:
>> Hello,
>>
>> We have a torque installation (torque 2.4.11) with multiple submit
>> hosts.  On this particular installation; we've explicitly specified the
>> mail_domain variable in the server settings.  Despite this; torque is
>> attempting to send mail to <user>@<submit host>.
>>
>> Is there some other parameter we need to adjust for this to work
>> properly?  The mail_domain variable seems to do the right thing on other
>> clusters with this torque version and only a single submit host.
>>
>> Here are our server attributes (from qmgr -c "p s"):
>>
>>> #
>>> # Set server attributes.
>>> #
>>> set server scheduling = True
>>> set server acl_host_enable = False
>>> set server acl_hosts = rocks.is.cs.cmu.edu
>>> set server managers = maui at rocks.is.cs.cmu.edu
>>> set server managers += root at rocks.is.cs.cmu.edu
>>> set server default_queue = normal
>>> set server log_events = 511
>>> set server mail_from = adm
>>> set server query_other_jobs = True
>>> set server scheduler_iteration = 600
>>> set server node_check_rate = 150
>>> set server tcp_timeout = 12
>>> set server job_nanny = True
>>> set server mom_job_sync = True
>>> set server mail_domain = cs.cmu.edu
>>> set server submit_hosts = rocks.is.cs.cmu.edu
>>> set server submit_hosts += rocks-login.local
>>> set server submit_hosts += rocks-login
>>> set server submit_hosts += rocks-login.is.cs.cmu.edu
>>> set server next_job_number = 748586
>>
>> Thanks much.
>>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>


More information about the torqueusers mailing list