[torqueusers] Checkpoint script failed with return value of 13
Sreedhar Manchu
sm4082 at nyu.edu
Tue Jan 31 08:32:40 MST 2012
Hi,
When I try to checkpoint a simple job I see the error
Checkpoint script failed with return value of 13
in qstat -f output.
I see this in system messages
Jan 31 10:09:04 compute-4-14 pbs_mom: LOG_ERROR::Operation not permitted (1) in blcr_checkpoint_job, cannot change checkpoint directory owner
Jan 31 10:09:04 compute-4-14 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 13
Jan 31 10:09:37 compute-4-14 pbs_mom: LOG_ERROR::Operation not permitted (1) in blcr_checkpoint_job, cannot change checkpoint directory owner
Jan 31 10:09:37 compute-4-14 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 13
I found this in checkpoint_script.
# Note also that a request was made to identify whether this script was invoked
# by the job's owner or by a system administrator. While this information is
# known to pbs_server, it is not propagated to pbs_mom and thus it is not
# possible to pass this to the script. Therefore, a workaround is to invoke
# qmgr and attempt to set a trivial variable. This will fail if the invoker is
# not a manager.
Anyone know what exactly do I need to do here? I am not sure what trivial variable I need to set wtih qmgr.
Our Server Attributes:
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = False
set server acl_hosts = crunch.its.nyu.edu
set server acl_hosts += crunch.local
set server managers = root at crunch.local
set server operators = root at crunch.local
set server default_queue = route
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server submit_hosts = login-0-1
set server submit_hosts += login-0-0
set server submit_hosts += login-0-3
set server submit_hosts += login-0-2
set server allow_node_submit = False
set server next_job_number = 139165
If anyone knows how to get around this error, please let me know. I'd appreciate your help.
Thanks,
Sreedhar.
---
Sreedhar Manchu
HPC Support Specialist
New York University
251 Mercer Street
New York, NY 10012-1110
More information about the torqueusers
mailing list