Discussion:
[Puppet Users] Command Exceeded Timeout on Basic Shell Commands
Franck
2015-06-16 02:12:03 UTC
Permalink
We've been experiencing a lot of "Command exceeded timeouts" on basic shell
commands using the "exec" type for tasks that should execute fairly fast:

Jun 15 15:45:44 host1 puppet-agent[57648]:
(/Stage[main]/Timezone::Utc/Exec[/bin/rm -f /etc/localtime && /bin/ln -s
/usr/share/zoneinfo/UTC /etc/localtime]) Command exceeded timeout
Jun 10 21:15:24 host1 puppet-agent[57081]:
(/Stage[main]/Open-vm-tools::Package/Exec[/usr/bin/vmware-uninstall-tools.pl]/onlyif)
Check "/usr/bin/test -f /usr/bin/vmware-uninstall-tools.pl" exceeded timeout
Jun 10 23:56:02 host1 puppet-agent[40286]:
(/Stage[main]/Open-vm-tools::Package/Exec[/usr/bin/yum install -y
open-vm-tools.x86_64]/unless) Check "/bin/rpm -q open-vm-tools" exceeded
timeout

All these commands can be run locally to the host and return fairly
quickly, but when puppet executes them they time out. Extending the timeout
is an option but ridiculous since default is 300 seconds and none of these
commands should take 5 minutes or more to return.

Some of the things observed is that this only affects CentOS 6.x hosts as
we also have Ubuntu 14.x hosts and they do not experience these problems.
Also, we've played around with different versions of the puppet agent
along with different versions of Ruby and none of them had any effect as
this condition persists regardless. Also, this does not seem to affect all
of our CentOS 6.x hosts but only certain ones -- randomly. Running puppet
agent in debug mode does not seem to uncover what's going on as it just
hangs when it gets to the "exec".

It's very annoying and actually dangerous in some cases as the puppet agent
will continue spawning multiple "applying configuration" processes which
will cause hosts to swap memory as each takes up more and more memory and
in some instances will hose them entirely.
We've had to remove these manifests that cause these conditions in the
interim but right now we have a lot of hosts we need to manage with puppet
so we need to be able to use this.

Basic info on the hosts in question:

- Puppet: 3.7.5
- Ruby: 2.1.2
- CentOS 6.6

Anyone have any ideas as to what could be causing this?

Thanks.
Franck
--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/f2071040-32e1-4c9b-8656-5a33650c67ca%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ashish Jaiswal
2015-06-16 13:05:16 UTC
Permalink
hi,

Could you test the same with puppet apply on the client itself
Post by Franck
We've been experiencing a lot of "Command exceeded timeouts" on basic
shell commands using the "exec" type for tasks that should execute fairly
(/Stage[main]/Timezone::Utc/Exec[/bin/rm -f /etc/localtime && /bin/ln -s
/usr/share/zoneinfo/UTC /etc/localtime]) Command exceeded timeout
(/Stage[main]/Open-vm-tools::Package/Exec[/usr/bin/
vmware-uninstall-tools.pl]/onlyif) Check "/usr/bin/test -f /usr/bin/
vmware-uninstall-tools.pl" exceeded timeout
(/Stage[main]/Open-vm-tools::Package/Exec[/usr/bin/yum install -y
open-vm-tools.x86_64]/unless) Check "/bin/rpm -q open-vm-tools" exceeded
timeout
All these commands can be run locally to the host and return fairly
quickly, but when puppet executes them they time out. Extending the timeout
is an option but ridiculous since default is 300 seconds and none of these
commands should take 5 minutes or more to return.
Some of the things observed is that this only affects CentOS 6.x hosts as
we also have Ubuntu 14.x hosts and they do not experience these problems.
Also, we've played around with different versions of the puppet agent along
with different versions of Ruby and none of them had any effect as this
condition persists regardless. Also, this does not seem to affect all of
our CentOS 6.x hosts but only certain ones -- randomly. Running puppet
agent in debug mode does not seem to uncover what's going on as it just
hangs when it gets to the "exec".
It's very annoying and actually dangerous in some cases as the puppet
agent will continue spawning multiple "applying configuration" processes
which will cause hosts to swap memory as each takes up more and more memory
and in some instances will hose them entirely.
We've had to remove these manifests that cause these conditions in the
interim but right now we have a lot of hosts we need to manage with puppet
so we need to be able to use this.
- Puppet: 3.7.5
- Ruby: 2.1.2
- CentOS 6.6
Anyone have any ideas as to what could be causing this?
Thanks.
Franck
--
You received this message because you are subscribed to the Google Groups
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an
To view this discussion on the web visit
https://groups.google.com/d/msgid/puppet-users/f2071040-32e1-4c9b-8656-5a33650c67ca%40googlegroups.com
<https://groups.google.com/d/msgid/puppet-users/f2071040-32e1-4c9b-8656-5a33650c67ca%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/CAMm4wzLCYjZhfMx3H-G%2BF9zTdFDfgdwdcuE3Owvd6E%2Bs83YzAg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
jcbollinger
2015-06-17 13:19:39 UTC
Permalink
Post by Franck
We've been experiencing a lot of "Command exceeded timeouts" on basic
shell commands using the "exec" type for tasks that should execute fairly
(/Stage[main]/Timezone::Utc/Exec[/bin/rm -f /etc/localtime && /bin/ln -s
/usr/share/zoneinfo/UTC /etc/localtime]) Command exceeded timeout
(/Stage[main]/Open-vm-tools::Package/Exec[/usr/bin/
vmware-uninstall-tools.pl]/onlyif) Check "/usr/bin/test -f /usr/bin/
vmware-uninstall-tools.pl" exceeded timeout
(/Stage[main]/Open-vm-tools::Package/Exec[/usr/bin/yum install -y
open-vm-tools.x86_64]/unless) Check "/bin/rpm -q open-vm-tools" exceeded
timeout
All these commands can be run locally to the host and return fairly
quickly, but when puppet executes them they time out.
Very strange.
Post by Franck
Extending the timeout is an option but ridiculous since default is 300
seconds and none of these commands should take 5 minutes or more to return.
No, probably not a viable option. If these particular commands are not
completing within the standard timeout, then there's no particular reason
to think that they would *ever* complete, no matter what timeout you set.
Post by Franck
Some of the things observed is that this only affects CentOS 6.x hosts as
we also have Ubuntu 14.x hosts and they do not experience these problems.
Also, we've played around with different versions of the puppet agent
along with different versions of Ruby and none of them had any effect as
this condition persists regardless. Also, this does not seem to affect all
of our CentOS 6.x hosts but only certain ones -- randomly.
There is surely some pattern to which machines are affected and which not.
Discovering that pattern would be a big step in solving the problem.
Post by Franck
Running puppet agent in debug mode does not seem to uncover what's going
on as it just hangs when it gets to the "exec".
You could try running Puppet under strace to get a low-level view of
exactly what Puppet gets stuck on. Nevertheless, if the problem sticks to
particular computers across different Puppet versions and different Ruby
versions, then the root of the problem must be outside Puppet itself.

You could compare the lists of installed packages between an affected
machine and a non-affected one. Perhaps the problem is caused by a
specific package or package version.

You should compare the catalogs applied to the machines that suffer from
this problem with those for the machines that are not affected. It may
help to narrow down the problem if you find that it is associated with a
small number of specific resources.

You should check how Puppet is running on affected machines vs.
non-affected ones. Is it running as a privileged user? The same one?
Post by Franck
It's very annoying and actually dangerous in some cases as the puppet
agent will continue spawning multiple "applying configuration" processes
which will cause hosts to swap memory as each takes up more and more memory
and in some instances will hose them entirely.
Have you actually observed that behavior? If so, then something is
dreadfully wrong. Puppet should never start a new catalog run when one is
already underway. It has safeguards in place to prevent that. If you have
stumbled across a way in which those can be circumvented, then I'm sure
PuppetLabs would appreciate a bug report.
Post by Franck
We've had to remove these manifests that cause these conditions in the
interim but right now we have a lot of hosts we need to manage with puppet
so we need to be able to use this.
- Puppet: 3.7.5
- Ruby: 2.1.2
- CentOS 6.6
Anyone have any ideas as to what could be causing this?
You haven't given us much to work with, and I, at least, have never before
heard of such an issue. I do not know what is causing it, but I suggest
you try narrowing down the configuration being applied to one of the
affected nodes to find a minimal set that is sufficient to trigger the
issue. For example, if you apply only class Timezone::Utc, is that
sufficient to cause puppet to exhibit the problem? Please provide the
actual manifests involved.


John
--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/2b617318-fe37-44d7-8461-b36ceadccde9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Franck
2015-06-20 20:33:37 UTC
Permalink
Thank you for your response John.

You made some good suggestions especially with running it in strace to
retrieve more information.

In regards to your question, whether we have observed the behavior of
puppet spawning multiple processes, I can definitely confirm this. When
some of our nodes suffer from this, we are alerted to memory hitting a
threshold and when we look at the process on these nodes, we clearly see
multiple "applying configurations" in the list of processes. At that
point, we stop the puppet daemon and kill all the "applying configurations".

We'll try retrieving some more information this week to see if we can
narrow down some more possibilities. This does seem to affect a few random
nodes but on a consistent basis so we can definitely reproduce this.

Thanks again for your response.

Franck
Post by jcbollinger
Post by Franck
We've been experiencing a lot of "Command exceeded timeouts" on basic
shell commands using the "exec" type for tasks that should execute fairly
(/Stage[main]/Timezone::Utc/Exec[/bin/rm -f /etc/localtime && /bin/ln -s
/usr/share/zoneinfo/UTC /etc/localtime]) Command exceeded timeout
(/Stage[main]/Open-vm-tools::Package/Exec[/usr/bin/
vmware-uninstall-tools.pl]/onlyif) Check "/usr/bin/test -f /usr/bin/
vmware-uninstall-tools.pl" exceeded timeout
(/Stage[main]/Open-vm-tools::Package/Exec[/usr/bin/yum install -y
open-vm-tools.x86_64]/unless) Check "/bin/rpm -q open-vm-tools" exceeded
timeout
All these commands can be run locally to the host and return fairly
quickly, but when puppet executes them they time out.
Very strange.
Post by Franck
Extending the timeout is an option but ridiculous since default is 300
seconds and none of these commands should take 5 minutes or more to return.
No, probably not a viable option. If these particular commands are not
completing within the standard timeout, then there's no particular reason
to think that they would *ever* complete, no matter what timeout you set.
Post by Franck
Some of the things observed is that this only affects CentOS 6.x hosts as
we also have Ubuntu 14.x hosts and they do not experience these problems.
Also, we've played around with different versions of the puppet agent
along with different versions of Ruby and none of them had any effect as
this condition persists regardless. Also, this does not seem to affect all
of our CentOS 6.x hosts but only certain ones -- randomly.
There is surely some pattern to which machines are affected and which
not. Discovering that pattern would be a big step in solving the problem.
Post by Franck
Running puppet agent in debug mode does not seem to uncover what's going
on as it just hangs when it gets to the "exec".
You could try running Puppet under strace to get a low-level view of
exactly what Puppet gets stuck on. Nevertheless, if the problem sticks to
particular computers across different Puppet versions and different Ruby
versions, then the root of the problem must be outside Puppet itself.
You could compare the lists of installed packages between an affected
machine and a non-affected one. Perhaps the problem is caused by a
specific package or package version.
You should compare the catalogs applied to the machines that suffer from
this problem with those for the machines that are not affected. It may
help to narrow down the problem if you find that it is associated with a
small number of specific resources.
You should check how Puppet is running on affected machines vs.
non-affected ones. Is it running as a privileged user? The same one?
Post by Franck
It's very annoying and actually dangerous in some cases as the puppet
agent will continue spawning multiple "applying configuration" processes
which will cause hosts to swap memory as each takes up more and more memory
and in some instances will hose them entirely.
Have you actually observed that behavior? If so, then something is
dreadfully wrong. Puppet should never start a new catalog run when one is
already underway. It has safeguards in place to prevent that. If you have
stumbled across a way in which those can be circumvented, then I'm sure
PuppetLabs would appreciate a bug report.
Post by Franck
We've had to remove these manifests that cause these conditions in the
interim but right now we have a lot of hosts we need to manage with puppet
so we need to be able to use this.
- Puppet: 3.7.5
- Ruby: 2.1.2
- CentOS 6.6
Anyone have any ideas as to what could be causing this?
You haven't given us much to work with, and I, at least, have never before
heard of such an issue. I do not know what is causing it, but I suggest
you try narrowing down the configuration being applied to one of the
affected nodes to find a minimal set that is sufficient to trigger the
issue. For example, if you apply only class Timezone::Utc, is that
sufficient to cause puppet to exhibit the problem? Please provide the
actual manifests involved.
John
--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/0d2c1448-9881-40a4-8dcd-75dbd20c5039%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
'Dan Finn' via Puppet Users
2015-06-23 21:45:20 UTC
Permalink
Hello,
I am Franck's coworker. I spent some time looking into this today and have
some more info but unfortunately am still really lost as to what is going
on or what is causing this.

Here is the output captured from a puppet agent -t --debug run where a node
hangs:

https://github.com/danfinn/errors/blob/master/puppet_debug_output_during_hang.txt

And here is the output from strace captured during the same host hanging at
the same point in the manifest (hangs at line 3441):

https://github.com/danfinn/errors/blob/master/puppet_hang_strace_no_follow_forks.txt

I understand that this strace would be more helpful with -ff enabled but
unfortunately I couldn't get puppet to run like that, I'm not sure why but
it generated a bunch of errors and never ran the manifest.

One thing I noticed is that while puppet is hanging running /usr/bin/test,
or at least it is telling me that it's running that and waiting, that
process does not show up in the command list. Also, running that identical
/usr/bin/test command at the same time that puppet is hanging on it works
fine, it returns very quickly.

I was able to get this test system to hang both via puppet agent and
running the manifest locally with puppet apply.
--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/aa1469c2-6b6d-4091-8c7a-6605dbec86d1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
'Dan Finn' via Puppet Users
2015-06-23 22:03:24 UTC
Permalink
Just noticed something else interesting. When running puppet apply or
puppet agent and it doesn't hang, puppet only spawns 1 process and there is
no puppet process left running afterwards, it seems to clean itself up and
die off. When puppet runs and hangs, it spawns 2 processes each looking
identical with the same name and even though the output from puppet apply
or agent says it finished running (after letting it timeout on the hang) it
leaves 1 process behind still running. I attached strace to this running
process and the output looks like so:

pid 61271] futex(0x7f5e2454b6a0, FUTEX_WAIT_PRIVATE, 2, {0, 263395792}) =
-1 ETIMEDOUT (Connection timed out)
[pid 61271] futex(0x7f5e2454b6a0, FUTEX_WAIT_PRIVATE, 2, {0, 248281920}) =
-1 ETIMEDOUT (Connection timed out)
[pid 61271] futex(0x7f5e2454b6a0, FUTEX_WAIT_PRIVATE, 2, {0, 170310944}) =
-1 ETIMEDOUT (Connection timed out)
[pid 61271] futex(0x7f5e2454b6a0, FUTEX_WAIT_PRIVATE, 2, {0, 199996576}) =
-1 ETIMEDOUT (Connection timed out)
[pid 61271] futex(0x7f5e2454b6a0, FUTEX_WAIT_PRIVATE, 2, {0, 244956432}) =
-1 ETIMEDOUT (Connection timed out)
[pid 61271] futex(0x7f5e2454b6a0, FUTEX_WAIT_PRIVATE, 2, {0, 141958480}) =
-1 ETIMEDOUT (Connection timed out)
[pid 61271] futex(0x7f5e2454b6a0, FUTEX_WAIT_PRIVATE, 2, {0, 22737840}) =
-1 ETIMEDOUT (Connection timed out)
[pid 61271] futex(0x7f5e2454b6a0, FUTEX_WAIT_PRIVATE, 2, {0, 78315952}) =
-1 ETIMEDOUT (Connection timed out)

and as best as I can tell it would just keep repeating that indefinitely.

This seems quite odd to me and has to be related in some way but I'm not
sure how.
Hello,
I am Franck's coworker. I spent some time looking into this today and
have some more info but unfortunately am still really lost as to what is
going on or what is causing this.
Here is the output captured from a puppet agent -t --debug run where a
https://github.com/danfinn/errors/blob/master/puppet_debug_output_during_hang.txt
And here is the output from strace captured during the same host hanging
https://github.com/danfinn/errors/blob/master/puppet_hang_strace_no_follow_forks.txt
I understand that this strace would be more helpful with -ff enabled but
unfortunately I couldn't get puppet to run like that, I'm not sure why but
it generated a bunch of errors and never ran the manifest.
One thing I noticed is that while puppet is hanging running /usr/bin/test,
or at least it is telling me that it's running that and waiting, that
process does not show up in the command list. Also, running that identical
/usr/bin/test command at the same time that puppet is hanging on it works
fine, it returns very quickly.
I was able to get this test system to hang both via puppet agent and
running the manifest locally with puppet apply.
--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/5949b475-91fb-463a-9ed8-4e4e672aaee6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Franck
2015-06-29 22:00:56 UTC
Permalink
In case anyone is interested, we figured out what was causing this issue.

Being that we run a Ruby on Rails application and naturally Puppet runs on
Ruby as well, we actually load TC Malloc memory allocator for our
application. By running one of the "applying configuration" processes in
gstack, we were able to discover that it was getting stuck on
libtcmalloc.so.

The version of TC Malloc we were running that caused this issue was 4.2.0
but since upgrading to latest 4.2.6 we have not experienced another
occurrence of this issue.
Post by Franck
We've been experiencing a lot of "Command exceeded timeouts" on basic
shell commands using the "exec" type for tasks that should execute fairly
(/Stage[main]/Timezone::Utc/Exec[/bin/rm -f /etc/localtime && /bin/ln -s
/usr/share/zoneinfo/UTC /etc/localtime]) Command exceeded timeout
(/Stage[main]/Open-vm-tools::Package/Exec[/usr/bin/
vmware-uninstall-tools.pl]/onlyif) Check "/usr/bin/test -f /usr/bin/
vmware-uninstall-tools.pl" exceeded timeout
(/Stage[main]/Open-vm-tools::Package/Exec[/usr/bin/yum install -y
open-vm-tools.x86_64]/unless) Check "/bin/rpm -q open-vm-tools" exceeded
timeout
All these commands can be run locally to the host and return fairly
quickly, but when puppet executes them they time out. Extending the timeout
is an option but ridiculous since default is 300 seconds and none of these
commands should take 5 minutes or more to return.
Some of the things observed is that this only affects CentOS 6.x hosts as
we also have Ubuntu 14.x hosts and they do not experience these problems.
Also, we've played around with different versions of the puppet agent
along with different versions of Ruby and none of them had any effect as
this condition persists regardless. Also, this does not seem to affect all
of our CentOS 6.x hosts but only certain ones -- randomly. Running puppet
agent in debug mode does not seem to uncover what's going on as it just
hangs when it gets to the "exec".
It's very annoying and actually dangerous in some cases as the puppet
agent will continue spawning multiple "applying configuration" processes
which will cause hosts to swap memory as each takes up more and more memory
and in some instances will hose them entirely.
We've had to remove these manifests that cause these conditions in the
interim but right now we have a lot of hosts we need to manage with puppet
so we need to be able to use this.
- Puppet: 3.7.5
- Ruby: 2.1.2
- CentOS 6.6
Anyone have any ideas as to what could be causing this?
Thanks.
Franck
--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/64526946-314f-46c2-8ba5-d7fbf1ee09e8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...