Integrating systems from two different vendors can be a challenging endeavor.

However, these challenges often present unique opportunities for problem-solving. With today's innovative tools such as Red Hat Event-Driven Ansible and OpenNMS, we now have more robust and sustainable solutions at our disposal for such issues.

I'd like to share a personal experience from my previous job where we dealt with a Network Attached Storage (NAS) system issue.

Check out this Ansible demo:

https://www.youtube.com/watch?v=aqQq5vD8-n0

Story Time

Our NAS setup featured a policy server that scrutinized every operation to prevent protected data from ending up in incorrect locations within our drive shares. This process was managed synchronously via a TCP connection initiated by the policy server. Every operation on the NAS required either approval or rejection from this server.

However, an unexpected bug brought this smooth process to a standstill when the policy server would set its TCP window size to zero. The notorious 'zero window' condition means the receiver, in this case the policy server, is no longer able to receive data. The NAS will respect this request of a zero window and stop sending operations for approval. Under normal operations, this is a good thing. A receiver like the policy server maybe temporarily overloaded unfortunately, in this case it never recovered. As the connection appeared alive from both the policy server and NAS neither end would reset it to restore the window, leading to inaccessible drive shares and subsequently halting users' work.

Finding a Fix

Resolving this issue swiftly became a critical priority. Our immediate fix involved manually resetting the connection for the policy server. We probably lost some transactions during this, but the alternative of a full stop to everything was far worse. To automate this process, we crafted several bespoke scripts. However, these were problem-specific and led to a duplication of effort. Thankfully, with the advent of modern solutions, we can now address such situations in a more efficient, reliable, and sophisticated manner.

Integrate & Automate

Red Hat Event-Driven Ansible is a fantastic tool that enables instant responses to events detected by OpenNMS. In the context of our NAS system issue, the policy server traffic remained above 1 megabit per second under normal circumstances, but when the bug appeared, the traffic plunged to less than 100 kilobits per second. OpenNMS was already collecting network interface data via SNMP from the policy server, and we could simply configure this as a low threshold event.

The amalgamation of Ansible's real-time automation and OpenNMS's comprehensive monitoring and event management capabilities provides a powerful solution for network system issues. When OpenNMS triggers an event like a traffic drop below a set threshold, this information is sent to Ansible. Ansible then springs into action, running a playbook that resets the TCP connection, validates service restoration, and sends an all-clear signal.

In conclusion, modern tools and technology have revolutionized how we address complex problems, simplifying our work and providing elegant solutions. With the power of Red Hat Event-Driven Ansible and OpenNMS at your fingertips, you're well-prepared to tackle any system issues head-on!

Jump to section

About the Author: Mike Huot

Published On: October 3rd, 2023Last Updated: September 26th, 20233 min read