Wireshark´s finest hour
Long before DevOps revolution, traditional software development and infrastructure management teams shared an antagonistic relationship. In plain words, they used to fight like cats and dogs. This is a true story from my experience on one of these quasi-aggressive interactions.
From 2000 to 2008, I was part of the infrastructure team at Superior Court of Justice (STJ), one of Brazil´s highest justice courts. Besides taking care of the network, one of my responsabilities was to provide and support Tomcat application servers, the workhorses that pulled Java programs wagons for STJ´s intranet.
STJ software developers didn´t have any contraints on the production at the time, they could deploy a new application or update an existing one at anytime. Yes, I know: “Big mistake, Rafael”. But, give me a break, it was a long time ago, when Information Technology in STJ was, in reality, improvised. Our “servers” were just overglorified desktops spread across two rows of stands.
I was no java developer. I have developed in C++ and Pascal a lot during my university years and kept creating shell and perl scripts to support my role at STJ. On the other hand, I had developed great skills on application troubleshooting at network level using a packet sniffer, Sniffer Pro at first and, later, Wireshark. When I had spare time, I used to mirror busy server ports and I loved spending hours understanding the cascade of Ethernet frames.
I felt like Cypher staring at the Nebuchadnezzar´s Matrix console:
Unfortunately I didn´t see the beautiful women that Cypher describes, but I knew the traffic baseline as no one. It was long before this cryptography-everywhere era. HTTPS was an unnecessary burden to our deskt… err, servers. Bottom line, I could see everything: requests, responses, database queries, measure response times and correlate them.
Modesty aside, I was good. I was so good that other IT departments lined at my door for a troubleshooting analysis when things got rough. Database queries, authentication protocols, file server communication; I debugged them all. They came with the the well known symptoms: My application is slow, it fails at random times, and the most common:
“- I tested it thoroughly at my local development environment (application, files and database at a local desktop) and it worked fine everytime, now that I moved to production, my performance is crap. Your network sucks!”
9 out of 10 times I could pinpoint the bottleneck: Chattiness, huge file accesses or even some left out code that queried non-existent servers until a timeout. After some time they all learned to stop blaming my top-notch network and just came asking for help, which I gladly provided every time.
STJ had even given me a bonus for teaching a network troubleshooting course for other network administrators, where I taught every trick I had on my sleeve: packet filtering, reordering, time anchoring, tcp stream following, windowing, data reconstruction, HTTP request-database query correlation… Everything. Unfortunately, no other employee had reached my troubleshooting skill level.
The network was my kingdom and I was the king.
Enough blowing my own trumpet, let´s move on with the story. It was just another day at work when, suddenly, one of my Tomcat server crashes. First response: restart the service. The server runs fine again.
A couple of days later, another inexplicable crash in the middle of the afternoon. My boss complains. I ask the java developer team if they had made any change to the application recently. They say the code is months old and no change had been made.
Some days later, another crash. I had enough. Our users are complaining, my boss is blaming my servers and the java developers swear that they did not change anything. I put down my coffe mug, open my server console and get down to business. Time to get my hands dirty.
I enable every java management extension I could find and start logging. Now I am just looking forward to the next crash, which, of course, happens soon enough.
I go through the logs, and what I found was a quick surge on the memory usage of the java process. It suddenly spiked to the maximum heap size and crashed the server, jumping from a healthy 50% to 100% in a matter of minutes. Memory leak, one of my favorites bugs. However, I still had no clue of what application or class had misbehaved.
I politely ask the developers once again: negative responses and some frowns upon me. Now I am pissed, but I did not say anything. Without the digital smoking gun, I cannot bring the case to court. Not yet, my boy. Not yet…
Out of management extensions left, I decided to call in the big guns on this case:
RELEASE THE KRAKEN!
- Wireshark laptop deploy: Check!
- Server port mirror: Check!
- Promiscuous mode: Check!
- Spurious traffic filters: Check!
- Circular buffers: Check!
As the dusty fan of my beaten laptop starts roaring, an involuntary sneer appears on my face. I just have to wait for the incident to happen again, and it surely happens. But this time, the Shark has caught something. A phrase slips between my teeth:
“Let´s get down to business, shall we…”
I look at the java management extension logs and server port usage to pinpoint the approximate time the server started misbehaving. My clues led me to the exact time when the first wheel of the wagon had fallen off: 12:45h PM!
I deep dive on the packets of that approximate time, and to my dismay, there are zillions of packets. I saw blondes, brunettes, redheads, dwarfs, zebras, giants, E.T.s…. But not a single defective request. The server was on a very busy time, the catch was simply too big to digest. It was a needle in a haystack. I had two clues, but no smoking gun. Yet.
Think, McFly, think! How is a misbehaving request different from a well behaved one?
OF COURSE!!! There is no spoon!
I crack my knuckles and start coding some filters. First, only HTTP requests. Then, I add the responses. I separate the individual streams, fiddle with the time counting and ordering and VOILA! I locate the third clue: the only HTTP request with absolutely no response. Just a TCP ack and radio silence. And it occurs minutes before the crash, right on the start of the memory usage climbing.
Now I remove some filters and analyze what packets appear around this problem. Surely enough, I found the fourth clue: a database query immediately after the reponseless HTTP request. And another. And another. The misbehaving request caused an infinite loop of repeating database queries, slowly crushing my poor server´s memory.
My last clue lead me to the depths of the java code. I extracted the offending SQL request text from the repeating packets.
Now I had all the clues I needed to solve this mystery. It is time to locate the faulty code. I run a grep on my server files searching for the SQL query I had just discovered. I found the java file containing the query, and of course, it is inside a loop.
The proverbial smoking gun full of developers fingerprints was right in front of my eyes!
I march to the java developers office, kick the door in and, without warning, yell only the name of the defective java file. A guy on the corner of the office stares at me with big startled eyes and, in an almost inaudible squeak, moans:
“- This change shouldn´t have caused a problem…”
Long story short: Developers were kicked out of production servers. They now had to open a formal request to update applications outside of business hours. We have started a long due IT governance project.
Epilogue
In 2008, two months after I left STJ, I received a call from the new network administrator asking for help debugging an application misbehavior. I instructed him to mirror the correct server port, to capture some data during the issue and to send me the file.
Surely enough, I found the problem.
I recently heard that there are still tales around STJ about a legendary guy that used to roam around the building solving all kinds of IT problems with his unusual companion: a shark that lived in the network cables. The almighty…