Monday, February 3, 2020

How I recovered from a Network Disaster?


Year 2000, when I first landed on a job where I was also in charge of a large network which included thousands of LAN and WAN nodes. It didn’t take me long to realize the bottlenecks and the challenges that laid ahead. Only on the LAN there were around 4000 nodes. With the MPLSs and the WAN connections, the number was probably around 18,000 nodes.  The most crazy part was that most of these networks were running on 10/100 switches, which were used as “dummy hubs”. In those days certifications like CCNA and MCSE did exist but people with certifications were very scarce.

I admit I am not a network engineer, but an IT Manager who knows the importance of a network; preferably a healthy one. In this case, it was far from perfect as the subnetting and the routes were implemented with the use of expensive hardware routers (yeah, even on the LAN!) and the use of IP subnets.

To give you an idea of the dire s**t I realized I was in, this company was a multi-million US$ GSM operator, which functioned 24/7 live operation. I guess you can imagine impacts of a network disruption and most equipment didn’t have a fallback system. So, if the traffic gets interrupted, it’s a total mess as it was virtually impossible to detect where the problem was arising from. I’m sure IT crowd of such operations know what I mean.

In a 24/7 operation there should be several fallback/backup systems as well as monitoring to minimize the risk of disruption as a failure means millions of US$ loss by the hour! So if you wish to send the top IT officer to the guillotine, this the most guaranteed method.

Ground Zero
I guess it was first few months into my IT Technical Manager position. Whilst still trying to orientate into the company culture and the messy organization, one morning at 3 am my phone started ringing persistently (yeah I pressed no few times I guess). I could feel the fear in the voice of the night shift techie. He was talking about a total network disruption, which also affected the NSS / GSM core network MSCs and as the domino effect followed, it was already causing buffer overflow on the HLRs, causing the GSM core network to fail.

As the Technical Manager it was my duty to alert the networking team and activate the emergency protocol ASAP (which generally meant that the networking team need to come to the operation centre ASAP).

By 05:00 am, most of the core networking and support teams were at the operation centre. We had already lost 2.5 hours, which didn’t relate to millions of $ yet as it wasn’t in the peak hours, but we had only 2 hours before 07:00 am, when the peak period began (in a normal work day there were around 12 million GSM related transactions taking place – calls, SMS, GPRS, pings, broadcasts etc., which 40% of these were in the peak hours) and if we could not get the system up by then, it’s counted as the worst case scenario of a 24/7 operation and yes, getting fired is the best thing that can happen in that case.

The goal was to make sure the HLR’s should be online as well as the core GSM network is active, whilst the MSCs recording the transactions (so that they can be billed). As the LAN’s and linked NSS networks were not able to communicate whilst switches not setup up properly (no VLANS, no redundancy, no bandwidth management, no packet management virtually nothing), we had to do a total sweep, eliminating the probabilities, within 2 hours!! J.

Starting with the HLRs, working our way backwards, we realized that there was no communication between the HLRs and the NSS, the core GSM network MSCs as they were set to automatically go into fault mode on disconnection, disabling the GSM network. By the way, we are talking about year 2000 and the main OS of the MSCs’ were run by a primitive version of Linux (Nokia DX200) and the MSCs were out of our department so we needed to cooperate and wake up more techies from their beds and drag them to work.

The error message on the Nokia DX200 was simple. It was stating that there was an IP conflict! Now you are probably now asking in your minds, how can there be an IP conflict on a dedicated network with static IP addresses? You are right, it cannot be! But, when we pinged it, we got a reply. When we tried to trace it, it seemed that it is somewhere in the void of the unknown LAN (yeah, they had placed the NSS Servers into the office network, like rest of the other servers) and it seems someone is sabotaging the whole setup by conflicting a very specific IP address!

Whilst we are trying to swim in the spaghetti, the “call” came from the CEO. Like I had no pressure, had to squeak to the CEO for 10 minutes whilst giving him assurance that the problem will be solved soon! As you can expect the board of trustees had been already alerted which was pressurizing the CEO and all the way down the ladder, “me”.

Due to shortness of time, we had to think outside the box and make a “crazy Ivan” manoeuvre [1].
During the Cold War, one submarine would frequently attempt to follow another by hiding in its baffles. That is, turning to observe the blind spot and detect any followers. Related manoeuvres included the Soviet’s "Crazy Ivan", a hard turn to clear the baffles and position the submarine to attack any followers, and "Angles and Dangles", a five-hour process of rapid direction and speed changes to ensure that all items aboard were properly secured for hard manoeuvring and would not fall or shift suddenly, producing noise that the enemy could detect,

Like mentioned before, nearly all network nodes were bundled into the same network segment and they could communicate with every other device, the only logical action we could do was to disconnect whatever isn’t needed for the operation to narrow the circumference hoping that we can disconnect the conflicting network node. Of course the simple question might come to your mind as “why don’t you just change the IP of the NSS?” but, changing that meant changing the IP settings on 7000+ HLRs throughout the country, so that it they can communicate with the NSSs.

So, after a minor vote, with my directive, the teams started turning off core active network devices in mare attempt to isolate the system with the conflicting IP. We are talking about shutting down the network of a building with 2000+ employees, over 10000 active connected nodes, including servers, data lines, security cameras, alarm systems, security card systems, fire systems, printers, scanners, HVAC systems and many others that I cannot remember after 20 years that connected in the same network only separated by IP addresses.

By 08:30 am, after having manually disconnected 80% of the core internal network, we still weren’t able to isolate the conflict! The only physical networks we were left with were GSM technical department, (the department that actually manages the NSS systems and the MSCs), the core network switches/routers, WAN connectors and the NSS systems. I admit my stress level continued to rise. So we continued into the GSM Tech department systems. By now, the non-critical personnel were in the office, not able to work and phones continued to ring nonstop. As we got to 90% of the systems, we started the 2nd phase, which consisted of questioning the teams on if they could have installed something new in the last 10 hours or so. None of the answers managed to solve our problem. Lastly, though the GSM Tech’s director strictly forbid us to disconnect their department’s systems, I override his authority and approved the disconnection of their department’s systems. Then poof! Suddenly the problem disappeared. The MSCs and the HLRs started responding after few minutes.

Of course you can imagine the shock. We had had shutdown the 95% of a multi-billion $ company’s 24/7 operational systems to find out that there’s a tiny server under a table, which belongs to a R & D techie, who setup SCO Unix server for test purposes and decided to setup a DHCP server service with identical IP range as the MSCs and the HLRs…apparently some other test machine in the same department had picked up the IP address which caused a huge conflict with the MX200’s IP address, making it unavailable for the HLRs to connect causing massive disruption due to buffer overflows.
We had isolated the culprits manually and managed to turn the network to its previous state by 12:00. The loss of the day was calculated to be around 6 million US$.

Of course some people’s key cards didn’t work the next day, including the GSM Tech director’s. In the meanwhile, as the leader of the ERT (emergency response team) although I didn’t get any commendations, I was instead given the task of reshaping the whole network infrastructure of the organization, in a way that a scenario like this would never happen again.

With a team of professionals we managed to revive and configure the huge network within 24 months with a big hefty budget, where all active devices were configured, managed, monitored and 24/7 (I shall talk about this in my next log). As I hear from colleagues, they are still using the same operational lifecycle fundamentals I had established back in 2002 (based on early version of ITIL and enterprise architecture).

Since that day, I prayed often that I would never face a challenge like this ever again. It was all good until 2019. Which I will talk about this in my next article.

No comments:

Post a Comment