Year 2000, when I first landed on a job where I was also in charge of a
large network which included thousands of LAN and WAN nodes. It didn’t take me
long to realize the bottlenecks and the challenges that laid ahead. Only on the
LAN there were around 4000 nodes. With the MPLSs and the WAN connections, the
number was probably around 18,000 nodes.
The most crazy part was that most of these networks were running on
10/100 switches, which were used as “dummy hubs”. In those days certifications
like CCNA and MCSE did exist but people with certifications were very scarce.
I admit I am not a network engineer, but an IT Manager who knows the
importance of a network; preferably a healthy one. In this case, it was far
from perfect as the subnetting and the routes were implemented with the use of
expensive hardware routers (yeah, even on the LAN!) and the use of IP subnets.
To give you an idea of the dire s**t I realized I was in, this company
was a multi-million US$ GSM operator, which functioned 24/7 live operation. I
guess you can imagine impacts of a network disruption and most equipment didn’t
have a fallback system. So, if the traffic gets interrupted, it’s a total mess
as it was virtually impossible to detect where the problem was arising from.
I’m sure IT crowd of such operations know what I mean.
In a 24/7 operation there should be several fallback/backup systems as
well as monitoring to minimize the risk of disruption as a failure means
millions of US$ loss by the hour! So if you wish to send the top IT officer to
the guillotine, this the most guaranteed method.
Ground Zero
I guess it was first few months into my IT Technical Manager position.
Whilst still trying to orientate into the company culture and the messy
organization, one morning at 3 am my phone started ringing persistently (yeah I pressed no few times I guess). I
could feel the fear in the voice of the night shift techie. He was talking
about a total network disruption, which also affected the NSS / GSM core
network MSCs and as the domino effect followed, it was already causing buffer
overflow on the HLRs, causing the GSM core network to fail.
As the Technical Manager it was my duty to alert the networking team and
activate the emergency protocol ASAP (which
generally meant that the networking team need to come to the operation centre
ASAP).
By 05:00 am, most of the core networking and support teams were at the
operation centre. We had already lost 2.5 hours, which didn’t relate to
millions of $ yet as it wasn’t in the peak hours, but we had only 2 hours
before 07:00 am, when the peak period began (in a normal work day there were around 12 million GSM related transactions
taking place – calls, SMS, GPRS, pings, broadcasts etc., which 40% of these
were in the peak hours) and if we could not get the system up by then, it’s
counted as the worst case scenario of a 24/7 operation and yes, getting fired
is the best thing that can happen in that case.
The goal was to make sure the HLR’s should be online as well as the core
GSM network is active, whilst the MSCs recording the transactions (so that they
can be billed). As the LAN’s and linked NSS networks were not able to
communicate whilst switches not setup up properly (no VLANS, no redundancy, no bandwidth management, no packet management
virtually nothing), we had to do a total sweep, eliminating the
probabilities, within 2 hours!! J.
Starting with the HLRs, working our way backwards, we realized that
there was no communication between the HLRs and the NSS, the core GSM network
MSCs as they were set to automatically go into fault mode on disconnection,
disabling the GSM network. By the way, we
are talking about year 2000 and the main OS of the MSCs’ were run by a
primitive version of Linux (Nokia DX200) and the MSCs were out of our
department so we needed to cooperate and wake up more techies from their beds
and drag them to work.
The error message on the Nokia DX200 was simple. It was stating that
there was an IP conflict! Now you are probably now asking in your minds, how
can there be an IP conflict on a dedicated network with static IP addresses?
You are right, it cannot be! But, when we pinged it, we got a reply. When we
tried to trace it, it seemed that it is somewhere in the void of the unknown
LAN (yeah, they had placed the NSS Servers into the office network, like rest
of the other servers) and it seems someone is sabotaging the whole setup by
conflicting a very specific IP address!
Whilst we are trying to swim in
the spaghetti, the “call” came from the CEO. Like I had no pressure, had to
squeak to the CEO for 10 minutes whilst giving him assurance that the problem
will be solved soon! As you can expect the board of trustees had been already
alerted which was pressurizing the CEO and all the way down the ladder, “me”.
Due to shortness of time, we had to think outside the box and make a
“crazy Ivan” manoeuvre [1].
During the Cold War, one submarine
would frequently attempt to follow another by hiding in its baffles. That is,
turning to observe the blind spot and detect any followers. Related manoeuvres
included the Soviet’s "Crazy Ivan", a hard turn to clear the baffles
and position the submarine to attack any followers, and "Angles and
Dangles", a five-hour process of rapid direction and speed changes to
ensure that all items aboard were properly secured for hard manoeuvring and
would not fall or shift suddenly, producing noise that the enemy could detect,
Like mentioned before, nearly all network nodes were bundled into the
same network segment and they could communicate with every other device, the
only logical action we could do was to disconnect whatever isn’t needed for the
operation to narrow the circumference hoping that we can disconnect the
conflicting network node. Of course the simple question might come to your mind
as “why don’t you just change the IP of
the NSS?” but, changing that meant changing the IP settings on 7000+ HLRs
throughout the country, so that it they can communicate with the NSSs.
So, after a minor vote, with my directive, the teams started turning off
core active network devices in mare attempt to isolate the system with the
conflicting IP. We are talking about shutting down the network of a building
with 2000+ employees, over 10000 active connected nodes, including servers,
data lines, security cameras, alarm systems, security card systems, fire
systems, printers, scanners, HVAC systems and many others that I cannot
remember after 20 years that connected in the same network only separated by IP
addresses.
By 08:30 am, after having manually disconnected 80% of the core internal
network, we still weren’t able to isolate the conflict! The only physical
networks we were left with were GSM technical department, (the department that
actually manages the NSS systems and the MSCs), the core network switches/routers,
WAN connectors and the NSS systems. I admit my stress level continued to rise.
So we continued into the GSM Tech department systems. By now, the non-critical
personnel were in the office, not able to work and phones continued to ring
nonstop. As we got to 90% of the systems, we started the 2nd phase, which
consisted of questioning the teams on if they could have installed something
new in the last 10 hours or so. None of the answers managed to solve our
problem. Lastly, though the GSM Tech’s director strictly forbid us to
disconnect their department’s systems, I override his authority and approved
the disconnection of their department’s systems. Then poof! Suddenly the
problem disappeared. The MSCs and the HLRs started responding after few minutes.
Of course you can imagine the shock. We had had shutdown the 95% of a
multi-billion $ company’s 24/7 operational systems to find out that there’s a
tiny server under a table, which belongs to a R & D techie, who setup SCO
Unix server for test purposes and decided to setup a DHCP server service with
identical IP range as the MSCs and the HLRs…apparently some other test machine
in the same department had picked up the IP address which caused a huge
conflict with the MX200’s IP address, making it unavailable for the HLRs to
connect causing massive disruption due to buffer overflows.
We had isolated the culprits manually and managed to turn the network to
its previous state by 12:00. The loss of the day was calculated to be around 6
million US$.
Of course some people’s key cards didn’t work the next day, including
the GSM Tech director’s. In the meanwhile, as the leader of the ERT (emergency
response team) although I didn’t get any commendations, I was instead given the
task of reshaping the whole network infrastructure of the organization, in a
way that a scenario like this would never happen again.
With a team of professionals we managed to revive and configure the huge
network within 24 months with a big hefty budget, where all active devices were
configured, managed, monitored and 24/7 (I shall talk about this in my next
log). As I hear from colleagues, they are still using the same operational
lifecycle fundamentals I had established back in 2002 (based on early version of ITIL and enterprise architecture).
Since that day, I prayed often that I would never face a challenge like
this ever again. It was all good until 2019. Which I will talk about this in my
next article.
No comments:
Post a Comment