Juniper EX Switch Stack By drkfiber on Flickr
⇈This is not mine. Click on the picture to go to the source.⇈
So, not so long ago, we were have big trouble with our network when one of the switches that serves the area where IT is located started playing up. Here’s a rundown of what was happening.
It was late in the afternoon, just before we were about to knock off, when we noticed that we had lost internet. We couldn’t get any email, or any other network access at all, actually. Mr Chief had spent all day trying to resolve another network problem, so we assumed it has something to do with that and just dismissed it. When we came in the next day, it was all still down. This caught us by surprise. Mr Chief had already noticed the problem, and had started investigating. With a problem like this, the first thing you need to do is isolate the cause or source. At first we thought it was House-wide but, when we asked other people that came past, it turned out it was just us.
The more people we talked to, the more isolated we found the issue to be. It was only affecting the IT area, and nowhere else. That lead us to a set of switches that serve our area specifically. Standing in front of them, staring at the little flickering lights, Mr Chief declared “It’s a looped cable.” This means that someone’s plugged both ends of an Ethernet cable into two ports on the wall. This creates a loop back to the switch, which generates excessive traffic, which, according to Mr Chief, could cause our issues. So we went looking at every desk, at every wire; we had to check that none of them were somehow looped, or plugged in where they shouldn’t have been. But really, this cause didn’t make much sense. Out of all the departments, IT is the last bunch you’d think would something like that.
The day went by, and we still hadn’t found the cause, so we were still without internet, email or network access the next day. Mr Chief was still convinced it was a looped wire, but decided to look into other possibilities. He connected to the switch stack via telnet (command line) and found that the second switch kept dropping it’s ports from the channel, then picking them up again, then dropping them again. You see, the switches are set up as in sets called stacks, and they’re linked together at the back via HDMI cables which serve as 10 Gb links between them. This lets them communicate between themselves super fast, and act as one unit. Also, the first five or so ports on each switch connect back to the core switch stack, which acts as the backbone for the whole House’s network. Those five ports are treated as one connection, and this is called a channel. The switch software does this by creating the channel and adding the ports to it. So, for some reason, the ports weren’t staying part of the channel. Mr Chief diagnosed this problem as being caused by either a faulty switch or bad HDMI cables running in & out of that switch. Hence the logical thing to do was to try replacing these things and see if it fixed it. The HDMI cables being easier, we started with them. Despite the fact each switch had come with two HDMI cables, we could only find one in over two hours of searching in our crowded storeroom. Mr Chief still tried to swap it in, and seeing if it fixed the issue, but it didn’t seem to make any difference.
Beginning to get frustrated, Mr Chief called D-Link, whom manufacturer the switches, for support. All they had to say was “We’ll call you back when we can.” Mr Chief then decided to have a poke around at the switch stack setting, via the web interface this time. But it refused to load. It was looking bad, but after switching through four browsers, and waiting for over ten minutes for the page to load, Mr Chief was able to access the web interface. It’s at this point he got the idea to upgrade the switch firmware. He had downloaded it a while ago, so he thought it might be worth a shot. Luckily, before he did anything else, he backed up all the configuration settings because, apparently, firmware updates wipe them out.
To upgrade the firmware for the switch stack, two files had to be replaced on each switch: the actual firmware and the boot files, similar to the rom and bootloader for Android. But when Mr Chief tried, he found that he couldn’t update the firmware for the whole stack together because it claimed some file was in use. This meant that each switch had to be done individually, and using telnet rather then the web interface. Although this is more time consuming, it didn’t really matter to us much, as long as it worked. Of course, as with any firmware upgrade, there is always the possibility of something going wrong and the whole thing ending up bricked. The big problem was Mr Chief wasn’t sure how to do a firmware upgrade with telnet, so he spent a large amount of time using 3G internet to read through the documentation to find how.
An hour later, Mr Chief has succeeded in updating the firmware and boot files for all the switches except switch 1 which is the Master, meaning it controls the rest of the stack. He has updated the firmware, but when he tried to replace the boot files, it gave him the same “in use” message as before. He decided that the way around this was to make switch 2 the master. Now, I was there when he did this, and I saw that he changed switch 2 to be Master, but he also changed the switch stack order so that the second switch thinks it’s No.1 and the first switch thinks it’s No.2. Thus, he cancelled out the change he just made. Not realising what he had done, he agreed when it asked him to reboot so that it could apply these changes. It’s also worth noting here that switch 2 was the one that was playing up, and now it’s supposedly Master?!? Not good!
And that bricked it. While they seemed to reboot, none of the stack was accessible via the web interface nor telnet. So, we tried hard resetting them… Nope, nothing. It’s at this point that D-Link support decided to call back, which was just a *little* bit too late. The only helpful thing he seemed to be able to tell us was that, if it turned out that they were really dead, D-Link should cover it because it happened as a result of a firmware upgrade. As a last chance possibility, Mr Chief tried connecting to the switches individually using a console cable which looks like a VGA, and plugs into the back of the switch. Using this method, he was able to totally reset the switches and put all the firmware and configs back on. Even after that, switch 1 & 2 were still confused about who was master. After resetting that, we were back at square one with switch 2 dropping ports.
Getting desperate now, Mr Chief set up a spare switch as switch 2 and we swapped them over. This wasn’t a straight forward task because we had to try and memorise which ports the Ethernet cable were plugged into, so we could disconnect them, pull out the switch, put in the other one, and plug all the cables in again. And after all that, it was still dropping ports from the channel. We admitted defeat for the day, and collected any diagnostics we could to send to D-Link, to see if they could tell us what was going on.
From what we and they could deduce, it seemed that switch 2 was constantly disconnecting from the stack then reconnecting again, and because the stack is set up to compensate for a failure, if a switch drops out or is added, the ordering is renegotiated to match. However, with switch 2 doing this so much and so fast, the stack was doing nothing but renegotiating order. Consequently it didn’t have any time to do what it was suppose to be doing. This led Mr Chief back to thinking it could be caused by the HDMI cables being damaged, but with no cables to swap them with, we were stuck. As it turned out, the problem resolved itself for a few days, during which switch 2 would go down briefly in the mid-morning before recovering. This gave us the time to finally find some more HDMI cables so we could replace both links to switch 2.
When it all was working again, Mr Chief’s official statement was that it seemed to have been caused by damaged HDMI cables, possibly done by an electrician during the installation of new cabling that had recently been put in. It turned out Mr Chief’s intuition about a looped cable was also right; we accidentally created one when we hastily plugged the cables back in after replacing switch 2.
So, that’s the whole story, in blow-by-blow detail. As always, if you have something to say, feel free to tell me in the comments below, or on the new Facebook fan page.
Post Again Soon,
Good post. Very helpful. Thanks for the info.
LikeLike
Pingback: The Problem Is Not Your Problem | Nitemice