Understand troubleshooting methodologies and approaches to efficiently resolve network issues.
- Troubleshooting methodology
- Define the problem
- Gather information
- Analyze information
- Eliminate potential causes
- Propose hypothesis
- Test hypothesis
- Solve problem and document solution
- Top-down method: - start at the top of the OSI model
- Bottom-up method: start at the bottom of the OSI model
- Divide and conquer method: start in the middle of the OSI model
- Compare configuration to another similar device
- Follow the path, checking each device in the path as you go
- Swap components
My two cents:
When troubleshooting, I like the divide and conquer method where you methodically isolate issues to ensure you find the root cause. I have found it to be the most efficient and effective. What I mean is that once you define the problem you are trying to solve the next step is to attempt to isolate it. Luckily for us, everything on a network is connected, which makes most troubleshooting efforts binary. It either works or it doesn't. Bandwidth and latency issues tend to be non-binary and more difficult to troubleshoot. Anyways, if you are able to divide your network into segments and test the problem over a part of the network instead of the entire network than you will be able to much more efficiently isolate the problem.
For example, your coworker's device can’t reach the internet so they call you to troubleshoot it. You know that between their device and the internet exists a switch, a router and a firewall. You divide the network in half and test if their device can reach the router. Then you run a separate test to see if the router can get to the internet. One of the two tests will most likely fail and the other will pass. Focus on the segment of the network that failed. Say it was the client to the router. Now divide that segment in half, which would be at the switch and test again. Test connectivity from the client to the switch. Test the connectivity from the switch to the router. Say the client is not getting to the switch.
At this point you need to select either the client device or the switch and start working your way up the OSI model. Is the client machine physically connected to the network with it's link light green (layer 1)? Is it displaying ARPs (layer 2)? Does it have an IP address (layer 3)?, etc. If you fail to find the problem on the client, go to the switch and check that there is a physical cable connected to the port on the switch (layer 1). Check that the status of the port is up/up, check the vlan and look for ARPs from the client (layer 2). Put an SVI on the switch and see if you can ping the client device (layer 3). Long story short, divide and conquer to accomplish efficient fault isolation and then step up the OSI model for resolution. Don’t forget to test access to the internet (layer 7) once you think you've resolved the issue. There are times when there are multiple unrelated problems occurring simultaneously. What happens when you fix a problem and tell the customer they are good without testing it yourself? They try to get to the internet and it looks like you didn’t fix anything. Don’t do that to yourself. Trust but verify!