Categories
Uncategorized

Network Troubleshooting Primer

https://www.sangfor.com

It is the sine qua non of software developers that we have to be prepared to debug the programs we write. No program is perfect the first time, and being able to debug a program–one you’ve written or one you’ve inherited–is a necessary skill.

“Testing proves a programmer’s failure. Debugging is the programmer’s vindication”. Boris Beizer

There are many debugging tools and methods–IDEs often have breakpointing features and context information available that makes it possible to find code- or logic-specific bugs with a high level of efficiency. These tools and methods work well when the problem is within the program (intrinsic), but often fail to help when the problem is extrinsic.

There are many types of extrinsic bugs–database access, API access, etc–but a specific category of extrinsic bugs is often difficult for many programmers to troubleshoot: networking issues.

This is understandable. While on the surface networking seems simple enough, it can actually be quite complex under the hood. Programmers have enough on their plates to be experts in, and so networking–protocols, behaviors, and so on–are not their long suit.

In that context, I want to offer a few simple things that can help enormously in a programmer’s network troubleshooting–perhaps not allowing the programmer to directly solve the problem, but gaining enough information to direct those who can solve the problem–e.g., Enterprise IT–to find and fix it more quickly and easily.

Let’s use a simple example of a program encountering a network problem as the basis for discussing the tools and procedures to use in narrowing down its source.

Imagine you have a program that attempts to connect to a database that is on a local or remote network to read some data from a table. The program references the database endpoint by FQDN: say, “accounts.example.com” and provides a port number for the connection–say, 3306.

Running the program results in connection failure with “Error establishing a database connection”. There are other failures, of course, that will provide more information–e.g., “User authentication failure” that is much more instructive, but receiving this generic failure message is not uncommon.

What do we do now?

We could reach out to the Enterprise IT personnel and ask for help, but the uninformative nature of the error message is just as likely to baffle them as it does you. We need to be more informed before calling them.

So, what to do?

Let’s start by understanding how a network connection works.

In the example given here, we are attempting a TCP connection to a named host and port. Why might it have failed?

Domain Name Service

First: let’s check to make sure the hostname we used for the connection is translatable to an IP address–in other words, let’s check the DNS process that takes place as the first step of this connection attempt.

On the host on which the program is running we will need to do a DNS lookup on the hostname “accounts.example.com”. If we can obtain a shell on the host, great. We can use one of the DNS tools–nslookup, dig, etc.–to see if the hostname can be transformed to an IP address.

So, on a linux system, we might do:

dig accounts.example.com

If we get back a result such as this:

<!-- wp:code -->
<pre class="wp-block-code"><code>; &lt;&lt;>> DiG 9.16.30-RH &lt;&lt;>> accounts.example.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER&lt;&lt;- opcode: QUERY, status: NOERROR, id: 37258
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;example.com.			IN	A

;; ANSWER SECTION:
accounts.example.com.		165	IN	A	104.18.26.120
accounts.example.com.		165	IN	A	104.18.27.120

;; Query time: 12 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Sun Dec 21 14:04:40 CST 2025
</code></pre>
<!-- /wp:code -->

we know the hostname is valid and can be translated by the local DNS system.

If, instead, we get something like this:

; <<>> DiG 9.16.30-RH <<>> accounts.example.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 8474
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;accounts.example.com.			IN	A

;; AUTHORITY SECTION:
com.			30	IN	SOA	a.gtld-servers.net. nstld.verisign-grs.com. 1766347559 1800 900 604800 900

;; Query time: 67 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Sun Dec 21 14:06:26 CST 2025
;; MSG SIZE  rcvd: 116

then the local network does not know a host by the name “accounts.example.com”. We have found a source of the error (at least the proximate source, as there may be others): we have the wrong hostname, the hostname is not registered with the DNS server, or we are using the wrong DNS server.

Providing the output of the DNS lookup to the Enterprise IT group should help them get right to the source of the problem, with a quick resolution.

If it is not possible to get to a shell on the host on which your program is running, you can do the equivalent of this DNS lookup programmatically. For instance, Python has a dnspython library, and examples of how to run a simple DNS query abound online.

There is second item to be noted here: even if the hostname is transformed to an IP address by the local DNS server, the IP address itself may not be correct–this would imply either an incorrect hostname or an incorrect entry in the DNS server tables. Either one can be investigated and fixed by the appropriate network engineer.

Connection issues

If the hostname is not our problem, what next?

Well, the first step a network connection has to take after resolving the hostname is to establish and end-to-end TCP connection to the specific host.

There are many things that could prevent such a connection: firewall rules, network permissions, and so on–but they all fall into a single category: establishing a connection from the local host to an endpoint.

Here is where another tool comes into play: ping.

Again, in a shell, if possible (and programatically, if not), we can a command:

ping <ipaddress>

If we get a successful response:

PING 23.46.216.147 (23.46.216.147) 56(84) bytes of data.
64 bytes from 23.46.216.147: icmp_seq=1 ttl=54 time=1.24 ms
64 bytes from 23.46.216.147: icmp_seq=2 ttl=54 time=1.23 ms
64 bytes from 23.46.216.147: icmp_seq=3 ttl=54 time=1.23 ms
64 bytes from 23.46.216.147: icmp_seq=4 ttl=54 time=1.21 ms
64 bytes from 23.46.216.147: icmp_seq=5 ttl=54 time=1.24 ms

we know the host is up, and a route to it is available.

Otherwise:

PING 1.2.3.4 (1.2.3.4) 56(84) bytes of data.
.
.
.

with no response means we have a connection problem.

There are many reasons that a connection is not possible, and getting to the bottom of those is a bit more complex, and beyond the scope of this primer, but at least you can tell Enterprise IT that you can’t reach the host, and they should be able to take it from there.

But let’s say you successfully pinged the host–what now? After all, our program still doesn’t connect.

This is where a powerful tool comes into play: Nmap.

Nmap is a network tool that has many features, but one simple usage of it can help determine further delineate the source of the connection problem.

Again, in a shell, if possible (and programatically, if not), we can a command:

nmap <ipaddress>

This form of the Nmap command probes each of the top 1000 most popular TCP/UDP ports and reports on whether it is “open” (ready to receive a connection attempt) or “closed”. Only the open ports are shown in this form of the command.

We might see an output like this:

Nmap scan report for <hostname> (<ipaddress>)
Host is up (0.052s latency).
Not shown: 995 filtered tcp ports (no-response)
PORT     STATE SERVICE
22/tcp   open  ssh
80/tcp   open  http
443/tcp  open  https
3000/tcp open  ppp
9000/tcp open  cslistener

Nmap done: 1 IP address (1 host up) scanned in 5.48 seconds

We now know that ports 22, 80, 443, 3000, and 9000 are open for connections. We also know that these ports are normally used for ssh, http, https, ppp, and clistener. (These programs may not actually being using the port for the intended purpose–all we care about is the port number is open.)

If, instead, we see something like this:

Nmap scan report for <hostname> (<ipaddress>)
Host is up (0.052s latency).
filtered tcp ports (no-response)
PORT     STATE SERVICE

Then we know no ports are open.

In our case, since we are trying to connect to port 3306, we now this port is not open and receiving connections. Why could this be?

The program using that port (MySQL usually) is not running on the remote host, or that TCP port is blocked by a network firewall rule. How do we tell the difference?

Here’s where the most powerful networking tool comes into play: Wireshark.

Wireshark (and its non-UI, terminal-only counterpart tshark) is a complex and heavyweight application so, depending on your tolerance for complexity and your ability to install the tool on the localhost, you may choose to forego this next step and leave it to the experts to figure out.

But, let’s assume you’re up for adventure. We will assume you’re using tshark for the following work–Wireshark is GUI-oriented, but has the same features.

Once tshark is installed, we can do a live packet capture process to see what’s going on when we are attempting, but failing to make the desire connection from our program.

We start tsharh and have listen for traffic to and from port 3306 on the remote host (and ignore other traffic so we don’t get a cluttered output) with the following command:

tshark -i <interface name> -f "tcp port 3306"

We should see the tshark program output something like this:

Running as user "root" and group "root". This could be dangerous.
Capturing on 'eth0'
 ** (tshark:89717) 14:47:31.330867 [Main MESSAGE] -- Capture started.
 ** (tshark:89717) 14:47:31.331040 [Main MESSAGE] -- File: "/tmp/wireshark_eth0JKA4H3.pcapng"

This is the indication that the program is running, capturing packets on the “eth0” interface, and looking only for packets coming from or going to TCP port 3306.

(Note: we are doing only simple filtering for this example–tshark can also do more complex filters, when can include an ip address, it can limit the total number of packets captured, etc. Typing “tshark –help” or reading the online documentation can provide more details.)

Now, we start our program and let it run and fail with the connection error.

If things are working properly, we would see something like this:

Capturing on 'Loopback: lo'
 ** (tshark:93028) 15:37:38.444422 [Main MESSAGE] -- Capture started.
 ** (tshark:93028) 15:37:38.444718 [Main MESSAGE] -- File: "/tmp/wireshark_lo4255H3.pcapng"
    1 0.000000000    127.0.0.1 → 127.0.0.1    MySQL 116 Request Query
    2 0.002171291    127.0.0.1 → 127.0.0.1    MySQL 80 Response  OK 
    3 0.002289567    127.0.0.1 → 127.0.0.1    TCP 66 40782 → 3306 [ACK] Seq=51 Ack=15 Win=512 Len=0 TSval=1301090850 TSecr=1301090849
    4 0.305042025    127.0.0.1 → 127.0.0.1    MySQL 148 Request Query
    5 0.309678250    127.0.0.1 → 127.0.0.1    MySQL 85 Response  OK 
    6 0.309760804    127.0.0.1 → 127.0.0.1    TCP 66 40770 → 3306 [ACK] Seq=83 Ack=20 Win=512 Len=0 TSval=1301091157 TSecr=1301091157
    7 1.528887608    127.0.0.1 → 127.0.0.1    MySQL 150 Request Query
    8 1.534546258    127.0.0.1 → 127.0.0.1    MySQL 85 Response  OK 
    9 1.534634645    127.0.0.1 → 127.0.0.1    TCP 66 40770 → 3306 [ACK] Seq=167 Ack=39 Win=512 Len=0 TSval=1301092382 TSecr=1301092382
^C   10 3.727212615    127.0.0.1 → 127.0.0.1    MySQL 151 Request Query
   11 3.731105058    127.0.0.1 → 127.0.0.1    MySQL 85 Response  OK 
   12 3.731149223    127.0.0.1 → 127.0.0.1    TCP 66 40770 → 3306 [ACK] Seq=252 Ack=58 Win=512 Len=0 TSval=1301094578 TSecr=1301094578

The details of the response above are a bit too complex for this discussion, but note that items 1 and 2 show we established a connect to the remote host, and item 3 shows that we had an SQL interaction with the host.

However we would now need to look into the actual SQL request sent to the server to see if something is wrong there. We could look into the captured packets or using debugging tools on the program itself to determine what it sent.

If, instead, we see this:

Capturing on 'Loopback: lo'
 ** (tshark:93028) 15:37:38.444422 [Main MESSAGE] -- Capture started.
 ** (tshark:93028) 15:37:38.444718 [Main MESSAGE] -- File: "/tmp/wireshark_lo4255H3.pcapng"

with no further output after running the program, then our program never established a connection with the remote host on port 3306.

This, then, would indicate some type of networking issue beyond the scope of this primer–or, general network debugging for that matter. Time to reach out to Enterprise IT with this info.

However, if we see this:

Capturing on 'Loopback: lo'
 ** (tshark:93028) 15:37:38.444422 [Main MESSAGE] -- Capture started.
 ** (tshark:93028) 15:37:38.444718 [Main MESSAGE] -- File: "/tmp/wireshark_lo4255H3.pcapng"
    1 0.000000000    127.0.0.1 → 127.0.0.1    MySQL 116 Request Query

with no acceptance of the connection request, we’re dealing with a problem on the remote server. The same is true if the packet capture shows a rejection of the request.

At this point–and at earlier points as we’ve seen–we can gather enough information about the network interaction of our program to determine the source of the problem, and provide details to the networking personnel who can track it down and fix it.

What can one do next?

Well, learning more about networking protocols and how to use Wireshark/tshark would make it possible to get even more detail on problems like this which should make it even easier to get tracked down and fixed.

Enjoy!

Leave a Reply

Your email address will not be published. Required fields are marked *