Troubleshooting application connectivity

Problem

Reporting and acting on network-related issues requires accurate data collection, visualization and interpretation. Without reliable network performance indicators, fixing connection issues becomes a guessing game. Ultimately, leading to poor employee experience and resource wasting.

Solution

The Application Connectivity troubleshooting follows a set of investigation principles. Application Connectivity helps to:

  • Identify the root causes behind network-related issues or exclude possible root causes.

  • Effectively troubleshoot network-related issues with targeted solutions.

  • Stop the “blame game” by enabling a fact-based discussion between involved teams.

  • Control device data privacy and ensure compliance within your organization.

To achieve this, the Application Connectivity framework relies on connections data with connection metrics, destination information, and data privacy compliances at an application-device level.

The NQL queries on this page are examples how to use connections data to investigate network-related issues. Similar queries are supported by different query-based features available in the Nexthink web interface.

You can also use Network view for connections data visualization, filtering and drill-downs (transport protocols, devices, binaries, destinations, etc.).

Prerequisites

  • Connections events are only available for devices with Collectors that report 'Infinity only'.

  • Minimum Collector version of 2023.10.

Connections data

The connections data used by Network view and the NQL queries on this page include:

  • Connection events

  • Connection metrics

  • Destination decorations

  • Connection event aggregations

Jump to the Network troubleshooting with Connections Data section to learn about the use of Application Connectivity queries.

Connection events

A connection event represents an outgoing TCP connection (established by a device) or outgoing UDP packages. Each connection event provides the following information:

  • Start time, end time, and duration of the event’s bucket

  • The source of the connection event

  • The destination of the event

  • The transport protocol and IP version

  • Metrics about the connection

Connection events are sampled events, meaning Nexthink reports connection events in buckets of 15 minutes and 1 day.

The namespace connection of the NQL data model contains one main table:

  • The connection.events table contains events for outgoing TCP connections and UDP packages.

The following two tables in the connection namespace are deprecated and will be removed in the future:

  • The connection.tcp_events table contains events for outgoing TCP connections.

  • The connection.udp_events table contains events for outgoing UDP packages.

Some metrics, like the number of failed connections or the connection establishment time, are only available for TCP connection events.

Refer to the NQL data model documentation for more information.

Connection events association

Connection events are linked to the following objects:

  • The device that establishes the connection.

  • The binary that uses the connection.

  • The user of the process that runs the binary.

  • Optionally, the desktop application configured for this binary.

  • Optionally, the network application matching the configured destinations.

Connection destination decoration

Nexthink decorates connection events data with additional destination information.

The IP and the network port define the destination of a connection. Additionally, Nexthink decorates connection events data with a destination type, the 'subnet address' and optional information:

  • Domain name of the destination.

  • Owner of the destination.

  • Country and location of the destination (GeoIP information).

  • Datacenter region name provided by the owner of the destination.

The destination subnet address equals the IP address with the last 8 bits set to zero.

Nexthink uses GeoIP data and published IP address ranges to enrich the destination information of connections. See the table below.

Information about the destination owner, country, and data center region is not always available or partially available only. The corresponding fields are NULL in this case.

Connection event metrics

Connection events provide the following metrics:

  • The connection round-trip-time (RTT): The average round trip time for all established TCP connections. The round trip time is measured between sending the SYN message and receiving the SYN-ACK message from the remote party during the TCP connection establishment, a 3-way handshake. This metric is only available for TCP connections with at least one established connection.

  • Incoming and outgoing traffic in bytes. Data received (TCP only) and sent (TCP and UDP) by the application during the event.

  • The ratio of all failed TCP connections over all attempted TCP connections i.e., all established and failed TCP connections.

  • Number of connections per status in the event.

Number of connections per status
  • Established connections: The number of connections that have been established in the current time bucket. (TCP and UDP)

  • Alive connections: The number of connections that were established in a previous time bucket and continue into the current time bucket. (TCP and UDP)

  • Successful connections: The sum of established and alive connections.

  • Failed connections (TCP only) can be one of the following:

    • Connections failed with no host: The number of connections that failed due to the device not reaching the destination host.

    • Connections failed with no service: The number of connections that failed due to the device not reaching the service on the destination host.

    • Rejected connections: The number of outgoing connections that have been rejected on the device of the user.

  • Attempted connections: The number of TCP connections a process tried to establish in a bucket i.e., the sum of established and failed connections.

  • The number of connections that Nexthink aggregates into one event is the total number of failed and successful connections.

To better understand the connection status and how the failed connections ratio is computed, look at the following image that shows connections on a timeline. Each line represents a specific connection and illustrates its duration. For example:

  • C1 starts before the selected timeframe and ends within it.

  • C2 starts before the selected timeframe and ends after it.

  • C5 attempts to connect in the selected timeframe but fails.

The table below describes which connections are considered to calculate the number of connections per their status within the specified timeframe.

Connection events aggregation

Nexthink aggregates connections into buckets of 15 minutes and 1 day.

15-min aggregation

Nexthink aggregates all connections happening within the same 15-minute bucket and sharing the following characteristics into one event :

  • Same device

  • Same binary

  • Same user running the binary

  • Same IP address and network port

  • Same transport protocol (TCP or UDP)

When aggregating connections into a 15-minute bucket, Nexthink computes the following:

  • The average of the connection round-trip times for established TCP connections.

  • The total sum of incoming and outgoing traffic of all connections in the bucket.

  • The number of connections per status.

  • The ratio of failed connections.

1-day subnet aggregation

When Nexthink aggregates 15-minute buckets into 1-day buckets, the system sets the IP address to NULL and combines all 15-minute events into one 1-day event that share the following characteristics:

  • Same device

  • Same binary

  • Same user running the binary

  • Same subnet mask and network port

  • Same transport protocol (TCP or UDP)

  • Same destination fields

When aggregating connections into a 1-day bucket, Nexthink computes the following:

  • The weighted average of the connection round-trip times for established TCP connections.

  • The total sum of incoming and outgoing traffic of all connections in the bucket.

  • The number of connections per status.

  • The ratio of failed connections.

Network troubleshooting with Connections data

Use connections data to troubleshoot network-related issues. To find the root cause of a network issue or exclude possible root causes, you must identify the relevant population (devices, apps, destinations) affected by network issues and when the impacted connection metric (failed connection ratio, establishment time, traffic) changed.

You can apply the same troubleshooting principle with Network view. Network View allows users to visually identify the relevant population by selecting a connection metric and filtering in the device, application, and destination dimension.

Identifying the relevant population

Focus on the three dimensions of the connections data:

  • Device Dimension: Which and how many are the impacted devices?

    • Determine the impacted devices sharing the same characteristics and location.

  • Application Dimension: Which and how many are the impacted desktop applications?

  • Destination Dimension: Which are the impacted destinations? Where are they located?

Device Dimension

Connections events are linked to the device object that created the network connection. This allows you to investigate the connections data of a single device (by devices.name) or a group of devices, for example by devices.entity, GeoIP-based location or other custom organizational unit classifications.

Refer to the Product configuration documentation for more information.

To group devices by GeoIP-based location, use the location context of the connection event, for example:

Code
connection.events during past 7d
| where transport_protocol == TCP
| where binary.name == "*outlook*"
| summarize Avg_RTT = establishment_time.avg() by context.location.country

The location context is where the device was at the time of the event. It requires an activated geolocation feature and works best when the collector traffic is routed to the Internet directly and not through a VPN.

Alternatively, you can use the organizational context, for example:

Code
devices during past 7d
| with connection.events during past 7d
| where transport_protocol == tcp
| where context.organization.entity == "XYZ"
| compute Failed_Connections = number_of_failed_connections.sum()

Application Dimension

Connection events are linked to the binary object that initiated the connection.

Code
connection.events during past 7d
| where transport_protocol == TCP
| where binary.name == "ABC"
| summarize Avg_RTT = establishment_time.avg() by context.location.country

Additionally, the connection event is linked to a desktop application if the binary is part of the application definition.

Code
connection.events during past 7d
| where transport_protocol == TCP
| where application.name == "XYZ"
| summarize Avg_RTT = establishment_time.avg() by context.location.country

Destination Dimension

The destination is a structured field of the connection event. For example:

Code
connection.events during past 7d
| where transport_protocol == TCP
| where destination.port == 135
| where number_of_failed_connections > 0

Note that it is impossible to summarize by IP address because the cardinality of IP addresses is too high. Instead, you can configure a Network Application based on IP address, IP subnet, network port, or domain name. Afterward you can filter connections events using the Network Application name, for example:

Code
connection.events during past 7d
| where transport_protocol == TCP
| where network_application.name == "XYZ"
| where number_of_failed_connections > 0

Investigating TCP Connections

The two main metrics to gain visibility on the quality of TCP connections are:

  • The connections round-trip-time (RTT): The connections RTT is available for all TCP events with established connections and can be accessed through tcp_events.establishment_time.avg. Connections RTT is a good indicator for slow connections.

  • The failed connections ratio: The number of failed connections over the number of new connections (established and failed). The failed connections ratio can be accessed through: failed_connection_ratio.avg

Failed connections ratio and its value fluctuation should always be evaluated along with the number of failed connections or the number of attempted connections. Consider the following example: if the ratio of failed connections is 100%, but the number of attempted connections equals 1, it’s not worthwhile to look into further.

Example: Investigation of VPN connectivity issues

Find an example below of a live dashboard to investigate VPN connectivity issues. Notice that the application dimension is fixed to the VPN binaries.

Find below the NQL queries from the example of investigating VPN connectivity issues:

Code
connection.tcp_events during past 24h
| where binary.name in ["VPN_binary_Windows", "VPN_binary_macOS"]
| where destination.domain == "VPN_edge_domain_name"
| where number_of_established_connections > 0
| summarize Devices__ = device.name.count(), Avg_RTT = establishment_time.avg(), Failed_Connections_Ratio_in_percent = (number_of_failed_connections.sum()) / ((number_of_established_connections.sum()) + (number_of_failed_connections.sum())) * 100, Failed_Connections = number_of_failed_connections.sum() by context.location.country, destination.country, destination.datacenter_region
| sort Failed_Connections desc
Code
connection.tcp_events during past 720min
| where binary.name in ["VPN_binary_Windows", "VPN_binary_macOS"]
| where destination.domain == "VPN_edge_domain_name"
| where number_of_established_connections > 0
| summarize average_RTT = establishment_time.avg() by 15min

Investigating UDP Traffic

Because of the connectionless nature of UDP, investigating UDP network traffic is limited compared to TCP network traffic. Your main tool is to look for changes and differences in the amount of outgoing UDP traffic, for example, comparing the average traffic per device for one application.

Code
connection.events during past 7d 
| where transport_protocol == UDP 
| where number_of_successful_connections > 0 
| where outgoing_traffic == 0 
| summarize idle_connections = number_of_successful_connections.sum() by binary.name 
| sort idle_connections desc

You can apply the same troubleshooting approach to Network view for connections data visualization, filtering and drill-downs (including transport protocols).

Application Connectivity in Nexthink Infinity

Find below the related documentation of some of the Nexthink features compatible with Application Connectivity’s connections data and queries:

  • Investigations using the NQL editor

  • Connections Timeline in the Device View

  • System Monitors for Alerts:

    • Binary connection establishment time increase

    • Binary failed connection ratio increase

  • Network view enabled for Network and Desktop Applications, Investigations, and Device view.

You can use the Application Connectivity queries included on this page for all NQL-based features in the Nexthink web interface.

Overseeing data privacy

Refer to the Configuring Collector level anonymization and Roles documentation to anonymize, filter and control connections data privacy .

Implementation Aspects

The following implementation aspects impact connection events:

Domain name reporting

The Collector retrieves the destination domain names from the TLS Server Name Indication (SNI) extension and also from monitoring DNS requests on the device. The Collector does not report the domain name if it cannot retrieve the domain name from SNI or the application does not use the OS for the DNS look-up.

Some applications like Google Chrome web browser encrypt the SNI. This is called TLS Encrypted ClientHello (ECH). The Collector cannot report the domain name for connections making use of ECH. Users can disable ECH in the application to ensure the reporting of the domain name.

The Collector reports “multiple domain names" instead of a real domain name, in case of multiple connection events for the same binary and user, on the same device, at the same time, and to the same destination (same IP address and port) but different domain names.

Domain name shortening

The Collector shortens long and randomized domain names by evaluating various mathematical and heuristic properties.

Maximum number of connections per binary or user

The Collector reports up to 512 connections per binary and user within 5 minutes. If there are more than 512 connections for one binary and user, for example during a “port scan”, the collector only reports one event for all these connections for this time interval. For such events, the following fields are NULL:

  • destination.ip_address

  • destination.ip_subnet

  • destination.port

  • destination.type

  • destination.owner

  • destination.country

  • destination.datacenter_region

  • destination.domain

  • establishment_time

Connections through a VPN

VPN and secure access service edge (SASE) products implement various solutions to intercept and route connections. The Collector reports connections data for many VPN and SASE products out-of-the-box. For some products, the Collector reports incomplete data. This depends on the product, the operating system of the device, and the configuration of the product.


Last updated