Detecting issues impacting multiple devices

Detect issues impacting multiple devices to allow application and network L2+ teams to proactively respond to global issues in their specific areas. Notify relevant application owners about issues impacting their applications. Using the following use cases, evaluate:

  • The number of impacted devices or users, for example, the number of devices with specific application crashes.

  • Frequent issues across devices, for example, the number of specific application crashes across all devices.

Both approaches are vital and often complement each other. Use either approach when configuring monitor trigger conditions to avoid triggering alerts and sending notifications when issues are not relevant to the recipient. For example, the system triggers an alert when the number of specific application crashes across all devices exceeds 20 and affects more than 5 devices. The system then notifies the application owner.

The following sections describe two use cases in detail.

Monitoring the number of devices or users with issues

Detect the number of devices or users with an issue to proactively monitor issues impacting multiple devices.

Create an NQL query that returns a summarized number of devices. Optionally, you can use the by keyword to group your results. The system triggers an alert per group.

devices during past 7d
| with execution.crashes during past 7d
| where binary.name = "teams.exe"
| summarize nr_of_devices = count() by entity

Notifications

The system sends notifications for all devices at once, or if the query includes the by clause, for each group separately.

Alerts overview dashboard

In the Alerts overview dashboard, the alert is displayed in a single line without the context-related label. If grouping has been added, the alert is displayed for each group in a separate line with context about the grouping.

Monitoring frequent issues across devices

Detect an issue across multiple devices which is reflected in an aggregated metric value.

Create an NQL query that returns a summarized metric value. Optionally, group your results using the by keyword. The system triggers an alert per group.

execution.crashes during past 24h
| summarize 
  total_number_of_crashes = count(), 
  devices_with_crashes = device.count()
by binary.name

Notifications

The system sends notifications for a single metric, or if the query includes the by clause, for each group separately.

Alerts overview dashboard

In the Alerts overview dashboard, the alert is displayed in a single line without the context-related label. If grouping has been added, the alert is displayed for each group in a separate line with context about the grouping.

Refer to the NQL examples below and the NQL data model documentation for more information about NQL.

NQL Examples

Below is a list of NQL query examples to help you create and edit monitors. Review the queries and pick the one most similar to the monitor you are creating or editing. Copy the query and adjust it to your use case, including the thresholds that have been provided as an example.

Detect specific web errors for an application.

This NQL query returns the aggregated number of errors and devices with errors for a specific application and triggers the alert per specific error code separately:

web.errors during past 1h
| where application.name  in ["Jenkins"] 
| where error.code !in [405, 404, 403]
| summarize nr_of_devices_impacted = device.count(), nr_of_errors = count() by label

Detect applications with a high web error ratio.

Select other thresholds to make sure there is enough usage volume and that there are enough issues to avoid false positives.

application.applications
| with web.page_views during past 60min
| where is_soft_navigation = false
| compute total_number_of_page_views = number_of_page_views.sum(), all_users = user.count()
| with web.errors during past 60min
| compute number_page_views_with_error = error.number_of_errors.sum(), users_with_errors = user.count()
| summarize web_errors_ratio = number_page_views_with_error.sum() * 100 / total_number_of_page_views.sum(), number_of_errors = number_page_views_with_error.sum(), users_with_issues = users_with_errors.sum(), ratio_of_users_with_issues = users_with_errors.sum() * 100 / all_users.sum() by application.name

Detect a high number of crashes for binaries.

execution.crashes during past 24h
| summarize total_number_of_crashes = count(), devices_with_crashes = device.count() by binary.name
| sort total_number_of_crashes desc

Detect a high number of devices with long boot time with Geolocation by country.

The long boot time is defined as time_until_desktop_is_visible>= 60s

devices
| with session.logins during past 24h
| compute total_devices = device.count(), avg_time_until_desktop_ready = time_until_desktop_is_ready.avg(), avg_time_until_desktop_visible = time_until_desktop_is_visible.avg()
| include session.logins during past 24h
| where time_until_desktop_is_visible>= 60s
| compute number_of_device_with_long_login = device.count()
| summarize percentage_of_devices_with_issue = number_of_device_with_long_login.sum() * 100 / total_devices.sum(), average_time_until_desktop_ready = avg_time_until_desktop_ready.avg(), average_time_until_desktop_visible = avg_time_until_desktop_visible.avg(), number_of_devices_with_issue = number_of_device_with_long_login.sum() by public_ip.country

Virtualization alert for when the average CPU queue length per desktop pool is > =3

device_performance.events during past 30min
| where device.virtualization.desktop_pool != null
| summarize Average_cpu_queue_length = cpu_queue_length.avg() / number_of_logical_processors.avg() by device.virtualization.desktop_pool

Last updated

#451: 2024.8-Overview of integration DOC

Change request updated