Reduce Noise With Alert Deduplication

In this section, we’ll cover how text embedding models can be leveraged for automated alert deduplication to reduce alert noise in incident response.

What is alert deduplication?

Alert deduplication is the process of identifying multiple alerts that refer to the same underlying issue and consolidating them into a single alert to avoid redundancy. The primary goal of alert deduplication is to reduce noise and prevent the overwhelming of incident response teams with multiple notifications for the same issue.

Using Embeddings Similarity Search for Alert Deduplication

There are many methods to implement alert deduplication. These methods range from simple rule-based systems to more complex machine learning models, each with its own set of advantages and applications. However, traditional machine learning techniques like clustering and classification often require a solid understanding of data science principles and involve a more hands-on approach by data scientists. This section introduces an approach based on vector embeddings and the use of pre-trained models, which makes it more accessible for individuals without deep data science expertise.

To begin, we'll explore the necessary concepts for this method.

Vector embeddings are a mathematical representation of data in a high-dimensional space, where each point (or vector) represents a specific piece of data, such as a word, sentence, or an entire document. These embeddings capture the semantic relationships between data points, meaning that similar items are placed closer together in the vector space. This technique is widely used in natural language processing (NLP) and machine learning to enable computers to understand and process human language by converting text into a form that algorithms can work with. When you use ChatGPT, for example, your prompts are transformed into a series of numbers first (a vector). Similarly, we will transform alerts into vectors using an embedding model.

An embedding model is a type of machine learning model that learns to represent complex data, such as words, sentences, images, or graphs, as dense vectors of real numbers in a lower-dimensional space. The key idea behind embedding models is to capture the semantic relationships and features of the data in a way that positions similar items closer together in the embedding space. This transformation enables algorithms to perform mathematical operations on these embeddings, facilitating tasks like similarity comparison, clustering, and classification more effectively.

Example
// Input
"A sentence like this will be transformed into a series of (thousands) number"  

// Output 
[
  -0.006929283495992422,
  -0.005336422007530928,
  -4.547132266452536e-05,
  -0.024047505110502243,
  ... // thousands more numbers
]

OK, but how can we use this for alert deduplication?

We will transform alerts into vector embeddings using OpenAI’s text embedding model. By comparing these vectors, we identify and deduplicate alerts that are semantically similar, even if they do not match exactly on a textual level.

The following sections details the steps involved in the process:

Step 1: Preprocessing Alerts

  • Normalization: Standardize the format of incoming alerts to ensure consistency. If you’re using an alerting system like ilert, which sits on top of multiple alert sources and observability tools, alerts are already normalized into a common format.

  • Cleaning:

    • Remove irrelevant information or noise from alerts, such as timestamps (which might be unique to each alert but irrelevant for deduplication) or alert IDs.

    • Use plain text and avoid markdown or JSON. This will not only reduce the number of tokens used, but will also avoid that the format will account for deduplication.

Step 2: Vectorization / Generating Text Embeddings

  • Text Embeddings Model Selection: Choose an appropriate text embeddings model that can convert alert messages into vectors. Models like BERT, OpenAI’s text embeddings, or Sentence-BERT (specially designed for sentence embeddings) can be suitable.

  • Vectorization: Each incoming alert is transformed into a vector using the selected model and stored in a vector database. Models trained on large datasets, including natural language text, can capture a wide range of semantic meanings, making them suitable for encoding the information contained in alerts.

Step 3: Deduplication Logic

  • Similarity Measurement: Use a similarity measure to compare the vectorized alerts. The similarity between embeddings is measured using metrics such as cosine similarity or Euclidean distance. These metrics quantify how close two embeddings are in the vector space, with closer embeddings indicating how similar they are in terms of their semantic content. OpenAI recommends using cosine similarity.

  • Threshold Setting: A threshold is set to determine when two alerts are considered duplicates. If the similarity score between an incoming alert and any existing alert exceeds this threshold, the alerts are considered duplicates. This threshold can be tuned based on the precision and recall requirements of your use case.

  • Deduplication and Clustering: When two alerts are identified as duplicates, they are consolidated into a single alert record, with a counter to indicate the number of duplicate alerts received.

  • Optional Summary Generation: Use a GenAI model to generate concise summaries for clusters of duplicate alerts. This step can aggregate the key information from multiple alerts into a single, easily digestible notification.

Step 4: Feedback Loop

Implement a feedback mechanism where operators can mark false positives or missed duplicates. Use this feedback to fine-tune the similarity threshold

The screenshot below shows how you can enable intelligent alert grouping in the alert source settings.

Advantages

The advantages of using embeddings for alert deduplication include:

Semantic Understanding:

Unlike exact text matching, embeddings can capture the meaning of alerts, allowing for the deduplication of alerts that are semantically similar but not textually identical.

Flexibility:

This method can handle variations in alert wording or structure, making it robust against changes in alert formats or sources.

Scalability:

Embeddings and similarity searches can be efficiently implemented using vector databases and libraries, making this approach scalable to handle large volumes of alerts.

.

Challenges and Considerations

Model Selection:

The effectiveness of embeddings for deduplication depends on the quality of the embedding model. Domain-specific models or fine-tuned models may offer better performance by capturing relevant nuances.

Threshold Tuning:

Determining the optimal threshold for deduplication requires balancing between false positives (incorrectly merging distinct alerts) and false negatives (failing to identify duplicates). This may require empirical testing and adjustment.

Continuous Learning:

Over time, the nature of alerts may evolve, necessitating updates to the model or reevaluation of the similarity threshold to maintain deduplication effectiveness.

Last updated