LLM Observability

Our journey into integrating Large Language Models into our incident response platform has been a revelation in both capability and complexity. LLMs, by their nature, are nondeterministic black boxes, offering powerful capabilities while presenting unique challenges. One of the most profound lessons we've learned is that the real-world application of LLMs unfolds in ways that are impossible to fully anticipate during the development phase. Users engage with LLM-based applications with an unpredictability that demands adaptability and insight.

In response to this, ilert has embraced a philosophy where real-world usage data becomes the cornerstone of our AI feature development and refinement process. Recognizing that user interactions provide the richest insights for improvement, we’re incorporating user feedback and have implemented an intermediate observability layer that collects telemetry data for every interaction with an LLM:

  1. User Feedback Collection: Simple yet effective, we solicit direct feedback from our users in the form of a thumbs up or down response. This immediate gauge of user satisfaction allows us to quickly identify and address areas needing refinement.

  2. Intermediate Observability Layer: To deepen our understanding and enhance the responsiveness of our AI features, we've established an intermediate layer that captures a telemetry data, including:

User Inputs:

What queries or commands users are submitting to the system.

LLM Outputs:

The responses generated by the LLM, which are crucial for assessing the appropriateness and accuracy of the model's outputs.

Error Logging:

Beyond mere system failures, we track instances where the LLM's output, although generated successfully, leads to errors downstream due to being contextually off-target or otherwise inappropriate.

Token Usage Metrics:

Monitoring the total number of input and output tokens used helps us optimize our models for efficiency and cost-effectiveness.

LLM Response time:

We track and monitor the response times of LLMs. The most advanced models usually have a longer response time.

Prompt version and LLM model:

for every interaction, we store which model and which version of our prompt was used.

Feedback Integration:

The direct feedback from users is linked with specific interactions, allowing us to pinpoint and prioritize enhancements.

Model Selection Strategy

Our approach to model selection emphasizes starting with high-performance models to ensure the best possible results, prioritizing outcome quality over costs and response time. Initially, this strategy allows us to confirm the effectiveness of an AI feature.

Subsequently, we consider transitioning to more cost-efficient models after validating the feature's success and gathering sufficient real-world usage data. This ensures that any move to a less powerful model does not compromise user experience.

This comprehensive observability framework ensures that our AI features do not just exist in a vacuum but evolve in a symbiotic relationship with our user base. It acknowledges the dynamic nature of LLM applications and the necessity of an iterative development process informed by real-world application. At ilert, we believe that the key to building reliable, user-centric AI-driven systems lies in embracing the unpredictability of user interaction, leveraging it as a rich source of feedback and innovation.

Last updated