new_logo_mona_white

 

 

WHAT TO MONITOR | OBJECT DETECTION & CLASSIFICATION

 

A Model Monitoring Framework for

Computer Vision in Production

 

Introduction and Purpose

 
Object detection and classification models power critical use cases—from autonomous vehicles and factory automation to medical diagnostics and smart retail. But in production, even a small detection miss can have massive downstream consequences—missed anomalies, mislabeling, or a dangerous lapse in safety.

This guide gives you a framework to monitor computer vision systems in production. It's designed to help teams spot failure modes early, quantify risk by class or scenario, and route actionable signals back into your development loop.

Why Traditional Monitoring Falls Short

 
Object classification models often fail in ways that are easy to miss but have big consequences. Performance can slowly decline for certain object classes, angles, or environments. Camera changes or lighting shifts may quietly disrupt input quality. Label distribution drift can cause rare classes to be underrepresented. And missing or incomplete human-in-the-loop feedback makes it hard to validate model outputs.

mAP and latency alone won’t catch these issues. You need granular monitoring across model behavior, data integrity, and human feedback to stay ahead.

 

 Intro Videos | Model Performance Insights for Computer Vision

 

 

 

Use Case: Flower Detection & Classification

 

 
In this general overview of Mona's model performance insights platform for object detection and classification, we'll walk you through how Mona detects anomalous behavior, generates root cause intelligence and alerts your team that there's an issue. 
 
 

 

Use Case: Model Performance Insights for CT Scan Models

 

 
In this overview of Mona's model performance insights platform for AI-driven radiology systems in the healthcare industry, we'll walk through the experience of anomaly detection, investigation and root cause analysis, and alerts & collaboration.
 
 
 

Core Pillars of Computer Vision Model Monitoring

 

Your monitoring strategy should cover five core pillars, each with specific metrics that help you proactively detect, understand, and resolve model degradation in real-world environments.
 

Model Effectiveness & Detection Quality

 

Aggregate mAP might look stable, but your model may be silently failing for specific objects, regions, or scenarios. Slice performance down to the class or environment level and spot performance cliffs early.
 
  Goal: Know when and where your object model is underperforming—even if the system isn’t crashing.

Data Integrity

 

A camera upgrade, a cropping bug, or a mislabeled feed can quietly destroy detection accuracy. Mona maps model performance directly to input quality and source integrity so you can trace root causes fast.
 
  Goal: Catch issues in data capture or preprocessing pipelines before they affect predictions.

Data Representation Quality

 
Models trained in ideal lab conditions often struggle in real-world lighting, background, or occlusion scenarios. Mona highlights exactly where real-world inputs diverge from your training assumptions.
 
  Goal:  Ensure your live image data still matches what your model was trained to see.

Application Performance

 
Computer vision workloads are heavy. Mona integrates operational metrics with model behavior so you know if a GPU spike is affecting specific camera feeds or object classes—not just that something’s slow.
 
  Goal:  Ensure your model performs efficiently under production conditions.

Human-in-the-Loop

 

For many industries, expert review is part of the production loop. Mona can include  this data in your monitoring pipeline so you can pinpoint exactly where  the model and human judgment diverge.
 
  Goal: Track how human reviewers and annotators interact with the model’s predictions.

 

What to monitor for 

Model Effectiveness & Detection Quality

 

 

 

mAP (mean Average Precision):

 
Over time, across object classes and bounding box thresholds

Measures how accurately the model detects and classifies objects across all classes and bounding box thresholds. Monitoring mAP over time helps teams track whether detection quality is improving, staying consistent, or degrading—especially useful when evaluating production performance by object class or detection scenario.
 
  TRAINING, VALIDATION, PRODUCTION

Per-class precision and recall:

 
Are some objects being over/under-detected?

These metrics evaluate how often the model correctly identifies each object class (precision) and how well it finds all instances of those classes (recall). Monitoring by class detects class-specific blind spots where the model may be overconfident or missing detections entirely.
 
 
 TRAINING, VALIDATION, PRODUCTION

False positives / false negatives:

 
By class, especially for safety-critical or rare items
 
Tracking these metrics reveals whether the model frequently detects objects that aren’t there (false positives) or misses objects it should detect (false negatives). This is especially critical for rare or safety-critical classes, where incorrect predictions have a high operational cost.
 
 
TRAINING, VALIDATION, PRODUCTION
Intersection over Union (IoU):
 
How tight are bounding boxes to the real object?
 
This measures how well the predicted bounding box overlaps with the actual object. It helps spot issues with localization accuracy that can vary by camera angle, resolution, or lighting conditions.
 
 VALIDATION, PRODUCTION

Confidence score trends:

 

Are prediction confidence levels shifting?
 
Tracks the distribution of model confidence scores over time and by class. Unusual changes can signal drift, uncertainty, or overfitting to specific input conditions.


 
 
 VALIDATION, PRODUCTION

Model version comparison:

  
Did the new version outperform the old one in production?
 
 Compares detection performance across model versions to ensure updates improve detection quality rather than introduce new issues.
 
 VALIDATION, PRODUCTION

 

What to monitor for 

Data Integrity

 

 

 

Missing or corrupt image files:

Are input images complete and valid?

Identifies instances where the model receives unusable input due to corruption or failed image retrieval. Frequent occurrences can indicate upstream pipeline issues that disrupt inference quality.
 
 
  DEPLOYMENT, VALIDATION, PROD

Image resolution drift:

 
Are image sizes staying consistent?

Tracks changes in image resolution over time, which may happen due to camera updates or cropping bugs. Since model accuracy is sensitive to input size, monitoring this helps prevent silent quality drops.
 
 DEPLOYMENT, VALIDATION, PROD

Aspect ratio inconsistencies:

 
Are image dimensions getting distorted?

Monitors variations in image shape (height-to-width ratio), which can stretch or compress image content and degrade model performance. Drift in aspect ratio may signal problems in preprocessing or camera configuration.
 
 DEPLOYMENT, VALIDATION, PROD
Unexpected file formats or encodings:
 
Are image formats consistent with training?
 
Flags image inputs that deviate from expected formats or encodings. Such discrepancies can cause pipeline failures, inference skips, or subtle shifts in model behavior.
 
  DEPLOYMENT, VALIDATION

Sudden spikes/drops

in input volume:

 

Is image traffic stable?
 
Monitors how many images are being processed over time. Sharp changes may reflect ingestion delays, sensor outages, or batch processing errors that affect operational reliability.
 
 
PRODUCTION

Camera/source ID mismatches or drift:

 

Are images coming from the correct sources?
 
Detects when images are coming from unexpected or incorrect sources. Source-level drift can introduce unfamiliar data distributions and reduce the relevance of monitoring signals.
 
 DEPLOYMENT, VALIDATION, PROD

 

What to monitor for 

Data Representation Quality

 

 

 

Class distribution drift:

 
Are object classes showing up as expected?

Tracks how frequently different object classes appear in live data. A shift in this balance may cause the model to underperform for rare or newly prominent classes that are underrepresented in training.
 
 TRAINING, VALIDATION, PROD

Lighting/environmental condition changes:

 
Is scene lighting changing significantly?

Monitors image brightness, contrast, and histogram characteristics to flag shifts in scene lighting or background. These environmental factors strongly affect visual recognition performance.
 
 
 VALIDATION, PRODUCTION

Bounding box size/location distribution changes:

 
Are objects showing up in new sizes or spots?

Tracks whether detected object sizes or their position within the frame are shifting. These changes may indicate different camera angles or scene compositions that weren’t present during training.
 
 VALIDATION, PRODUCTION
Camera angle /perspective variation:
 
Are new viewpoints appearing in production?
 
Detects changes in camera perspective based on metadata or derived features. New viewpoints can reduce detection accuracy if not represented in the training data.
 
 VALIDATION, PRODUCTION

Training vs inference dataset drift:

 

Is live data drifting from training data?
 
Compares feature-level distributions in training vs production using embedding-based metrics. Divergence between these datasets highlights when the model is making predictions on inputs it wasn’t optimized for.
 
 
TRAINING, VALIDATION, PROD

Training set version skew

 

Did retraining shift label balance?
 
Flags inconsistencies introduced during retraining, such as class rebalancing or label changes. Helps teams understand if training adjustments unintentionally harmed production behavior.
 
 
 TRAINING, VALIDATION

 

What to monitor for 

Application Performance

 

 

 

Inference latency:

 
Is the model staying within real-time limits?

Measures the time it takes for the model to process each image. High or inconsistent latency can disrupt real-time systems and should be tracked across environments.
 
  DEPLOYMENT, PRODUCTION

Dropped frames or image timeouts:

 
Are any frames being skipped?

Identifies instances where images are skipped or timeout before being processed. This can create blind spots in detection pipelines and lead to missed events.
 
  PRODUCTION

Throughput:

 
How many frames per second are processed?

Tracks how many frames per second the system can handle. Drops in throughput may indicate infrastructure bottlenecks or overloads.
 
  PRODUCTION

Memory and compute usage:

 
Is the model resource-efficient?

Monitors how much CPU, GPU, or RAM is used per inference. High resource usage can lead to crashes, throttling, or delayed processing.
 
 PRODUCTION

Model warm-up time:

 
How long to spin up the model?

Measures the time it takes for a model to become responsive after startup. Cold starts can delay detection in edge deployments or when scaling dynamically.
 
 DEPLOYMENT, PRODUCTION

Error logs by environment/camera/source:

 
Where do failures cluster?

Analyzes where errors occur most frequently in the system. This helps pinpoint whether failures are model-specific or due to environment-level issues.
 
  PRODUCTION

 

What to monitor for 

Human-in-the-loop

 

 

 

Human review trigger rate:

 
How often does the model defer to humans?

Monitors how often the model flags inputs for manual review. Rising review rates can suggest the model is encountering unfamiliar inputs or degrading.
 
 VALIDATION, PRODUCTION

Manual override frequency:

 
How often are predictions fixed by people?

Tracks how often humans adjust the model’s outputs, such as editing bounding boxes or relabeling classes. High correction rates highlight weak spots in model predictions.
 
 VALIDATION, PRODUCTION

Analyst agreement rates:

 
Do human reviewers agree?

Measures how often human reviewers agree with each other and with the model. Low agreement may point to unclear labeling guidelines or model confusion.
 
 VALIDATION, PRODUCTION

Label latency:

 
How fast is human feedback returned?

Tracks the delay between an image being captured and receiving a validated label. Long feedback loops slow retraining and response time.
 
 VALIDATION, PRODUCTION

Coverage gaps in rare/regulated object classes:

 
Are rare classes getting reviewed?

Identifies whether rare or regulated object classes are being reviewed adequately. Helps ensure risk-prone categories aren’t slipping through the cracks.
 
 VALIDATION, PRODUCTION

Jumpstart This Monitoring Plan

 

You don’t have to implement everything at once—but you do need a framework that grows as you scale. 
Here’s how data scientists, MLOps engineers, and other model-centric roles typically get started with Mona:
 

STEP 1: Consider what can go wrong.

 

Use a combination of experience (what went wrong before that you wished you could catch earlier) and theoretical thinking (you built this model, where could there be weak or blind spots?). This doesn’t have to be comprehensive. Start with the obvious, grow with time.
 
STEP 2: Define a monitoring schema.
 
Use this guide to think of all the data you need to track in order to find issues. Usually, the relevant fields could be picked from the following categories:
 
Raw data properties (e.g., image/video resolution, size, brightness…)
Model input features and output scores
Technical metadata (model and other component versions, data sources, hardware / firmware (camera) classes and versions …)
Business metadata (things like customer ids, geographical regions…)
Feedback and performance metrics

You don’t have to calculate all the derived metrics on your own. A good monitoring platform should be able to do that for you.
 

STEP 3: Initiate data logging.

 
If not already implemented, make sure the relevant information from the previous point is tracked in an orderly fashion that will allow you future analysis. Here too, don’t be alarmed if you encounter challenges in gathering everything on day 1. Things like human reviewer / customer feedback, as well as some specific business / technical metadata might be out of reach or logged in different places. Start with the basics and chart a way to get to the full schema.
 

STEP 4: Set up your first alerting rules.

 

These will usually be the basic drift, sudden change and outlier behavior detections for the metrics discussed in this post, along the dimensions you thought of in point 2 relating to technical and business metadata.
 

 Start Monitoring Object Detection Models with Confidence

 
Image data is messy. Real-world conditions change fast. If you're monitoring with surface metrics, you're leaving blind spots that degrade performance and trust. Mona is built to give you deep visibility into your object detection model’s real-world behaviorby class, camera, and condition. It’s model-aware, drift-sensitive, and built to make MLOps actionable.

 

 

Get started today.

 
Schedule a demo to learn how Mona helps teams in computer vision
detect failure early, trace root causes, and scale with confidence.

 

How Granularity in Model Monitoring Saves Quants from Costly Mistakes

Itai Bar Sinai, Co-founder and CPO

 

Model monitoring at quant funds can feel like a constant fire drill—issues go unnoticed until it’s too late, and small missed tweaks could have made a big impact. The culprit? A lack of granularity. Here’s how deeper monitoring can change that.

 

CONTINUE READING

 

 

The definitive guide to AI / ML monitoring

Itai Bar Sinai, Yotam Oren

 

 

If your machine learning models are running in production but you’re not actively monitoring their impact on business KPIs, you might be missing critical insights—and setting yourself up for costly failures.

 

CONTINUE READING

 

 

Data drift, concept drift, and how to monitor for them

Itai Bar Sinai, Co-founder and CPO

 

 

 

Is data drift a threat to your machine learning models—or just a natural part of running models in production? Understanding the difference could be the key to maintaining reliable performance.

 

CONTINUE READING