Building memory-driven, spatial intelligence layers on top of raw deep learning predictions to enable conveyor belt pick-and-place automation.
Raw deep learning models like YOLO are excellent detectors, but they operate as static frame-by-frame snapshot engines. They are completely temporal-blind: they lack "memory." To successfully automate a physical conveyor belt, knowing that an item exists in a single frame is not enough. The automation layer needs to verify if an object identified in Frame A is the exact same piece in Frame B, measure the total volume of objects moving past a threshold (throughput), and pinpoint the exact moment an item enters the robot's coordinates (pick-zone).
To bridge the gap between raw bounding boxes and mechanical action, we engineered an advanced spatial logic pipeline. Combining the lightweight YOLOv8 detector with the high-performance ByteTrack multi-object tracking engine and native OpenCV vector operations, we built a low-latency, modular system that converts raw video feeds into actionable, coordinate-level automation triggers.
The foundation of spatial logic is identity tracking. We integrated ByteTrack directly with our fine-tuned YOLOv8 model. Unlike generic trackers that throw away low-confidence detection boxes (often leading to tracking loss when items are partially blocked or shadowed), ByteTrack exploits association logic across every single detection box, matching low-score detections to existing trajectories using Kalman filters.
By comparing the Intersect-over-Union (IoU) of boxes across frames, ByteTrack assigns a persistent, unique ID to every detected item. This ID remains tied to the physical object as long as it is in the camera's field of view. This "memory" is what prevents the throughput counter from double-counting the same item, and it provides a smooth, continuous path of coordinates that a robotic arm can easily follow.
# Integrating multi-object tracking with YOLOv8 using ByteTrack
from ultralytics import YOLO
# Load fine-tuned conveyor model weights
model = YOLO("best.pt")
# Perform real-time inference with persistent tracking
results = model.track(source="conveyor_feed.mp4", persist=True, tracker="bytetrack.yaml")
for box in results[0].boxes:
# Safely retrieve persistent track ID assigned by ByteTrack
track_id = int(box.id[0]) if box.id is not None else None
x1, y1, x2, y2 = box.xyxy[0]
print(f"Object ID: {track_id} | Coordinates: ({x1:.1f}, {y1:.1f})")
With unique IDs successfully assigned, we implemented virtual bounding zones to simulate industrial robotic workspace boundaries. Rather than using simple rectangular bounding boxes, we engineered support for arbitrary 2D polygons to model complex, real-world pick-zones or machine exclusion spaces.
Using OpenCV's mathematical vectors, we defined a custom coordinates array representing the green "Pick Zone." In each frame, the centroid $(C_x, C_y)$ of each active track ID is computed. We then call the cv2.pointPolygonTest algorithm:
# Check if object centroid lies within custom pick polygon
# dist = positive (inside), zero (on edge), negative (outside)
dist = cv2.pointPolygonTest(polygon_coords, (centroid_x, centroid_y), measureDist=False)
if dist >= 0:
# Target object is inside pick zone, trigger robotic pickup event
trigger_pick_action(track_id, centroid_x, centroid_y)
This allows the system to register precisely when objects enter the pick area, incrementing a live 'IN ZONE' monitor and feeding exact localized coordinates directly to the robot arm's actuation system.
Real-time zone tracking. The system draws a virtual green polygon and successfully increments the 'IN ZONE' counter only when the tracked centroids of the fine-tuned classes enter the designated area.
The second application is a high-speed **tripwire throughput counter**. This is designed to replace traditional optoelectronic sensors, with the added capability of classifying exactly what kind of product crossed the line.
The system defines a linear threshold using two points. As items move from left to right down the conveyor belt, the tracking engine records the history of their coordinates. By comparing the previous X-coordinate of a specific object ID to the X-coordinate of the tripwire line, the system detects a threshold crossing. Once crossed, the object's ID is registered in a "counted set" to prevent duplicate tracking, and the global throughput count is updated instantly.
Real-time throughput counting. A virtual red tripwire is established. As the multi-object tracker maintains the ID of the moving items, the top-left 'COUNT' updates precisely as each centroid crosses the threshold.
This project highlights that the real value of deep learning in industrial contexts is unlocked by clever post-processing algorithms. By keeping our vision pipeline lightweight and computationally efficient, we achieve robust spatial intelligence that scales easily to standard hardware.