The most significant design decision was not using a trained classifier. Building a heuristic pipeline over the 21 hand landmarks delivered by MediaPipe has a concrete advantage: it’s fully interpretable and requires no training data. The trade-off is rigidity — the rules work well within the defined gesture set, but don’t generalize well if the user significantly varies posture or lighting conditions.
The pipeline follows four stages. OpenCV captures the webcam frame and manages visual feedback on the video stream. MediaPipe extracts real-time 3D coordinates for all 21 hand landmarks. A custom classification layer evaluates each finger’s state (extended or flexed) and calculates geometric distances between key points to identify the active gesture. PyAutoGUI then translates that gesture into the corresponding system action: navigation, zoom, undo/redo, or screenshot.
The project delivers on its premise but has clear limits: sensitivity to lighting conditions, latency accumulation on modest hardware, and no temporal control to prevent accidental activations. The natural next step is introducing a per-gesture cooldown and evaluating whether a lightweight classifier trained directly on the landmark data improves robustness without sacrificing pipeline speed.