Computer Vision Gesture Controller

Real-time hand gesture recognition system that translates hand movements into desktop actions using computer vision.

Back to projects Open repository

Case

Computer Vision Gesture Controller

Academic project focused on applied computer vision.

Case notes

The most significant design decision was not using a trained classifier. Building a heuristic pipeline over the 21 hand landmarks delivered by MediaPipe has a concrete advantage: it’s fully interpretable and requires no training data. The trade-off is rigidity — the rules work well within the defined gesture set, but don’t generalize well if the user significantly varies posture or lighting conditions.

The pipeline follows four stages. OpenCV captures the webcam frame and manages visual feedback on the video stream. MediaPipe extracts real-time 3D coordinates for all 21 hand landmarks. A custom classification layer evaluates each finger’s state (extended or flexed) and calculates geometric distances between key points to identify the active gesture. PyAutoGUI then translates that gesture into the corresponding system action: navigation, zoom, undo/redo, or screenshot.

The project delivers on its premise but has clear limits: sensitivity to lighting conditions, latency accumulation on modest hardware, and no temporal control to prevent accidental activations. The natural next step is introducing a per-gesture cooldown and evaluating whether a lightweight classifier trained directly on the landmark data improves robustness without sacrificing pipeline speed.

Key highlights

Real-time hand landmark tracking from webcam with visual feedback overlaid on the live frame.
Gesture classification through per-finger state detection and geometric distances between key landmarks.
Direct system integration for navigation, zoom, undo/redo, and screenshot capture.

Stack

Python

OpenCV

MediaPipe

NumPy

PyAutoGUI