Think like a human: Decoding meaning for smarter video insights

Imagine a bustling warehouse floor: Forklifts weave between towering shelves, workers navigate through a maze of inventory, and conveyor belts hum with activity. In this complex environment, safety is paramount. Now, picture a safety manager reviewing footage, not by tediously scrubbing through hours of video, but by simply typing: "Show me all near-miss incidents involving forklifts and pedestrians in the loading bay this week." Within moments, relevant clips appear, each a potential learning opportunity to enhance workplace safety. This isn't a futuristic concept; it's the present reality of AI-powered video analytics in warehouse safety management.

However, this advancement is not without its complexities. The challenge of interpreting intricate real-world scenes, characterized by contextual ambiguity and constant movement, is significant. Obtaining high-quality and diverse examples that represent such safety hazards is a monumental task. Even after you’re done building a model, getting results in real-time means a delicate dance between precision vs. speed vs. cost. Only because of recent shifts in AI technology, what was previously magic, is now a reality — this new age of semantic AI is just getting started.

The missing key: Adding top-down understanding to bottom-up systems

Historically, AI as a field was primarily concerned with training models to do exact tasks with large sets of data. Some of these specializations include object detection in images, or text recognition, or language translation. These models would be only smart at doing those exact tasks.

The way these models learned was through understanding data in a bottom-up way. For example, by seeing variations in pixel data, AI models would find patterns in multiple sets of data to draw a conclusion. The key here is that these models understand the world not through conceptual understanding of its primary task, but instead by slight repeated patterns in the underlying data.

These approaches are extremely effective at specialization — however, once a new example that is unseen or unfamiliar is exposed to a bottom-up AI model, it may output completely bizarre and unexpected results. To give an example, let’s say that you are detecting cats, and you only give it examples of black cats. Suddenly, if you give an example of a white cat, the AI model may not generalize to this fur color, and have unexpected accuracy results.

Rather than this, a new set of models focused on semantics have been improving rapidly. Although the models are trained using similar data, the models are instead encouraged to discover and conceptualize new concepts through associations. First, these developments came in text domains, but very quickly, images were introduced to build models that understand vague semantics.

From syntactic parsing to contextual embeddings with text

In the early days of natural language processing (NLP), systems relied heavily on syntactic parsing and rule-based approaches. They meticulously analyzed parts of speech, grammatical structures, and predefined semantic rules. This bottom-up approach, while foundational, struggled with ambiguity and contextual nuances.

The advent of transformer models, particularly BERT (Bidirectional Encoder Representations from Transformers), marked a pivotal shift. These models introduced the concept of contextual embeddings, where the representation of a word is dynamically influenced by its surrounding context. AI scientists discovered that if they simply exposed AI models to billions of transcripts of texts, from news articles to books, the AI would implicitly understand the intuitive contextual distance that each word has with each other. This allowed for a more nuanced understanding of language, capturing semantic relationships that were previously elusive.

Computer vision's parallel evolution: From convolutions to semantic understanding

Computer vision followed a similar trajectory. Initial approaches relied on hand-crafted features like SIFT (Scale-Invariant Feature Transform) and HOG (Histogram of Oriented Gradients), meticulously engineering low-level visual descriptors like the color changes of a person’s face or the diagonal line of an R for reading text. The advent of Convolutional Neural Networks (CNNs) automated this feature extraction process, leading to significant improvements in tasks like object detection and image classification.

However, the true game-changer came very recently with the introduction of CLIP (Contrastive Language-Image Pre-training). CLIP bridged the gap between vision and language, employing a novel training methodology that learns visual concepts directly from natural language supervision. In the same way as with text, by exposing models to millions of image and text pairings, suddenly the AI would learn to see that a picture of a cat is similar to the phrase “that cat is funny” but more different than “a car racing down a track.”

None of these developments matter unless they are cheaply accessible by the everyday user. As these technologies matured, a parallel trend emerged: the push towards efficiency and accessibility. As these technologies matured, they became more efficient. Just as your smartphone can now recognize faces or objects in real-time — a feat that once required supercomputers — the ability to understand the semantics of text and images is becoming more accessible and widespread.

In this landscape of rapid innovation, our approach stands out by synthesizing top-down semantic understanding with traditional bottom-up computer vision techniques. Advancements in both efficiency and AI sophistication have helped democratizing access to whole new classes of tools and workflows. This is our mission for our customers: as AI gets smarter and more efficient, bringing compute and AI to edge video data allows us to bring more intelligence and keep workplaces safer and more secure. In doing so, we continue to drive accessibility to this powerful technology, but how do we do it?

Spot AI: Where magic meets method

Imagine Spot AI's text search system as an artificial brain, a complex neural network dedicated to warehouse safety. Like the human brain, it's composed of specialized regions that process different aspects of sensory input, all working in concert to create a holistic understanding of the environment.

Spot AI's system, much like the human brain, is composed of specialized regions that work in harmony to process and interpret the complex visual landscape of a warehouse environment. At its core lies the "visual cortex," powered by advanced convolutional neural networks such as YOLO that spot objects traveling around. This region rapidly processes incoming video feeds, identifying and categorizing key elements in the scene — workers, forklifts, motion, and more. Building upon this visual understanding, we find specialized neural networks analogous to distinct cortical regions in the human brain. A network on facial recognition and reading text identifies recognizes familiar faces and categorizes important textual information like license plates.

With these networks alone, our system was world class at providing hard data points and surfacing relevant video for customers fast. However, what about the generalization we needed for all the search queries, a customer might ask? What we needed to incorporate was a sense of reason and semantic understanding. The true innovation lies in how we’ve integrated these bottom-up detections with this new reasoning and allow it to power deeper video retrieval.

Here's where CLIP-based semantic model plays a crucial role: our semantic models create semantic similarity comparisons between visual content and natural language queries by encoding each into a unique point in space. By leveraging CLIP's ability to perform effectively with ambiguous, novel concepts, the system can potentially identify and search for elements it wasn't explicitly trained on. More specifically, the semantic layer enables the system to reason about relationships between detected objects and actions, inferring higher-level concepts that we haven't been exposed to before, like “missing glove.” In combining semantic understanding with text search, suddenly we can reason about the activities of a specific car, tied with a license plate, along with other features we’ve detected.

An example of how we’d understand a query.

When a query like "Identify instances of tailgating at secure entrances between 2200 and 0600 hours" is received, the system performs several steps:

Query parsing: The natural language query is parsed to extract key elements (action: tailgating, location: secure entrances, time range: 2200–0600).
Semantic expansion: The query is expanded using the semantic understanding layer to include related concepts (e.g., "unauthorized entry," "multiple people entering on single authorization").
Multi-modal matching: The system then performs a matching operation in the joint embedding space, comparing the expanded query against pre-computed embeddings of video segments and intuitively getting a sense of how far concepts and images are from each other.
Temporal and spatial filtering: Results are filtered based on the specified time range and location information, leveraging metadata associated with the video segments.
Ranking and retrieval: Matched segments are ranked based on their semantic similarity scores and presented to the user.

Future horizons

As we peer into the future of this technology, several exciting possibilities emerge:

Improved generalization: Advances in few-shot and zero-shot learning could enable systems to recognize and search for concepts with minimal or no specific training.
Multimodal integration: Incorporating audio analysis and other external data from Spot Connect alongside video could provide even richer contextual understanding.
Temporal understanding: Enhanced modeling of long-term temporal dependencies could improve the detection of complex, time-dependent events.

The convergence of advanced NLP and computer vision techniques in video analysis represents a significant leap in our ability to derive insights from visual data. By bridging the gap between human language and machine perception, systems like Spot AI's are not just changing how we search through videos — they're fundamentally altering our relationship with visual information.

As this technology continues to evolve, becoming more sophisticated and accessible, its impact will likely be felt across numerous domains, from enhancing security and optimizing industrial processes to unlocking new realms of scientific discovery and creative expression. As these systems become more complex, ensuring their decisions are interpretable and accountable becomes increasingly important too, which drives better control but also more responsible use.

The story of AI in video analysis is far from complete. Each advancement brings new capabilities and new questions, challenging us to push the boundaries of what's possible while carefully considering the implications of our creations. As we write the next chapters of this technological narrative, our task is clear: to harness the power of AI to unlock the full potential of visual data, while ensuring that our innovations serve businesses, workers, and end customers rather than harm them.

‍