Enhancing CCTV Security with Large Language Models and Computer Vision

In the CCTV monitoring and security sector, the use of AI technology has become the norm, with many manufacturers offering advanced image analysis features built into their devices. Manufacturers such as Hikvision and Dahua have for years been offering devices with features such as motion detection (human vehicle target classification), line crossing detection, intrusion detection, region entrance detection, region exiting detection (support alarm triggering by specified target types (human and vehicle)), scene change detection, face detection, and recognition.
Capabilities Offered by Built-in AI
These AI technologies offer a range of capabilities, such as:
- Object recognition and classification (people, vehicles, animals).
- Behaviour and motion path analysis.
- Intrusion and line crossing detection.
- Abandoned object detection.
- Crowd density analysis and safety regulation monitoring.
The Problem of False Alarms
Despite all the advantages, AI-based systems still face the challenge of false alarms, which can arise from various causes, such as:
- Variable lighting conditions may cause erroneous motion or object detection.
- A dynamic environment, such as trees moved by the wind or animals passing through the scene, may be mistakenly interpreted as threats.
- Technological limitations in recognising context and complex scenarios, sometimes resulting in difficulty distinguishing trivial events from genuine threats.
Reducing false alarms is crucial for the effectiveness of security systems, as too high a number of false alarms can lead to “alarm fatigue,” where users begin to ignore notifications, potentially overlooking real threats.
Reducing false alarms is not the only advantage of using LLM (Large Language Models) and Computer Vision (CV); other benefits can include:
Flexibility in Defining Alarm Scenarios
The integration of CCTV with LLM and CV offers users unparalleled flexibility in defining the scenarios that should trigger alarms or notifications. Users can precisely specify the conditions that must be met for the system to respond – from very detailed ones, such as detecting a vehicle of a specific colour or a person carrying a tool, to general ones like increased activity in a specific area. Scenarios can also be defined where an alarm will be triggered in the event of a fire or flooding of the premises or any other environmental hazards.
Automatic Monitoring Reports
One of the most innovative capabilities is the generation of automatic or on-demand monitoring reports that describe the current situation at the monitored site. Thanks to advanced analysis capabilities, the system can provide information about the presence of employees, the stage of a construction project, or the status of gate closures. This allows for more informed security management, enabling quick responses to anomalies. This functionality also paves the way for creating a system that could support, for example, security personnel (monitoring) and might even replace their work in the future.
A Practical Exploration
As a company that installs and integrates CCTV systems, we are keen to explore how we can utilize large language models with image analysis capabilities (LLM CV) in video surveillance systems. Models such as GPT-4V or Claude 3 have tremendous capabilities in image analysis, outperforming the AI systems built into CCTV devices.
Let’s examine how language models fare in analyzing images from CCTV cameras in practice.
For our tests, we will use the GPT-4V language model available through the OpenAI API. We will analyze a set of several photos from different locations where our company has installed CCTV systems. The surveillance images were obtained with the consent of the property owners. Photos in which a human face is visible have been anonymized. Moreover, language models are not capable of identifying individuals or describing any human features that would enable identification. This limitation is imposed by the LLM providers.
Data Security and Privacy
At this point, it’s crucial to address data security and privacy concerns, as images from cameras will be sent for analysis via APIs from external providers such as OpenAI. According to what can be read in the terms and conditions of the model provider, among other points:
Ownership: You own and control your data
- We do not train on your business data (data from ChatGPT Team, ChatGPT Enterprise, or our API Platform)
- You own your inputs and outputs (where allowed by law)
Security: Comprehensive compliance
- We’ve been audited for SOC 2 compliance (ChatGPT Enterprise and API)
- Data encryption at rest (AES-256) and in transit (TLS 1.2+)
The most crucial information for us is that data sent to the API will not be used for training the model, meaning they will not be processed further. Additionally, as can be read on the website https://platform.openai.com/docs/guides/vision, “After an image has been processed by the model, it is deleted from OpenAI servers and not retained.”
Scenario no. 1
Let’s start by checking how LLM handles describing what is found in images. We will start with a simple prompt.

Prompt 1
Describe what you see in the camera image.
LLM Answer:
This appears to be a still image from a security camera overlooking a street scene at night. The timestamp in the top left corner indicates that the image was captured on March 3rd, 2021, at 18:40:57. The camera view is labeled “Street View” in the bottom right corner.
In the image, we see a well-lit urban street with a sidewalk in the foreground. There are several people visible: one person is standing near the corner of the street, another is walking along the sidewalk, and two others are crossing the street in the distance. There is a parked bicycle with a green basket attached to the front, and another bicycle locked to a metal fence next to a trash bin. The street itself has a few cars, with one car visible in motion and the brake lights of another indicating it might be slowing down or stopped.
The lighting suggests it’s evening or night, and the businesses in the background have their lights on. The quality of the image is relatively clear, allowing for details like these to be discerned.
As you can see, the language model described in detail what is happening in the photo. However, I would like it to provide a shorter and more concise answer. Let’s see what the response will look like when we put the language model in the role of a person analysing CCTV camera photos.
Prompt 2
As a CCTV Operator specializing in security oversight, your task is to analyze images from a CCTV camera with a sharp focus on detecting potential security breaches or dangers. Your report should be succinct, omitting normal, non-suspicious activities to concentrate exclusively on notable concerns. In your analysis, please include:
Unusual Activities: Briefly describe actions or behaviors that deviate from the expected norm and might indicate a security concern.
Suspicious Objects: Mention any objects that are out of place or could pose a risk, excluding everyday items that do not raise suspicion.
Potential Threats: Identify any immediate threats to safety, such as unauthorized entry attempts or clear signs of conflict.
Your description should be concise and factual, aimed at providing essential information for quick decision-making by security personnel.
LLM Answer:
In the provided image, there are no immediately apparent unusual activities, suspicious objects, or potential threats to safety. The scene depicts a typical street view with pedestrians walking and vehicles on the road. There is a bicycle parked on the sidewalk, but it does not seem out of place or suspicious. All individuals in the image appear to be engaged in normal activities, such as walking or waiting to cross the street. There are no clear signs of unauthorized entry attempts or conflict. The image does not require further action based on the visible content.
It’s better; the model returned information relevant during CCTV analysis in terms of security or dangerous situations, but it’s still too lengthy. Let’s try to compel the model to provide even shorter information.
Prompt 3
Analyze the CCTV image for security concerns. Directly report any observed unusual activities, suspicious objects, or potential threats. If no irregularities are found, simply state that everything is okay. Focus on:
Unusual Activities: Only mention if actions or behaviors suggest security concerns.
Suspicious Objects: Highlight objects that could pose a risk, if any.
Potential Threats: Report immediate safety threats, if visible.
Avoid introductory phrases. Be brief and to the point.
LLM Answer:
No unusual activities, suspicious objects, or potential threats are visible in the image. Everything appears to be okay.
As can be seen, it’s possible to compel the model to include only the information that is relevant to us and in a specific scenario. Let’s see if we can modify the prompt further so that the analysis also contains information regarding bicycles left on the sidewalk.
Prompt 4
Analyze the CCTV image for security concerns. Directly report any observed unusual activities, suspicious objects, or potential threats. If no irregularities are found, simply state that everything is okay. Focus on:
– Pay attention to whether any bicycles have been left behind, and if so, count them.
Avoid introductory phrases. Be brief and to the point.
LLM Answer:
Two bicycles are present; one is secured to a bike rack and the other has a green basket attached. No unusual activities, suspicious objects, or potential threats are observed. Everything appears to be okay.
Scenario no. 2
In this scenario, we’ll place the model in the role of a parking attendant. We’ll see if we can use the model to count the parked cars and check if there is a car parked in a specific color.


Prompt 1
You are analyzing an image from a CCTV camera, and your task is to check which cars are parked in a private parking lot. Determine the number of cars parked in the lot, and if possible, identify the types of vehicles (e.g., SUV, van, sedan, etc.) that are parked.
LLM Answer (Pic. 1):
In the image provided, I can see two vehicles parked in the lot:
- On the left side of the image, there is a white van. It appears to be a light commercial vehicle, commonly used for deliveries or trades.
- In the center, parked against the brick wall, there is a dark-colored car that looks like a hatchback or a small sedan.
These are the only two vehicles visible in the parking lot from this angle.
LLM Answer (Pic. 2):
In the image, there are six vehicles parked in the private parking lot. Here is a description of the types of vehicles I can identify:
- On the far left, there is a white van.
- Next to the van, there is a dark-colored hatchback.
- The third vehicle from the left is a light-colored hatchback.
- In the center, there is a red hatchback.
- To the right of the red car, there is a black sedan parked perpendicular to the other vehicles.
- On the far right, partially obscured by the staircase, there is a dark-colored sedan.
Please note that the specific makes and models of the vehicles cannot be determined with certainty from this image.
Now, we will instruct the model to, firstly, return the response as structured data in the JSON format for application use, and secondly, to add information to the prompt to pay attention to whether a red hatchback is parked. Such custom scenarios can be dynamically added to the prompt as needed.
Prompt 2
Your task is to analyze images from a single CCTV camera, and generate a response in JSON format.
In your analysis, pay special attention to objectives listed below:
– custom scenario 1: Pay attention to whether red hatchback is parked.
Your description must be concise, focusing solely on the details relevant to the objectives.
JSON Format Example###:
{{
“number_of_cars”:2,
“is_space_available”: true/false,
“custom_scenario_1_condition_met”: true/false,
}}
Ensure accuracy and attention to detail in your analysis.
LLM Answer (Pic. 1)
{
"number_of_cars": 2,
"is_space_available": true,
"custom_scenario_1_condition_met": false
}
LLM Answer (Pic. 2)
{
"number_of_cars": 6,
"is_space_available": false,
"custom_scenario_1_condition_met": true
}
Scenario no. 3
In this scenario, we will use the model to check if the doors are open and whether there are people visible in the camera image.


Prompt 1
Your task is to analyze images from a single CCTV camera, and generate a response in JSON format.
In your analysis, pay special attention to objectives listed below:
– custom scenario 1: Pay attention to whether the doors are open.
JSON Format Example###:
{{
"people_present": true/false,
"custom_scenario_1_condition_met": true/false,
}}
Ensure accuracy and attention to detail in your analysis.
LLM Answer (Pic. 1)
{
"people_present": false,
"custom_scenario_1_condition_met": false
}
LLM Answer (Pic. 2)
{
"people_present": false,
"custom_scenario_1_condition_met": true
}
Scenario no. 4
This time, we will put the model in the role of a Health and Safety Executive and instruct it to check whether entry into the hall is safe and possible for vehicles. The area that the model should pay attention to will be marked in the images from the CCTV cameras. We are only interested in the area marked in red.


Prompt 1
Your task is to analyze the image from a CCTV camera monitoring a warehouse loading bay entrance, and generate a response in JSON format. In your analysis, focus only on the area marked by the RED rectangle.
Pay special attention to the following safety objectives:
– Check if the loading bay entrance is clear and safe for vehicles to enter. Look for any obstructions or hazards on the ground that could impede safe vehicle access.
Provide your response in JSON format only, without any additional text.
JSON Format Example###:
{
"entrance_clear": true/false,
"hazards_identified": ["list any hazards or obstructions if present, otherwise leave empty"]
}
Ensure accuracy and attention to detail in your safety analysis of the marked loading bay entrance area.
LLM Answer (Pic. 1)
{
"entrance_clear": true,
"hazards_identified": []
}
LLM Answer (Pic. 2)
{
"entrance_clear": false,
"hazards_identified": ["Pallet with materials on the ground"]
}
Scenario no. 5
Now we will check how the model copes with the verification of false alarms. The CCTV footage comes from Hikvision cameras (AcuSense series). The cameras use motion detection or intrusion detection functionality, but they should only react to the presence of a human. However, it turns out that the cameras are triggering events even if there are no people in the camera’s view.




Prompt 1
Your task is to analyze images from a single CCTV camera, which might consist of either a single image or a sequence of images, and generate a response in JSON format. This analysis should provide a focused overview of the captured scene, highlighting key aspects relevant to security and safety without making assumptions about the personal characteristics of individuals observed. In your analysis, pay special attention to the objectives listed below:
– Identify the presence of people in the area.
– Assess any environmental hazards present (e.g., fire, flooding).
The JSON response should encapsulate your analysis, whether it is based on a single image or multiple images, with a structure that allows for informed security-related decisions, such as whether to activate alarms. Your description must be concise, focusing solely on the details relevant to the objectives. If no significant events or concerns are observed, simply state that “everything appears normal.”
JSON Format Example###:
{{
“time_of_day”: “morning/afternoon/evening/night/unknown”,
“people_present”: true/false,
“vehicles_present”: true/false,
“environmental_hazards”: true/false,
“gate_open”: true/false,
“door_open”: true/false,
“short_description”: “A brief description of the scene, focusing on the presence of key elements. If no significant changes or incidents, state ‘everything appears normal’. Avoid introductory phrases. Be brief and to the point.”
}}
Ensure accuracy and attention to detail in your analysis, focusing on activities and objects present. Avoid overemphasizing the absence of events or elements unless it’s directly relevant to the security assessment. The goal is to provide a precise and useful security overview without unnecessary speculation about the individuals involved.
LLM Answer (Pic. 1)
{
"time_of_day": "night",
"people_present": false,
"vehicles_present": true,
"environmental_hazards": false,
"gate_open": false,
"door_open": false,
"short_description": "The area is clear of people. One vehicle is parked inside. No environmental hazards or open gates/doors are visible. Everything appears normal."
}
LLM Answer (Pic. 2)
{
"time_of_day": "night",
"people_present": false,
"vehicles_present": true,
"environmental_hazards": false,
"gate_open": false,
"door_open": false,
"short_description": "The area is illuminated and shows one vehicle parked. No people are visible, and there are no immediate environmental hazards. The gate is closed. Everything appears normal."
}
LLM Answer (Pic. 3)
{
"time_of_day": "night",
"people_present": false,
"vehicles_present": true,
"environmental_hazards": false,
"gate_open": false,
"door_open": false,
"short_description": "The area is clear of people. Several vehicles are parked. No environmental hazards or open gates/doors are visible. Everything appears normal."
}
LLM Answer (Pic. 4)
{
"time_of_day": "night",
"people_present": false,
"vehicles_present": false,
"environmental_hazards": false,
"gate_open": false,
"door_open": false,
"short_description": "everything appears normal"
}
In this case, the model performed well during the verification of “false alarms” – there are no people in the photos. However, due to poor lighting conditions and weather, it provided an incorrect count of parked cars in Pic. 1 and 2. In Pic. 3, the model mistook pallets with construction materials for cars. But in this scenario, our goal was to check whether there were actually any people present in the images from the cameras or not.
Conclusions
The integration of Large Language Models (LLM) with Computer Vision (CV) has proven to be a highly effective approach in reducing false alarms in CCTV monitoring systems. Our company has successfully implemented an AI-based middleware system that verifies potential alarms before sending notifications to recipients, resulting in a remarkable 90% reduction in false alarms. Additionally, we have employed LLM with CV to generate automatic reports from CCTV monitoring, further enhancing the efficiency and effectiveness of our security solutions. However, it is crucial to acknowledge that language models are inherently non-deterministic, and as such, the responses generated by these models should be subject to additional verification. Occasionally, the results produced by the language model may be erroneous or inadequate in relation to the given prompt. Despite these limitations, the application of LLM with CV in CCTV monitoring systems has demonstrated significant potential in improving security management and reducing the reliance on human monitoring. As we continue to refine and develop these technologies, we anticipate even greater advancements in the field of AI-assisted security solutions.