使用GPT的视觉功能和TTS API处理和讲述视频

🤖

原文作者：Kai Chen 发表于 Nov 6, 2023

原文链接：https://cookbook.openai.com/examples/gpt_with_vision_for_video_understanding

This notebook demonstrates how to use GPT's visual capabilities with a video. GPT-4 doesn't take videos as input directly, but we can use vision and the new 128K context widnow to describe the static frames of a whole video at once. We'll walk through two examples:​
这个笔记本演示了如何通过视频使用GPT的视觉功能。GPT-4不直接将视频作为输入，但我们可以使用视觉和新的128 K上下文widnow来同时描述整个视频的静态帧。我们将介绍两个示例：​

1.
Using GPT-4 to get a description of a video​
使用GPT-4获取视频的描述​

2.
Generating a voiceover for a video with GPT-4 and the TTS API​
使用GPT-4和TTS API为视频生成画外音​

代码块

from IPython.display import display, Image, Audio​
​
import cv2  # We're using OpenCV to read video​
import base64​
import time​
import openai​
import os​
import requests​

1. Using GPT's visual capabilities to get a description of a video

1.使用GPT的视觉功能获取视频的描述

First we use OpenCV to extract frames from a nature video containing bisons and wolves:

首先，我们使用OpenCV从包含野牛和狼的自然视频中提取帧：

代码块

video = cv2.VideoCapture("data/bison.mp4")​
​
base64Frames = []​
while video.isOpened():​
    success, frame = video.read()​
    if not success:​
        break​
    _, buffer = cv2.imencode(".jpg", frame)​
    base64Frames.append(base64.b64encode(buffer).decode("utf-8"))​
​
video.release()​
print(len(base64Frames), "frames read.")​

618 frames read.

Display frames to make sure we've read them in correctly:​
显示帧以确保我们已正确读取它们：​

代码块

display_handle = display(None, display_id=True)​
for img in base64Frames:​
    display_handle.update(Image(data=base64.b64decode(img.encode("utf-8"))))​
    time.sleep(0.025)​

common.docs_name - LarkCCM_Docs_Menu_Image

Once we have the video frames we craft our prompt and send a request to GPT (Note that we don't need to send every frame for GPT to understand what's going on):​
一旦我们有了视频帧，我们就制作了提示并向GPT发送请求（注意，我们不需要发送每一帧给GPT来了解发生了什么）：​

代码块

PROMPT_MESSAGES = [​
    {​
        "role": "user",​
        "content": [​
            "These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video.",​
            *map(lambda x: {"image": x, "resize": 768}, base64Frames[0::10]),​
        ],​
    },​
]​
params = {​
    "model": "gpt-4-vision-preview",​
    "messages": PROMPT_MESSAGES,​
    "api_key": os.environ["OPENAI_API_KEY"],​
    "headers": {"Openai-Version": "2020-11-07"},​
    "max_tokens": 200,​
}​
​
result = openai.ChatCompletion.create(**params)​
print(result.choices[0].message.content)​

使用GPT的视觉功能和TTS API处理和讲述视频​

使用GPT的视觉功能和TTS API处理和讲述视频