-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
作者大大您好,在原文中的视频特征提取部分您提到”This involves scene detection, object detection (Ren et al., 2015), face detection (Zhang et al., 2017), face tracking, and audio-visual active speaker detection (Tao et al., 2021), as described in (Zhang et al., 2022a). This process can generate more than 1,000K high-quality keyframes with speaker bounding boxes in approximately 5 days. Next, we use these annotated RoIs and employ the instance segmentation method, Mask R-CNN (He et al., 2017), pre-trained on the COCO (Lin et al., 2014) dataset to extract visual features.“,请问下之后的1000K帧说话人特征提取大概花费了几天时间呢,以及整个过程(即场景检测,目标检测,人脸跟踪,特征提取等)用到的GPU型号和数量是怎样的呢?
Metadata
Metadata
Assignees
Labels
No labels