Date of Original Version

5-2014

Type

Conference Proceeding

Journal Title

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

First Page

1360

Last Page

1364

Rights Management

© 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Abstract or Description

The audio semantic concepts (sound events) play important roles in audio-based content analysis. How to capture the semantic information effectively from the complex occurrence pattern of sound events in YouTube quality videos is a challenging problem. This paper presents a novel framework to handle the complex situation for semantic information extraction in real-world videos and evaluate through the NIST multimedia event detection task (MED). We calculate the occurrence confidence matrix of sound events and explore multiple strategies to generate clip-level semantic features from the matrix. We evaluate the performance using TRECVID2011 MED dataset. The proposed method outperforms previous HMM-based system. The late fusion experiment with the low-level features and text feature (ASR) shows that audio semantic concepts capture complementary information in the soundtrack.

DOI

10.1109/ICASSP.2014.6853819

Share

COinS
 

Published In

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1360-1364.