Audio Captioning Based on Combined Audio and Semantic Embeddings

Eren, Aysegul Ozkaya; Sert, Mustafa

Audio Captioning Based on Combined Audio and Semantic Embeddings

dc.contributor.author	Eren, Aysegul Ozkaya
dc.contributor.author	Sert, Mustafa
dc.date.accessioned	2023-09-08T08:24:56Z
dc.date.available	2023-09-08T08:24:56Z
dc.date.issued	2020
dc.description.abstract	Audio captioning is a recently proposed task for automatically generating a textual description of a given audio clip. Most existing approaches use the encoder-decoder model without using semantic information. In this study, we propose a bi-directional Gated Recurrent Unit (BiGRU) model based on encoder-decoder architecture using audio and semantic embeddings. To obtain semantic embeddings, we extract subject-verb embeddings using the subjects and verbs from the audio captions. We use a Multilayer Perceptron classifier to predict subject-verb embeddings of test audio clips for the testing stage. Within the aim of extracting audio features, in addition to log Mel energies, we use a pretrained audio neural network (PANN) as a feature extractor which is used for the first time in the audio captioning task to explore the usability of audio embeddings in the audio captioning task. We combine audio embeddings and semantic embeddings to feed the BiGRU-based encoder-decoder model. Following this, we evaluate our model on two audio captioning datasets: Clotho and AudioCaps. Experimental results show that the proposed BiGRU-based deep model significantly outperforms the state of the art results across different evaluation metrics and inclusion of semantic information enhance the captioning performance.	en_US
dc.identifier.endpage	48	en_US
dc.identifier.isbn	978-1-7281-8697-9	en_US
dc.identifier.scopus	2-s2.0-85101449935	en_US
dc.identifier.startpage	41	en_US
dc.identifier.uri	http://hdl.handle.net/11727/10554
dc.identifier.wos	000654273000008	en_US
dc.language.iso	eng	en_US
dc.relation.isversionof	10.1109/ISM.2020.00014	en_US
dc.relation.journal	22nd IEEE International Symposium on Multimedia (IEEE ISM)	en_US
dc.rights	info:eu-repo/semantics/closedAccess	en_US
dc.subject	audio captioning	en_US
dc.subject	PANNs	en_US
dc.subject	GRU	en_US
dc.subject	BiGRU	en_US
dc.title	Audio Captioning Based on Combined Audio and Semantic Embeddings	en_US
dc.type	Conference Object	en_US

Files

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Mühendislik Fakültesi / Faculty of Engineering