Audio Captioning Based on Combined Audio and Semantic Embeddings

dc.contributor.authorEren, Aysegul Ozkaya
dc.contributor.authorSert, Mustafa
dc.date.accessioned2023-09-08T08:24:56Z
dc.date.available2023-09-08T08:24:56Z
dc.date.issued2020
dc.description.abstractAudio captioning is a recently proposed task for automatically generating a textual description of a given audio clip. Most existing approaches use the encoder-decoder model without using semantic information. In this study, we propose a bi-directional Gated Recurrent Unit (BiGRU) model based on encoder-decoder architecture using audio and semantic embeddings. To obtain semantic embeddings, we extract subject-verb embeddings using the subjects and verbs from the audio captions. We use a Multilayer Perceptron classifier to predict subject-verb embeddings of test audio clips for the testing stage. Within the aim of extracting audio features, in addition to log Mel energies, we use a pretrained audio neural network (PANN) as a feature extractor which is used for the first time in the audio captioning task to explore the usability of audio embeddings in the audio captioning task. We combine audio embeddings and semantic embeddings to feed the BiGRU-based encoder-decoder model. Following this, we evaluate our model on two audio captioning datasets: Clotho and AudioCaps. Experimental results show that the proposed BiGRU-based deep model significantly outperforms the state of the art results across different evaluation metrics and inclusion of semantic information enhance the captioning performance.en_US
dc.identifier.endpage48en_US
dc.identifier.isbn978-1-7281-8697-9en_US
dc.identifier.scopus2-s2.0-85101449935en_US
dc.identifier.startpage41en_US
dc.identifier.urihttp://hdl.handle.net/11727/10554
dc.identifier.wos000654273000008en_US
dc.language.isoengen_US
dc.relation.isversionof10.1109/ISM.2020.00014en_US
dc.relation.journal22nd IEEE International Symposium on Multimedia (IEEE ISM)en_US
dc.rightsinfo:eu-repo/semantics/closedAccessen_US
dc.subjectaudio captioningen_US
dc.subjectPANNsen_US
dc.subjectGRUen_US
dc.subjectBiGRUen_US
dc.titleAudio Captioning Based on Combined Audio and Semantic Embeddingsen_US
dc.typeConference Objecten_US

Files

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: