Audio Captioning with Composition of Acoustic and Semantic Information

dc.contributor.authorEren, Aysegul Ozkaya
dc.contributor.authorSert, Mustafa
dc.date.accessioned2022-09-05T09:46:51Z
dc.date.available2022-09-05T09:46:51Z
dc.date.issued2021
dc.description.abstractGenerating audio captions is a new research area that combines audio and natural language processing to create meaningful textual descriptions for audio clips. To address this problem, previous studies mostly use the encoder-decoder-based models without considering semantic information. To fill this gap, we present a novel encoder-decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings. We extract semantic embedding by obtaining subjects and verbs from the audio clip captions and combine these embedding with audio embedding to feed the BiGRU-based encoder-decoder model. To enable semantic embeddings for the test audios, we introduce a Multilayer Perceptron classifier to predict the semantic embeddings of those clips. We also present exhaustive experiments to show the effciency of different features and datasets for our proposed model the audio captioning task. To extract audio features, we use the log Mel energy features, VGGish embeddings, and a pretrained audio neural network (PANN) embeddings. Extensive experiments on two audio captioning datasets Clotho and AudioCaps show that our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics and using the semantic information improves the captioning performance.en_US
dc.identifier.endpage160en_US
dc.identifier.issn1793-351Xen_US
dc.identifier.issue02en_US
dc.identifier.scopus2-s2.0-85109474276en_US
dc.identifier.startpage143en_US
dc.identifier.urihttps://arxiv.org/pdf/2105.06355.pdf
dc.identifier.urihttp://hdl.handle.net/11727/7509
dc.identifier.volume15en_US
dc.identifier.wos000670288200002en_US
dc.language.isoengen_US
dc.relation.isversionof10.1142/S1793351X21400018en_US
dc.relation.journalINTERNATIONAL JOURNAL OF SEMANTIC COMPUTINGen_US
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergien_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectAudio captioningen_US
dc.subjectPANNsen_US
dc.subjectVGGishen_US
dc.subjectGRUen_US
dc.subjectBiGRUen_US
dc.titleAudio Captioning with Composition of Acoustic and Semantic Informationen_US
dc.typearticleen_US

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
ds98.pdf
Size:
812.2 KB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: