Mühendislik Fakültesi / Faculty of Engineering

Permanent URI for this collectionhttps://hdl.handle.net/11727/1401

Browse

Search Results

Now showing 1 - 10 of 22

Multimodal Video Captioning Using Object-Auditory Information Fusion with Transformers
(2023) Selbes, Berkay; Sert, Mustafa
Video captioning aims to generate natural language sentences of an input video. Generating coherent natural language sentences is a challenging task due to the complex nature of video content such as object and scene understanding, extraction of object- and event-specific auditory information, and acquisition of relationships among objects. In this study, we address the problem of efficient modeling of object interactions in scenes, as they include crucial information regarding the events in the visual scene. To this end, we propose to use object features along with auditory information to better model the audio-visual scene appearing within the video. Specifically, we extract Faster R-CNN as the object features and VGGish as the auditory features and design a transformer encoder-decoder architecture in the multimodal setup. Experiments on MSR-VTT show encouraging results and object features better model the object interactions along with the auditory information in comparison to the ResNet features.
Efficient Recognition of Human Emotional States from Audio Signals
(2014) Erdem, Ernur Sonat; Sert, Mustafa; https://orcid.org/0000-0002-7056-4245; AAB-8673-2019
Automatic recognition of human emotional states is an important task for efficient human-machine communication. Most of existing works focus on the recognition of emotional states using audio signals alone, visual signals alone, or both. Here we propose empirical methods for feature extraction and classifier optimization that consider the temporal aspects of audio signals and introduce our framework to efficiently recognize human emotional states from audio signals. The framework is based on the prediction of input audio clips that are described using representative low-level features. In the experiments, seven (7) discrete emotional states (anger, fear, boredom, disgust, happiness, sadness, and neutral) from EmoDB dataset, are recognized and tested based on nineteen (19) audio features (15 standalone, 4 joint) by using the Support Vector Machine (SVM) classifier. Extensive experiments have been conducted to demonstrate the effect of feature extraction and classifier optimization methods to the recognition accuracy of the emotional states. Our experiments show that, feature extraction and classifier optimization procedures lead to significant improvement of over 11% in emotion recognition. As a result, the overall recognition accuracy achieved for seven emotions in the EmoDB dataset is 83.33% compared to the baseline accuracy of 72.22%.
Audio-based Event Detection in Office Live Environments Using Optimized MFCC-SVM Approach
(2015) Kucukbay, Selver Ezgi; Sert, Mustafa; 0000-0002-7056-4245; AAB-8673-2019
Audio data contains several sounds and is an important source for multimedia applications. One of them is unstructured Environmental Sounds (also referred to as audio events) that have noise-like characteristics with flat spectrums. Therefore, in general, recognition methods applied for music and speech data are not appropriate for the Environmental Sounds. In this paper, we propose an MFCC-SVM based approach that exploits the effect of feature representation and learner optimization tasks for efficient recognition of audio events from audio signals. The proposed approach considers efficient representation of MFCC features using different window and hop sizes by changing the number of Mel coefficients in the analyses as well as optimizing the SVM parameters. Moreover, 16 different audio events from the IEEE Audio and Acoustic Signal Processing (AASP) Challenge Dataset, namely alert, clear throat, cough, door slam, drawer, keyboard, keys, knock, laughter, mouse, page turn, pen drop, phone, printer, speech, and switch that are collected from office live environments are utilized in the evaluations. Our empirical evaluations show that, when the results of the proposed methods are chosen for MFFC feature and SVM classifier, the tests conducted through using 5-fold cross validation gives the results of 62%, 58% and 55% for Precision, Recall and F-measure scores, respectively. Extensive experiments on audio-based event detection using the IEEE AASP Challenge dataset show the effectiveness of the proposed approach.
Audio Captioning Based on Combined Audio and Semantic Embeddings
(2020) Eren, Aysegul Ozkaya; Sert, Mustafa
Audio captioning is a recently proposed task for automatically generating a textual description of a given audio clip. Most existing approaches use the encoder-decoder model without using semantic information. In this study, we propose a bi-directional Gated Recurrent Unit (BiGRU) model based on encoder-decoder architecture using audio and semantic embeddings. To obtain semantic embeddings, we extract subject-verb embeddings using the subjects and verbs from the audio captions. We use a Multilayer Perceptron classifier to predict subject-verb embeddings of test audio clips for the testing stage. Within the aim of extracting audio features, in addition to log Mel energies, we use a pretrained audio neural network (PANN) as a feature extractor which is used for the first time in the audio captioning task to explore the usability of audio embeddings in the audio captioning task. We combine audio embeddings and semantic embeddings to feed the BiGRU-based encoder-decoder model. Following this, we evaluate our model on two audio captioning datasets: Clotho and AudioCaps. Experimental results show that the proposed BiGRU-based deep model significantly outperforms the state of the art results across different evaluation metrics and inclusion of semantic information enhance the captioning performance.
Combining Acoustic and Semantic Similarity for Acoustic Scene Retrieval
(2019) Sert, Mustafa; Basbug, Ahmet Melih
Automatic retrieval of acoustic scenes in large audio collections is a challenging task due to the complex structures of these sounds. A robust and flexible retrieval system should address both the acoustic- and semantic aspects of these sounds and how to combine them. In this study, we introduce an acoustic scene retrieval system that uses a combined acoustic- and semantic-similarity method. To address the acoustic aspects of sound scenes, we use a cascaded convolutional neural network (CNN) with a gated recurrent unit (GRU). The acoustic similarity is calculated in feature space using the Euclidean distance and the semantic similarity is obtained using the Path Similarity method of the WordNet. Two performance datasets from the TAU Urban Acoustic Scenes 2019 and the TUT Urban Acoustic Scenes 2018 are used to compare the performance of the proposed retrieval system with the literature and the developed baseline. Results show that the semantic similarity improves the mAP and P@k scores.
Multimodal Classification of Obstructive Sleep Apnea using Feature Level Fusion
(2017) Memis, Gokhan; Sert, Mustafa; 0000-0002-7056-4245; AAB-8673-2019
Obstructive sleep apnea (OSA) is a sleep disorder with long-term consequences. Long-term effects include sleep related issues and cardiovascular diseases. OSA is often diagnosed with an overnight sleep test called a polysomnogram. Monitoring can be costly with long wait times for diagnosis and computer-based efficient algorithms are needed. Here, we employ a multi-modal approach that performs feature-level fusion of two physiological signals, namely electrocardiograph (ECG) and saturation of peripheral oxygen (SpO(2)) for efficient OSA classification. We design Naive Bayes (NB), k-nearest neighbor (kNN), and Support Vector Machine (SVM) classifiers as the learning algorithms and present extensive empirical information regarding the utilized fusion strategy. Compared with other existing methods either considering single modality of signals or perform tests on subjects that have same severity of sleep apnea (i.e., high degree of apnea, low degree of apnea, or without apnea), we also define a test scenario that employs different subjects that have different sleep apnea severity to show the effectiveness of our approach. Our experimental results on real clinical examples from PhysioNet database show that, the proposed multimodal approach using feature-level fusion approach gives best classification rates when using SVM with an average accuracy of 96.64% for all test scenarios, i.e., within Subject with Same Severity (99.49%), between subjects with same sleep apnea severity (95.35%), and between subjects with distinct sleep apnea severity (95.07%).
Feature-level Fusion of Deep Convolutional Neural Networks for Sketch Recognition on Smartphones
(2017) Boyaci, Emel; Sert, Mustafa; 0000-0002-7056-4245; AAB-8673-2019
Efficient Bag of Words Based Concept Extraction for Visual Object Retrieval
(2016) Ergun, Hilal; Sert, Mustafa; https://orcid.org/0000-0002-7056-4245; AAB-8673-2019
Recent burst of multimedia content available on Internet is pushing expectations on multimedia retrieval systems to even higher grounds. Multimedia retrieval systems should offer better performance both in terms of speed and memory consumption while maintaining good accuracy compared to state-of-the-art implementations. In this paper, we discuss alternative implementations of visual object retrieval systems based on popular bag of words model and show optimal selection of processing steps. We demonstrate our offering using both keyword and example-based retrieval queries on three frequently used benchmark databases, namely Oxford, Paris and Pascal VOC 2007. Additionally, we investigate effect of different distance comparison metrics on retrieval accuracy. Results show that, relatively simple but efficient vector quantization can compete with more sophisticated feature encoding schemes together with the adapted inverted index structure.
Early and Late Level Fusion of Deep Convolutional Neural Networks for Visual Concept Recognition
(2016) Ergun, Hilal; Akyuz, Yusuf Caglar; Sert, Mustafa; Liu, Jianquan; 0000-0002-7056-4245; 0000-0002-7056-4245; B-1296-2011; D-3080-2015; AAB-8673-2019
Visual concept recognition is an active research field in the last decade. Related to this attention, deep learning architectures are showing great promise in various computer vision domains including image classification, object detection, event detection and action recognition in videos. In this study, we investigate various aspects of convolutional neural networks for visual concept recognition. We analyze recent studies and different network architectures both in terms of running time and accuracy. In our proposed visual concept recognition system, we first discuss various important properties of popular convolutional network architecture under consideration. Then we describe our method for feature extraction at different levels of abstraction. We present extensive empirical information along with best practices for big data practitioners. Using these best practices we propose efficient fusion mechanisms both for single and multiple network models. We present state-of-the-art results on benchmark datasets while keeping computational costs at low level. Our results show that these state-of-the-art results can be reached without using extensive data augmentation techniques.
Multimodal Vehicle Type Classification Using Convolutional Neural Network and Statistical Representations of MFCC
(2017) Selbes, Berkay; Sert, Mustafa; 0000-0002-7056-4245; AAB-8673-2019
Recognition of vehicle types in real life traffic scenarios is a challenging task due to the diversity of vehicles and uncontrolled environments. Efficient methods and feature representations are needed to cope with these challenges. In this paper, we address the vehicle type classification problem in real life traffic scenarios and propose a multimodal method that uses efficient representations of audio-visual modalities in the fusion context. We first separate audio-visual modalities from video data by extracting the keyframes and the corresponding audio fragments. Then we extract deep convolutional neural network (CNN) and the Mel Frequency Cepstral Coefficient (MFCC) features from the visual and audio modalities of the video data, respectively. The Principal Component Analysis (PCA) algorithm is used for the visual part and various types of statistical representations of the MFCC feature vectors are calculated to select representative features. These representations are then fused to form a robust multimodal feature. Finally, we train Support Vector Machine (SVM) classifiers for final classification of vehicle types using the obtained multimodal features. We evaluate the effectiveness of our proposed method on the TRECVID 2012 SIN video performance dataset for both single- and multi-modal cases. Our results show that, fusing the proposed MFCC representations with the GoogLeNet CNN features improves the classification accuracy.

Mühendislik Fakültesi / Faculty of Engineering

Browse

Filters

Settings

Sort By

Results per page

Search Results