Mühendislik Fakültesi / Faculty of Engineering

Permanent URI for this collectionhttps://hdl.handle.net/11727/1401

Browse

Search Results

Now showing 1 - 10 of 34
  • Item
    Multimodal Video Captioning Using Object-Auditory Information Fusion with Transformers
    (2023) Selbes, Berkay; Sert, Mustafa
    Video captioning aims to generate natural language sentences of an input video. Generating coherent natural language sentences is a challenging task due to the complex nature of video content such as object and scene understanding, extraction of object- and event-specific auditory information, and acquisition of relationships among objects. In this study, we address the problem of efficient modeling of object interactions in scenes, as they include crucial information regarding the events in the visual scene. To this end, we propose to use object features along with auditory information to better model the audio-visual scene appearing within the video. Specifically, we extract Faster R-CNN as the object features and VGGish as the auditory features and design a transformer encoder-decoder architecture in the multimodal setup. Experiments on MSR-VTT show encouraging results and object features better model the object interactions along with the auditory information in comparison to the ResNet features.
  • Item
    Efficient Recognition of Human Emotional States from Audio Signals
    (2014) Erdem, Ernur Sonat; Sert, Mustafa; https://orcid.org/0000-0002-7056-4245; AAB-8673-2019
    Automatic recognition of human emotional states is an important task for efficient human-machine communication. Most of existing works focus on the recognition of emotional states using audio signals alone, visual signals alone, or both. Here we propose empirical methods for feature extraction and classifier optimization that consider the temporal aspects of audio signals and introduce our framework to efficiently recognize human emotional states from audio signals. The framework is based on the prediction of input audio clips that are described using representative low-level features. In the experiments, seven (7) discrete emotional states (anger, fear, boredom, disgust, happiness, sadness, and neutral) from EmoDB dataset, are recognized and tested based on nineteen (19) audio features (15 standalone, 4 joint) by using the Support Vector Machine (SVM) classifier. Extensive experiments have been conducted to demonstrate the effect of feature extraction and classifier optimization methods to the recognition accuracy of the emotional states. Our experiments show that, feature extraction and classifier optimization procedures lead to significant improvement of over 11% in emotion recognition. As a result, the overall recognition accuracy achieved for seven emotions in the EmoDB dataset is 83.33% compared to the baseline accuracy of 72.22%.
  • Item
    Audio-based Event Detection in Office Live Environments Using Optimized MFCC-SVM Approach
    (2015) Kucukbay, Selver Ezgi; Sert, Mustafa; 0000-0002-7056-4245; AAB-8673-2019
    Audio data contains several sounds and is an important source for multimedia applications. One of them is unstructured Environmental Sounds (also referred to as audio events) that have noise-like characteristics with flat spectrums. Therefore, in general, recognition methods applied for music and speech data are not appropriate for the Environmental Sounds. In this paper, we propose an MFCC-SVM based approach that exploits the effect of feature representation and learner optimization tasks for efficient recognition of audio events from audio signals. The proposed approach considers efficient representation of MFCC features using different window and hop sizes by changing the number of Mel coefficients in the analyses as well as optimizing the SVM parameters. Moreover, 16 different audio events from the IEEE Audio and Acoustic Signal Processing (AASP) Challenge Dataset, namely alert, clear throat, cough, door slam, drawer, keyboard, keys, knock, laughter, mouse, page turn, pen drop, phone, printer, speech, and switch that are collected from office live environments are utilized in the evaluations. Our empirical evaluations show that, when the results of the proposed methods are chosen for MFFC feature and SVM classifier, the tests conducted through using 5-fold cross validation gives the results of 62%, 58% and 55% for Precision, Recall and F-measure scores, respectively. Extensive experiments on audio-based event detection using the IEEE AASP Challenge dataset show the effectiveness of the proposed approach.
  • Item
    Video Scene Classification Using Spatıal Pyramid Based Features
    (2014) Sert, Mustafa; Ergun, Hilal; https://orcid.org/0000-0002-7056-4245; AAB-8673-2019
    Recognition of video scenes is a challenging problem due to the unconstrained structure of the video content. Here, we propose a spatial pyramid based method for the recognition of video scenes and explore the effect of parameter optimization to the recognition accuracy. In the experiments different sampling methods, dictionary sizes, kernel methods, and pyramid levels are examined. Support Vector Machine (SVM) is employed for classification due to the success in pattern recognition applications. Our experiments show that, the size of dictionary and proper pyramid levels in feature representation drastically enhance the recognition accuracy.
  • Item
    Audio Captioning Based on Combined Audio and Semantic Embeddings
    (2020) Eren, Aysegul Ozkaya; Sert, Mustafa
    Audio captioning is a recently proposed task for automatically generating a textual description of a given audio clip. Most existing approaches use the encoder-decoder model without using semantic information. In this study, we propose a bi-directional Gated Recurrent Unit (BiGRU) model based on encoder-decoder architecture using audio and semantic embeddings. To obtain semantic embeddings, we extract subject-verb embeddings using the subjects and verbs from the audio captions. We use a Multilayer Perceptron classifier to predict subject-verb embeddings of test audio clips for the testing stage. Within the aim of extracting audio features, in addition to log Mel energies, we use a pretrained audio neural network (PANN) as a feature extractor which is used for the first time in the audio captioning task to explore the usability of audio embeddings in the audio captioning task. We combine audio embeddings and semantic embeddings to feed the BiGRU-based encoder-decoder model. Following this, we evaluate our model on two audio captioning datasets: Clotho and AudioCaps. Experimental results show that the proposed BiGRU-based deep model significantly outperforms the state of the art results across different evaluation metrics and inclusion of semantic information enhance the captioning performance.
  • Item
    Analysis of Deep Neural Network Models for Acoustic Scene Classification
    (2019) Basbug, Ahmet Melih; Sert, Mustafa
    Acoustic Scene Classification is one of the active fields of both audio signal processing and machine learning communities. Due to the uncontrolled environment characteristics and the multiple diversity of environmental sounds, the classification of acoustic environment recordings by computer systems is a challenging task. In this study, the performance of deep learning algorithms on acoustic scene classification problem which includes continuous information in sound events are analyzed. For this purpose, the success of the AlexNet and the VGGish based 4- and 8-layered convolutional neural networks utilizing long-short-term memory recurrent neural network (LSTM-RNN) and Gated Recurrent Unit Recurrent Neural Network (GRU-RNN) architectures have been analyzed for this classification task. In this direction, we adapt the LSTM-RNN and the GRU-RNN models with the 4- and 8-layared CNN architectures for the classification. Our experimental results show that 4-layered CNN with GRU structure improve the accuracy.
  • Item
    Classification of Obstructive Sleep Apnea using Multimodal and Sigma-based Feature Representation
    (2019) Memis, Gokhan; Sert, Mustafa
    Obstructive sleep apnea (OSA) is a sleep disorder characterized by a decrease in blood oxygen saturation and waking up after a long time. Diagnosis can be made by following a full night with a polysomnogram device, so there is a need for computer-based methods for the diagnosis of OSA. In this study, a method based on feature selection is proposed for OSA classification using oxygen saturation and electrocardiogram signals. Standard deviation (sigma) based features have been created to increase accuracy and reduce computational complexity. To evaluate the effectiveness, comparisons were made with selected machine learning algorithms. The achievements of the obtained features were compared with Naive Bayes (NB), k-nearest neighborhood (kNN) and Support Vector Machine (SVM) classifiers. The tests performed on the PhysioNet dataset consisting of real clinical samples show that the use of sigma-based features result an average performance increase of 1.98% in all test scenarios.
  • Item
    Combining Acoustic and Semantic Similarity for Acoustic Scene Retrieval
    (2019) Sert, Mustafa; Basbug, Ahmet Melih
    Automatic retrieval of acoustic scenes in large audio collections is a challenging task due to the complex structures of these sounds. A robust and flexible retrieval system should address both the acoustic- and semantic aspects of these sounds and how to combine them. In this study, we introduce an acoustic scene retrieval system that uses a combined acoustic- and semantic-similarity method. To address the acoustic aspects of sound scenes, we use a cascaded convolutional neural network (CNN) with a gated recurrent unit (GRU). The acoustic similarity is calculated in feature space using the Euclidean distance and the semantic similarity is obtained using the Path Similarity method of the WordNet. Two performance datasets from the TAU Urban Acoustic Scenes 2019 and the TUT Urban Acoustic Scenes 2018 are used to compare the performance of the proposed retrieval system with the literature and the developed baseline. Results show that the semantic similarity improves the mAP and P@k scores.
  • Item
    The Effectiveness of Feature Selection Methods on Physical Activity Recognition
    (2018) Memis, Gokhan; Sert, Mustafa
    For the definition of physical activity monitoring with long activity times can be costly and there is a need for efficient computer based algorithms. Smartphone sensors such as accelerometer, magnetometer, and gyroscope for physical activity recognition are used in many researches. In this study, we propose a multi-modal approach to classify the different physical activities at the feature level by fusing electrocardiography (ECG), accelerometer, magnetometer, and gyroscope signals. We use Support Vector Machine (SVM), nearest neighbors, Naive Bayes, Random Tree and Bagging RepTree classifiers as learning algorithms and provide comprehensive empirical results on fusion strategy. Our experimental results on real clinical examples from the MHealth dataset show that the proposed feature-level fusion approach gives an average accuracy of 98.40% using SVM with the highest value in all scenarios. We also observe that when we use the SVM classifier with the gyroscope signal, which we take the highest value as a single modal, it gives an average accuracy of 96.27%. We achieve a significant improvement in comparision with existing studies.
  • Item
    Continuous Valence Prediction Using Recurrent Neural Networks with Facial Expressions and EEG Signals
    (2018) Sen, Dogancan; Sert, Mustafa
    Automatic analysis of human emotions by computer systems is an important task for human-machine interaction. Recent studies show that, the temporal characteristics of emotions play an important role in the success of automatic recognition. Also, the use of different signals (facial expressions, bio-signals, etc.) is important for the understanding of emotions. In this study, we propose a multi-modal method based on feature-level fusion of human facial expressions and electroencephalograms (EEG) data to predict human emotions in continuous valence dimension. For this purpose, a recursive neural network (LSTM-RNN) with long short-term memory units is designed. The proposed method is evaluated on the MAHNOB-HCI performance data set.