Browsing by Author "Selbes, Berkay"

Now showing 1 - 4 of 4

Çok kipli video kavram sınıflandırılması
(Başkent Üniversitesi Fen Bilimleri Enstitüsü, 2018) Selbes, Berkay; Sert, Mustafa
Çokluortam verileri, İnternet kullanımının artmasıyla, sürekli üretilmekte ve paylaşılmaktadır. Bunun bir sonucu olarak, çokluortam verilerinin büyüklüğü hızla artmakta ve bu verilerin içeriklerini analiz eden otomatik yöntemlere ihtiyaç duyulmaktadır. Video verisi, çokluortam verilerinin önemli bir bileşenidir. Video içerik analizi, video verisi içeriğindeki zamansal veya konumsal olayların ve kavramların otomatik yöntemlerle belirlenmesi olarak tanımlanabilen önemli bir araştırma konusudur. Video içerik analizi, video içeriğinin karmaşık yapısı nedeniyle zor bir görevdir ve içerdiği bilgilerin otomatik olarak elde edilebilmesi için etkin yöntemlere ihtiyaç duyulmaktadır. Video verisinin artan büyüklüğü bu görevi zorlaştırmaktadır. Bu tez çalışmasında, video verilerinin çok kipli analizi için, görsel ve işitsel kiplerin füzyonuna dayalı bir yöntem önerilmektedir ve büyük veri platformunda uygulaması gerçekleştirilmektedir. Önerilen yöntem, Evrişimsel Sinir Ağı (ESA) öznitelikleri ile Mel-frekansı Kepstrum Katsayıları (MFCC) özniteliğinin temsillerinin füzyonuna dayanmaktadır. Büyük veri platformlarından Apache Spark kullanılarak önerilen yöntem gerçeklenmektedir. Önerilen yöntemin başarısı TRECVID 2012 SIN veri kümesi üzerinde değerlendirilmektedir. Sonuçlar göstermektedir ki, çok kipli yaklaşım tek kipli yaklaşımın başarısını geliştirmekte ve büyük veri platformu, çok kipli video içerik analizi yönteminin işlem zamanını önemli oranda düşürmektedir. The multimedia data has been continuously produced and shared out at a high rate as a result of the internet usage escalation. Thus, the size of multimedia data has rapidly increased, and hence, automated methods are needed to analyze the contents of the data produced. Video data is an important component of multimedia data. Video content analysis is an important research topic for several applications, such as audio-video based surveillance, content-based search and retrieval and can be defined as the automatic determination of temporal or spatial events/concepts in content of video data. Video content analysis is a difficult task due to the complex nature of the video content and requires efficient algorithms for extraction of high-level information included in the content. The increasing size of video data makes this task more difficult. In this thesis, a method based on the fusion of audio-visual modalities for multimodal content analysis of video data is proposed and implemented on a big data platform. The proposed method is based on the fusion of representations of Mel-frequency Cepstral Coefficient (MFCC) features with Convolutional Neural Network (CNN) features. The proposed method is implemented on Apache Spark big data platform. The success of the proposed method is evaluated on the TRECVID 2012 SIN data set. Our results show that the multi-modal method improves the accuracy of the single-model approach and also the big data platform significantly reduces the computation time of the multi-modal video content analysis method.
Multimodal Vehicle Type Classification Using Convolutional Neural Network and Statistical Representations of MFCC
(2017) Selbes, Berkay; Sert, Mustafa; 0000-0002-7056-4245; AAB-8673-2019
Recognition of vehicle types in real life traffic scenarios is a challenging task due to the diversity of vehicles and uncontrolled environments. Efficient methods and feature representations are needed to cope with these challenges. In this paper, we address the vehicle type classification problem in real life traffic scenarios and propose a multimodal method that uses efficient representations of audio-visual modalities in the fusion context. We first separate audio-visual modalities from video data by extracting the keyframes and the corresponding audio fragments. Then we extract deep convolutional neural network (CNN) and the Mel Frequency Cepstral Coefficient (MFCC) features from the visual and audio modalities of the video data, respectively. The Principal Component Analysis (PCA) algorithm is used for the visual part and various types of statistical representations of the MFCC feature vectors are calculated to select representative features. These representations are then fused to form a robust multimodal feature. Finally, we train Support Vector Machine (SVM) classifiers for final classification of vehicle types using the obtained multimodal features. We evaluate the effectiveness of our proposed method on the TRECVID 2012 SIN video performance dataset for both single- and multi-modal cases. Our results show that, fusing the proposed MFCC representations with the GoogLeNet CNN features improves the classification accuracy.
Multimodal Video Captioning Using Object-Auditory Information Fusion with Transformers
(2023) Selbes, Berkay; Sert, Mustafa
Video captioning aims to generate natural language sentences of an input video. Generating coherent natural language sentences is a challenging task due to the complex nature of video content such as object and scene understanding, extraction of object- and event-specific auditory information, and acquisition of relationships among objects. In this study, we address the problem of efficient modeling of object interactions in scenes, as they include crucial information regarding the events in the visual scene. To this end, we propose to use object features along with auditory information to better model the audio-visual scene appearing within the video. Specifically, we extract Faster R-CNN as the object features and VGGish as the auditory features and design a transformer encoder-decoder architecture in the multimodal setup. Experiments on MSR-VTT show encouraging results and object features better model the object interactions along with the auditory information in comparison to the ResNet features.
Multimodal Video Concept Classification based on Convolutional Neural Network and Audio Feature Combination
(2017) Selbes, Berkay; Sert, Mustafa; Sert, Mustafa; 0000-0002-7056-4245; AAB-8673-2019
Video concept classification is a very important task for several applications such as content based video indexing and searching In this study, we propose a multi-modal video classification method based on the feature-level fusion of audiovisual signals. In the proposed method, we extract Mel Frequency Cepstral Coefficient (MFCC) and convolutional neural network (CNN) features from the audio and visual parts of the video signal, respectively and calculate three statistical representations of the MFCC feature vectors. We perform feature level fusion of both modalities using the concatenation operator and train Support Vector Machine (SVM) classifiers using these multimodal features. We evaluate the effectiveness of our proposed method on the TRECVID video performance dataset for both single and multi-modal cases. Our results show that, fusing standard deviation representation of the audio modality along with the GoogleNet CNN features improves the classification accuracy.