See Google Scholar
for all publications.
-
Video question answering with limited supervision
Deniz Engin
PhD Thesis 2024
abstract |
pdf
Video content has significantly increased in volume and diversity in the digital era, and
this expansion has highlighted the necessity for advanced video understanding
technologies that transform vast volumes of unstructured data into practical insights by
learning from data. Driven by this necessity, this thesis explores semantically
understanding videos, leveraging multiple perceptual modes similar to human cognitive
processes and efficient learning with limited supervision similar to human learning
capabilities. Multimodal semantic video understanding synthesizes visual, audio, and
textual data to analyze and interpret video content, facilitating comprehension of
underlying semantics and context. This thesis specifically focuses on video question
answering to understand videos as one of the main video understanding tasks. Our first
contribution addresses long-range video question answering, which involves answering
questions about long videos, such as TV show episodes. These questions require an
understanding of extended video content. While recent approaches rely on human-generated
external sources, we present processing raw data to generate video summaries. Our
following contribution explores zero-shot and few-shot video question answering, aiming
to enhance efficient learning from limited data. We leverage the knowledge of existing
large-scale models by eliminating challenges in adapting pre-trained models to limited
data, such as overfitting, catastrophic forgetting, and bridging the cross-modal gap
between vision and language. We introduce a parameter-efficient method that combines
multimodal prompt learning with a transformer-based mapping network while keeping the
pre-trained vision and language models frozen. We demonstrate that these contributions
significantly enhance the capabilities of multimodal video question-answering systems,
where specifically human-annotated labeled data is limited or unavailable.
-
Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
Deniz Engin,
Yannis Avrithis
ICCV Workshops 2023
[Oral Presentation]
abstract |
pdf |
project page |
code
Recent vision-language models are driven by large-scale pretrained
models. However, adapting pretrained models on limited data presents challenges such as
overfitting, catastrophic forgetting, and the cross-modal gap between vision and
language. We
introduce a parameter-efficient method to address these challenges, combining multimodal
prompt
learning and a transformer-based mapping network, while keeping the pretrained models
frozen.
Our experiments on several video question answering benchmarks demonstrate the
superiority of
our approach in terms of performance and parameter efficiency on
both zero-shot and few-shot settings.
-
On the hidden treasure of dialog in video
question answering
Deniz Engin, François Schnitzler, Ngoc Q. K. Duong, Yannis Avrithis
ICCV 2021
abstract |
pdf |
project page |
code
High-level understanding of stories in video such as movies and TV shows from raw data is
extremely challenging. Modern video question answering (VideoQA) systems often use
additional human-made sources like plot synopses, scripts, video descriptions or
knowledge bases. In this work, we present a new approach to understand the whole story
without such external sources. The secret lies in the dialog: unlike any prior work, we
treat dialog as a noisy source to be converted into text description via dialog
summarization, much like recent methods treat video. The input of each modality is
encoded by transformers independently, and a simple fusion method combines all
modalities, using soft temporal attention for localization over long inputs. Our model
outperforms the state of the art on the KnowIT VQA dataset by a large margin, without
using question-specific human annotation or human-made plot summaries. It even
outperforms human evaluators who have never watched any whole episode before. Code is
available at https://engindeniz.github.io/dialogsummary-videoqa
-
Offline Signature Verification on Real-World Documents
Deniz Engin*, Alperen Kantarcı*, Seçil Arslan, Hazım Kemal Ekenel
CVPR Workshops 2020
abstract |
pdf
Research on offline signature verification has explored a large variety of methods on
multiple signature datasets, which are collected under controlled conditions. However,
these datasets may not fully reflect the characteristics of the signatures in some
practical use cases. Real-world signatures extracted from the formal documents may
contain different types of occlusions, for example, stamps, company seals, ruling lines,
and signature boxes. Moreover, they may have very high intra-class variations, where
even genuine signatures resemble forgeries. In this paper, we address a real-world
writer independent offline signature verification problem, in which, a bank’s customers’
transaction request documents that contain their occluded signatures are compared with
their clean reference signatures. Our proposed method consists of two main components, a
stamp cleaning method based on CycleGAN and signature representation based on CNNs. We
extensively evaluate different verification setups, fine-tuning strategies, and
signature representation approaches to have a thorough analysis of the problem.
Moreover, we conduct a human evaluation to show the challenging nature of the problem.
We run experiments both on our custom dataset, as well as on the publicly available
Tobacco-800 dataset. The experimental results validate the difficulty of offline
signature verification on real-world documents. However, by employing the stamp cleaning
process, we improve the signature verification performance significantly.
-
Cycle-Dehaze: Enhanced CycleGAN for Single Image Dehazing
Deniz Engin*, Anıl Genç*, Hazım Kemal Ekenel
CVPR Workshops 2018
abstract |
pdf |
code
In this paper, we present an end-to-end network, called Cycle-Dehaze, for single image
dehazing problem, which does not require pairs of hazy and corresponding ground truth
images for training. That is, we train the network by feeding clean and hazy images in
an unpaired manner. Moreover, the proposed approach does not rely on estimation of the
atmospheric scattering model parameters. Our method enhances CycleGAN formulation by
combining cycle-consistency and perceptual losses in order to improve the quality of
textural information recovery and generate visually better haze-free images. Typically,
deep learning models for dehazing take low resolution images as input and produce low
resolution outputs. However, in the NTIRE 2018 challenge on single image dehazing, high
resolution images were provided. Therefore, we apply bicubic downscaling. After
obtaining low-resolution outputs from the network, we utilize the Laplacian pyramid to
upscale the output images to the original resolution. We conduct experiments on
NYU-Depth, I-HAZE, and O-HAZE datasets. Extensive experiments demonstrate that the
proposed approach improves CycleGAN method both quantitatively and qualitatively.
-
Face Frontalization for Cross-Pose Facial Expression Recognition
Deniz Engin, Christophe Ecabert, Hazım Kemal Ekenel, Jean-Philippe Thiran
EUSIPCO 2018
abstract |
pdf
In this paper, we have explored the effect of pose normalization for cross-pose facial
expression recognition. We have first presented an expression preserving face
frontalization method. After face frontalization step, for facial expression
representation and classification, we have employed both a traditional approach, by
using hand-crafted features, namely local binary patterns, in combination with support
vector machine classification and a relatively more recent approach based on
convolutional neural networks. To evaluate the impact of face frontalization on facial
expression recognition performance, we have conducted cross-pose, subject-independent
expression recognition experiments using the BU3DFE database. Experimental results show
that pose normalization improves the performance for cross-pose facial expression
recognition. Especially, when local binary patterns in combination with support vector
machine classifier is used, since this facial expression representation and
classification does not handle pose variations, the obtained performance increase is
significant. Convolutional neural networks-based approach is found to be more successful
handling pose variations, when it is fine-tuned on a dataset that contains face images
with varying pose angles. Its performance is further enhanced by benefiting from face
frontalization.