SATNet: 自顶向下的基于立体注意力网络的3D质量评价

🔗 Preprint 论文链接：

Towards Top-Down Stereoscopic Image Quality Assessment via Stereo Attention

Stereoscopic image quality assessment (SIQA) plays a crucial role in evaluating and improving the visual experience of 3D content. Existing binocular properties and attention-based methods for...

https://arxiv.org/abs/2308.04156

🔗 Code 代码链接：

SATNet

Fanning-Zhang • Updated Aug 20, 2024

本文介绍我在硕士期间的代表性工作：SATNet，希望能快速让读者了解工作的背景、解题思路及方法。对于一些个人理解的浅薄、错误之处，也欢迎指出。🤝🤝🤝 This blog is a brief intro to help readers quickly grasp the background, idea, and method of my representative research: SATNet: Towards Top-Down Stereo Image Quality Assessment via Stereo Attention. Please feel free to point out any mistakes and share your insights with me. 🤝🤝🤝 —— 张慧林 Huilin Zhang 2023年8月 Aug, 2023.

1. 工作背景 Background 1.1 SIQA背景 What is SIQA?1.2 自底向上 or 自顶向下？Bottom-Up or Top-Down?2. 思路及方法 Idea and Method 2.1 立体注意力 Stereo AttenTion 2.2 自适应的能量系数 Energy Coefficient (EC)2.3 双池化策略 Dual-Pooling 3. 实验及结果 Experiments and Results 3.1 总体性能指标 Overall Performance 3.2 EC的自适应变化 Effects of EC 4. 总结 Conclusion

1. 工作背景 Background

1.1 SIQA背景 What is SIQA?

Stereo Image Quality Assessment (SIQA，3D-IQA，立体图像质量评价)，即通过算法实现对3D图像质量评价的自动化。在3D内容爆炸增长的背景下，一套高效、能准确反映人类视觉评价机制的质量评价系统，对后续的图像处理工作（图像去噪、超分辨率等重建任务）具有重要指导意义。SIQA工作流程：输入一张3D图像（左视点图像和右视点图像），输出一个质量分数（DMOS或MOS）。

Stereo Image Quality Assessment (SIQA): People perceive the world through their eyes every day. We can subjectively evaluate the quality of stereo images/videos (such as 3D movies) via the collaboration of visual and cognitive systems. Our work aims to transform this subjective process into objective assessment by developing SIQA algorithms, realizing accurate assessments automatically when facing tons of data.

Imperative: Since stereo images degrade inevitably during the acquisition, processing, and transmission, SIQA becomes crucial in guiding other image processing technologies, e.g., denoising, super-resolution, and other reconstruction tasks.

Workflow: SIQA algorithms take a stereo image (composed of left and right views) as input and produce a quality score, either in the form of Differential Mean Opinion Score (DMOS) or Mean Opinion Score (MOS).

1.2 自底向上 or 自顶向下？Bottom-Up or Top-Down?

SIQA任务的核心是模拟人类视觉系统（HVS），在HVS中有两种感知推理机制：自底向上Bottom-Up和自顶向下Top-Down。SIQA领域内现有的CNN模型都是基于Bottom-Up机制的，不存在高级信号对低级信号的指导。其中有两类典型方法： The key to SIQA is to simulate the Human Visual System (HVS), which contains two cognitive reasoning mechanisms: bottom-up and top-down. Existing SIQA methods can be considered a bottom-up strategy without guidance from higher-level signals to lower-level ones. There are two general types:

如Fig. 1上半部分所示，单目信息之间、单目信息和双目信息之间没有任何交互； As depicted in the upper part of Fig .1, the interaction between monocular features or between binocular and monocular features (or both) is lacking or implicitly used during feature extraction.

如Fig. 1下半部分所示，不同层级的单目特征被融合为双目交互特征，然后将这些特征回归。 As illustrated in the lower part of Fig. 1, they obtain multiple fusion or difference features (or both) as multi-level interactive information for further transmission.

我们的Stereo AttenTion Network（SATNet）是基于Top-Down机制的，从这个视角设计网络更符合质量评价这一任务的特性，即人类在评价质量时，显著区域、失真的类型和程度等先验知识会影响人类对质量的感知结果。 Our Stereo AttenTion Network (SATNet) is designed from a top-down perspective, which is more in line with the characteristics of SIQA. i.e., This task is high-level because quality-related expectations, such as distorted types and degrees, will influence how we perceive and assess the quality of an image.

具体来说，我们将主流的注意力模块重新建模，修改了输入输出和其中组件，提出了Stereo AttenTion（SAT）结构。SAT输出的注意力图作为高级双目信号，对两个低级单目信号进行指导和调制。 Specifically, our generalized Stereo AttenTion (SAT) structure adapts components and input/output for stereo scenarios. It leverages the fusion-generated attention map as a higher-level binocular modulator to influence two lower-level monocular features, allowing progressive recalibration of both throughout the pipeline.

2. 思路及方法 Idea and Method

2.1 立体注意力 Stereo AttenTion

为什么选择注意力机制来实现Top-Down？ Why do we choose the attention mechanism to implement the top-down philosophy?

某种程度上和显著特性类似 Saliency
输出相对输入来说是高级信息 Output is relatively higher-level than input
信息聚合能力 Information integration
非线性特性 Non-linear

从注意力到立体注意力 From Attention to Stereo AttenTion (SAT)

单输入单输出 → 双输入双输出 single input/output → dual input/output
引入了能量系数EC Introduce Energy Coefficient (EC)
Sigmoid → Softmax

2.2 自适应的能量系数 Energy Coefficient (EC)

Furthermore, we introduce an Energy Coefficient (EC) into SAT structure to make it more sensible and well-performing in light of a fact that biological researchers have put forward, which discloses that binocular responses in primate primary visual cortex are less than the sum of monocular responses of two eyes [25]. Zhang et al. [26] also proved this fact by investigating neuronal responses in macaque V1 with two-photon calcium imaging. Whereas to the best of our knowledge, almost all previous works took all portions of two monocular responses into account, which is unreasonable to some extent. Thus, we introduce an EC into the SAT structure to adaptively learn suitable binocular response magnitudes.

2.3 双池化策略 Dual-Pooling

双池化策略指同时使用最大池化（max-pooling）和最小池化（min-pooling）来进行下采样。具体来说，左右两支路特征相加后得到Fusion map，相减后得到Difference map，我们对Fusion map进行最大池化操作，对Difference map进行最小池化操作。

最小池化的采用可能不常见，因为在CV任务中大多数时候是以下两种情况：1) 对卷积后的特征进行最大池化来保持“局部不变性”；2) 对特征图进行全局平均池化来提取上下文信息。

Moreover, for the SIQA task, good features should be able to capture quality properties such as noise and blur, which are usually the smaller values in the feature map. Nonetheless, only max-pooling is often employed for most prior SIQA studies, following feature encoding to reduce computational overhead and maintain local structure invariance. Toward adding quality-sensitive attributes to our model, we employ min-pooling while retaining max-pooling, named dual-pooling. We demonstrate the validity of our proposed dual-pooling strategy through empirical evaluation, showcasing its effectiveness in screening the most crucial structure and distortion information for quality regression.

3. 实验及结果 Experiments and Results

3.1 总体性能指标 Overall Performance

3.2 EC的自适应变化 Effects of EC

我们在四个数据库（LIVE 3D Phase I&II、WIVC 3D Phase I&II）上训练模型SATNet-SE ()，将第一、三、五、七个SAT-SE模块中的EC进行可视化，分别用、、、表示，结果如下图所示： We visualize four ECs’ updating processes during training on four databases (LIVE 3D Phase I&II and WIVC 3D Phase I&II) in Fig. 5. represents the corresponding EC in the th SAT-SE block.

我们可以发现：It can be found in Fig. 5:

不同数据库上，EC稳定后的值不同。The same phase’s EC varies among different databases.

在同一个数据库上，不同阶段的EC稳定后的值也不同。ECs of different phases on the same database are also diverse.

4. 总结 Conclusion

From a top-down perspective, we propose a Stereo AttenTion Network (SATNet) for NR-SIQA, which realizes the guidance from higher-level binocular signals down to lower-level monocular signals, allowing progressive recalibration of both throughout the pipeline.

We design a generalized Stereo AttenTion (SAT) block, where components and input/output are adapted for stereo scenarios, and the attention map is leveraged as the higher-level binocular modulator of two lower-level monocular features, implementing the top-down philosophy. Moreover, an Energy Coefficient (EC) is introduced for binocular feature formation, reflecting the fact that binocular responses in the primate primary visual cortex are less than the sum of monocular responses.

To screen the most discriminative quality information from the summation and subtraction of the two branches of monocular features, we apply min-pooling and max-pooling to them, respectively, namely the dual-pooling strategy. Empirical evidence verifies the superiority of our strategy.