您的瀏覽器不支援JavaScript語法,網站的部份功能在JavaScript沒有啟用的狀態下無法正常使用。

中央研究院 資訊科學研究所

研究

友善列印

列印可使用瀏覽器提供的(Ctrl+P)功能

近期研究成果

:::

HSIC Bottleneck for Cross-Generator and Domain-Incremental Synthetic Image Detection

The Fourteenth International Conference on Learning Representations (ICLR), April 2026

Chin-Chia Yang, Yung-Yu Chuang, Hwann-Tzong Chen and Tyng-Luh Liu

Tyng-Luh Liu

Abstract

Synthetic image generators evolve rapidly, challenging detectors to generalize across current methods and adapt to new ones. We study domain-incremental synthetic image detection with a two-phase evaluation. Phase I trains on either diffusion- or GAN-based data and tests on the combined group to quantify bidirectional cross-generator transfer. Phase II sequentially introduces renders from 3D Gaussian Splatting (3DGS) head avatar pipelines, requiring adaptation while preserving earlier performance. We observe that CLIP-based detectors inherit text-image alignment semantics that are irrelevant to authenticity and hinder generalization. We introduce a Hilbert-Schmidt Independence Criterion (HSIC) bottleneck loss on intermediate CLIP ViT features, encouraging representations predictive of real versus synthetic while independent of generator identity and caption alignment. For domain-incremental learning, we propose HSIC-Guided Replay (HGR), which selects per-class exemplars via a hybrid score combining HSIC relevance with k-center coverage, yielding compact memories that mitigate forgetting. Empirically, the HSIC bottleneck improves transfer between diffusion and GAN families, and HGR sustains prior accuracy while adapting to 3DGS renders. These results underscore the value of information-theoretic feature shaping and principled replay for resilient detection under shifting generative regimes.

Universal Robust Speech Adaptation for Cross-Domain Speech Recognition and Enhancement

IEEE Transaction on Audio, Speech and Language Processing, February 2026

Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang, and Berlin Chen

Hsin-Min Wang

Abstract

Pre-trained models for automatic speech recognition (ASR) and speech enhancement (SE) have exhibited remarkable capabilities under matched noise and channel conditions. However, these models often suffer from severe performance degradation when confronted with domain shifts, particularly in the presence of unseen noise and channel distortions. In view of this, we in this paper present URSA-GAN, a unified and domain-aware generative framework specifically designed to mitigate mismatches in both noise and channel conditions. URSA-GAN leverages a dual-embedding architecture that consists of a noise encoder and a channel encoder, each pre-trained with limited in-domain data to capture domain-relevant representations. These embeddings condition a GAN-based speech generator, facilitating the synthesis of speech that is acoustically aligned with the target domain while preserving phonetic content. To enhance generalization further, we propose dynamic stochastic perturbation, a novel regularization technique that introduces controlled variability into the embeddings during generation, promoting robustness to unseen domains. Empirical results demonstrate that URSA-GAN effectively reduces character error rates in ASR and improves perceptual metrics in SE across diverse noisy and mismatched channel scenarios. Notably, evaluations on compound test conditions with both channel and noise degradations confirm the generalization ability of URSA-GAN, yielding relative improvements of 16.16% in ASR performance and 15.58% in SE metrics.

Cross-Attention Reprogramming for ASR: Bridging Discrete Speech Units and Pretrained Language Models

IEEE Access, January 2026

Pei-Jun Liao, Hung-Yi Lee, and Hsin-Min Wang

Hsin-Min Wang

Abstract

In automatic speech recognition (ASR), an emerging trend involves converting continuous speech features into sequences of discrete speech units (DSUs) via quantization. A key advantage of DSU representations is their compatibility with pretrained language models (PLMs), where DSUs are directly mapped to PLM token indices and the embedding layer is fine-tuned. However, this conventional strategy often relies heavily on large-scale training data to mitigate the inherent modality mismatch. In light of this, we explore a more effective way to exploit the PLM embedding dictionary. Drawing inspiration from Time-LLM, a recent time-series forecasting model, we propose a cross-attention reprogramming mechanism that incorporates codebook information from the DSU quantizer to better align the DSUs with the PLM embeddings. Compared to direct fine-tuning of PLM embeddings, our method consistently achieves improvements on the Discrete Audio and Speech Benchmark (DASB), reaching state-of-the-art performance across most DASB-style settings. We also evaluate our method on LibriSpeech-960, LibriLight-10, and Swedish, Czech, and Hungarian data from Common Voice, and observe similar trends. Notably, the proposed reprogramming method demonstrates significant gains over the fine-tuning baseline, particularly in cross-lingual and low-resource scenarios. This study proposes a new approach to using PLM embedding dictionaries in DSU-based ASR, and lays a foundation for combining speech representations with large language models in other discriminative tasks of speech processing such as speech emotion recognition and spoken question answering.

Can We Formalise Type Theory Intrinsically without Any Compromise? A Case Study in Cubical Agda

Proceedings of the 15th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP '26), January 2026

Liang-Ting Chen, Fredrik Nordvall Forsberg, Tzu-Chun Tsai

Liang-Ting Chen Fredrik Nordvall Forsberg

Abstract

We present an intrinsic representation of type theory in the proof assistant Cubical Agda, inspired by Awodey’s natural models of type theory. The initial natural model is defined as quotient inductive-inductive-recursive types, leading us to a syntax accepted by Cubical Agda without using any transports, postulates, or custom rewrite rules. We formalise some meta-properties such as the standard model, normalisation by evaluation for typed terms, and strictification constructions. Since our formalisation is carried out using Cubical Agda's native support for quotient inductive types, all our constructions compute at a reasonable speed. When we try to develop more sophisticated metatheory, however, the 'transport hell' problem reappears. Ultimately, it remains a considerable struggle to develop the metatheory of type theory using an intrinsic representation that lacks strict equations. The effort required is about the same whether or not the notion of natural model is used.

Efficient Column-Wise N:M Pruning on RISC-V CPU

Journal of Systems Architecture (JSA), March 2026

Chi-Wei Chu, Ding-Yong Hong, Jan-Jan Wu

Chi-Wei Chu Ding-Yong Hong Jan-Jan Wu

Abstract

In deep learning frameworks, weight pruning is a widely used technique for improving computational efficiency by reducing the size of large models. This is especially critical for convolutional operators, which often act as performance bottlenecks in convolutional neural networks (CNNs). However, the effectiveness of pruning heavily depends on how it is implemented, as different methods can significantly impact both computational performance and memory footprint. In this work, we propose a column-wise N:M pruning strategy applied at the tile level and modify XNNPACK to enable efficient execution of pruned models on the RISC-V vector architecture. Additionally, we propose fusing the operations of im2col and data packing to minimize redundant memory accesses and memory overhead. To further optimize performance, we incorporate AITemplate’s profiling technique to identify the optimal implementation for each convolutional operator. Our proposed approach effectively increases ResNet inference throughput by as much as 4×, and preserves ImageNet top-1 accuracy within 2.1% of the dense baseline.

AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection

IEEE Transactions on Cognitive and Developmental Systems, December 2025

Ammarah Hashmi, Sahibzada Adil Shahzad, Chia-Wen Lin, Yu Tsao, and Hsin-Min Wang

Yu Tsao Hsin-Min Wang

Abstract

The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous studies on detecting artificial intelligence-generated fake videos only utilize visual modality or audio modality. While some methods exploit audio and visual modalities to detect forged videos, they have not been comprehensively evaluated on multimodal datasets of deepfake videos involving acoustic and visual manipulations, and are mostly based on convolutional neural networks with low detection accuracy. Considering that human cognition instinctively integrates multisensory information including audio and visual cues to perceive and interpret content and the success of transformer in various fields, this study introduces the audio-visual transformer-based ensemble network (AVTENet). This innovative framework tackles the complexities of deepfake technology by integrating both acoustic and visual manipulations to enhance the accuracy of video forgery detection. Specifically, the proposed model integrates several purely transformer-based variants that capture video, audio, and audio-visual salient cues to reach a consensus in prediction. For evaluation, we use the recently released benchmark multimodal audio-video FakeAVCeleb dataset. For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset. Experimental results show that the proposed model outperforms all existing methods and achieves state-of-the-art performance on Testset-I and Testset-II of the FakeAVCeleb dataset. We also compare AVTENet against humans in detecting video forgery. The results show that AVTENet significantly outperforms humans.

Uncertainty-Guided Exploration for Efficient AlphaZero Training

Annual Conference on Neural Information Processing Systems (NeurIPS), December 2025

Scott Cheng, Meng-Yu Tsai, Ding-Yong Hong, Mahmut Kandemir

Scott Cheng Ding-Yong Hong

Abstract

AlphaZero has achieved remarkable success in complex decision-making problems through self-play and neural network training. However, its self-play process remains inefficient due to limited exploration of high-uncertainty positions, the overlooked runner-up decisions in Monte Carlo Tree Search (MCTS), and high variance in value labels. To address these challenges, we propose and evaluate uncertainty-guided exploration by branching from high-uncertainty positions using our proposed Label Change Rate (LCR) metric, which is further refined by a Bayesian inference framework. Our proposed approach leverages runner-up MCTS decisions to create multiple variations, and ensembles value labels across these variations to reduce variance. We investigate three key design parameters for our branching strategy: where to branch, how many variations to branch, and which move to play in the new branch. Our empirical findings indicate that branching with 10 variations per game provides the best performance-exploration balance. Overall, our end-to-end results show an improved sample efficiency over the baseline by 58.5% on 9x9 Go in the early stage of training and by 47.3% on 19x19 Go in the late stage of training.

A Grouping Algorithm for Training Tree-Shaped Models on Multiple GPUs with High Efficiency

IEEE International Conference on Computers, Software, and Applications (COMPSAC), July 2025

Cai-Feng Lin, Ding-Yong Hong, Tzu-Hsien Tsai, Pangfeng Liu, Jan-Jan Wu

Ding-Yong Hong Jan-Jan Wu

Abstract

Graph Neural Network (GNN) is an important tool in deep learning to handle structured data, where graphs with nodes and edges represent entities and their relationships. Various challenges arise when GNN is tree-shaped, with irregular connectivity patterns and varying depth. It is difficult to distribute and process the dynamic structure for parallel execution on multiple GPUs. In addition, tree data dependency demands the processing of parent nodes before their children, severely limiting execution parallelism. This research aims to improve the training speed of treeshaped GNN on multi-GPU systems. First, we introduce a cost model that estimates the running time of the training across multiple GPUs. Then, we demonstrate that finding an optimal way to distribute tree-structured data across GPUs is an NP-complete problem on this cost model. We then propose a practical heuristic method for distributing data that improves efficiency while maintaining training quality. The heuristic method first assigns data to batches based on our cost model and then assigns data in each batch to the devices. We also show that our device assigning algorithm is a 4-approximation algorithm. That is, it guarantees that its cost is four times the optimal running time in each training batch, ensuring that it performs effectively in practice. We implement the algorithm and conduct the experiments. The results show that our algorithm achieves a significant increase in training time. The speedup is up to 1.86 for two GPUs, 3.43 for four GPUs, and 7.25 for eight GPUs.

Bridging Local and Global Knowledge via Transformer in Board Games

the thirty-fourth International Joint Conference on Artificial Intelligence (IJCAI), August 2025

Yan-Ru Ju, Tai-Lin Wu, Chung-Chin Shih, Ti-Rong Wu

Yan-Ru Ju Ti-Rong Wu

Abstract

Although AlphaZero has achieved superhuman performance in board games, recent studies reveal its limitations in handling scenarios requiring a comprehensive understanding of the entire board, such as recognizing long-sequence patterns in Go. To address this challenge, we propose ResTNet, a network that interleaves residual and Transformer blocks to bridge local and global knowledge. ResTNet improves playing strength across multiple board games, increasing win rate from 54.6% to 60.8% in 9x9 Go, 53.6% to 60.9% in 19x19 Go, and 50.4% to 58.0% in 19x19 Hex. In addition, ResTNet effectively processes global information and tackles two long-sequence patterns in 19x19 Go, including circular pattern and ladder pattern. It reduces the mean square error for circular pattern recognition from 2.58 to 1.07 and lowers the attack probability against an adversary program from 70.44% to 23.91%. ResTNet also improves ladder pattern recognition accuracy from 59.15% to 80.01%. By visualizing attention maps, we demonstrate that ResTNet captures critical game concepts in both Go and Hex, offering insights into AlphaZero's decision-making process. Overall, ResTNet shows a promising approach to integrating local and global knowledge, paving the way for more effective AlphaZero-based algorithms in board games. Our code is available at https://rlg.iis.sinica.edu.tw/papers/restnet.