您的瀏覽器不支援JavaScript語法,網站的部份功能在JavaScript沒有啟用的狀態下無法正常使用。

Institute of Information Science, Academia Sinica

Research

Print

Press Ctrl+P to print from browser

Recent Research Results

:::

Efficient Beam Search for Large Language Models Using Trie-Based Decoding

The 30th Conference on Empirical Methods in Natural Language Processing (EMNLP 2025), November 2025

Brian J Chan, Mao-xun Huang, Jui-Hung Cheng, Chao-Ting Chen, Hen-Hsen Huang

Hen-Hsen Huang

Abstract

This work presents a novel trie (prefix-tree)-based parallel decoding method that addresses the memory inefficiency of batch-based beam search. By sharing a single KV cache across beams with common prefixes, our approach dramatically reduces memory usage and enables efficient decoding. We evaluated our method across three attention architectures--Multi-Head Attention (Phi-3.5-mini-instruct), Grouped Query Attention (Llama-3.1-8B-Instruct), and Sliding Window Attention (Mistral-Small-24B-Instruct-2501)--using CNN/DailyMail for abstractive summarization and HumanEval for code generation. Our experiments demonstrate substantial memory savings (4–8×) and notable decoding speed improvements (up to 2.4×), without compromising generation quality. These results highlight the method's suitability for memory-constrained environments and large-scale deployments.

BadVim: Unveiling Backdoor Threats in Visual State Space Model

European Conference on Artificial Intelligence (ECAI), October 2025

Cheng-Yi Lee, Yu-Hsuan Chiang, Zhong-You Wu, Chia-Mu Yu, and Chun-Shien Lu

Cheng-Yi Lee Yu-Hsuan Chiang Chun-Shien Lu

Abstract

Visual State Space Models (VSSM) have shown remarkable performance in various computer vision tasks. However, backdoor attacks pose significant security challenges, causing compromised models to predict target labels when specific triggers are present while maintaining normal behavior on benign samples. In this paper, we investigate the robustness of VSSMs against backdoor attacks. Specifically, we delicately design a novel framework for VSSMs, dubbed BadVim, which utilizes low-rank perturbations on state-wise to uncover their impact on state transitions during training. By poisoning only 0.3% of the training data, our attacks cause any trigger-embedded input to be misclassified to the targeted class with a high attack success rate (over 97%) at inference time. Our findings suggest that the state-space representation property of VSSMs, which enhances model capability, may also contribute to its vulnerability to backdoor attacks. Our attack exhibits effectiveness across three datasets, even bypassing state-of-the-art defenses against such attacks. Extensive experiments show that the backdoor robustness of VSSMs is comparable to that of Transformers (ViTs) and superior to that of Convolutional Neural Networks (CNNs).We believe our findings will prompt the community to reconsider the trade-offs between performance and robustness in model design.

MaXsive: High-Capacity and Robust Training-Free Generative Image Watermarking in Diffusion Models

ACM International Conference on Multimedia (ACM MM), October 2025

Po-Yuan Mao, Cheng-Chang Tsai, and Chun-Shien Lu

Po-Yuan Mao Cheng-Chang Tsai Chun-Shien Lu

Abstract

The great success of the diffusion model in image synthesis led to the release of gigantic commercial models, raising the issue of copyright protection and inappropriate content generation. Trainingfree diffusion watermarking provides a low-cost solution for these issues. However, the prior works remain vulnerable to rotation, scaling, and translation (RST) attacks. Although some methods employ meticulously designed patterns to mitigate this issue, they often reduce watermark capacity, which can result in identity (ID) collusion. To address these problems, we propose MaXsive, a training-free diffusion model generative watermarking technique that has high capacity and robustness. MaXsive best utilizes the initial noise to watermark the diffusion model. Moreover, instead of using a meticulously repetitive ring pattern, we propose injecting the X-shape template to recover the RST distortions. This design significantly increases robustness without losing any capacity, making ID collusion less likely to happen. The effectiveness of MaXsive has been verified on two well-known watermarking benchmarks under the scenarios of verification and identification.

Bridging Local and Global Knowledge via Transformer in Board Games

the thirty-fourth International Joint Conference on Artificial Intelligence (IJCAI), August 2025

Yan-Ru Ju, Tai-Lin Wu, Chung-Chin Shih, Ti-Rong Wu

Ti-Rong Wu

Abstract

Although AlphaZero has achieved superhuman performance in board games, recent studies reveal its limitations in handling scenarios requiring a comprehensive understanding of the entire board, such as recognizing long-sequence patterns in Go. To address this challenge, we propose ResTNet, a network that interleaves residual and Transformer blocks to bridge local and global knowledge. ResTNet improves playing strength across multiple board games, increasing win rate from 54.6% to 60.8% in 9x9 Go, 53.6% to 60.9% in 19x19 Go, and 50.4% to 58.0% in 19x19 Hex. In addition, ResTNet effectively processes global information and tackles two long-sequence patterns in 19x19 Go, including circular pattern and ladder pattern. It reduces the mean square error for circular pattern recognition from 2.58 to 1.07 and lowers the attack probability against an adversary program from 70.44% to 23.91%. ResTNet also improves ladder pattern recognition accuracy from 59.15% to 80.01%. By visualizing attention maps, we demonstrate that ResTNet captures critical game concepts in both Go and Hex, offering insights into AlphaZero's decision-making process. Overall, ResTNet shows a promising approach to integrating local and global knowledge, paving the way for more effective AlphaZero-based algorithms in board games. Our code is available at https://rlg.iis.sinica.edu.tw/papers/restnet.

NeuroAMP: A Novel End-to-end General Purpose Deep Neural Amplifier for Personalized Hearing Aids

IEEE Transactions on Artificial Intelligence, August 2025

Shafique Ahmed, Ryandhimas E. Zezario, Hui-Guan Yuan, Amir Hussain, Hsin-Min Wang, Wei-Ho Chung, and Yu Tsao

Hsin-Min Wang Yu Tsao

Abstract

The prevalence of hearing aids is increasing. However, optimizing their amplification remains challenging due to the complexity of integrating multiple components in traditional methods. To address this, we present NeuroAMP, a novel deep neural network for end-to-end, personalized amplification in hearing aids. NeuroAMP leverages spectral features and the listener’s audiogram as inputs, and we explore four architectures: Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Convolutional Recurrent Neural Network (CRNN), and Transformer. We also introduce Denoising NeuroAMP, an extension that integrates noise reduction with amplification for improved real-world performance. To enhance generalization, we employed a comprehensive data augmentation strategy during training on diverse speech (TIMIT, TMHINT) and music (Cadenza Challenge MUSIC) datasets. Evaluation using the Hearing Aid Speech Perception Index (HASPI), Hearing Aid Speech Quality Index (HASQI), and Hearing Aid Audio Quality Index (HAAQI) shows that the Transformer-based NeuroAMP achieves the best performance, with SRCC scores of 0.9927 (HASQI) and 0.9905 (HASPI) on TIMIT, and 0.9738 (HAAQI) on Cadenza dataset. Notably, the augmentation strategy maintains robust performance on unseen datasets (e.g., VoiceBank-DEMAND, MUSDB18-HQ). Furthermore, Denoising NeuroAMP outperforms both the conventional NAL-R+WDRC method and a two-stage baseline on the VoiceBank-DEMAND dataset, achieving HASPI of 0.90 and HASQI of 0.59. These results highlight the strong potential of NeuroAMP and Denoising NeuroAMP to provide a novel and effective framework for personalized hearing aid amplification.

HAAQI-Net: A Non-intrusive Neural Music Audio Quality Assessment Model for Hearing Aids

IEEE Transaction on Audio, Speech and Language Processing, February 2025

Dyah A. M. G. Wisnu, Stefano Rini, Ryandhimas E. Zezario, Hsin-Min Wang, and Yu Tsao

Hsin-Min Wang Yu Tsao

Abstract

This paper introduces HAAQI-Net, a non-intrusive music audio quality assessment model for hearing aid users. Unlike traditional methods such as Hearing Aid Audio Quality Index (HAAQI), which requires intrusive reference signal comparisons, HAAQI-Net offers a more accessible and computationally efficient alternative. Leveraging a bidirectional long short-term memory architecture with attention mechanisms and features extracted from a pre-trained BEATs model, it can predict HAAQI scores directly from music audio clips and hearing loss patterns. The experimental results demonstrate that, compared to the traditional HAAQI as the reference, HAAQI-Net achieves a linear correlation coefficient (LCC) of 0.9368, a Spearman's rank correlation coefficient (SRCC) of 0.9486, and a mean squared error (MSE) of 0.0064, while significantly reducing the inference time from 62.52 seconds to 2.54 seconds. Furthermore, a knowledge distillation strategy was applied, reducing the parameters by 75.85% and inference time by 96.46%, while maintaining strong performance (LCC: 0.9071, SRCC: 0.9307, MSE: 0.0091). To expand its capabilities, HAAQI-Net was adapted to predict subjective human scores, mean opinion score (MOS), by fine-tuning. This adaptation significantly improved the prediction accuracy. Furthermore, the robustness of HAAQI-Net was evaluated under varying sound pressure level (SPL) conditions, revealing optimal performance at a reference SPL of 65 dB, with the accuracy gradually decreasing as SPL deviated from this point. The advancements in subjective score prediction, SPL robustness, and computational efficiency position HAAQI-Net as a reliable solution for music audio quality assessment, significantly contributing to the development of efficient and accurate models in audio signal processing and hearing aid technology.

Bottom-up computation using trees of sublists

Journal of Functional Programming, December 2024

Shin-Cheng Mu

Shin-Cheng Mu

Abstract

Some top-down problem specifications, if executed, may compute sub-problems repeatedly. Instead, we may want a bottom-up algorithm that stores solutions of sub-problems in a table to be reused. How the table can be represented and efficiently maintained, however, can be tricky. We study a special case: computing a function h taking lists as inputs such that hxs is defined in terms of all immediate sublists of xs. Richard Bird studied this problem in 2008 and presented a concise but cryptic algorithm without much explanation. We give this algorithm a proper derivation and discovered a key property that allows it to work. The algorithm builds trees that have certain shapes—the sizes along the left spine is a prefix of a diagonal in Pascal’s triangle. The crucial function we derive transforms one diagonal to the next.

GPU Memory Usage Optimization for Backward Propagation in Deep Network Training

Journal of Parallel and Distributed Computing (JPDC), May 2025

Ding-Yong Hong, Tzu-Hsien Tsai, Ning Wang, Pangfeng Liu, Jan-Jan Wu

Ding-Yong Hong Jan-Jan Wu

Abstract

In modern Deep Learning, it has been a trend to design larger Deep Neural Networks (DNNs) for the execution of more complex tasks and better accuracy. On the other hand, Convolutional Neural Networks (CNNs) have become the standard method for most of computer vision tasks. However, the memory allocation for the intermediate data in convolution layers can cause severe memory pressure during model training. Many solutions have been proposed to resolve the problem. Besides hardware-dependent solutions, a general methodology rematerialization can reduce GPU memory usage by trading computation for memory efficiently. The idea is to select a set of intermediate results during the forward phase as checkpoints, and only save them in memory to reduce memory usage. The backward phase recomputes the intermediate data from the closest checkpoints in memory as needed. This recomputation increases execution time but saves memory by not storing all intermediate results in memory during the forward phase. In this paper, we will focus on efficiently finding the optimal checkpoint subset to achieve the least peak memory usage during the model training. We first describe the theoretical background of the training of a neural network using mathematical equations. We use these equations to identify all essential data required during both forward and backward phases to compute the gradient of weights of the model. We first identify the checkpoint selection problem and propose a dynamic programming algorithm with time complexity to solve the problem of finding the optimal checkpoint subset. With extensive experiments, we formulate a more accurate description of the problem using our theoretical analysis and revise the objective function based on the tracing, and propose an -time algorithm for finding the optimal checkpoint subset.