Recent Research Results

DIffUMI: Training-Free Universal Model Inversion via Unconditional Diffusion for Face Recognition

IEEE Transactions on Information Forensics and Security , April 2026

Hanrui Wang, Shuo Wang, Chun-Shien Lu, and Isao Echizen

Abstract

Face recognition poses serious privacy risks due to its reliance on sensitive and immutable biometric data. While modern systems mitigate privacy risks by mapping facial images to embeddings (commonly regarded as privacy-preserving), model inversion attacks reveal that identity information can still be recovered, exposing critical vulnerabilities. However, existing attacks are often computationally expensive and lack generalization, especially those requiring target-specific training. Even training-free approaches suer from limited identity controllability, hindering faithful reconstruction of nuanced or unseen identities. In this work, we propose DiMI, the first diusion-driven, training-free model inversion attack. DiMI introduces a novel pipeline combining robust latent code initialization, a ranked adversarial refinement strategy, and a statistically grounded, confidence-aware optimization objective. DiMI applies directly to unseen target identities and face recognition models, oering greater adaptability than trainingdependent approaches while significantly reducing computational overhead. Our method achieves 84.42%–92.87% attack success rates against inversion-resilient systems and outperforms the best prior training-free GAN-based approach by 4.01%–9.82%. The implementation is available at https://github.com/azrealwang/ DiMI.

Detail

Harnessing Sequence Embedding and Ensemble Learning to Identify Antifungal Peptides with Low Hemolytic Risk

ACS OMEGA, April 2026

Chung-Yen Lin,Wen-Chih Cheng, U-Lin Chen, Tzu-Tang Lin, Li-Hang Hsu, Yang-Hsin Shih, I-Hsuan Lu, Ying-Lien Chen, Shu-Hwa Chen

Abstract

The increasing prevalence of fungal infections represents a growing threat to human health, driven in part by the misuse of antibiotics and the rising incidence of resistance to conventional antifungal agents. Antifungal peptides (AFPs) have emerged as promising alternatives due to their diverse mechanisms of action and their relatively low propensity to develop resistance. To facilitate the systematic discovery of AFPs, we developed AI4AFP. This computational framework integrates curated antifungal peptide resources with advanced machine learning approaches to predict antifungal potential directly from peptide sequences.

Using a comprehensive dataset, we constructed a seven-model ensemble that combines multiple sequence encoding strategies, including ProtBERT-BFD, PC6, and Doc2Vec, with diverse learning algorithms, including random forests, support vector machines, convolutional neural networks, and fine-tuned BERT models. This ensemble demonstrated robust performance on an independent test set, achieving 0.94 in accuracy and 0.89 in Matthews correlation coefficient, outperforming existing AFP prediction methods. Importantly, the predicted AFP score is intended to reflect the general antifungal potential rather than species-specific potency.

Experimental validation against representative fungal pathogens, including Candida albicans, Candida glabrata, and Cryptococcus neoformans, revealed that peptides with high predicted AFP scores exhibited context-dependent antifungal activity. Several candidates displayed pronounced inhibitory effects against specific species, despite limited activity against others, highlighting the inherent species-dependence of antifungal efficacy and supporting the role of AI4AFP as a prioritization tool rather than a species-specific predictor.

To complement antifungal prediction, we further developed a hemolysis classifier that incorporates both peptide sequence and applied concentration as continuous inputs, enabling explicit modeling of the dose-dependent nature of hemolytic toxicity. Experimental determination of the minimum concentration inducing 10% hemolysis (MHC₁₀) provided an empirical safety reference, allowing antifungal activity to be interpreted alongside concentration-dependent toxicity. All models and validation results are implemented on a user-friendly web server, AI4AFP (https://axp.iis.sinica.edu.tw/AI4AFP), providing an accessible platform for the discovery and prioritization of antifungal peptides, with consideration of both efficacy and safety.

Detail

Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing

The 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), Main Conference, July 2026

Arthur Amalvy, Vincent Labatut, Xavier Bost, and Hen-Hsen Huang

Abstract

While annotated corpora are crucial in the field of natural language processing (NLP), those containing copyrighted material are difficult to exchange among researchers. Yet, such corpora are necessary to fully represent the diversity of data found in the wild in the context of NLP tasks. We tackle this issue by proposing a method to lawfully and publicly share the annotations of copyrighted literary texts. The corpus creator shares the annotations in clear, along with a non-reversible hashed version of the source material. The corpus user must own the source material, and apply the same hash function to their own tokens, in order to match them to the shared annotations. Crucially, our method is robust to reasonable divergences in the version of the copyrighted data owned by the user. As an illustration, we present alignment experiments on different editions of novels. Our results show that our method is able to correctly align 98.7 to 99.79% of tokens depending on the novel, provided the user version is sufficiently close to the corpus creator's version. We publicly release novelshare, a Python implementation of our method.

Detail

Rethinking Forgery Attacks on Semantic Watermarks in Black-Box Settings: A Geometric Distortion Perspective

Forty-third International Conference on Machine Learning (ICML), July 2026

Cheng-Yi Lee, Yichi Zhang, Yuchen Yang, and Chun-Shien Lu, and Jun-Cheng Chen

Abstract

Recent studies have shown that semantic watermarks, which embed information into the initial noise of latent diffusion models (LDMs), are vulnerable to black-box forgery attacks. However, existing methods primarily rely on empirical evidence and lack a rigorous theoretical understanding of the conditions under which such attacks succeed or fail. To bridge this gap, we rethink the nature of such attacks through the lens of ratedistortion in the latent space. Our analysis identifies an irreducible distortion floor due to structural mismatches between proxy and target models, which fundamentally limits the fidelity of forged watermarks. We further characterize this distortion as structured geometric deviations on the latent manifold, in the form of global drift and local deformation rather than stochastic noise. Leveraging these insights, we propose a scheme-agnostic detection method that distinguishes forged samples before watermark verification. Extensive experiments demonstrate the effectiveness of our method across diverse black-box scenarios, while preserving robustness to common distortions.

Submodular Optimization for Minimal Augmentation in Robust Language Model Alignment

Forty-third International Conference on Machine Learning (ICML), July 2026

Ching-Chia Kao, Chia-Mu Yu, Chun-Shien Lu, and Chu-Song Chen

Abstract

Safety alignment of large language models is fragile: even small fine-tuning perturbations elastically revert behaviors toward those of the pretraining, with degradation inversely proportional to the size of the alignment set. We ask how to achieve safety alignment with minimal augmentation. To this end, we model augmentation as a set of group actions on sequences and formalize robustness gains as a normalized, monotone submodular function over transformations. We then leverage submodular optimization to select minimal augmentations that provably improve robustness. Experiments confirm that our approach efficiently restores safety alignment while minimizing the overhead of augmentation.

Understanding Audiovisual Deepfake Detection: Techniques, Challenges, Human Factors, and Perceptual Insights

IEEE Computational Intelligence Magazine, May 2026

Ammarah Hashmi, Sahibzada Adil Shahzad, Chia-Wen Lin, Yu Tsao, and Hsin-Min Wang

Abstract

Deep learning has been successfully applied in various fields, and its impact on deepfake detection is no exception. Deepfakes are fake, yet realistic synthetic content that can be used deceitfully for political impersonation, phishing, slander, or the spread of misinformation. Despite extensive research on unimodal deepfake detection, the identification of complex deepfakes through joint analysis of audio and visual streams remains relatively unexplored. To fill this gap, this survey first provides an overview of audiovisual deepfake generation techniques, applications, and their consequences, and then provides a comprehensive review of state-of-the-art methods that combine audio and visual modalities to increase detection accuracy, summarizing and critically analyzing their strengths and limitations. Furthermore, we discuss existing open source datasets for a deeper understanding, which can contribute to the research community and provide necessary information for beginners who want to analyze deep learning-based audiovisual methods for video forensics. By bridging the gap between unimodal and multimodal approaches, this paper aims to improve the effectiveness of deepfake detection strategies and guide future research on cybersecurity and media integrity.

Detail

Telomere-to-Telomere, Haplotype-Resolved Chromosome-Level Genome Assembly and Annotation of Taiwan Hard Clam (Meretrix taiwanica)

Scientific Data, May 2026

Ching-Huei Huang, Po-Cheng Hsu, San-Tzu Hsieh, Fu-Shen Tseng, Chung-Yen Lin

Abstract

Taiwan Hard Clam (Meretrix taiwanica) is an economically important aquaculture species in Taiwan, yet genomic resources for this species have remained fragmented. We present a telomere-to-telomere (T2T), haplotype-resolved, chromosome-level genome assembly for M. taiwanica, generated using PacBio HiFi long reads and Hi-C sequencing. The two haploid assemblies (hap1 and hap2) span 1,006.48 Mb and 1,007.28 Mb, comprising 126 and 66 sequences, respectively, and each containing 19 chromosomes. Hap1 and hap2 exhibit sequence N50 values of 53.87 Mb and 51.57 Mb, with average scaffold lengths of 7.99 Mb and 15.26 Mb, and contain 0.0176% and 0.1313% ambiguous bases. Comparative analyses revealed 81.59% and 83.78% syntenic regions between haplotypes and identified 10,175 structural variations. Repetitive elements constitute 47.06% and 47.02% of the hap1 and hap2 genomes. We annotated 23,320 and 23,598 protein-coding gene models, with median gene lengths of 7,721 bp and 7,657.5 bp, respectively. The mitochondrial genome was assembled at 21,164 bp and encodes 13 protein-coding genes, 22 tRNAs, and 2 rRNAs. Functional annotation covered 16.23% and 16.33% of the nuclear and mitochondrial gene sets. BUSCO analysis indicated genome completeness of 92.4% and 92.5%, and proteome completeness of 95.4% and 94.5% for hap1 and hap2. By providing the first T2T-level reference, this dataset enables precise identification of trait-associated markers for marker-assisted selection (MAS), thereby facilitating genetic improvement of growth and stress-resistance traits. Furthermore, it serves as a robust genomic framework for conservation genomics to assess the genetic diversity of both wild and hatchery populations of this economically vital species.

Regret-Guided Search Control for Efficient Learning in AlphaZero

the Fourteenth International Conference on Learning Representations (ICLR), April 2026

Yun-Jui Tsai, Wei-Yu Chen, Yan-Ru Ju, Yu-Hung Chang, Ti-Rong Wu

Abstract

Reinforcement learning (RL) agents achieve remarkable performance but remain far less learning-efficient than humans. While RL agents require extensive self-play games to extract useful signals, humans often need only a few games, improving rapidly by repeatedly revisiting states where mistakes occurred. This idea, known as search control, aims to restart from valuable states rather than always from the initial state. In AlphaZero, prior work Go-Exploit applies this idea by sampling past states from self-play or search trees, but it treats all states equally, regardless of their learning potential. We propose Regret-Guided Search Control (RGSC), which extends AlphaZero with a regret network that learns to identify high-regret states, where the agent's evaluation diverges most from the actual outcome. These states are collected from both self-play trajectories and MCTS nodes, stored in a prioritized regret buffer, and reused as new starting positions. Across 9x9 Go, 10x10 Othello, and 11x11 Hex, RGSC outperforms AlphaZero and Go-Exploit by an average of 77 and 89 Elo, respectively. When training on a well-trained 9x9 Go model, RGSC further improves the win rate against KataGo from 69.3% to 78.2%, while both baselines show no improvement. These results demonstrate that RGSC provides an effective mechanism for search control, improving both efficiency and robustness of AlphaZero training. Our code is available at https://rlg.iis.sinica.edu.tw/papers/rgsc.

Detail

HSIC Bottleneck for Cross-Generator and Domain-Incremental Synthetic Image Detection

The Fourteenth International Conference on Learning Representations, April 2026

Chin-Chia Yang, Yung-Yu Chuang, Hwann-Tzong Chen and Tyng-Luh Liu

Abstract

Synthetic image generators evolve rapidly, challenging detectors to generalize across current methods and adapt to new ones. We study domain-incremental synthetic image detection with a two-phase evaluation. Phase I trains on either diffusion- or GAN-based data and tests on the combined group to quantify bidirectional cross-generator transfer. Phase II sequentially introduces renders from 3D Gaussian Splatting (3DGS) head avatar pipelines, requiring adaptation while preserving earlier performance. We observe that CLIP-based detectors inherit text-image alignment semantics that are irrelevant to authenticity and hinder generalization. We introduce a Hilbert-Schmidt Independence Criterion (HSIC) bottleneck loss on intermediate CLIP ViT features, encouraging representations predictive of real versus synthetic while independent of generator identity and caption alignment. For domain-incremental learning, we propose HSIC-Guided Replay (HGR), which selects per-class exemplars via a hybrid score combining HSIC relevance with k-center coverage, yielding compact memories that mitigate forgetting. Empirically, the HSIC bottleneck improves transfer between diffusion and GAN families, and HGR sustains prior accuracy while adapting to 3DGS renders. These results underscore the value of information-theoretic feature shaping and principled replay for resilient detection under shifting generative regimes.

Universal Robust Speech Adaptation for Cross-Domain Speech Recognition and Enhancement

IEEE Transaction on Audio, Speech and Language Processing, February 2026

Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang, and Berlin Chen

Abstract

Pre-trained models for automatic speech recognition (ASR) and speech enhancement (SE) have exhibited remarkable capabilities under matched noise and channel conditions. However, these models often suffer from severe performance degradation when confronted with domain shifts, particularly in the presence of unseen noise and channel distortions. In view of this, we in this paper present URSA-GAN, a unified and domain-aware generative framework specifically designed to mitigate mismatches in both noise and channel conditions. URSA-GAN leverages a dual-embedding architecture that consists of a noise encoder and a channel encoder, each pre-trained with limited in-domain data to capture domain-relevant representations. These embeddings condition a GAN-based speech generator, facilitating the synthesis of speech that is acoustically aligned with the target domain while preserving phonetic content. To enhance generalization further, we propose dynamic stochastic perturbation, a novel regularization technique that introduces controlled variability into the embeddings during generation, promoting robustness to unseen domains. Empirical results demonstrate that URSA-GAN effectively reduces character error rates in ASR and improves perceptual metrics in SE across diverse noisy and mismatched channel scenarios. Notably, evaluations on compound test conditions with both channel and noise degradations confirm the generalization ability of URSA-GAN, yielding relative improvements of 16.16% in ASR performance and 15.58% in SE metrics.

Detail

Cross-Attention Reprogramming for ASR: Bridging Discrete Speech Units and Pretrained Language Models

IEEE Access, January 2026

Pei-Jun Liao, Hung-Yi Lee, and Hsin-Min Wang

Abstract

In automatic speech recognition (ASR), an emerging trend involves converting continuous speech features into sequences of discrete speech units (DSUs) via quantization. A key advantage of DSU representations is their compatibility with pretrained language models (PLMs), where DSUs are directly mapped to PLM token indices and the embedding layer is fine-tuned. However, this conventional strategy often relies heavily on large-scale training data to mitigate the inherent modality mismatch. In light of this, we explore a more effective way to exploit the PLM embedding dictionary. Drawing inspiration from Time-LLM, a recent time-series forecasting model, we propose a cross-attention reprogramming mechanism that incorporates codebook information from the DSU quantizer to better align the DSUs with the PLM embeddings. Compared to direct fine-tuning of PLM embeddings, our method consistently achieves improvements on the Discrete Audio and Speech Benchmark (DASB), reaching state-of-the-art performance across most DASB-style settings. We also evaluate our method on LibriSpeech-960, LibriLight-10, and Swedish, Czech, and Hungarian data from Common Voice, and observe similar trends. Notably, the proposed reprogramming method demonstrates significant gains over the fine-tuning baseline, particularly in cross-lingual and low-resource scenarios. This study proposes a new approach to using PLM embedding dictionaries in DSU-based ASR, and lays a foundation for combining speech representations with large language models in other discriminative tasks of speech processing such as speech emotion recognition and spoken question answering.

Detail

Can We Formalise Type Theory Intrinsically without Any Compromise? A Case Study in Cubical Agda

Proceedings of the 15th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP '26), January 2026

Liang-Ting Chen, Fredrik Nordvall Forsberg, Tzu-Chun Tsai

Abstract

We present an intrinsic representation of type theory in the proof assistant Cubical Agda, inspired by Awodey’s natural models of type theory. The initial natural model is defined as quotient inductive-inductive-recursive types, leading us to a syntax accepted by Cubical Agda without using any transports, postulates, or custom rewrite rules. We formalise some meta-properties such as the standard model, normalisation by evaluation for typed terms, and strictification constructions. Since our formalisation is carried out using Cubical Agda's native support for quotient inductive types, all our constructions compute at a reasonable speed. When we try to develop more sophisticated metatheory, however, the 'transport hell' problem reappears. Ultimately, it remains a considerable struggle to develop the metatheory of type theory using an intrinsic representation that lacks strict equations. The effort required is about the same whether or not the notion of natural model is used.

Detail

Efficient Column-Wise N:M Pruning on RISC-V CPU

Journal of Systems Architecture (JSA), March 2026

Chi-Wei Chu, Ding-Yong Hong, Jan-Jan Wu

Abstract

In deep learning frameworks, weight pruning is a widely used technique for improving computational efficiency by reducing the size of large models. This is especially critical for convolutional operators, which often act as performance bottlenecks in convolutional neural networks (CNNs). However, the effectiveness of pruning heavily depends on how it is implemented, as different methods can significantly impact both computational performance and memory footprint. In this work, we propose a column-wise N:M pruning strategy applied at the tile level and modify XNNPACK to enable efficient execution of pruned models on the RISC-V vector architecture. Additionally, we propose fusing the operations of im2col and data packing to minimize redundant memory accesses and memory overhead. To further optimize performance, we incorporate AITemplate’s profiling technique to identify the optimal implementation for each convolutional operator. Our proposed approach effectively increases ResNet inference throughput by as much as 4×, and preserves ImageNet top-1 accuracy within 2.1% of the dense baseline.

Detail

AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection

IEEE Transactions on Cognitive and Developmental Systems, December 2025

Ammarah Hashmi, Sahibzada Adil Shahzad, Chia-Wen Lin, Yu Tsao, and Hsin-Min Wang

Abstract

The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous studies on detecting artificial intelligence-generated fake videos only utilize visual modality or audio modality. While some methods exploit audio and visual modalities to detect forged videos, they have not been comprehensively evaluated on multimodal datasets of deepfake videos involving acoustic and visual manipulations, and are mostly based on convolutional neural networks with low detection accuracy. Considering that human cognition instinctively integrates multisensory information including audio and visual cues to perceive and interpret content and the success of transformer in various fields, this study introduces the audio-visual transformer-based ensemble network (AVTENet). This innovative framework tackles the complexities of deepfake technology by integrating both acoustic and visual manipulations to enhance the accuracy of video forgery detection. Specifically, the proposed model integrates several purely transformer-based variants that capture video, audio, and audio-visual salient cues to reach a consensus in prediction. For evaluation, we use the recently released benchmark multimodal audio-video FakeAVCeleb dataset. For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset. Experimental results show that the proposed model outperforms all existing methods and achieves state-of-the-art performance on Testset-I and Testset-II of the FakeAVCeleb dataset. We also compare AVTENet against humans in detecting video forgery. The results show that AVTENet significantly outperforms humans.

Detail