近期研究成果

SmartSpatial: Enhancing 3D Spatial Awareness in Stable Diffusion with a Novel Evaluation Framework

The 34th International Joint Conference on Artificial Intelligence (IJCAI 2025) Special Track on AI, Arts & Creativity, August 2025

Mao-Xun Huang, Brian J Chan, Hen-Hsen Huang

Abstract

Stable Diffusion models have made remarkable strides in generating photorealistic images from text prompts but often falter when tasked with accurately representing complex spatial arrangements, particularly involving intricate 3D relationships. To address this limitation, we introduce SmartSpatial, an innovative approach that not only enhances the spatial arrangement capabilities of Stable Diffusion but also fosters AI-assisted creative workflows through 3D-aware conditioning and attention-guided mechanisms. SmartSpatial incorporates depth information injection and cross-attention control to ensure precise object placement, delivering notable improvements in spatial accuracy metrics. In conjunction with SmartSpatial, we present SmartSpatialEval, a comprehensive evaluation framework that bridges computational spatial accuracy with qualitative artistic assessments. Experimental results show that SmartSpatial significantly outperforms existing methods, setting new benchmarks for spatial fidelity in AI-driven art and creativity.

詳細資訊

HAAQI-Net: A Non-intrusive Neural Music Audio Quality Assessment Model for Hearing Aids

IEEE Transaction on Audio, Speech and Language Processing, February 2025

Dyah A. M. G. Wisnu, Stefano Rini, Ryandhimas E. Zezario, Hsin-Min Wang, and Yu Tsao

Abstract

This paper introduces HAAQI-Net, a non-intrusive music audio quality assessment model for hearing aid users. Unlike traditional methods such as Hearing Aid Audio Quality Index (HAAQI), which requires intrusive reference signal comparisons, HAAQI-Net offers a more accessible and computationally efficient alternative. Leveraging a bidirectional long short-term memory architecture with attention mechanisms and features extracted from a pre-trained BEATs model, it can predict HAAQI scores directly from music audio clips and hearing loss patterns. The experimental results demonstrate that, compared to the traditional HAAQI as the reference, HAAQI-Net achieves a linear correlation coefficient (LCC) of 0.9368, a Spearman's rank correlation coefficient (SRCC) of 0.9486, and a mean squared error (MSE) of 0.0064, while significantly reducing the inference time from 62.52 seconds to 2.54 seconds. Furthermore, a knowledge distillation strategy was applied, reducing the parameters by 75.85% and inference time by 96.46%, while maintaining strong performance (LCC: 0.9071, SRCC: 0.9307, MSE: 0.0091). To expand its capabilities, HAAQI-Net was adapted to predict subjective human scores, mean opinion score (MOS), by fine-tuning. This adaptation significantly improved the prediction accuracy. Furthermore, the robustness of HAAQI-Net was evaluated under varying sound pressure level (SPL) conditions, revealing optimal performance at a reference SPL of 65 dB, with the accuracy gradually decreasing as SPL deviated from this point. The advancements in subjective score prediction, SPL robustness, and computational efficiency position HAAQI-Net as a reliable solution for music audio quality assessment, significantly contributing to the development of efficient and accurate models in audio signal processing and hearing aid technology.

詳細資訊

EigenGS Representation: From Eigenspace to Gaussian Image Space

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025

Lo-Wei Tai, Ching-En Li, Cheng-Lin Chen, Chih-Jung Tsai, Hwann-Tzong Chen and Tyng-Luh Liu

Abstract

Principal Component Analysis (PCA), a classical dimensionality reduction technique, and Gaussian Splatting, a recent high-quality image synthesis method, represent fundamentally different approaches to image representation. Despite these significant differences, we present EigenGS, a novel method that bridges these two paradigms. By establishing an efficient transformation pipeline between eigenspace and image-space Gaussian representations, our approach enables instant initialization of Gaussian parameters for new images without requiring per-image training from scratch. Our method also introduces a frequency-aware learning mechanism that encourages Gaussians to adapt to different scales in order to better model spatial frequencies, effectively preventing artifacts in high-resolution reconstruction. Extensive experiments demonstrate that EigenGS not only achieves superior reconstruction quality but also dramatically accelerates convergence. The results highlight EigenGS's effectiveness and its ability to generalize across images with varying resolutions and diverse categories. This makes high-quality Gaussian Splatting practically viable for real-time applications.

Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

The Thirteenth International Conference on Learning Representations (ICLR 2025 Spotlight), April 2025

Hsun-Yu Kuo, Yin-Hsiang Liao, Yu-Chieh Chao, Wei-Yun Ma, Pu-Jen Cheng

Abstract

Synthetic data augmentation via Large Language Models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring about deficient results while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs using merely a tiny amount of real-world data. We empirically assessed the effectiveness of our methods on multiple text classification tasks, and the results showed that leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator.

詳細資訊

Bottom-up computation using trees of sublists

Journal of Functional Programming, December 2024

Shin-Cheng Mu

Abstract

Some top-down problem specifications, if executed, may compute sub-problems repeatedly. Instead, we may want a bottom-up algorithm that stores solutions of sub-problems in a table to be reused. How the table can be represented and efficiently maintained, however, can be tricky. We study a special case: computing a function h taking lists as inputs such that hxs is defined in terms of all immediate sublists of xs. Richard Bird studied this problem in 2008 and presented a concise but cryptic algorithm without much explanation. We give this algorithm a proper derivation and discovered a key property that allows it to work. The algorithm builds trees that have certain shapes—the sizes along the left spine is a prefix of a diagonal in Pascal’s triangle. The crucial function we derive transforms one diagonal to the next.

詳細資訊

GPU Memory Usage Optimization for Backward Propagation in Deep Network Training

Journal of Parallel and Distributed Computing (JPDC), May 2025

Ding-Yong Hong, Tzu-Hsien Tsai, Ning Wang, Pangfeng Liu, Jan-Jan Wu

Abstract

In modern Deep Learning, it has been a trend to design larger Deep Neural Networks (DNNs) for the execution of more complex tasks and better accuracy. On the other hand, Convolutional Neural Networks (CNNs) have become the standard method for most of computer vision tasks. However, the memory allocation for the intermediate data in convolution layers can cause severe memory pressure during model training. Many solutions have been proposed to resolve the problem. Besides hardware-dependent solutions, a general methodology rematerialization can reduce GPU memory usage by trading computation for memory efficiently. The idea is to select a set of intermediate results during the forward phase as checkpoints, and only save them in memory to reduce memory usage. The backward phase recomputes the intermediate data from the closest checkpoints in memory as needed. This recomputation increases execution time but saves memory by not storing all intermediate results in memory during the forward phase. In this paper, we will focus on efficiently finding the optimal checkpoint subset to achieve the least peak memory usage during the model training. We first describe the theoretical background of the training of a neural network using mathematical equations. We use these equations to identify all essential data required during both forward and backward phases to compute the gradient of weights of the model. We first identify the checkpoint selection problem and propose a dynamic programming algorithm with time complexity to solve the problem of finding the optimal checkpoint subset. With extensive experiments, we formulate a more accurate description of the problem using our theoretical analysis and revise the objective function based on the tracing, and propose an -time algorithm for finding the optimal checkpoint subset.

詳細資訊

HistoFS: Non-IID Histopathologic Whole Slide Image Classification via Federated Style Transfer with RoI-Preserving

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025

Farchan Hakim Raswa, Chun-Shien Lu, and Jia-Ching Wang

Abstract

Federated learning for pathological whole slide image (WSI) classification allows multiple clients to train a global multiple instance learning (MIL) model without sharing their privacy-sensitive WSIs. To accommodate the non-independent and identically distributed (non-i.i.d.) feature shifts, cross-client style transfer has been popularly used but is subject to two fundamental issues: (1) WSIs contain multiple morphological structures due to tissue heterogeneity, and (2) the region of interests (RoIs) is not guaranteed, particularly after augmenting local WSIs data trough style transfer. To address these challenges, we propose HistoFS, a federated learning framework for computational pathology on non-i.i.d. feature shifts in WSI classification. Specifically, we introduce pseudo bag styles that capture multiple style variations within a single WSI. In addition, an authenticity module is introduced to ensure that RoIs are preserved, allowing local models to learn WSIs with diverse styles while maintaining essential RoIs. Extensive experiments validate the superiority of HistoFS over state-of-the-art methods on three clinical datasets.

詳細資訊

OptionZero: Planning with Learned Options

the Thirteenth International Conference on Learning Representations (ICLR), April 2025

Po-Wei Huang, Pei-Chiun Peng, Hung Guei, Ti-Rong Wu

Abstract

Planning with options -- a sequence of primitive actions -- has been shown effective in reinforcement learning within complex environments. Previous studies have focused on planning with predefined options or learned options through expert demonstration data. Inspired by MuZero, which learns superhuman heuristics without any human knowledge, we propose a novel approach, named OptionZero. OptionZero incorporates an option network into MuZero, providing autonomous discovery of options through self-play games. Furthermore, we modify the dynamics network to provide environment transitions when using options, allowing searching deeper under the same simulation constraints. Empirical experiments conducted in 26 Atari games demonstrate that OptionZero outperforms MuZero, achieving a 131.58% improvement in mean human-normalized score. Our behavior analysis shows that OptionZero not only learns options but also acquires strategic skills tailored to different game characteristics. Our findings show promising directions for discovering and using options in planning. Our code is available at https://rlg.iis.sinica.edu.tw/papers/optionzero.

詳細資訊

Strength Estimation and Human-Like Strength Adjustment in Games

the Thirteenth International Conference on Learning Representations (ICLR), April 2025

Chun Jung Chen, Chung-Chin Shih, Ti-Rong Wu

Abstract

Strength estimation and adjustment are crucial in designing human-AI interactions, particularly in games where AI surpasses human players. This paper introduces a novel strength system, including a strength estimator (SE) and an SE-based Monte Carlo tree search, denoted as SE-MCTS, which predicts strengths from games and offers different playing strengths with human styles. The strength estimator calculates strength scores and predicts ranks from games without direct human interaction. SE-MCTS utilizes the strength scores in a Monte Carlo tree search to adjust playing strength and style. We first conduct experiments in Go, a challenging board game with a wide range of ranks. Our strength estimator significantly achieves over 80% accuracy in predicting ranks by observing 15 games only, whereas the previous method reached 49% accuracy for 100 games. For strength adjustment, SE-MCTS successfully adjusts to designated ranks while achieving a 51.33% accuracy in aligning to human actions, outperforming a previous state-of-the-art, with only 42.56% accuracy. To demonstrate the generality of our strength system, we further apply SE and SE-MCTS to chess and obtain consistent results. These results show a promising approach to strength estimation and adjustment, enhancing human-AI interactions in games. Our code is available at https://rlg.iis.sinica.edu.tw/papers/strength-estimator.

詳細資訊

Execution Time Optimization for Pipeline Deep Network Training on Multiple GPUs

Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), March 2025

Bing-Jou Wu, Ding-Yong Hong, Pangfeng Liu, Jan-Jan Wu

Abstract

As neural network models become gigantic, they increasingly demand more time and memory for training. To meet these demands, advanced parallel computing techniques have become essential. Our research focuses on hybrid parallelism, an extension of pipeline parallelism. Pipeline parallelism splits the neural network into sub-networks distributed across a sequence of processing units, enabling simultaneous processing of different data segments on each device. Hybrid parallelism extends this concept by allocating multiple devices to each sub-network. Our research focuses on optimizing hybrid parallelism by improving how the model is partitioned and how computational devices are assigned. We address these issues by modeling the neural network as a directed acyclic graph of tensor operators, and then demonstrating that optimally partitioning this graph is NP-complete. Then, we propose a two-step approach. The first step is to determine a sequence of nodes. The second step is dynamic programming, which partitions the sequence to maintain balance across the assigned devices. In transforming the graph into a sequence, we explore two methods: one employs topological sorting, while the other clusters non-sequential subgraphs. We apply both methods and select the more effective one based on performance outcomes. We implement our algorithm and conduct experiments. The results show substantial enhancements in both the speed of partitioning and training throughput, with speedups reaching up to 23 in partitioning time and a 1.3-fold increase in training throughput.

Optimizing Compute Core Assignment for Dynamic Batch Inference in AI Inference Accelerator

ACM Symposium on Applied Computing (SAC), March 2025

Ze-Wei Liou and Ding-Yong Hong

Abstract

Modern AI inference accelerators offer high-performance and power-efficient computations for machine learning models. Most accelerators employ static inference to enhance performance, which requires models to be compiled with predetermined input batch sizes and intermediate tensor shapes. However, static inference can lead to program failures or inefficient execution when processing batched data of varying sizes, a scenario known as dynamic batch inference. This work addresses this challenge by focusing on the emerging multicore AI inference accelerators that offer flexible compute core assignment. We propose to dynamically partition the input batch data into smaller batches, and create multiple model instances to process each partition in parallel. The challenge lies in how to determine the optimal number of model instances, the proper batch size for each handling model, and the assignment of compute cores among the models, to minimize the inference time. To solve the problem, we construct an accurate profiling-based cost model and devise a dynamic programming algorithm to determine the best configuration. Experimental results indicate that our method achieves 3.05× higher throughput on average in multi- person pose estimation benchmarks, compared to the EdgeTPU-like inference strategy.

Multi-objective Non-intrusive Hearing-aid Speech Assessment Model

The Journal of the Acoustical Society of America, November 2024

Hsin-Tien Chiang, Szu-Wei Fu, Hsin-Min Wang, Yu Tsao, and John H. L. Hansen

Abstract

Because a reference signal is often unavailable in real-world scenarios, reference-free speech quality and intelligibility assessment models are important for many speech processing applications. Despite a great number of deep-learning models that have been applied to build non-intrusive speech assessment approaches and achieve promising performance, studies focusing on the hearing impaired (HI) subjects are limited. This paper presents HASA-Net+, a multi-objective non-intrusive hearing-aid speech assessment model, building upon our previous work, HASA-Net. HASA-Net+ improves HASA-Net in several ways: (1) inclusivity for both normal-hearing and HI listeners, (2) integration with pre-trained speech foundation models and fine-tuning techniques, (3) expansion of predictive capabilities to cover speech quality and intelligibility in diverse conditions, including noisy, denoised, reverberant, dereverberated, and vocoded speech, thereby evaluating its robustness, and (4) validation of the generalization capability using an out-of-domain dataset.

詳細資訊

A Deconvolution Approach to Unveiling the Immune Microenvironment of Complex Tissues and Tumors in Transcriptomics

BMC Cancer, May 2025

Shu-Hwa Chen, Bo-Yi Yu, Wen-Yu Kuo, Ya-Bo Lin, Sheng-Yao Su, Wei-Hsuan Chuang, I-Hsuan Lu, Chung-Yen Lin

Abstract

Resolving the composition of tumor-infiltrating leukocytes is essential for expanding the cancer immunotherapy strategy, which has witnessed dramatic success in some clinical trials but remained elusive and limited in its application. In this study, we developed a two-step streamed workflow to manage the complex bioinformatic processes involved in immune cell composition analysis. We developed a dockerized toolkit (DOCexpress_fastqc, https://hub.docker.com/r/lsbnb/docexpress_fastqc) to perform gene expression profiling from RNA sequencing raw reads by integrating the hisat2-stringtie pipeline and our scripts with Galaxy/Docker images. Then the output of DOCexpress_fastqc fits the input format of mySORT web, a web application that employs the deconvolution algorithm to determine the immune content of 21 cell subclasses. The usage of mySORT was also demonstrated using a pseudo-bulk pool through single-cell datasets. Additionally, the consistency between the estimated values and the ground-truth immune-cell composition from the single-cell datasets confirmed the exceptional performance of mySORT. The mySORT demo website and Docker image can be accessed for free at https://mysort.iis.sinica.edu.tw and https://hub.docker.com/r/lsbnb/mysort_2022.

詳細資訊