您的瀏覽器不支援JavaScript語法,網站的部份功能在JavaScript沒有啟用的狀態下無法正常使用。

中央研究院 資訊科學研究所

研究

友善列印

列印可使用瀏覽器提供的(Ctrl+P)功能

近期研究成果

:::

One-Class Face Anti-spoofing via Spoof Cue Map-Guided Feature Learning

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024

Pei-Kai Huang, Cheng-Hsuan Chiang, Tzu-Hsien Chen, Jun-Xiong Chong, Tyng-Luh Liu and Chiou-ting Hsu

Tyng-Luh Liu

Abstract

Many face anti-spoofing (FAS) methods have focused on learning discriminative features from both live and spoof training data to strengthen the security of face recognition systems. However, since not every possible attack type is available in the training stage, these FAS methods usually fail to detect unseen attacks in the inference stage. In comparison, one-class FAS, where training data comprise only live faces, aims to detect whether a test face image belongs to the live class or not. In this paper, we propose a novel One-Class Spoof Cue Map estimation Network (OC-SCMNet) to address the one-class FAS detection problem. Our first goal is to learn to extract latent spoof features from live images so that their estimated Spoof Cue Maps (SCMs) should have zero responses. To avoid trapping to a trivial solution, we devise a novel SCM-guided feature learning by combining many SCMs as pseudo ground-truths to guide a conditional generator to create latent spoof features for spoof data. Our second goal is to simulate the potential out-of-distribution spoof attacks approximately. To this end, we propose using a memory bank to dynamically preserve a set of sufficiently “independent” latent spoof features to encourage the generator to probe the latent spoof feature space. Extensive experiments conducted on eight FAS benchmark datasets demonstrate that the proposed OC-SCMNet not only outperforms previous one-class FAS approaches but also achieves performance comparable to the state-of-the-art two-class FAS methods.

Bridging Actions: Generate 3D Poses and Shapes In-between Photos

IEEE Transactions on Pattern Analysis and Machine Intelligence, April 2024

Wen-Li Wei and Jen-Chun Lin

Wen-Li Wei Jen-Chun Lin

Abstract

Generating realistic 3D human motion has been a fundamental goal of the game/animation industry. This work presents a novel transition generation technique that can bridge the actions of people in the foreground by generating 3D poses and shapes in-between photos, allowing 3D animators/novice users to easily create/edit 3D motions. To achieve this, we propose an adaptive motion network (ADAM-Net) that effectively learns human motion from masked action sequences to generate kinematically compliant 3D poses and shapes in-between given temporally-sparse photos. Three core learning designs underpin ADAM-Net. First, we introduce a random masking process that randomly masks images from an action sequence and fills masked regions in latent space by interpolation of unmasked images to simulate various transitions under given temporally-sparse photos. Second, we propose a long-range adaptive motion (L-ADAM) attention module that leverages visual cues observed from human motion to adaptively recalibrate the range that needs attention in a sequence, along with a multi-head cross-attention. Third, we develop a short-range adaptive motion (S-ADAM) attention module that weightedly selects and integrates adjacent feature representations at different levels to strengthen temporal correlation. By coupling these designs, the results demonstrate that ADAM-Net excels not only in generating 3D poses and shapes in-between photos, but also in classic 3D human pose and shape estimation.

Effective Compression of Language Models by Combining Pruning and Knowledge Distillation

IEEE International Conference on Computers, Software, and Applications (COMPSAC), July 2024

Chi-Yu Chiu, Ding-Yong Hong, Pangfeng Liu and Jan-Jan Wu

Ding-Yong Hong Jan-Jan Wu

Abstract

Weight pruning is a prominent model compression technique that removes some weights in a model. However, pruning on transformer models faces a challenge. After pruning, Transformer models require repeating the whole training process, including pre-training on a large generalized data set and fine-tuning on a small downstream data set, to recover the accuracy. The whole training process takes a long time and many computation resources. To address the challenge, we propose a pruning method that
combines with knowledge distillation to avoid a long re-training time while recovering the accuracy. We use 2:4 pruning as our basic pruning method. 2:4 pruning is a method proposed by NVIDIA that keeps two larger absolute values in every four  consecutive elements in every row in a weight matrix. We generalize 2:4 pruning to N:M pruning which refers to keeping N larger absolute values in every M consecutive elements in every row in a weight matrix. Knowledge distillation is another model  compression method that makes a small model, which is referred to as a student, learn from a large model, which is referred to
as a teacher. With our method, we use N:M pruning to uniformly prune the model into N:M structure. Next, we use two-stage fine-tuning on the downstream dataset with knowledge distillation. By using our method, the pruned models can achieve comparable accuracy by using only downstream datasets and take much less time than traditional retraining. We run our experiments on SQuAD and GLUE datasets using DistilBERT. The experimental results show that DistilBERT in a 1:4 structure can achieve comparable accuracy on the SQuAD v1.1 and SQuAD v2.0 datasets and 1.7x speedup on inference compared to the original dense model.

Contrastive Learning for DeepFake Classification and Localization via Multi-Label Ranking

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024

Cheng-Yao Hong, Yen-Chi Hsu and Tyng-Luh Liu

Tyng-Luh Liu

Abstract

We propose a unified approach to simultaneously addressing the conventional setting of binary deepfake classification and a more challenging scenario of uncovering what facial components have been forged as well as the exact order of the manipulations. To solve the former task, we consider multiple instance learning (MIL) that takes each image as a bag and its patches as instances. A positive bag corresponds to a forged image that includes at least one manipulated patch (i.e., a pixel in the feature map). The formulation allows us to estimate the probability of an input image being a fake one and establish the corresponding contrastive MIL loss. On the other hand, tackling the component-wise deepfake problem can be reduced to solving multi-label prediction, but the requirement to recover the manipulation order further complicates the learning task into a multi-label ranking problem. We resolve this difficulty by designing a tailor-made loss term to enforce that the rank order of the predicted multi-label probabilities respects the ground-truth order of the sequential modifications of a deepfake image. Through extensive experiments and comparisons with other relevant techniques, we provide extensive results and ablation studies to demonstrate that the proposed method is an overall more comprehensive solution to deepfake detection.

A Study on Incorporating Whisper for Robust Speech Assessment

IEEE International Conference on Multimedia and Expo (ICME), July 2024

Ryandhimas E. Zezario, Yu-Wen Chen, Szu-Wei Fu, Yu Tsao, Hsin-Min Wang, and Chiou-Shann Fuh

Yu Tsao Hsin-Min Wang

Abstract

This research introduces an enhanced version of the multi-objective speech assessment model--MOSA-Net+, by leveraging the acoustic features from Whisper, a large-scaled weakly supervised model. We first investigate the effectiveness of Whisper in deploying a more robust speech assessment model. After that, we explore combining representations from Whisper and SSL models. The experimental results reveal that Whisper's embedding features can contribute to more accurate prediction performance achieved by MOSA-Net+. Moreover, combining the embedding features from Whisper and SSL models only leads to marginal improvement. As compared to intrusive methods, MOSA-Net, and other SSL-based speech assessment models, MOSA-Net+ yields notable improvements in estimating subjective quality and intelligibility scores across all evaluation metrics in Taiwan Mandarin Hearing In Noise test - Quality \& Intelligibility (TMHINT-QI) dataset. To further validate its robustness, MOSA-Net+ was tested in the noisy-and-enhanced track of the VoiceMOS Challenge 2023, where it obtained the top-ranked performance among nine systems.

Generating Attractive and Authentic Copywriting from Customer Reviews

NAACL 2024, June 2024

Yu-Xiang Lin, Wei-Yun Ma

YU-XIANG LIN Wei-Yun Ma

Abstract

The goal of product copywriting is to capture the interest of users and enhance their experience by emphasizing the features of products through text descriptions. As e-commerce platforms offer a wide range of services, it's becoming essential to dynamically adjust the styles of these auto-generated descriptions. Traditional approaches to copywriting generation often rely solely on specified product attributes, which may result in dull and repetitive content. To tackle this issue, we propose to generate copywriting based on customer reviews, as they provide firsthand practical experiences with the product, offering a richer source of information than just product attributes. We have developed a sequence-to-sequence framework, enhanced with reinforcement learning, to produce copywriting that is attractive, authentic, and rich in information. Our framework outperforms all existing baseline and zero-shot large language models, including LLaMA-2-chat-7B and GPT-3.5\footnote{In this work, we use gpt-3.5-turbo-0613. }, in terms of both attractiveness and faithfulness. Furthermore, this work features the use of LLMs for aspect-based summaries collection and argument allure assessment. Experiments demonstrate the effectiveness of using LLMs for marketing domain corpus construction. The dataset will be made available in the future.

Plug-in Language Model: Controlling Text Generation with a Simple Regression Model

NAACL 2024, June 2024

Nai-Chi Yang, Wei-Yun Ma, Pu-Jen Cheng

Nai-Chi Yang Wei-Yun Ma

Abstract

Large-scale pre-trained language models have displayed unrivaled capacity in generating text that closely resembles human-written text. Nevertheless, generating texts adhering to specific conditions without fine-tuning or adding new parameters can be challenging.
Contemporary approaches commonly rely on either prompts or auxiliary models to avoid modifying the language models. These auxiliary models are designed to assess whether a generated token contributes to meeting the desired requirements. These approaches adjust the distribution of the next token during the inference phase by leveraging the prediction score of the desired attribute to calculate gradients. However, these auxiliary models typically require the language model's latent states. This prerequisite challenges integrating various existing black box attribute models or tools.
We present the Plug-in Language Model (PiLM) as a solution to address the limitations. PiLM leverages reinforcement learning to utilize black box tools directly, adjusting the latent state to control text generation. However, performing backpropagation during the inference phase is time-consuming for PiLM. By replacing backpropagation with a simple regression model, PiLM can achieve an inference time comparable to that of the original LLM.
Experiment results show that our approaches in this paper outperform existing state-of-the-art methods that rely on gradient-based, weighted decoding, or prompt-based methodologies.

Automatic Construction of a Chinese Review Dataset for Aspect Sentiment Triplet Extraction via Iterative Weak Supervision

LREC-COLING 2024, May 2024

Chia-Wen Lu, Ching-Wen Yang, Wei-Yun Ma

CHIA-WEN LU Wei-Yun Ma

Abstract

Aspect Sentiment Triplet Extraction (ASTE), introduced in 2020, is a task that involves the extraction of three key elements: target aspects, descriptive opinion spans, and their corresponding sentiment polarity. This process, however, faces a significant hurdle, particularly when applied to Chinese languages, due to the lack of sufficient datasets for model training, largely attributable to the arduous manual labeling process. To address this issue, we present an innovative framework that facilitates the automatic construction of ASTE via Iterative Weak Supervision, negating the need for manual labeling, aided by a discriminator to weed out subpar samples. The objective is to successively improve the quality of this raw data and generate supplementary data. The effectiveness of our approach is underscored by our results, which include the creation of a substantial Chinese review dataset. This dataset encompasses over 60,000 Google restaurant reviews in Chinese and features more than 200,000 extracted triplets. Moreover, we have also established a robust baseline model by leveraging a novel method of weak supervision. Both our dataset and model are openly accessible to the public.

Decoding the genome of bloodsucking midge Forcipomyia taiwana (diptera: Ceratopogonidae): Insights into odorant receptor expansion

Insect Biochemistry and Molecular Biology, May 2024

Ming-Der Lin, Chia-Hsien Chuang, Chih-Hsin Kao, Shu-Hwa Chen, Szu-Chieh Wang, Ping-Heng Hsieh, Guan-Yu Chen, Chun-Chia Mao, Jeng-Yi Li, Mei-Yeh Jade Lu, Chung-Yen Lin

Ming-Der Lin Chia-Hsien Chuang Shu-Hwa Chen Szu-Chieh Wang Ping-Heng Hsieh Chun-Chia Mao Chung-Yen Lin

Abstract

Biting midges, notably those within the Ceratopogonidae family, have long been recognized for their epidemiological significance, both as nuisances and vectors for disease transmission in vertebrates. Despite their impact, genomic insights into these insects, particularly beyond the Culicoides genus, remain limited. In this study, we assembled the Forcipomyia taiwana (Shiraki) genome, comprising 113 scaffolds covering 130.4 Mbps—with the longest scaffold reaching 7.6 Mbps and an N50 value of 2.6 Mbps—marking a pivotal advancement in understanding the genetic architecture of ceratopogonid biting midges. Phylogenomic analyses reveal a shared ancestry between F. taiwana and Culicoides sonorensis Wirth & Jones, dating back approximately 124 million years, and highlight a dynamic history of gene family expansions and contractions within the Ceratopogonidae family. Notably, a substantial expansion of the odorant receptor (OR) gene family was observed, which is crucial for the chemosensory capabilities that govern biting midges' interactions with their environment, including host seeking and oviposition behaviors. The distribution of OR genes across the F. taiwana genome displays notable clusters on scaffolds, indicating localized tandem gene duplication events. Additionally, several collinear regions were identified, hinting at segmental duplications, inversions, and translocations, contributing to the olfactory system's evolutionary complexity. Among the 156 ORs identified in F. taiwana, 134 are biting midge-specific ORs, distributed across three distinct clades, each exhibiting unique motif features that distinguish them from the others. Through weighted gene co-expression network analysis, we correlated distinct gene modules with sex and reproductive status, laying the groundwork for future investigations into the interplay between gene expression and adaptive behaviors in F. taiwana. In conclusion, our study not only highlights the unique olfactory repertoire of ceratopogonid biting midges but also sets the stage for future studies into the genetic underpinnings of their unique biological traits and ecological strategies.

DTC: A Drift-Tolerant Coding to Improve the Performance and Energy Efficiency of Multi-Level-Cell Phase-Change Memory

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , October 2023

Yi-Shen Chen, Yuan-Hao Chang, and Tei-Wei Kuo

Yuan-Hao Chang

Abstract

Recently, phase-change memory (PCM) has emerged as a promising memory and storage technology. By storing multiple bits in a PCM cell, multi-level-cell (MLC) PCM further reduces the per-bit cost to improve its competitiveness. However, MLC PCM suffers from the high write latency and energy consumption caused by its complex write operations. Different from the existing works that attempt to improve the write latency and energy efficiency of the physical program & verify strategy for MLC PCM, we propose DTC, a drift-tolerant coding scheme, to apply fast write operation on MLC PCM without sacrificing the data accuracy. By exploiting the resistance drift and asymmetric write characteristics of PCM cells, the proposed DTC can significantly reduce the write latency and energy consumption of MLC PCM. Meanwhile, we propose a segmentation strategy to further improve the write performance with our coding scheme and an elimination methodology to avoid issuing unnecessary update operations. A series of analyses and experiments was conducted to evaluate the capability of the proposed scheme. It is encouraging that the proposed scheme can reduce 16.8%–32.1% energy consumption and 20.1%–32.6% write latency under the representative benchmarks, compared with the existing well-known schemes.

TRAIN: A Reinforcement Learning Based Timing-Aware Neural Inference on Intermittent Systems

ACM/IEEE International Conference on Computer-Aided Design (ICCAD), October 2023

Shu-Ting Cheng, Wen Sheng Lim, Chia-Heng Tu and Yuan-Hao Chang

Yuan-Hao Chang

Abstract

Intermittent systems become popular to be considered as the solutions of various application domains, thanks to the maturation of energy harvesting technology. Environmental monitoring is such an example and it is a time-sensitive application domain. In order to report the perceived environmental status in a timely manner, methods have been proposed to consider the freshness of the collected information on such systems with unstable power sources. Nevertheless, these methods cannot be applied to neural network workloads since these methods do not consider the delivered model accuracy. On the other hand, while there have been studies for deploying neural network applications on intermittent systems, they depend on branchy network architectures, each branch representing an energy-accuracy tradeoff, and do not take into account a time constraint, which tends to cause system failures because of the frequent generation of expired data. In this work, the first timing-aware framework TRAIN is proposed to deploy the neural network models on the intermittent systems by considering energy, time constraint, and delivered model accuracy. Compared with the prior studies that depend on branchy network architectures, TRAIN offers a broadened solution space representing various energy/time/accuracy tradeoffs. It is achieved by allowing to choose among different implementations of each model layer during the model inference at runtime, and the smart choices are made by the proposed reinforcement learning algorithm. Our results demonstrate TRAIN outperforms the prior study by 65%, regarding the delivered model accuracy. We believe that TRAIN paves the way for building complex applications on intermittent systems.

Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality Assessment Model

IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP2024), April 2024

Ryandhimas Zezario, Bo-Ren Brian Bai, Chiou-Shann Fuh, Hsin-Min Wang, and Yu Tsao

Hsin-Min Wang Yu Tsao

Abstract

This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net. MPL consists of two stages: obtaining pseudolabel scores from a pretrained model and performing multitask learning. The 3QUEST metrics, namely Speech-MOS (S-MOS), Noise-MOS (N-MOS), and General-MOS (GMOS), are the assessment targets. The pretrained MOSA-Net model is utilized to estimate three pseudo labels: perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (SDI). Multi-task learning is then employed to train MTQ-Net by combining a supervised loss (derived from the difference between the estimated score and the ground-truth label) and a semi-supervised loss (derived from the difference between the estimated score and the pseudo label), where the Huber loss is employed as the loss function. Experimental results first demonstrate the advantages of MPL compared to training a model from scratch and using a direct knowledge transfer mechanism. Second, the benefit of the Huber loss for improving the predictive ability of MTQ-Net is verified. Finally, the MTQ-Net with the MPL approach exhibits higher overall predictive power compared to other SSL-based speech assessment models.

Is Explanation the Cure? Misinformation Mitigation in the Short-term and Long-term

in Proceedings of The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), December 2023

Yi-Li Hsu, Shih-Chieh Dai, Aiping Xiong, Lun-Wei Ku

Lun-Wei Ku

Abstract

With advancements in natural language processing (NLP) models, automatic explanation generation has been proposed to mitigate misinformation on social media platforms in addition to adding warning labels to identified fake news. While many researchers have focused on generating good explanations, how these explanations can really help humans combat fake news is under-explored. In this study, we compare the effectiveness of a warning label and the state-of- the-art counterfactual explanations generated by GPT-4 in debunking misinformation. In a two-wave, online human-subject study, participants (N = 215) were randomly assigned to a control group in which false contents are shown without any intervention, a warning tag group in which the false claims were labeled, or an explanation group in which the false contents were accompanied by GPT-4 generated explanations. Our results show that both interventions significantly decrease participants’ self-reported belief in fake claims in an equivalent manner for the short-term and long-term. We discuss the implications of our findings and directions for future NLP-based misinformation debunking strategies.

LLM-in-the-loop: Leveraging Large Language Model for Thematic Analysis

in Proceedings of The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), December 2023

Shih-Chieh Dai, Aiping Xiong, Lun-Wei Ku

Lun-Wei Ku

Abstract

Thematic analysis (TA) has been widely used for analyzing qualitative data in many disciplines and fields. To ensure reliable analysis, the same piece of data is typically assigned to at least two human coders. Moreover, to produce meaningful and useful analysis, human coders develop and deepen their data interpretation and coding over multiple iterations, making TA labor-intensive and time-consuming. Recently the emerging field of large language models (LLMs) research has shown that LLMs have the potential replicate human-like behavior in various tasks: in particular, LLMs outperform crowd workers on text-annotation tasks, suggesting an opportunity to leverage LLMs on TA. We propose a human–LLM collaboration framework (i.e., LLM-in-the-loop) to conduct TA with in-context learning (ICL). This framework provides the prompt to frame discussions with a LLM (e.g., GPT-3.5) to generate the final codebook for TA. We demonstrate the utility of this framework using survey datasets on the aspects of the music listening experience and the usage of a password manager. Results of the two case studies show that the proposed framework yields similar coding quality to that of human coders but reduces TA’s labor and time demands.

Location-Aware Visual Question Generation with Lightweight Models

in Proceedings of The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), December 2023

Nicholas Collin Suwono, Justin Chen, Tun Min Hung, Ting-Hao Kenneth Huang, I-Bin Liao, Yung-Hui Li, Lun-Wei Ku, Shao-Hua Sun

Ting-Hao Huang Lun-Wei Ku

Abstract

This work introduces a novel task, location-aware visual question generation (LocaVQG), which aims to generate engaging questions from data relevant to a particular geographical location. Specifically, we represent such location-aware information with surrounding images and a GPS coordinate. To tackle this task, we present a dataset generation pipeline that leverages GPT-4 to produce diverse and sophisticated questions. Then, we aim to learn a lightweight model that can address the LocaVQG task and fit on an edge device, such as a mobile phone. To this end, we propose a method which can reliably generate engaging questions from location-aware information. Our proposed method outperforms baselines regarding human evaluation (e.g., engagement, grounding, coherence) and automatic evaluation metrics (e.g., BERTScore, ROUGE-2). Moreover, we conduct extensive ablation studies to justify our proposed techniques for both generating the dataset and solving the task.

A formal treatment of bidirectional typing

33rd European Symposium on Programming (ESOP 2024), April 2024

Chen, Liang-Ting and Ko, Hsiang-Shang

Liang-Ting Chen Hsiang-Shang Ko

Abstract

There has been much progress in designing bidirectional type systems and associated type synthesis algorithms, but mainly on a case-by-case basis. To remedy the situation, this paper develops a general and formal theory of bidirectional typing, and, as a by-product of our formalism, provides a verified generator of proof-relevant type synthesisers for simply typed languages: for every signature that specifies a mode-correct bidirectionally typed language, there exists a proof-relevant type synthesiser that for an input abstract syntax tree constructs a typing derivation if any, gives its refutation if not, or reports that the input does not have enough type annotations. Soundness, completeness, and mode-correctness are studied universally for all signatures, which are sufficient conditions for deriving a type synthesiser. We propose a preprocessing step called mode decoration, which helps the user to deal with missing type annotations in a given abstract syntax tree. The entire development is formalised in Agda and can be further integrated with other language-formalisation frameworks.