Institute of Information Science, Academia Sinica

Research Overview


Press Ctrl+P to print from browser

Recent Research Results


Scaled-YOLOv4: Scaling Cross Stage Partial Network

IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2021

Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao

Chien-Yao Wang Hong-Yuan Mark Liao

We show that the YOLOv4 object detection neural network based on the CSP approach, scales both up and down and is applicable to small and large networks while maintaining optimal speed and accuracy. We propose a network scaling approach that modifies not only the depth, width, resolution, but also structure of the network. YOLOv4- large model achieves state-of-the-art results: 55.4% AP (73.3% AP50) for the MS COCO dataset at a speed of 15 FPS on Tesla V100, while with the test time augmentation, YOLOv4-large achieves 55.8% AP (73.2 AP50). To the best of our knowledge, this is currently the highest accuracy on the COCO dataset among any published work. The YOLOv4-tiny model achieves 22.0% AP (42.0% AP50) at a speed of 443 FPS on RTX 2080Ti, while by using TensorRT, batch size = 4 and FP16-precision the YOLOv4-tiny achieves 1774 FPS.

Perceptual Indistinguishability-Net (PI-Net): Facial Image Obfuscation with Manipulable Semantics

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

Jia-Wei Chen, Li-Ju Chen, Chia-Mu Yu, and Chun-Shien Lu

Li-Ju Chen Chia-Mu Yu Chun-Shien Lu

With the growing use of camera devices, the industry has many image datasets that provide more opportunities for collaboration between the machine learning community and industry. However, the sensitive information in the datasets discourages data owners from releasing these datasets. Despite recent research devoted to removing sensitive information from images, they provide neither meaningful privacy-utility trade-off nor provable privacy guarantees. In this study, with the consideration of the perceptual similarity, we propose perceptual indistinguishability (PI) as a formal privacy notion particularly for images. We also propose PI-Net, a privacy-preserving mechanism that achieves image obfuscation with PI guarantee. Our study shows that PI-Net achieves significantly better privacy utility trade-off through public image data.

Adaptive Image Transformer for One-Shot Object Detection

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021

Ding-Jie Chen, He-Yen Hsieh and Tyng-Luh Liu

Tyng-Luh Liu

One-shot object detection tackles a challenging task that aims at identifying within a target image all object instances of the same class, implied by a query image patch. The main difficulty lies in the situation that the class label of the query patch and its respective examples are not available in the training data. Our main idea leverages the concept of language translation to boost metric-learning-based detection methods. Specifically, we emulate the language translation process to adaptively translate the feature of each object proposal to better correlate the given query feature for discriminating the class-similarity among the proposal-query pairs. To this end, we propose the Adaptive Image Transformer (AIT) module that deploys an attention-based encoder-decoder architecture to simultaneously explore intra-coder and inter-coder (\\ie, each proposal-query pair) attention. The adaptive nature of our design turns out to be flexible and effective in addressing the one-shot learning scenario. With the informative attention cues, the proposed model excels in predicting the class-similarity between the target image proposals and the query image patch. Though conceptually simple, our model significantly outperforms a state-of-the-art technique, improving the unseen-class object classification from 63.8 mAP and 22.0 AP50 to 72.2 mAP and 24.3 AP50 on the PASCAL-VOC and MS-COCO benchmark datasets, respectively.

Referring Image Segmentation via Language-Driven Attention

International Conference on Robotics and Automation (ICRA), May 2021

Ding-Jie Chen, He-Yen Hsieh and Tyng-Luh Liu

Tyng-Luh Liu

This paper aims to tackle the problem of referring image segmentation, which is targeted at reasoning the region of interest referred by a query natural language sentence. One key issue to address the referring image segmentation is how to establish the cross-modal representation for encoding the two modalities, namely, the query sentence and the input image. Most existing methods are designed to concatenate the features from each modality or to gradually encode the cross-modal representation concerning each word's effect. In contrast, our approach leverages the correlation between the two modalities for constructing the cross-modal representation. To make the resulting cross-modal representation more discriminative for the segmentation task, we propose a novel mechanism of language-driven attention to encode the cross-modal representation for reflecting the attention between every single visual element and the entire query sentence. The proposed mechanism, named as Language-Driven Attention (LDA), first decouples the cross-modal correlation to channel-attention and spatial-attention and then integrates the two attentions for obtaining the cross-modal representation. The channel attention and the spatial attention respectively reveal how sensitive each channel, or each pixel of a particular feature map is with respect to the query sentence. With a proper fusion of the two kinds of feature attention, the proposed LDA model can effectively guide the generation of the final cross-modal representation. The resulting representation is further strengthened for capturing the multi-receptive-field and multi-level-semantic for the intended segmentation. We assess our referring image segmentation model on four public benchmark datasets, and the experimental results show that our model achieves state-of-the-art performance.

ATACgraph: profiling genome wide chromatin accessibility from ATAC-seq

Frontiers in Genetics, January 2021

Rita Jui-Hsein Lu, Yen-Ting Liu, Chih Wei Huang, Ming-Ren Yen, Chung-Yen Lin and Pao-Yang Chen

Chih Wei Huang Chung-Yen Lin

Assay for transposase-accessible chromatin using sequencing data (ATAC-seq) is an efficient and precise method for revealing chromatin accessibility across the genome. Most of the current ATAC-seq tools follow chromatin immunoprecipitation sequencing (ChIP-seq) strategies that do not consider ATAC-seq-specific properties. To incorporate specific ATAC-seq quality control and the underlying biology of chromatin accessibility, we developed a bioinformatics software named ATACgraph for analyzing and visualizing ATAC-seq data. ATACgraph profiles accessible chromatin regions and provides ATAC-seq-specific information including definitions of nucleosome-free regions (NFRs) and nucleosome-occupied regions. ATACgraph also allows identification of differentially accessible regions between two ATAC-seq datasets. ATACgraph incorporates the docker image with the Galaxy platform to provide an intuitive user experience via the graphical interface. Without tedious installation processes on a local machine or cloud, users can analyze data through activated websites using pre-designed workflows or customized pipelines composed of ATACgraph modules. Overall, ATACgraph is an effective tool designed for ATAC-seq for biologists with minimal bioinformatics knowledge to analyze chromatin accessibility. ATACgraph can be run on any ATAC-seq data with no limit to specific genomes. As validation, we demonstrated ATACgraph on human genome to showcase its functions for ATAC-seq interpretation. This software is publicly accessible and can be downloaded at

Adaptive and Generative Zero-Shot Learning

Ninth International Conference on Learning Representations (ICLR), May 2021

Yu-Ying Chou, Hsuan-Tien Lin and Tyng-Luh Liu

Tyng-Luh Liu

We address the problem of generalized zero-shot learning (GZSL) where the task is to predict the class label of a target image whether its label belongs to the seen or unseen category. Similar to ZSL, the learning setting assumes that all class-level semantic features are given, while only the images of seen classes are available for training. By exploring the correlation between image features and the corresponding semantic features, the main idea of the proposed approach is to enrich the semantic-to-visual (S2V) embeddings via a seamless fusion of adaptive and generative learning. To this end, we extend the semantic features of each class by supplementing image-adaptive attention so that the learned S2V embedding can account for not only inter-class but also intra-class variations. In addition, to break the limit of training with images only from seen classes, we design a generative scheme to simultaneously generate virtual class labels and their visual features by sampling and interpolating over seen counterparts. In inference, a testing image will give rise to two different S2V embeddings, seen and virtual. The former is used to decide whether the underlying label is of the unseen category or otherwise a specific seen class; the latter is to predict an unseen class label. To demonstrate the effectiveness of our method, we report state-of-the-art results on four standard GZSL datasets, including an ablation study of the proposed modules. 

Efficient Video Captioning on Heterogeneous System Architectures

35th IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2021

Horng-Ruey Huang, Ding-Yong Hong, Jan-Jan Wu, Pangfeng Liu, Wei-Chung Hsu

Horng-Ruey Huang Ding-Yong Hong Jan-Jan Wu

Video captioning is the core technology to drive the development of many important multidisciplinary applications, such as AI-assisted medical diagnosis, storytelling through videos, video question answering, lip-reading, just to name a few. Video captioning employs a hybrid CNN+RNN neural network model to translate video scenes into natural language descriptions. For deep learning inference, a typical approach is running both the CNN and the RNN on a GPU. Such a GPU-only approach often suffers long inference time due to underutilization of the computing power offered by the CPU+GPU heterogeneous system architecture, which is a common architecture in modern computers. This work is an early effort to tackle the performance issue of performing deep learning inference using a hybrid CNN+RNN model on a heterogeneous system with a CPU and a GPU. This is a challenging task because of (1) CNN and RNN exhibit very different computing behaviors. This raises the question of how to split the two models into computing tasks and properly assign the tasks to the CPU and the GPU to minimize the inference time for a video frame, and (2) Data dependency exists between the CNN and the RNN within a video frame, as well as between the adjacent RNNs across two video frames. These data dependencies prohibit full parallelization of the hybrid model. To solve these two problems, we propose two optimizations: a finegrained scheduling scheme for mapping computation and devices within a video frame, and a pipeline scheduling scheme to exploit maximum parallelism between the execution of the video frames. To facilitate our optimizations, we also develop an accurate regression-based cost model to predict the computation time of CNN/RNN operations and the communication time for moving data between CPU and GPU. Experimental results show that our optimization improves the performance of video captioning by up to 3.24× on the CPU+GPU system, compared with the GPU-only execution.

Not by Equations Alone: Reasoning with Extensible Effects

Journal of Functional Programming, January 2021

Oleg Kiselyov, Shin-Cheng Mu and Amr Sabry

Shin-Cheng Mu

The challenge of reasoning about programs with (multiple) effects such as mutation, jumps or IO dates back to the inception of program semantics in the works of Strachey and Landin. Using monads to represent individual effects and the associated equational laws to reason about them proved exceptionally effective. Even then it is not always clear what laws are to be associated with a monad — for a good reason, as we show for non-determinism. Combining expressions using different effects brings challenges not just for monads, which do not compose, but also for equational reasoning: the interaction of effects may invalidate their individual laws, as well as induce emerging properties that are not apparent in the semantics of individual effects. Overall, the problems are judging the adequacy of a law; determining if or when a law continues to hold upon addition of new effects; and obtaining and easily verifying emergent laws. We present a solution relying on the framework of (algebraic, extensible) effects, which already proved itself for writing programs with multiple effects. Equipped with a fairly conventional denotational semantics, this framework turns useful, as we demonstrate, also for reasoning about and optimizing programs with multiple interacting effects. Unlike the conventional approach, equational laws are not imposed on programs/effect handlers, but induced from them: our starting point hence is a program (model), whose denotational semantics, besides being used directly, suggests and justifies equational laws and clarifies side-conditions. The main technical result is the introduction of the notion of equivalence modulo handlers (`modulo observation’) or a particular combination of handlers — and proving it to be a congruence. It is hence usable for reasoning in any context, not just evaluation contexts — provided particular conditions are met. Concretely, we describe several realistic handlers for non-determinism and elucidate their laws (some of which hold in the presence of any other effect). We demonstrate appropriate equational laws of non-determinism in the presence of global state, which have been a challenge to state and prove before.

Multi-Q 2 software facilitates isobaric labeling quantitation analysis with improved accuracy and coverage

Scientific Reports, January 2021

Ching-Tai Chen, Jen-Hung Wang, Cheng-Wei Cheng, Wei-Che Hsu, Chu-Ling Ko, Wai-Kok Choong, and Ting-Yi Sung

Ching-Tai Chen Jen-Hung Wang Wai-Kok Choong Ting-Yi Sung

Mass spectrometry-based proteomics using isobaric labeling for multiplex quantitation has become a popular approach for proteomic studies. We present Multi-Q 2, an isobaric-labeling quantitation tool which can yield the largest quantitation coverage and improved quantitation accuracy compared to three state-of-the-art methods. Multi-Q 2 supports identification results from several popular proteomic data analysis platforms for quantitation, offering up to 12% improvement in quantitation coverage for accepting identification results from multiple search engines when compared with MaxQuant and PatternLab. It is equipped with various quantitation algorithms, including a ratio compression correction algorithm, and results in up to 336 algorithmic combinations. Systematic evaluation shows different algorithmic combinations have different strengths and are suitable for different situations. We also demonstrate that the flexibility of Multi-Q 2 in customizing algorithmic combination can lead to improved quantitation accuracy over existing tools. Moreover, the use of complementary algorithmic combinations can be an effective strategy to enhance sensitivity when searching for biomarkers from differentially expressed proteins in proteomic experiments. Multi-Q 2 provides interactive graphical interfaces to process quantitation and to display ratios at protein, peptide, and spectrum levels. It also supports a heatmap module, enabling users to cluster proteins based on their abundance ratios and to visualize the clustering results. Multi-Q 2 executable files, sample data sets, and user manual are freely available at

Parallel Asynchronous Stochastic Dual Coordinate Descent Algorithms for Efficiency and Convergence

29th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP 2021), March 2021

Yung-Chen Chen, Pangfeng Liu, Jan-Jan Wu

Jan-Jan Wu

Parallel asynchronous stochastic dual coordinate descent algorithm (PASSCoDe) is an efficient method to train linear models in multi-core shared memory systems.
{\\tt PASSCoDe} enjoys a good speedup when the number of threads is less than 8 on sparse datasets, i.e., the percentage of nonzero elements in the training data is relatively small.
However, due to the memory conflict and delayed parameter access problem in parallel execution, it often diverges or does not converge to the best accuracy as a serial dual coordinate descent algorithm does.
In this paper, we proposed two algorithms -- {\\em Adaptive Hybrid} algorithm and {\\em Lazy-Sync} algorithm, to overcome the convergence issues in parallel execution.
Experiment results indicate that both algorithms converge to the {\\em same} high accuracy as a sequential program does on {\\em all}  datasets we tested, except on one extremely small dataset.
On the other hand, PASSCoDe sometimes converges to a less accurate value, or does not converge at all on some datasets.
Our methods also outperform PASSCoDe-Fix, an improved version of PASSCoDe, in stable convergence, execution speed, and scalability.

PASSLEAF: A Pool-bAsed Semi-Supervised LEArning Framework for Uncertain Knowledge Graph Embedding

The 35th AAAI Conference on Artificial Intelligence (AAAI 2021), February 2021

Zhu-Mu Chen, Mi-Yen Yeh, and Tei-Wei Kuo

Mi-Yen Yeh

In this paper, we study the problem of embedding uncertain knowledge graphs, where each relation between entities is associated with a confidence score. Observing the existing embedding methods may discard the uncertainty information, only incorporate a specific type of score function, or cause many false-negative samples in the training, we propose the PASSLEAF framework to solve the above issues. PASSLEAF consists of two parts, one is a model that can incorporate different types of scoring functions to predict the relation confidence scores and the other is the semi-supervised learning model by exploiting both positive and negative samples associated with the estimated confidence scores. Furthermore, PASSLEAF leverages a sample pool as a relay of generated samples to further augment the semi-supervised learning. Experiment results show that our proposed framework can learn better embedding in terms of having higher accuracy in both the confidence score prediction and tail entity prediction.

Accelerating Continuous Normalizing Flow with Trajectory Polynomial Regularization

The 35th AAAI Conference on Artificial Intelligence (AAAI 2021), February 2021

Han-Hsien Huang and Mi-Yen Yeh

Mi-Yen Yeh

In this paper, we propose an approach to effectively accelerating the computation of continuous normalizing flow (CNF), which has been proven to be a powerful tool for the tasks such as variational inference and density estimation. The training time cost of CNF can be extremely high because the required number of function evaluations (NFE)  for solving corresponding ordinary differential equations (ODE) is very large. We think that the high NFE results from large truncation errors of solving ODEs. To address the problem, we propose to add a regularization. The regularization penalizes the difference between the trajectory of the ODE and its fitted polynomial regression. The trajectory of ODE will approximate a polynomial function, and thus the truncation error will be smaller. Furthermore, we provide two proofs and claim that the additional regularization does not harm training quality. Experimental results show that our proposed method can result in 42.3\%  to  71.3\% reduction of NFE on the task of density estimation, and 19.3\%  to  32.1\% reduction of NFE on  variational auto-encoder, while the testing losses are not affected at all.

Positions, Channels, and Layers: Fully Generalized Non-Local Network for Singer Identification

Thirty-Fifth AAAI Conference on Artificial Intelligence, February 2021

I-Yuan Kuo, Wen-Li Wei, and Jen-Chun Lin

I-Yuan Kuo Jen-Chun Lin

Recently, a non-local (NL) operation has been designed as the central building block for deep-net models to capture long-range dependencies (Wang et al. 2018). Despite its excellent performance, it does not consider the interaction between positions across channels and layers, which is crucial in fine-grained classification tasks. To address the limitation, we target at singer identification (SID) task and present a fully generalized non-local (FGNL) module to help identify finegrained vocals. Specifically, we first propose a FGNL operation, which extends the NL operation to explore the correlations between positions across channels and layers. Secondly, we further apply a depth-wise convolution with Gaussian kernel in the FGNL operation to smooth feature maps for better generalization. More, we modify the squeeze-and-excitation (SE) scheme into the FGNL module to adaptively emphasize correlated feature channels to help uncover relevant feature responses and eventually the target singer. Evaluating results on the benchmark artist20 dataset shows that the FGNL module significantly improves the accuracy of the deep-net models in SID. Codes are available at

A Flexible Template Generation and Matching Method with Applications for Publication Reference Metadata Extraction

Journal of the Association for Information Science and Technology, To Appear

Ting-Hao Yang, Yu-Lun Hsieh, Shih-Hung Liu, Yung-Chun Chang, and Wen-Lian Hsu

Yu-Lun Hsieh Yung-Chun Chang Wen-Lian Hsu

Conventional rule‐based approaches use exact template matching to capture linguistic information and necessarily need to enumerate all variations. We propose a novel flexible template generation and matching scheme called the principle‐based approach (PBA) based on sequence alignment, and employ it for reference metadata extraction (RME) to demonstrate its effectiveness. The main contributions of this research are threefold. First, we propose an automatic template generation that can capture prominent patterns using the dominating set algorithm. Second, we devise an alignment‐based template‐matching technique that uses a logistic regression model, which makes it more general and flexible than pure rule‐based approaches. Last, we apply PBA to RME on extensive cross‐domain corpora and demonstrate its robustness and generality. Experiments reveal that the same set of templates produced by the PBA framework not only deliver consistent performance on various unseen domains, but also surpass hand‐crafted knowledge (templates). We use four independent journal style test sets and one conference style test set in the experiments. When compared to renowned machine learning methods, such as conditional random fields (CRF), as well as recent deep learning methods (i.e., bi‐directional long short‐term memory with a CRF layer, Bi‐LSTM‐CRF), PBA has the best performance for all datasets.

LBERT: Lexically-aware Transformers based Bidirectional Encoder Representation model for learning Universal Bio-Entity Relations

Bioinformatics, To Appear

Neha Warikoo, Yung-Chun Chang, and Wen-Lian Hsu

Yung-Chun Chang Wen-Lian Hsu

Natural Language Processing techniques are constantly being advanced to accommodate the influx of data as well as to provide exhaustive and structured knowledge dissemination. Within the biomedical domain, relation detection between bio-entities known as the Biomedical relation extraction (BRE) task has a critical function in knowledge structuring. Although recent advances in deep learning-based biomedical domain embedding have improved BRE predictive analytics, these works are often task selective or employ external knowledge-based pre/post processing. In addition, deep learning-based models do not account for local syntactic contexts, which have improved data representation in many kernel classifier-based models. In this study, we propose a universal BRE model, i.e. LBERT, which is a Lexically-aware Transformer-based Bidirectional Encoder Representation model, and which explores both local and global contexts representations for sentence level classification tasks. This paper presents one of the most exhaustive BRE studies ever conducted over five different bio-entity relation types. Our model outperforms state-of-the-art deep learning models in protein-protein (PPI), drug-drug (DDI) and protein-bio-entity (REL) relation classification tasks by 0.02%, 11.2% and 41.4% respectively. LBERT representations show a statistically significant improvement over BioBERT in detecting true bio-entity relation for large corpora like PPI. Our ablation studies clearly indicate the contribution of the lexical features and distance-adjusted attention in improving prediction performance by learning additional local semantic context along with bi-directionally learned global context.

Text-guided Graph Neural Networks for Referring 3D Instance Segmentation

Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI), February 2021

Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen and Tyng-Luh Liu

Tyng-Luh Liu

This paper addresses a new task called referring 3D instance segmentation, which aims to segment out the target instance in a 3D scene given a query sentence. Previous work on scene understanding has explored visual grounding with natural language guidance, yet the emphasis is mostly constrained on images and videos. We propose a Text-guided Graph Neural Network for referring 3D instance segmentation on point clouds. Given a query sentence and the point cloud of a 3D scene, our method learns to extract per-point features and predicts an offset to shift each point toward its object center. Based on the point features and the offset, we cluster the points to produce fused features and coordinates for the candidate objects. The resulting clusters are modeled as nodes in a Graph Neural Network (GNN) to learn the representations that encompass the relation structure for each candidate object. The GNN layers leverage each object's features and its relations with neighbors to generate an attention heatmap for the input sentence expression. Finally, the attention heatmap is used to ``guide" the aggregation of information from neighborhood nodes. Our method achieves state-of-the-art performance on the tasks of referring 3D instance segmentation and 3D localization on ScanRefer, Nr3D, and Sr3D benchmarks.

Comparison of different variant sequence types coupled with decoy generation methods used in concatenated target-decoy database searches for proteogenomic research

Journal of Proteomics, January 2021

Wai-Kok Choong and Ting-Yi Sung

Wai-Kok Choong Ting-Yi Sung

Concatenated target-decoy database searches are commonly used in proteogenomic research for variant peptide identification. Currently, protein-based and peptide-based sequence databases are applied to store variant sequences for database searches. The protein-based database records a full-length wild-type protein sequence but using the given variant events to replace the original amino acids, whereas the peptide-based database retains only the in silico digested peptides containing the variants. However, the performance of applying various decoy generation methods on the peptide-based variant sequence database is still unclear, compared to the protein-based database. In this paper, we conduct a thorough comparison on target-decoy databases constructed by the above two types of databases coupled with various decoy generation methods for proteogenomic analyses. The results show that for the protein-based variant sequence database, using the reverse or the pseudo reverse method achieves similar performance for variant peptide identification. Furthermore, for the peptide-based database, the pseudo reverse method is more suitable than the widely used reverse method, as shown by identifying 6% more variant PSMs in a HEK293 cell line data set.

Assessing the Helpfulness of Learning Materials with Inference-Based Learner-Like Agent

Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 2020

Yun-Hsuan Jen, Chieh-Yang Huang, MeiHua Chen, Ting-Hao Huang and Lun-Wei Ku

Lun-Wei Ku

Many language learners have trouble using near-synonym words (e.g.,small vs.little; briefly vs.shortly) correctly, and often look for example sentences to learn how two nearly synonymous terms differ. Prior work uses hand-crafted scores to recommend sentences but has difficulty in adopting such scores to all the near-synonyms as near-synonyms differ in various ways. We notice that the helpfulness of the learning material would reflect on the learners performance. Thus, we propose the inference-based learner-like agent to mimic learner behavior and identify good learning materials by examining the agents performance. To enable the agent to behave like a learner, we leverage entailment modelings capability of inferring answers from the provided materials. Experimental results show that the proposed agentis equipped with good learner-like behavior to achieve the best performance in both fill-in-the-blank (FITB) and good example sentence selection tasks. We further conduct a classroom user study with college ESL learners.The results of the user study show that the proposed agent can find out example sentencesthat help students learn more easily and efficiently. Compared to other models, the proposed agent improves the score of more than 17% of students after learning.

Reactive Supervision: A New Method for Collecting Sarcasm Data

Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 2020

Boaz Shmueli, Lun-Wei Ku and Soumya Ray

Lun-Wei Ku

Sarcasm detection is an important task in affective computing, requiring large amounts of labeled data. We introduce reactive supervision, a novel data collection method that utilizes the dynamics of online conversations to overcome the limitations of existing data collection techniques. We use the new method to create and release a first-of-its-kind large dataset of tweets with sarcasm perspective labels and new contextual features. The dataset is expected to advance sarcasm detection research. Our method can be adapted to other affective computing domains, thus opening up new research opportunities.

Subspace-based Representation and Learning for Phonotactic Spoken Language Recognition

IEEE/ACM Transactions on Audio, Speech, and Language Processing, November 2020

Hung-Shin Lee, Yu Tsao, Shyh-Kang Jeng, and Hsin-Min Wang

Hsin-Min Wang

Phonotactic constraints can be employed to distinguish languages by representing a speech utterance as a multinomial distribution or phone events. In the present study, we propose a new learning mechanism based on subspace-based representation, which can extract concealed phonotactic structures from utterances, for language verification and dialect/accent identification. The framework mainly involves two successive parts. The first part involves subspace construction. Specifically, it decodes each utterance into a sequence of vectors filled with phone-posteriors and transforms the vector sequence into a linear orthogonal subspace based on low-rank matrix factorization or dynamic linear modeling. The second part involves subspace learning based on kernel machines, such as support vector machines and the newly developed subspace-based neural networks (SNNs). The input layer of SNNs is specifically designed for the sample represented by subspaces. The topology ensures that the same output can be derived from identical subspaces by modifying the conventional feed-forward pass to fit the mathematical definition of subspace similarity. Evaluated on the "General LR" test of NIST LRE 2007, the proposed method achieved up to 52\\%, 46%, 56%, and 27% relative reductions in equal error rates over the sequence-based PPR-LM, PPR-VSM, and PPR-IVEC methods and the lattice-based PPR-LM method, respectively. Furthermore, on the dialect/accent identification task of NIST LRE 2009, the SNN-based system performed better than the aforementioned four baseline methods.

Learning From Music to Visual Storytelling of Shots: A Deep Interactive Learning Mechanism

ACM Multimedia Conference, October 2020

Jen-Chun Lin, Wen-Li Wei, Yen-Yu Lin, Tyng-Luh Liu, and Hong-Yuan Mark Liao

Jen-Chun Lin Tyng-Luh Liu Hong-Yuan Mark Liao

Interesting and emerging task. It produces a coherent visual story in the form of a shot type sequence, which not only expands the storytelling potential for a song but also facilitates automatic concert video mashup process and storyboard generation. In this study, we present a deep interactive learning (DIL) mechanism for building a compact yet accurate sequence-to-sequence model to accomplish the task. Different from the one-way transfer between a pre-trained teacher network (or ensemble network) and a student network in knowledge distillation (KD), the proposed method enables collaborative learning between an ensemble teacher network and a student network. Namely, the student network also teaches. Specifically, our method first learns a teacher network that is composed of several assistant networks to generate a shot type sequence and produce the soft target (shot types) distribution accordingly through KD. It then constructs the student network that learns from both the ground truth label (hard target) and the soft target distribution to alleviate the difficulty of optimization and improve generalization capability. As the student network gradually advances, it turns to feed back knowledge to the assistant networks, thereby improving the teacher network in each iteration. Owing to such interactive designs, the DIL mechanism bridges the gap between the teacher and student networks and produces more superior capability for both networks. Objective and subjective experimental results demonstrate that both the teacher and student networks can generate more accurate for improving the performance.

Self-similarity Student for Partial Label Histopathology Image Segmentation

16th European Conference on Computer Vision (ECCV), August 2020

Hsien-Tzu Cheng, Chun-Fu Yeh, Po-Chen Kuo, Andy Wei, Keng-Chi Liu, Mong-Chi Ko, Kuan-Hua Chao, Yu-Ching Peng, and Tyng-Luh Liu

Tyng-Luh Liu

Delineation of cancerous regions in gigapixel whole slide images (WSIs) is a crucial diagnostic procedure in digital pathology. This process is time-consuming because of the large search space in the gigapixel WSIs, causing chances of omission and misinterpretation at indistinct tumor lesions. To tackle this, the development of an automated cancerous region segmentation method is imperative. We frame this issue as a modeling problem with partial label WSIs, where some cancerous regions may be misclassified as benign and vice versa, producing patches with noisy labels. To learn from these patches, we propose Self-similarity Student, combining teacher-student model paradigm with similarity learning. Specifically, for each patch, we first sample its similar and dissimilar patches according to spatial distance. A teacher-student model is then introduced, featuring the exponential moving average on both student model weights and teacher predictions ensemble. While our student model takes patches, teacher model takes all their corresponding similar and dissimilar patches for learning robust representation against noisy label patches. Following this similarity learning, our similarity ensemble merges similar patches' ensembled predictions as the pseudo-label of a given patch to counteract its noisy label. On the CAMELYON16 dataset, our method substantially outperforms state-of-the-art noise-aware learning methods by 5% and the supervised-trained baseline by 10% in various degrees of noise. Moreover, our method is superior to the baseline on our TVGH TURP dataset with 2% improvement, demonstrating the generalizability to more clinical histopathology segmentation tasks.

An Adaptive Layer Expansion Algorithm for Efficient Training of Deep Neural Networks

IEEE International Conference on Big Data, December 2020

Leo Chen, Pangfeng Liu, and Jan-Jan Wu

Jan-Jan Wu

In this paper, we propose an adaptive layer expansion algorithm to reduce the training time of deep neural networks without noticeable loss of accuracy. Neu- ral networks have become deeper and wider to improve accuracy. The size of such networks makes them time- consuming to train. Hence, we propose an adaptive layer expansion algorithm that reduces training time by dynam- ically adding nodes in where is necessary, to improve the training efficiency while not losing accuracy. We start with a smaller model of only a fraction of parameters of the original model, then train the network and add nodes to specific layers determined by the stability of gradients. The algorithm repeatedly adds nodes until a threshold is reached, and trains model until the accuracy converges. The experiment results indicate that our algorithm only uses a quarter of computation time of a full model, and achieves 64.1% accuracy on MobileNet with dataset CIFAR100, which is only 2% less than 66.2% accuracy form a full model. The algorithm stops adding nodes when it has only half of the parameters of the original model. As a result, this new model is good for fast inference in those environments where both the computation power and memory storage are very limited, such as mobile devices.

Automated Graph Generation at Sentence Level for Reading Comprehension Based on Conceptual Graphs

The 28th International Conference on Computation Linguistics (COLING), December 2020

Wan-Hsuan Lin and Chun-Shien Lu

Wan-Hsuan Lin Chun-Shien Lu

This paper proposes a novel miscellaneous-context-based method to convert a sentence into a knowledge embedding in the form of a directed graph. We adopt the idea of conceptual graphs to frame for the miscellaneous textual information into conceptual compactness. We first empirically observe that this graph representation method can (1) accommodate the slot-filling challenges in typical question answering and (2) access to the sentence-level graph structure in order to explicitly capture the neighbouring connections of reference concept nodes. Secondly, we propose a task-agnostic semantics-measured module, which cooperates with the graph representation method, in order to (3) project an edge of a sentence-level graph to the space of semantic relevance with respect to the corresponding concept nodes. As a result of experiments on the QA-type relation extraction, the combination of the graph representation and the semantics-measured module achieves the high accuracy of answer prediction and offers humancomprehensible graphical interpretation for every well-formed sample. To our knowledge, our approach is the first towards the interpretable process of learning vocabulary representations with the experimental evidence.

Determinizing Crash Behavior with a Verified Snapshot-Consistent Flash Translation Layer

USENIX Symposium on Operating Systems Design and Implementation (OSDI), November 2020

Yun-Sheng Chang, Yao Hsiao, Tzu-Chi Lin, Che-Wei Tsao, Chun-Feng Wu, Yuan-Hao Chang, Hsiang-Shang Ko, and Yu-Fang Chen

Yun-Sheng Chang Che-Wei Tsao Chun-Feng Wu Yuan-Hao Chang Hsiang-Shang Ko Yu-Fang Chen

We introduce the design of a snapshot-consistent flash translation layer (SCFTL) for flash disks, which has a stronger guarantee about the possible behaviors after a crash than conventional designs. More specifically, the flush operation of SCFTL also has the functionality of making a “disk snapshot.” When a crash occurs, the flash disk is guaranteed to recover to the state right before the last flush. The major benefit of SCFTL is that it allows a more efficient design of upper layers in the storage stack. For example, the file system hosted by SCFTL does not require the use of a journal for crash recovery. Instead, it only needs to perform a flush operation of SCFTL at the end of each atomic transaction. We use a two-layer approach, combining a proof assistant, a symbolic executor, and an SMT solver, to formally verify the correctness of our prototype SCFTL implementation. We optimize the xv6 file system by utilizing SCFTL’s stronger crash guarantee. Evaluation results show that the optimized xv6 is 3 to 30 times faster than the original version.

Cross-batch Reference Learning for Deep Retrieval

IEEE Transactions on Neural Networks and Learning Systems, September 2020

Huei-Fang Yang, Kevin Lin, Ting-Yen Chen, and Chu-Song Chen

Chu-Song Chen

Learning effective representations that exhibit semantic content is crucial to image retrieval applications. Recent advances in deep learning have made significant improvements in performance on a number of visual recognition tasks. Studies have also revealed that visual features extracted from a deep network learned on a large-scale image data set (e.g., ImageNet) for classification are generic and perform well on new recognition tasks in different domains. Nevertheless, when applied to image retrieval, such deep representations do not attain performance as impressive as used for classification. This is mainly because the deep features are optimized for classification rather than for the desired retrieval task. We introduce the cross-batch reference (CBR), a novel training mechanism that enables the optimization of deep networks with a retrieval criterion. With the CBR, the networks leverage both the samples in a single minibatch and the samples in the others for weight updates, enhancing the stochastic gradient descent (SGD) training by enabling interbatch information passing. This interbatch communication is implemented as a cross-batch retrieval process in which the networks are trained to maximize the mean average precision (mAP) that is a popular performance measure in retrieval. Maximizing the cross-batch mAP is equivalent to centralizing the samples relevant to each other in the feature space and separating the samples irrelevant to each other. The learned features can discriminate between relevant and irrelevant samples and thus are suitable for retrieval. To circumvent the discrete, nondifferentiable mAP maximization, we derive an approximate, differentiable lower bound that can be easily optimized in deep networks. Furthermore, the mAP loss can be used alone or with a classification loss. Experiments on several data sets demonstrate that our CBR learning provides favorable performance, validating its effectiveness.

EpiMOLAS: An Intuitive Web-based Framework for Genome-wide DNA Methylation Analysis

BMC Genomics, April 2020

Sheng-Yao Su, I-Hsuan Lu, Wen-Chih Cheng, Wei-Chun Chung, Pao-Yang Chen, Jan-Ming Ho, Shu-Hwa Chen, Chung-Yen Lin

Jan-Ming Ho Chung-Yen Lin

DNA methylation is a crucial epigenomic mechanism in various biological processes. Using whole-genome bisulfite sequencing (WGBS) technology, methylated cytosine sites can be revealed at the single nucleotide level. However, the WGBS data analysis process is usually complicated and challenging.
To alleviate the associated difficulties, we integrated the WGBS data processing steps and downstream analysis into a two-phase approach. First, we set up the required tools in Galaxy and developed workflows to calculate the methylation level from raw WGBS data and generate a methylation status summary, the mtable. This computation environment is wrapped into the Docker container image DocMethyl, which allows users to rapidly deploy an executable environment without tedious software installation and library dependency problems. Next, the mtable files were uploaded to the web server EpiMOLAS_web to link with the gene annotation databases that enable rapid data retrieval and analyses.
To our knowledge, the EpiMOLAS framework, consisting of DocMethyl and EpiMOLAS_web, is the first approach to include containerization technology and a web-based system for WGBS data analysis from raw data processing to downstream analysis. EpiMOLAS will help users cope with their WGBS data and also conduct reproducible analyses of publicly available data, thereby gaining insights into the mechanisms underlying complex biological phenomenon. The Galaxy Docker image DocMethyl is available at piMOLAS_web is publicly accessible at