Recent Research Results
"A Bicameralism Voting Framework for Combining Knowledge from Clients into Better Prediction," IEEE International Conference on Big Data, December 2019.
Authors: Yu-Tung Hsieh, Chuan-Yu Lee, Ching-Chi Lin, Pangfeng Liu, and Jan-Jan Wu

Abstract:
In this paper, we propose abicameralism votingtoimprove the accuracy of a deep learning network. After we traina deep learning network with existing data, we may want toimprove it with some newly collected data. However, it would betime consuming if we retrain the model with all the available data.Instead, we propose a collective framework that train models onmobile devices with new data (also collected from the mobiledevices) via transfer learning. Then we collect the predictionsfrom these new models from the mobile devices, and achieve moreaccurate predictions bycombiningtheir predictions viavoting.The proposed bicameralism voting is different from federatedlearning, since we do not average the weights of models frommobile devices, but let them vote bybicameralism.The proposed bicameralism voting mechanism has three ad-vantages. First, this collective mechanismimproves the accuracyof the deep learning model. The accuracy of bicameralism voting(VGG-19 on the data set Food-101 dataset) is 77.838%, higherthan that of a single model (75.517%) with the same amount oftraining data. Second, the bicameralism votingsaves computationresource, because it only updates an existing model, and canbe done in parallel by multiple devices. For example, in ourexperiments to update an existing model via transfer learningtakes about 10 minutes on a server, but to train a model fromscratch with both the original and the new data will take morethan a week. Finally, the bicameralism voting isflexible. Unlikefederated learning, bicameralism voting can use any architectureof model, any preprocessing of input data, and any format ofmodel when the models are trained on different mobile devices.
"Handling local state with global state," Mathematics of Program Construction (MPC 2019), October 2019.
Authors: Koen Pauwels, Tom Schrijvers and Shin-Cheng Mu

Abstract:
Equational reasoning is one of the most important tools of functional programming. To facilitate its application to monadic programs, Gibbons and Hinze have proposed a simple axiomatic approach using laws that characterise the computational effects without exposing their implementation details.  At the same time Plotkin and Pretnar have proposed algebraic effects and handlers, a mechanism of layered abstractions by which effects can be implemented in terms of other effects.

This paper performs a case study that connects these two strands of research. We consider two ways in which the nondeterminism and state effects can interact: the high-level semantics where every nondeterministic branch has a local copy of the state, and the low-level semantics where a single sequentially threaded  state is global to all branches.

We give a monadic account of the folklore technique of handling local state in terms of global state, provide a novel axiomatic characterisation of global state and prove that the handler satisfies Gibbons and Hinze's local state axioms by means of a novel combination of free monads and contextual equivalence. We also provide a model for global state that is necessarily
Current Research Results
"CAR: The Clean Air Routing Algorithm for Path Navigation with Minimal PM2.5 Exposure on the Move," IEEE Access, To Appear.
Authors: Sachit Mahajan, Yu-Siou Tang, Dong-Yi Wu, Tzu-Chieh Tsai, and Ling-Jyh Chen

Abstract:
Transport related pollution is becoming a major issue as it adversely affects human health and one way to lower the personal exposure to air pollutants is to choose a health-optimal route to the destination. Current navigation systems include options for the quickest paths (distance, traffic) and least expensive paths (fuel costs, tolls). In this paper, we come up with the CAR (Cleanest Air Routing) algorithm and use it to build a health-optimal route recommendation system between the origin and the destination. We combine the open source PM2.5 (Fine Particulate Matter with diameter less than 2.5 micrometers) concentration data for Taiwan, with the road network graph obtained through OpenStreetMaps. In addition, spatio-temporal interpolation of PM2.5 is performed to get PM2.5 concentration for the road network intersections. Our algorithm introduces a weight function that assesses how much PM2.5 the user is exposed to at each intersection of the road network and uses it to navigate through intersections with the lowest PM2.5 exposures. The algorithm can help people reduce their overall PM2.5 exposure by offering a healthier alternative route which may be slightly longer than the shortest path in some cases. We evaluate our algorithm for different travel modes, including driving, cycling, walking and riding motorbikes. An analysis is done for 1500 real-world travel scenarios, which shows that routes recommended by our approach tend to have a lower PM2.5 concentration than those recommended by Google Maps.
"Compacting, Picking and Growing for Unforgetting Continual Learning," Thirty-third Conference on Neural Information Processing Systems, NeurIPS 2019, December 2019.
Authors: Steven C. Y. Hung, Cheng-Hao Tu, Cheng-En Wu, Chien-Hung Chen, Yi-Ming Chan, and Chu-Song Chen

Abstract:
Current Research Results
"mwJFS: A Multi-Write-Mode Journaling File System for MLC NVRAM Storages," IEEE Transactions on Very Large Scale Integration Systems (TVLSI), September 2019.
Authors: Shuo-Han Chen, Yuan-Hao Chang, Yu-Ming Chang, and Wei-Kuan Shih

Abstract:
At present, nonvolatile random access memory (NVRAM) is widely considered as a promising candidate for the next-generation storage medium due to its appealing characteristics, including short read/write latency, byte addressability, and low idle energy consumption. In addition, to provide a higher bit density, multilevel-cell (MLC) NVRAM has also been proposed. Nevertheless, when compared with conventional single-level-cell (SLC) NVRAM, MLC NVRAM has longer write latency and higher energy consumption. Hence, the performance of MLC NVRAM-based storage systems could be degraded due to the lengthened write latency. The performance degradation is further magnified by existing journaling file systems (JFS) on MLC NVRAM-based storage devices due to the JFS's fail-safe policy of writing the same data twice. Such observations motivate us to propose multiwrite-mode JFSs (mwJFSs) to alleviate the drawbacks of MLC NVRAM and boost the performance of MLC NVRAM-based JFS. The proposed mwJFS differentiates the data retention requirement of journaled data and applies different write modes to enhance the access performance with lower energy consumption. A series of experiments was conducted to demonstrate the capability of mwJFS on MLC NVRAM-based storage systems.
Current Research Results
Authors: Sheng-Yu Fu, Ding-Yong Hong, Yu-Ping Liu, Jan-Jan Wu, Wei-Chung Hsu

Abstract:
More and more modern processors have been supporting non-contiguous SIMD data accesses. However, translating such instructions has been overlooked in the Dynamic Binary Translation (DBT) area. For example, in the popular QEMU dynamic binary translator, guest memory instructions with strides are emulated by a sequence of scalar instructions, leaving a significant room for performance improvement when the SIMD instructions are available on the host machines. Structured loads/stores, such as VLDn/VSTn in ARM NEON, are one type of strided SIMD data access instructions. They are widely used in signal processing, multimedia, mathematical, and 2D matrix transposition applications. Efficient translation of such structured loads/stores is a critical issue when migrating ARM executables to other ISAs. However, it is quite challenging since not only the translation of structured loads/stores is not trivial, but also the mapping of SIMD registers between guest and host is complicated. This paper presents the design of translating structured loads/stores in DBT, including target code generation, efficient SIMD register mapping, and optimizations for reducing data permutations. Our proposed register mapping mechanisms and optimizations are not limited to handle structured loads/stores, they can be extended to deal with normal SIMD instructions. This paper evaluates how different factors affect the translation performance and code size. These factors include guest SIMD register length, strides, and use cases for structured loads. On a set of OpenCV benchmarks, our QEMU-based system has achieved a maximum speedup of 5.03x, with an average improvement of 2.87x. On a set of BLAS benchmarks, our system has obtained a maximum speedup of 2.22x and an average improvement of 1.78x.
Current Research Results
Authors: Wai-Kok Choong, Ching-Tai Chen, Jen-Hung Wang, and Ting-Yi Sung

Abstract:
When conducting proteomics experiments to detect missing proteins and protein isoforms in the human proteome, it is desirable to use a protease that can yield more unique peptides with properties amenable for mass spectrometry analysis. Though trypsin is currently the most widely used protease, some proteins can yield only a limited number of unique peptides by trypsin digestion. Other proteases and multiple proteases have been applied in reported studies to increase the number of identified proteins and protein sequence coverage. To facilitate the selection of proteases, we developed a web-based resource, called in silico Human Proteome Digestion Map (iHPDM), which contains a comprehensive proteolytic peptide database constructed from human proteins, including isoforms, in neXtProt digested by 15 protease combinations of one or two proteases. iHPDM provides convenient functions and graphical visualizations for users to examine and compare the digestion results of different proteases. Notably, it also supports users to input filtering criteria on digested peptides, e.g., peptide length and uniqueness, to select suitable proteases. iHPDM can facilitate protease selection for shotgun proteomics experiments to identify missing proteins, protein isoforms, and single amino acid variant peptides.
"One-Shot Object Detection with Co-Attention and Co-Excitation," Thirty-third Conference on Neural Information Processing Systems, December 2019.
Authors: Ting-I Hsieh, Yi-Chen Lo, Hwann-Tzong Chen, and Tyng-Luh Liu

Abstract:
This paper aims to tackle the challenging problem of one-shot object detection. Given a query image patch whose class label is not included in the training data, the goal of the task is to detect all instances of the same class in a target image. To this end, we develop a novel {\em co-attention and co-excitation} (CoAE) framework that makes contributions in three key technical aspects. First, we propose to use the non-local operation to explore the co-attention embodied in each query-target pair and yield region proposals accounting for the one-shot situation. Second, we formulate a squeeze-and-co-excitation scheme that can adaptively emphasize correlated feature channels to help uncover relevant proposals and eventually the target objects. Third, we design a margin-based ranking loss for implicitly learning a metric to predict the similarity of a region proposal to the underlying query, no matter its class label is seen or unseen in training. The resulting model is therefore a two-stage detector that yields a strong baseline on both VOC and MS-COCO under one-shot setting of detecting objects from both seen and never-seen classes.
"Toward Instantaneous Sanitization through Disturbance-induced Errors and Recycling Programming over 3D Flash Memory," ACM/IEEE International Conference on Computer-Aided Design (ICCAD), November 2019.
Authors: Wei-Chen Wang, Ping-Hsien Lin, Yung-Chun Li, Chien-Chung Ho, Yu-Ming Chang, and Yuan-Hao Chang

Abstract:
As data security has become one of the most crucial issues in modern storage system/application designs, the data sanitization techniques are regarded as the promising solution on 3D NAND flash-memory-based devices. Many excellent works had been proposed to exploit the in-place reprogramming, erasure and encryption techniques to achieve and implement the sanitization functionalities. However, existing sanitization approaches could lead to performance, disturbance overheads or even deciphered issues. Different from existing works, this work aims at exploring an instantaneous data sanitization scheme by taking advantage of programming disturbance properties. Our proposed design can not only achieve the instantaneous data sanitization by exploiting programming disturbance and error correction code properly, but also enhance the performance with the recycling programming design. The feasibility and capability of our proposed design are evaluated by a series of experiments on 3D NAND flash memory chips, for which we have very encouraging results. The experiment results show that the proposed design could achieve the instantaneous data sanitization with low overhead; besides, it improves the average response time and reduces the number of block erase count by up to 86.8% and 88.8%, respectively.
Current Research Results
Authors: Reuben Wang, Chung-Yen Lin, Shu-Hwa Chen, Kai-Jiun Lo, Chi-Te Liu, Tzu-Ho Chou, Yang-hsin Shih

Abstract:
We discovered one purple photosynthetic bacterium, Rhodopseudomonas palustris YSC3, which has a specific ability to degrade 1, 2, 5, 6, 9, 10-hexabromocyclododecane (HBCD). The whole transcriptome of R. palustris YSC3 was analyzed using the RNA-based sequencing technology in illumina and was compared as well as discussed through Multi-Omics onLine Analysis System (MOLAS, http://molas.iis.sinica.edu.tw/NTUIOBYSC3/ ) platform we built. By using genome based mapping approach, we can align the trimmed reads on the genome of R. palustris and estimate the expression profiling for each transcript. A total of 341 differentially expressed genes (DEGs) in HBCD-treated R. palustris (RPH) versus control R. palustris (RPC) was identified by 2-fold changes, among which 305 genes were up-regulated and 36 genes were down-regulated. The regulated genes were mapped to the database of Gene Ontology (GO) and Genes and Genomes Encyclopedia of Kyoto (KEGG), resulting in 78 pathways being identified. Among those DEGs which annotated to important functions in several metabolic pathways, including those involved in two-component system (13.6%), ribosome assembly (10.7%), glyoxylate and dicarboxylate metabolism (5.3%), fatty acid degradation (4.7%), drug metabolism-cytochrome P450 (2.3%), and chlorocyclohexane and chlorobenzene degradation (3.0%) were differentially expressed in RPH and RPC samples. We also identified one transcript annotated as dehalogenase and other genes involved in the HBCD biotransformation in R. palustris. Furthermore, the putative HBCD biotransformation mechanism in R. palustris was proposed.
"Enriching Variety of Layer-wise Learning Information by Gradient Combination," IEEE International Conference on Computer Vision Workshop(ICCVW) Low Power Computer Vision.'', October 2019.
Authors: C. Y. Wang, H. Y. Mark Liao, P. Y. Chen, and J. W. Hsieh

Abstract:
This study proposes to use the combination of gradient concept to enhance the learning capability of Deep Convolutional Networks (DCN), and four Partial Residual Networksbased (PRN-based) architectures are developed to verify above concept. The purpose of designing PRN is to provide as rich information as possible for each single layer. During the training phase, we propose to propagate gradient combinations rather than feature combinations. PRN can be easily applied in many existing network architectures, such as ResNet, feature pyramid network, etc., and can effectively improve their performance. Nowadays, more advancedDCNsaredesignedwiththehierarchicalsemantic information of multiple layers, so the model will continue to deepen and expand. Due to the neat design of PRN, it can beneﬁt all models, especially for lightweight models. In the MSCOCO object detection experiments, YOLO-v3-PRN maintainsthesameaccuracyasYOLO-v3witha55%reduction of parameters and 35% reduction of computation, while increasing the speed of execution by twice. For lightweight models, YOLO-v3-tiny-PRN maintains the same accuracy under the condition of 37% less parameters and 38% less computation than YOLO-v3-tiny and increases the frame rate by up to 12 fps on the NVIDIA Jetson TX2 platform. The Pelee-PRN is 6.7% mAP@0.5 higher than Pelee, which achieves the state-of-the-art lightweight object detection. The proposed lightweight object detection model has been integrated with technologies such as multi-object tracking and license plate recognition, and is used in a commercial intelligent trafﬁc ﬂow analysis system as its edge computing component. Therearealreadythreecountriesandmorethan ten cities have deployed this technique into their trafﬁc ﬂow analysis systems.
"Identifying expressive semantics in orchestral conducting kinematics," International Society of Music Information Retrieval Conference (ISMIR), November 2019.
Authors: Yu-Fen Huang, Tsung-Ping Chen, Nikki Moran, Simon Coleman, and Li Su

Abstract:
Existing kinematic research on orchestral conducting movement contributes to beat-tracking and the delivery of performance dynamics. Methodologically, such movement cues have been treated as distinct, isolated events. Yet as practicing musicians and music pedagogues know, conductors’ expressive instructions are highly flexible and dependent on the musical context. We seek to demonstrate an approach to search for effective descriptors to express musical features in conducting movement in a valid music context, and to extract complex expressive semantics from elementary conducting kinematic variations. This study therefore proposes a multi-task learning model to jointly identify dynamic, articulation, and phrasing cues from conducting kinematics. A professional conducting movement dataset is compiled using a high-resolution motion capture system. The ReliefF algorithm is applied to select significant features from conducting movement, and recurrent neural network (RNN) is implemented to identify multiple movement cues. The experimental results disclose key elements in conducting movement which communicate musical expressiveness; the results also highlight the advantage of multi-task learning in the complete musical context over single-task learning. To the best of our knowledge, this is the first attempt to use recurrent neural network to explore multiple semantic expressive cuing in conducting movement kinematics.
"Tell Me Where It is Still Blurry: Adversarial Blurred Region Mining and Refining," 27th ACM Multimedia Conference (long paper), October 2019.
Authors: Jen-Chun Lin, Wen-Li Wei, Tyng-Luh Liu, C.-C. Jay Kuo, and H. Y. Mark Liao

Abstract:
Mobile devices such as smartphones are ubiquitously being used to take photo sand videos,thus increasing the importance of image deblurring.This study introduces a novel deep learning approach that can automatically and progressively achieve the task via adversarial blurred region mining and refining (adversarialBRMR). Starting with a collaborative mechanism of two coupled conditional generative adversarial networks(CGANs),our method first learns the image-scaleCGAN,denoted as iGAN, to globally generate adeblurred image and locally uncoverits still blurred regions through an adversarial mining process. Then, we construct the patch-scale CGAN, denoted as pGAN, to further improve sharpness of the most blurred region in each iteration.  Owing to such complementary designs, the adversarial BRMR indeed functions as a bridge between iGAN and pGAN,and yields the performance synergy in better solving blind image deblurring.The overall formulation is self-explanatory and effective to globally and locally restore an underlying sharp image.Experimental results on benchmark datasets demonstrate that the proposed method outperforms the current state-of-the-art technique for blind image deblurring both quantitatively and qualitatively.
"See-through-Text Grouping for Referring Image Segmentation," 2019 International Conference on Computer Vision, October 2019.
Authors: Ding-Jie Chen, Songhao Jia, Yi-Chen Lo, Hwann-Tzong Chen, and Tyng-Luh Liu

Abstract:
Motivated by the conventional grouping techniques to image segmentation, we develop their DNN counterpart to tackle the referring variant. The proposed method is driven by a convolutional-recurrent neural network (ConvRNN) that iteratively carries out top-down processing of bottom-up segmentation cues. Given a natural language referring expression, our method learns to predict its relevance to each pixel and derives a See-through-Text Embedding Pixelwise (STEP) heatmap, which reveals segmentation cues of pixel level via the learned visual-textual co-embedding. The ConvRNN performs a top-down approximation by converting the STEP heatmap into a refined one, whereas the improvement is expected from training the network with a classification loss from the ground truth. With the refined heatmap, we update the textual representation of the referring expression by re-evaluating its attention distribution and then compute a new STEP heatmap as the next input to the ConvRNN. Boosting by such collaborative learning, the framework can progressively and simultaneously yield the desired referring segmentation and reasonable attention distribution over the referring sentence. Our method is general and does not rely on, say, the outcomes of object detection from other DNN models, while achieving state-of-the-art performance in all of the four datasets in the experiments.
"Achieving Lossless Accuracy with Lossy Programming for Efficient Neural-Network Training on NVM-Based Systems," ACM/IEEE International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), October 2019.
Authors: Wei-Chen Wang, Yuan-Hao Chang, Tei-Wei Kuo, Chien-Chung Ho, Yu-Ming Chang, and Hung-Sheng Chang

Abstract:
Neural networks over conventional computing platforms are heavily restricted by the data volume and performance concerns. While non-volatile memory offers potential solutions to data volume issues, challenges must be faced over performance issues, especially with asymmetric read and write performance. Beside that, critical concerns over endurance must also be resolved before non-volatile memory could be used in reality for neural networks. This work addresses the performance and endurance concerns altogether by proposing a data-aware programming scheme. We propose to consider neural network training jointly with respect to the data-flow and data-content points of view. In particular, methodologies with approximate results over Dual-SET operations were presented. Encouraging results were observed through a series of experiments, where great efficiency and lifetime enhancement is seen without sacrificing the result accuracy.
"Enabling Sequential-write-constrained B+-tree Index Scheme to Upgrade Shingled Magnetic Recording Storage Performance," ACM/IEEE International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), October 2019.
Authors: Yu-Pei Liang, Tseng-Yi Chen, Yuan-Hao Chang, Shuo-Han Chen, Kam-Yiu Lam, Wei-Hsin Li, and Wei-Kuan Shih

Abstract:
When a shingle magnetic recording (SMR) drive has been widely applied to modern computer systems (e.g., archive file systems, big data computing systems, and large-scale database systems), storage system developers should thoroughly review whether current designs (e.g., index schemes and data placements) are appropriate for an SMR drive because of its sequential write constraint. Through many prior works excellently manage data in an SMR drive by integrating their proposed solutions into the driver layer, an index scheme over an SMR drive has never been optimized by any previous works because managing index over the SMR drive needs to jointly consider the properties of B$^+$-tree and SMR natures (e.g., sequential write constraint and zone partitions) in a host storage system. Moreover, poor index management will result in terrible storage performance because an index manager is extensively used in file systems and database applications. For optimizing the B$^+$-tree index structure over an SMR storage, this work identifies performance overheads caused by the B$^+$-tree index structure in an SMR drive. By such observation, this study proposes a sequential-write-constrained B$^+$-tree index scheme, namely SW-B$^+$tree, which consists of an address redirection data structure, an SMR-aware node allocation mechanism, and a frequency-aware garbage collection strategy. According to our experiments, the SW-B$^+$tree can improve the SMR storage performance 55% on average.
Current Research Results
Authors: Ching-Tai Chen, Chu-Ling Ko, Wai-Kok Choong, Jen-Hung Wang, Wen-Lian Hsu, and Ting-Yi Sung

Abstract:
Protein and peptide identification and quantitation are essential tasks in proteomics research and involve a  series of steps in analyzing mass spectrometry data. Trans-Proteomic Pipeline (TPP) provides a wide range of useful tools through its web interfaces for analyses such as sequence  database search, statistical validation, and quantitation. To utilize the powerful functionality of TPP without the need for manual intervention to launch each step, we developed a software tool, called WinProphet, to create and automatically  execute a pipeline for proteomic analyses. It seamlessly integrates with TPP and other external command-line programs, supporting various functionalities, including database search for protein and peptide identification, spectral library construction and search, data-independent acquisition (DIA) data analysis, and isobaric labeling and label-free quantitation. WinProphet is a standalone, installation-free tool with graphical interfaces for users to configure, manage, and automatically execute pipelines. The constructed pipelines can be exported as XML files with all of the parameter settings for reusability and portability. The executable files, user manual, and sample data sets of WinProphet are freely available at http://ms.iis.sinica.edu.tw/COmics/Software_WinProphet.html.
"Exploiting Vector Processing in Dynamic Binary Translation," the International Conference on Parallel Processing (ICPP), August 2019.
Authors: Chih-Min Lin, Sheng-Yu Fu, Ding-Yong Hong, Yu-Ping Liu, Jan-Jan Wu, Wei-Chung Hsu

Abstract:
Auto vectorization techniques have been adopted by compilers to exploit data-level parallelism in parallel processing for decades. However, since processor architectures have kept enhancing with new features to improve vector/SIMD performance, legacy application binaries failed to fully exploit new vector/SIMD capabilities in modern architectures. For example, legacy ARMv7 binaries cannot benefit from ARMv8 SIMD double precision capability, and legacy x86 binaries cannot enjoy the power of AVX-512 extensions. In this paper, we study the fundamental issues involved in cross-ISA Dynamic Binary Translation (DBT) to convert non-vectorized loops to vector/SIMD forms to achieve greater computation throughput available in newer processor architectures. The key idea is to recover critical loop information from those application binaries in order to carry out vectorization at runtime. Experiment results show that our approach achieves an average speedup of 1.42x compared to ARMv7 native run across various benchmarks in an ARMv7-to-ARMv8 dynamic binary translation system.
Current Research Results
Authors: Sheng-Yao Su, I-Hsuan Lu, Wen-Chih Cheng, Wei-Chun Chung, Pao-Yang Chen, Jan-Ming Ho, Shu-Hwa Chen, Chung-Yen Lin

Abstract:
DNA methylation is a crucial epigenomic mechanism in the biological system. Using whole genome bisulfite sequencing (WGBS) technology, the methylation status of cytosine sites can be revealed. However, performing WGBS data analysis is often complicated and challenging. To alleviate such difficulties, we integrated the WGBS data processing and downstream analysis into a two-phase approach, namely, EpiMOLAS. First, we packed a Docker container DocMethyl to deal with raw data processing, mapping, and methylation calling/ scoring to give the summary, mtable, of the whole genome methylation status by the gene. Next, mtables are uploaded to the web server EpiMOLAS_web for linking with gene annotation databases that enable rapid data retrieval and analyses. This two-phase combination of DocMethyl and EpiMOLAS_web solve methylome data analysis from raw reads processing to downstream analysis.
"Rethinking Last-level-cache Write-back Strategy for MLC STT-RAM Main Memory with Asymmetric Write Energy," ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED), July 2019.
Authors: Yu-Pei Liang, Tseng-Yi Chen, Yuan-Hao Chang, Shuo-Han Chen, Pei-Yu Chen, and Wei-Kuan Shih

Abstract:
To meet the requirement of low-power consumption, multi-level-cell STT-RAM (MLC STT-RAM) has been widely regarded as a potential candidate for replacing DRAM-based main memory in the next generation computer architectures because of its high memory cell density, fast read/write performance and zero refresh power consumption. However, MLC STT-RAM has higher power consumption than DRAM while a write operation is performed because MLC STT-RAM sometimes needs to perform a two-step transition to change the originally stored bits to another specifically written bit patterns. As a result, MLC STT-RAM has different power consumption while different bit patterns are written to a memory cell. To the best of our knowledge, a few or none of the previous studies rethink a cache replacement policy to overcome the asymmetric write energy issue of MLC STT-RAM-based main memory. Thus, this study proposes an energy-aware cache replacement policy, namely E-cache, which considers asymmetric write-back power consumption on MLC STT-RAM-based main memory to evict a proper cached data from the last-level cache, so as to minimize system power consumption. The experimental results show that the proposed solution reduces the energy consumption by 36\\% on average, compared with the LRU.
"The Best of Both Worlds: On Exploiting Bit-Alterable NAND Flash for Lifetime and Read Performance Optimization," ACM/IEEE Design Automation Conference (DAC), June 2019.
Authors: Shuo-Han Chen, Ming-Chang Yang, and Yuan-Hao Chang

Abstract:
With the emergence of bit-alterable 3D NAND flash, programming and erasing a flash cell at bit-level granularity have become a reality. Bit-level operations can benefit the high density, high bit-error-rate 3D NAND flash via realizing the bit-level rewrite operation,'' which can refresh error bits at bit-level granularity for reducing the error correction latency and improving the read performance with minimal lifetime expense. Different from existing refresh techniques, bit-level operations can lower the lifetime expense via removing error bits directly without page-based rewrites. However, since bit-level rewrites may induce a similar amount of latency as conventional page-based rewrites and thus lead to low rewrite throughput, the efficiency of bit-level rewrites should be carefully considered. Such observation motivates us to propose a bit-level error removal (BER) scheme to derive the most-efficient way of utilizing the bit-level operations for both lifetime and read performance optimization. % without exaggerating the uneven wear level issue. A series of experiments was conducted to demonstrate the capability of the BER scheme with encouraging results.
"Enabling File-Oriented Fast Secure Deletion on Shingled Magnetic Recording Drives," ACM/IEEE Design Automation Conference (DAC), June 2019.
Authors: Shuo-Han Chen, Ming-Chang Yang, Yuan-Hao Chang, and Chun-Feng Wu

Abstract:
Existing secure deletion approaches are inefficient in erasing data permanently because file systems have no knowledge of the data layout on the storage device, nor is the storage device aware of file information within the file systems. This inefficiency is exaggerated on the emerging shingled magnetic recording (SMR) drive due to its inherent sequential-write constraint. On SMR drives, secure deletion requests may lead to serious write amplification and performance degradation if the data layout is not properly configured. Such observation motivates us to propose a file-oriented fast secure deletion (FFSD) strategy to alleviate the negative impacts of SMR drives' sequential-write constraint and improve the efficiency of secure deletion operations on SMR drives. A series of experiments were conducted to demonstrate the capability of the proposed strategy on improving the efficiency of secure deletion on SMR drives.
Current Research Results
"Enhancing Transactional Memory Execution via Dynamic Binary Translation," ACM Applied Computing Review (ACR), April 2019.
Authors: Ding-Yong Hong, Shih-Kai Lin, Sheng-Yu Fu, Jan-Jan Wu, Wei-Chung Hsu

Abstract:
Transactional Synchronization Extensions (TSX) have been introduced for hardware transactional memory since the 4th generation Intel Core processors. TSX provides two software programming interfaces: Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM). HLE is easy to use and maintains backward compatibility with processors without TSX support, while RTM is more flexible and scalable. Previous researches have shown that critical sections protected by RTM with a well-designed retry mechanism as its fallback code path can often achieve better performance than HLE. More parallel programs may be programmed in HLE, however, using RTM may obtain greater performance. To embrace both productivity and high performance of parallel programs with TSX, we present a framework built on QEMU that can dynamically transform HLE instructions in an application binary to fragments of RTM codes with adaptive tuning on the fly. Compared to HLE execution, our prototype achieves 1.56 x speedup with 8 threads on average. Due to the scalability of RTM, the speedup will be more significant as the number of threads increases.