您的瀏覽器不支援JavaScript語法,網站的部份功能在JavaScript沒有啟用的狀態下無法正常使用。

Institute of Information Science, Academia Sinica

Research

Print

Press Ctrl+P to print from browser

Deep Reinforcement Learning and Games

:::

PIs: Ti-Rong Wu and I-Chen Wu

Recent advances in deep reinforcement learning (DRL) have made significant progress in various fields, including gaming, robotics, and natural language processing. Within the field of DRL, computer games serve as an ideal benchmark, where the environment is relatively accessible and controllable, and can be used to extend valid knowledge to more complex systems. The goal of this project is to study and develop novel DRL algorithms that can be applied to various real-world applications. We introduce our research topics and some preliminary results as follows.

AlphaGo-Like or AlphaZero-Like Approach

First, we proposed an AlphaGo-like algorithm with multi-labelled value network. Although AlphaGo has achieved superhuman playing levels in the game of Go, it was designed with 7.5 komi, the number of points added to the second player as to balance the game. To support different game rules, we proposed a novel value network architecture, called the multi-labelled (ML) value network, which is shown in Figure 1. The ML value network has three advantages, (a) it outputs values for different komi, (b) it supports dynamic komi, and (c) it lowers the mean squared error (MSE) while improving playing strength, especially for handicap games. 

Fig. 1: Multi-labelled value network architecture.
Fig. 1: Multi-labelled value network architecture.

Then, we proposed to use Population Based Training (PBT) to train our Go program, CGI. CGI obtains a winrate of 74% against ELF OpenGo, which was a SOTA program (developed by Facebook). Our design also saved 10+ times computing resources. The above proposed methods, published in IEEE Transactions on Games and a top-tier conference AAAI 2020, are widely used by most computer Go programs.

Next, we developed strength adjustment on MCTS-based program. Superhuman level game playing programs have captured the fascination of society at large, offering human players an interesting challenge and opportunities for learning. However, it is important to adjust the program’s difficulty to appropriate levels for human players to facilitate learning. Programs that are too weak might cause human players to lose interest, while excessive difficulty can lead to frustration. To address this issue, we propose a general approach to adjust the strength of Monte Carlo Tree Search-based (MCTS) game playing programs by using a softmax policy with a strength index z to control the difficulty. After applying the strength index to our computer Go program, CGI Go Intelligence (CGI), the program can cover a wide range of strengths, from beginner to superhuman levels. Our proposed method was published in IEEE Computational Intelligence Magazine, as well as AAAI 2019.

Problem Solving

While DRL has demonstrated the ability to learn strong heuristics that surpass human performance in gaming, these learned heuristics are not always exact solutions. Many real-world applications, such as autonomous self-driving cars or medical surgeries, require exact manipulation rather than just strong heuristics. It is therefore a big challenge to combine learned DRL heuristics and exact methods to achieve perfection. To tackle this challenge, we use 7x7 Killall-Go, a variant of the game of Go, as a benchmark to study problem solving. We expect these methods to be extended to other real-world problems, such as theorem proving, job shop scheduling problems (JSP), and traveling salesman problems (TSP). The research topics can be divided into two parts: designing efficient search methods and developing better DRL algorithms. We further introduce two parts as follows.
First, we design efficient search methods for game solving. Traditionally, researchers have dedicated themselves to designing knowledge to help search algorithms on game solving. However, this approach requires significant effort and cannot be easily generalized to other domains. To address this issue, we propose a new search approach called relevance zone-based search (RZS). RZS uses the information among solution sub-trees to reduce search complexity and form knowledge based on the search tree without requiring too much human knowledge. An example of relevance zone construction is shown in Figure 2. Our RZS approach reaches the state-of-the-art and successfully solves 68 of 106 19x19 life-and-death Go problems. Furthermore, we have designed a transposition table based on the relevance zone pattern, which has improved the number of solved problems from 68 to 83. These results have been published in the AAAI 2022 and IEEE Transactions on Games.

Fig. 2: An example of relevance zone construction.
Fig. 2: An example of relevance zone construction.

Second, we develop better DRL algorithms for game solving. While many researchers have tried using game playing DRL heuristics to help solve problems, these heuristics are not always tailor-made and effective for game solving. To address this issue, we propose a novel DRL training method that modifies the training target of the AlphaZero algorithm, such that it prioritizes solving the game quickly, rather than winning. We train a Proof Cost Network (PCN), where proof cost is a heuristic that estimates the amount of work required to solve problems. We propose two specific training targets: the first finds the shortest path to a solution, while the second estimates the proof cost. Our experiments show that PCN outperforms traditional DRL methods and significantly solves more problems under the same time limits. This result was published in ICLR 2022.

To further improve the efficiency of solving more challenging problems, we develop a novel online learning method that dynamically adjusts DRL heuristics during the solving process. The architecture of this method is shown in Figure 3, which includes a manager, a set of workers, and an online learning trainer. The manager maintains a search tree rooted at the position to be solved and distributes jobs to workers based on PCN value. Workers are also game solvers and attempt to solve the jobs sent by the manager within a given limited computing resource. Meanwhile, an online learning trainer fine-tunes the DRL heuristics during the solving process by using training samples collected from recently encountered positions in the search tree. With our method, we successfully solve a series of challenging 7x7 Killall-Go openings and achieve a 3.29x speedup compared to the solver that does not use online learning methods. We expect to have a greater speed up on more difficult problems.

Fig. 3: The architecture for online learning game solver.
Fig. 3: The architecture for online learning game solver.