Page 28 - profile-ok

P. 28

研究群｜ Research Laboratories

Natural Language and

Knowledge Processing Laboratory

Research Faculty Group Profile

Wen-Lian Hsu We focus on problems concerning knowledge-based information processing. This area of research (or Pinyin) sequence into characters with a hit ratio close to 96%. This ognition and answer ranking, to construct a Chinese QA system. Our
Distinguished Research Fellow is strongly motivated by the over-flooding of information on the Internet, for which effective and system is widely used in Taiwan. It received the Distinguished Chi- system won the first place in NTCIR-5 (2005) and NTCIR-6 (2007). In the
Operations Research , Cornell University autonomous information processing tools are still lacking. In order to achieve high-level intelli- nese Information Product Award (中文傑出資訊產品獎 )in 1993. In the future, we will extend the types of questions asked and add the ability
gent information processing, many challenging research problems in the areas of knowledge ac- area of PC Home software downloads, GOING has been downloaded to engage in dialogues.
Fu Chang quisition, knowledge representation, and knowledge utilization must be addressed. about one million times. It is one of two domestic software programs
Associate Research Fellow ranked within the top 20 for downloads. Our knowledge representa- ● Integration of the knowledge about Chinese characters
Mathematical Statistics , Columbia University 1. Knowledge Acquisition tion kernel, InfoMap, has been applied to a wide variety of application We have established a platform to integrate knowledge about Chi-
Keh-Jiann Chen For the task of acquiring linguistic and common sense knowledge, we will focus on strategies systems in natural language processing, biological knowledge base, nese characters with the features listed below:
Research Fellow and methodologies of automating knowledge acquisition processes. We expect that in the fu- and e-learning. In the future, we will design an event frame, which is (1) Our platform has various means to retrieve Chinese characters,
Computer Science , State University of New York ture, enhancement of knowledge bases will be carried out automatically by using established the key technology for language understanding and also acts as a ma-
at Buffalo and yet to be developed processing technologies to extract new knowledge from the Internet jor building block of our learning system. We will also develop basic such as by glyph structures and by pronunciations.
and from various text sources, such as XML documents and tagged corpus. technologies for processing spoken languages, and support various (2) It addresses the issues of un-encoded Chinese characters, in-
Chun-Nan Hsu applications. Future major research topics include: knowledge-based cluding displaying, retrieval, input, registration, and printing.
Research Fellow ● Construction of linguistic knowledge bases language processing, information extraction and retrieval from text,
Computer Science , University of Southern audio, and video, intelligent search, cross-language information re- (3) It organizes information about Chinese characters and allows
California In the past twenty some years, we have developed an infrastructure for Chinese language trieval, computer processing for Taiwanese, question answering, dia- customization to meet personal preference.
processing which includes part-of-speech tagged corpus, tree-banks, Chinese lexical database,
Hsin-Min Wang Chinese grammars, InfoMap, Chinese glyph structure databases, word identification systems, log, and intelligent tutoring.
Associate Research Fellow sentence parsers, etc. We have also developed some basic techniques for knowledge extrac- ● Knowledge-based Chinese language processing 3. Knowledge Representation
Electrical Engineering , National Taiwan University tion, such as named-entity recognition (NER), semantic role labeling, and relation extraction in
both Chinese and biological literature. We have won 1st place in Chinese word segmentation, We will focus on the conceptual processing of Chinese documents. We study the logical foundation of ontology as well as fine-grained
2nd place in Chinese NER at 2006 SIGHAN contests, and 1st place in gene normalization in the Our knowledge-based language processing system will utilize statisti- semantic representation, which enable us to have better knowledge
2009 BioCreative II.5 contest. In the future, we plan to utilize our developed infrastructure to cal, linguistic, and common sense knowledge derived by our evolving about meaning representation and composition. We will remodel the
extract linguistic and domain knowledge from various corpora and texts on the web, and to Knowledge Web and E-HowNet to parse the conceptual structures of current ontology structures of WordNet, HowNet, and FrameNet to
enhance current knowledge bases. In particular, we have collected 40 million high-frequency sentences and interpret sentence meanings. The knowledge-based achieve a better and more unified representation.
meaningful word pairs in Chinese. Based on these, our future research involves automatically language processing systems incorporate various knowledge bases ● E-HowNet
collecting useful event frames in order to better understand natural language texts. to form a learning system. The processing power of the language
Technical Faculty ● Machine learning and data mining processing systems is increased, due to the enhancement of the used Natural language is a means of denoting concepts. However, word
knowledge bases. In addition, these knowledge bases are evolving
due to automatic knowledge extraction made possible by the lan- sense ambiguities make natural language processing and concep-
tual processing almost impossible. To bridge the gaps between
We have focused on machine learning and its applications to document image analysis, opti-
Der-Ming Juang cal character recognition, and bioinformatics, and we will continue our work in enhancing the guage processing systems. natural language representations and conceptual representations,
Assistant Research Engineer applicability of learning machines to large-scale problems. There are three types of scale prob- ● Audio (speech / music / song) processing & retrieval we propose a universal concept representational mechanism called
The Institute of Computer Management, National lems with which we deal: large scale in training samples, large scale in class types, and large E-HowNet, which was evolved from HowNet. It extends the word
Tsing Hua University
scale in (irrelevant) features. For the first problem, we have proposed an extremely efficient tree Our goal is to develop methods for analyzing, extracting, recogniz- sense definition mechanism of HowNet and uses WordNet synsets as
decomposition approach to train non-linear support vector machines at a speedup factor of ing, indexing, and retrieving information from audio data. In the area vocabulary to describe concepts. Each word sense (or concept) is de-
hundreds, sometimes even thousands, while achieving comparable test accuracy. This method of speech, our research has focused mainly on speech recognition, fined by some simpler concept. The simple concepts used in the defi-
has been used effectively to deal with a large size protein-protein interface prediction prob- speaker recognition, and speech information retrieval. We have pub- nitions can be further decomposed into even simpler concepts, until
lem with a 300-fold speedup. The tree decomposition method can be extended to an equally lished several papers in prestigious journals, such as IEEE TASLP and primitive or basic concepts are derived. Therefore, definitions can be
powerful forest decomposition in order to speed up machine learning on data sets that scale ACM TALIP. In addition, we have successfully implemented several dynamically decomposed and unified into E-HowNet representations
up in both training samples and class types, thereby solving the first and the second problems prototype systems, such as a TV news retrieval system and a speaker at different levels. E-HowNet is language independent. Thus, any word
simultaneously. For the third problem, we are pioneering a new method for ranking and select- verification system. Our speaker verification system was ranked 2nd sense of any language can be defined and near-canonical represen-
ing features using multiple feature subsets, and have gained advantages in computing speed, out of 6 participants in the ISCSLP2006 speaker recognition evalu- tation can be achieved. The semantic distances of any two concepts,
test accuracy, the number of essential features that are ranked above all irrelevant features, and ation. Our on-going research includes attribute-detection-based as well as their sense similarity and difference can be determined by
the number of essential features in the selected features. While endeavoring to develop new speech recognition, spoken document summarization, speaker dia- checking their definitions. In addition to taxonomy links, concepts are
methods, we also publicize both our implementations and the data sets that were created in rization, and language modeling. In the area of music, our research also associated by their shared conceptual features, and fine-grained
our applications, so as to benefit potential users of our methods. has focused mainly on vocal melody extraction, query by singing/ differences between near-synonyms can be differentiated by adding
humming, and solo vocal modeling. We have successfully implement- new features.
ed several prototype systems, such as a music retrieval system and
2. Knowledge Utilization a singer identification system, and published papers in IEEE TASLP, ● Expression of the knowledge about the glyph of Chinese char-
acters
IEEE TMM, Computer Music Journal, as well as others. We participat-
We have designed a Chinese input system, GOING, which automatically translates a phonetic ed in the audio tag classification task of MIREX2009, and our system The Chinese Glyph Structure Database is designed to record the
was ranked 1st out of 12 systems. Future research directions include knowledge of Chinese characters, including time-variant shapes,
continuous improvement of our own technologies and systems, fea- structures and the relationships across variants in practical usage. The
ture analysis, vocal separation, and finally automatic music structure database has the following features:
analysis and summarization, so as to facilitate the management and
retrieval of a large music database. (1) It reflects the evolution of Chinese characters.
● Chinese question answering system (2) It expresses the cross-era relationship among Chinese character
variants.
In a natural language question answering system, the user can ask (3) It demonstrates an exclusive shape-by-definition feature of Chi-
a computer questions in an ordinary fashion, such as “Who is the nese characters.
President of the United States?” Such system would greatly enhance
search efficiency. We integrate several Chinese NLP techniques, such (4) It uses “glyph expressions” and “style codes” to solve the problem
as question type classification, passage retrieval, named entity rec- of Chinese character encoding.

28 29

23 24 25 26 27 28 29 30 31 32 33