[Most-ai-contest] zhwiki_20190820 & pbs+cdc+cna+moti(2019.1.1~2019.11.28 broadcast news) corpus for pre-training DNN models

范正忠 jjfan於iis.sinica.edu.tw
Thu 12月 5 07:29:06 CST 2019


For 羅上堡 & 江侑倫, 

新增加 2 個 datasets FGC_search.dev & FGC_zhidao.dev, 裡面有大約 'YesNo' (415+382), 'Multi-Spans-Extraction' (1345+1480). 

[ https://drive.google.com/drive/folders/1y0UCb-n1YKlKUsQk2GJz6iKJqQFJ25rK?usp=sharing | https://drive.google.com/drive/folders/1y0UCb-n1YKlKUsQk2GJz6iKJqQFJ25rK?usp=sharing ] 

Best, 
jjfan 

From: "范正忠" <jjfan at iis.sinica.edu.tw> 
To: "Most-ai Contest" <Most-ai-contest at iis.sinica.edu.tw> 
Sent: Wednesday, December 4, 2019 12:37:20 PM 
Subject: Re: [Most-ai-contest] zhwiki_20190820 & pbs+cdc+cna+moti(2019.1.1~2019.11.28 broadcast news) corpus for pre-training DNN models 

This is for 羅上堡 to pretrain NN models and 吳佳樺 to N-Grams language models. 
[ https://drive.google.com/drive/folders/1y0UCb-n1YKlKUsQk2GJz6iKJqQFJ25rK?usp=sharing | https://drive.google.com/drive/folders/1y0UCb-n1YKlKUsQk2GJz6iKJqQFJ25rK?usp=sharing ] 

jjfan 


From: "范正忠" <jjfan at iis.sinica.edu.tw> 
To: "Most-ai Contest" <Most-ai-contest at iis.sinica.edu.tw> 
Sent: Thursday, November 28, 2019 4:49:04 PM 
Subject: [Most-ai-contest] zhwiki_20190820 & pbs(2019.1.1~2019.11.28 broadcast news) corpus for pre-training DNN models 

Dear all, 

The following corpus can be used to improve pre-trained parameters of DNN models and N-Gram language models. 
Primarily for 羅上堡 to pre-train XLNet model, and 吳佳樺 to train N-Gram models. 

[ https://drive.google.com/drive/folders/1y0UCb-n1YKlKUsQk2GJz6iKJqQFJ25rK?usp=sharing | https://drive.google.com/drive/folders/1y0UCb-n1YKlKUsQk2GJz6iKJqQFJ25rK?usp=sharing ] 

Please feel free to let me know any questions. 

Best, 
jjfan 


From: "范正忠" <jjfan at iis.sinica.edu.tw> 
To: "Most-ai Contest" <Most-ai-contest at iis.sinica.edu.tw> 
Sent: Thursday, November 21, 2019 4:06:45 PM 
Subject: Release dataset 1.2 

Dear all, 

Enclosed please find FGC_Release_1.1 data-set , which 
1. DRCD, ASR, Kaggle, Lee 
2. FGC_release_A_train, FGC_release_A_dev, FGC_release_A_test 

Please note that all data are in cn language and FGC format 

The following is the answer-type & answer-mode distributions for each dataset ( less "Misc" answer-type ) 
All 											
Answer Type 	YesNo 	Num-Measure 	Kinship 	Person 	Date-Duration 	Location 	Organization 	Object 	Event 	Misc 	Total 
	53 	59 	83 	73 	125 	92 	83 	71 	19 	88 	746 
	53 	87 	83 	77 	137 	99 	86 	79 	19 	26 	746 
	7.10% 	11.66% 	11.13% 	10.32% 	18.36% 	13.27% 	11.53% 	10.59% 	2.55% 	3.49% 	100.00% 
Answer Mode 	YesNo (是否題) 	Multi-Spans-Extraction (列舉題型) 	Kinship 	Single-Span-Extraction (單一答案) 	Date-Duration 	Arithmetic-Operations 	Counting 	Comparing-Members 	Common-Sense 		
	53 	101 	75 	442 	57 	3 	15 	0 	0 		746 
	53 	101 	75 	426 	61 	6 	23 	1 	0 		746 
	7.10% 	13.54% 	10.05% 	57.10% 	8.18% 	0.80% 	3.08% 	0.13% 	0.00% 		100.00% 
											
Train 											
Answer Type 	YesNo 	Num-Measure 	Kinship 	Person 	Date-Duration 	Location 	Organization 	Object 	Event 	Misc 	Total 
	21 	43 	59 	39 	50 	44 	56 	24 	11 	16 	363 
	5.79% 	11.85% 	16.25% 	10.74% 	13.77% 	12.12% 	15.43% 	6.61% 	3.03% 	4.41% 	100.00% 
Answer Mode 	YesNo (是否題) 	Multi-Spans-Extraction (列舉題型) 	Kinship 	Single-Span-Extraction (單一答案) 	Date-Duration 	Arithmetic-Operations 	Counting 	Comparing-Members 	Common-Sense 		
	21 	40 	59 	208 	18 	3 	14 	0 	0 		363 
	5.79% 	11.02% 	16.25% 	57.30% 	4.96% 	0.83% 	3.86% 	0.00% 	0.00% 		100.00% 
											
											
Dev 											
Answer Type 	YesNo 	Num-Measure 	Kinship 	Person 	Date-Duration 	Location 	Organization 	Object 	Event 	Misc 	Total 
	17 	23 	15 	21 	56 	34 	14 	18 	2 	9 	209 
	8.13% 	11.00% 	7.18% 	10.05% 	26.79% 	16.27% 	6.70% 	8.61% 	0.96% 	4.31% 	100.00% 
Answer Mode 	YesNo (是否題) 	Multi-Spans-Extraction (列舉題型) 	Kinship 	Single-Span-Extraction (單一答案) 	Date-Duration 	Arithmetic-Operations 	Counting 	Comparing-Members 	Common-Sense 		
	17 	26 	12 	117 	29 	2 	6 	0 	0 		209 
	8.13% 	12.44% 	5.74% 	55.98% 	13.88% 	0.96% 	2.87% 	0.00% 	0.00% 		100.00% 
											
Test 											
Answer Type 	YesNo 	Num-Measure 	Kinship 	Person 	Date-Duration 	Location 	Organization 	Object 	Event 	Misc 	Total 
	15 	21 	9 	17 	31 	21 	16 	37 	6 	1 	174 
	8.62% 	12.07% 	5.17% 	9.77% 	17.82% 	12.07% 	9.20% 	21.26% 	3.45% 	0.57% 	100.00% 
Answer Mode 	YesNo (是否題) 	Multi-Spans-Extraction (列舉題型) 	Kinship 	Single-Span-Extraction (單一答案) 	Date-Duration 	Arithmetic-Operations 	Counting 	Comparing-Members 	Common-Sense 		
	15 	35 	4 	101 	14 	1 	3 	0 	0 		173 
	8.67% 	20.23% 	2.31% 	58.38% 	8.09% 	0.58% 	1.73% 	0.00% 	0.00% 		100.00% 
Best, 
jjfan 





From: "范正忠" <jjfan at iis.sinica.edu.tw> 
To: "Most-ai Contest" <Most-ai-contest at iis.sinica.edu.tw> 
Sent: Tuesday, November 19, 2019 5:04:12 PM 
Subject: Re: [Most-ai-contest] refinement of anstype and ansmode for fgc-2019 dataset 

Dear all, 

Enclosed please find FGC_Release_1.1 data-set, which include 
1. DRCD, ASR, Kaggle, Lee 
2. FGC_release_A_train_1.1, FGC_release_A_dev_1.1, FGC_release_A_test_1.1 
Please use this data-set as the standard benchmark. 
Also note that you can use item 1 + FGC_release_A_train_1.1 as your training set, FGC_release_A_dev_1.1 as development set, and FGC_release_A_test_1.1 as testing set. 
Please feel free to let me know any questions. 

Best, 
jjfan 


From: "范正忠" <jjfan at iis.sinica.edu.tw> 
To: "Most-ai Contest" <Most-ai-contest at iis.sinica.edu.tw> 
Sent: Monday, November 18, 2019 8:56:28 AM 
Subject: Re: [Most-ai-contest] refinement of anstype and ansmode for fgc-2019 dataset 

Dear all, 

Please send me error list of Answer-Type and Answer-Mode annotations end of today. 
Then I will divide FGC release data-set into training, development, and test, and release them tomorrow for your benchmark. 

Thanks. 

Best, 
jjfan 


From: "Chiangyulun0914" <chiangyulun0914 at iis.sinica.edu.tw> 
To: "Most-ai Contest" <Most-ai-contest at iis.sinica.edu.tw> 
Sent: Wednesday, November 13, 2019 5:19:49 PM 
Subject: Re: [Most-ai-contest] refinement of anstype and ansmode for fgc-2019 dataset 

大家好, 

檔案以 .xlsx 或 .csv 檔為主。附檔為範例。感謝! 

江侑倫 
自然語言理解實驗室 




中央研究院資訊科學研究所 

BQ_BEGIN

BQ_END
Yu-Lun Chiang 

BQ_BEGIN

BQ_END
Natural Language Understanding Lab 

BQ_BEGIN

BQ_END
Institute of Information Science, Academia Sinica 
Mobile: +886-975279013 (Taiwan) 


江侑倫 < [ mailto:chiangyulun0914 at iis.sinica.edu.tw | chiangyulun0914 at iis.sinica.edu.tw ] > 於 2019年11月13日 週三 下午4:57寫道: 

BQ_BEGIN

大家好, 

有鑑於范博士最新釋出的 fgc-2019 dataset 中,可能因使用 rule-based 標記 anstype 和 ansmode 而造成一些錯誤,因此若團隊成員在使用數據集時發現致命錯誤, 請隨手紀錄,並依照附檔的格式與檔名 ,將修正前和修正後的 anstype 與 ansmode 回傳給范博士,以利范博士更新數據集。在此亦附上 20191112 當天范博士釋出最新版的 anstype 與 ansmode 種類。 

若 anstype 與 ansmode 中 僅出現一個 需要被修正,仍請將 不需修正的另一個 也填進附檔的 refined 那行中,以利范博士直接依照 refined 行中的資訊進行數據集更新。 

感謝 ! 

江侑倫 
自然語言理解實驗室 

BQ_BEGIN

BQ_END
中央研究院資訊科學研究所 

BQ_BEGIN

BQ_END
Yu-Lun Chiang 

BQ_BEGIN

BQ_END
Natural Language Understanding Lab 

BQ_BEGIN

BQ_END
Institute of Information Science, Academia Sinica 
Mobile: +886-975279013 (Taiwan) 



BQ_END


_______________________________________________ 
Most-ai-contest mailing list 
Most-ai-contest at iis.sinica.edu.tw 
https://www.iis.sinica.edu.tw/mailman/listinfo/most-ai-contest 

_______________________________________________ 
Most-ai-contest mailing list 
Most-ai-contest at iis.sinica.edu.tw 
https://www.iis.sinica.edu.tw/mailman/listinfo/most-ai-contest 

_______________________________________________ 
Most-ai-contest mailing list 
Most-ai-contest at iis.sinica.edu.tw 
https://www.iis.sinica.edu.tw/mailman/listinfo/most-ai-contest 


_______________________________________________ 
Most-ai-contest mailing list 
Most-ai-contest at iis.sinica.edu.tw 
https://www.iis.sinica.edu.tw/mailman/listinfo/most-ai-contest 

_______________________________________________ 
Most-ai-contest mailing list 
Most-ai-contest at iis.sinica.edu.tw 
https://www.iis.sinica.edu.tw/mailman/listinfo/most-ai-contest 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.iis.sinica.edu.tw/pipermail/most-ai-contest/attachments/20191205/cdbeb492/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jjfan_anstype&ansmode.PNG
Type: image/png
Size: 54300 bytes
Desc: not available
URL: <http://www.iis.sinica.edu.tw/pipermail/most-ai-contest/attachments/20191205/cdbeb492/attachment-0001.png>


More information about the Most-ai-contest mailing list