本語料庫的建立旨在呈現非華語學生在小學階段中文書面語學習的歷程,以便為教育工作者揭示學生的階段性語言面貌,為課程編排、教學規劃和教材編寫提供參考。
語料庫的語料主要取自部分本地小學一年級至六年級的非華語學生所產出的書面語,包括工作紙、作文、測驗、考試等課業中所包含的語料,字數為30萬字。
語料經過篩選、分類,錄入後,以《漢語水平等級標準與語法等級大綱》和《現代漢語詞典》(2005版)作為標準,進行詞性、短語、句式、標點、錯字等標註。漢字偏誤標註中,別字會根據錯誤類型分類,錯字會以截圖顯示。
用戶在使用語料庫時,可通過輸入關鍵詞,搜索字、詞、語法項目、字詞偏誤、字頻與詞頻等信息。部分搜索功能在試用階段暫不提供,日後會逐步開放。
本系統由香港理工大學中文及雙語學系陳瑞端教授所帶領的研究團隊開發,團隊成員包括劉藝、苗傳江、蔡靄琳(前期)、李英男、梁鑫。項目得到語文教育及研究常務委員會資助,特此鳴謝。
The establishment of this learner corpus aims to capture the interlanguage of Non-Chinese speaking (NCS) students in their learning of Chinese in local primary schools in Hong Kong. The understanding of learners’ interlanguage would provide educators with useful information for curriculum planning, teaching and learning activity design, as well as material development.
The corpus contains a variety of text types produced by Chinese L2 learners studying in grade 1 to grade 6 in local primary schools. The corpus contains 300,000 characters, covering texts taken from compositions, worksheets, quizzes and examinations.
Based on HSK Grading Standards and Grammar Outline, as well as the Modern Chinese Dictionary (fifth edition), this learner corpus is tagged to show the part of speech of each word, phrases, sentence structure, punctuation and erroneous characters. Annotation of misused Chinese characters is sorted by error types, while wrongly written characters are displayed in the form of screenshots.
Users can search keywords on the user interface to find information about characters, words, grammatical items, punctuation and erroneous characters, and character/word frequency. Some functions are not available at the trial stage, but will be gradually opened for use.
This system is constructed by a research team led by Prof. Chan Shui-duen of the Department of Chinese and Bilingual Studies of the Hong Kong Polytechnic University. Team members include Liu Yi, Miao Chuangjiang, Choi Oi Lam, Li Yingnan, and Liang Xin. The construction work is financed by the Standing Committee on Language Education and Research.