Chinese Corpus

Chinese monolingual corpus

The Hanku is a monolingual, synchronous Chinese corpus (in simplified Chinese characters) available via web interface. It is available via the website of the Confucius Institute at Comenius University in Bratislava at: <konfuciovinstitut.sk>. The building process has begun in spring 2016 and it was supported by the Chinese National Office for Teaching Chinese as a Foreign Language in 2016. 

The corpus can be used free of charge, but registration is required. 

The Hanku uses an open-source version of the Sketch Engine corpus manager (NoSketch Enigine) as well as open-source tools for tokenization (ZPar) and POS tagging (the Penn Chinese Treebank). The corpus has reached the size of 800 million tokens (June 2016), is equipped with bibliographic, POS, style and genre, phonetic annotation. Syntactic annotation is prepared (autumn 2016). So far, the Hanku corpus is equipped with the following style and genre annotation: (1) baokan (journalistic texts from the PRC), (2) falv (legal texts from the PRC; texts of laws and regulations), (3) none (texts from the Internet). Texts from different registers will follow (e.g. professional texts, texts of Modern Chinese literature etc.).