Faculty of ArtsComenius University Bratislava


2021–2023 Digitized Translation

Digitized Translation – CAT Tools as a Primary Step


The New Chinese Corpus Hanku, supported by: Chinese National Office for Teaching Chinese as a Foreign Language (2016)

Project director: Ľuboš Gajdoš, PhD., prof. Jana Benická, PhD.

The aim of the project is to build a Chinese monolingual corpus. The building process has begun in spring 2016 and it was supported by the Chinese National Office for Teaching Chinese as a Foreign Language in 2016. 

The corpus uses an open-source version of the Sketch Engine corpus manager (NoSketch Enigine) as well as open-source tools for tokenization (ZPar) and POS tagging (the Penn Chinese Treebank). The corpus has reached the size of 800 million tokens (June 2016), is equipped with bibliographic, POS, style and genre, phonetic annotation. Syntactic annotation is prepared (autumn 2016). So far, the Hanku corpus is equipped with two subcorpora: (1) zh-law (legal texts from the PRC; texts of laws and regulations), (2) web-zh (texts from the Internet). Texts from different registers will follow (e.g. professional texts, texts of Modern Chinese literature etc.).



