The Chinese Corpus Hanku
The Hanku is a monolingual, synchronous Chinese corpus (in simplified Chinese characters) available via web interface. It is available via the website of the Confucius Institute at Comenius University in Bratislava at: <konfuciovinstitut.sk>. The building process has begun in spring 2016 and it was supported by the Chinese National Office for Teaching Chinese as a Foreign Language in 2016.
The corpus is available here. The corpus can be used free of charge, but registration is required. Please send us the following information (via email to lubos.gajdos(at)uniba.sk):
- Your Name
- Your Academic Affiliation
- Your E-mail
By registering you agree to use the Hanku corpus and its resources solely for study, research, teaching and other non-commercial purposes.
The Hanku uses an open-source version of the Sketch Engine corpus manager (NoSketch Enigine) as well as open-source tools for tokenization (ZPar) and POS tagging (the Penn Chinese Treebank). The corpus has reached the size of 800 million tokens (June 2016), is equipped with bibliographic, POS, style and genre, phonetic annotation. Syntactic annotation is prepared (autumn 2016). So far, the Hanku corpus is equipped with two subcorpora: (1) zh-law (legal texts from the PRC; texts of laws and regulations), (2) web-zh (texts from the Internet). Texts from different registers will follow (e.g. professional texts, texts of Modern Chinese literature etc.).
Structure of the Corpus
The logical (as presented to the end user) structure of the corpus is based on documents. Typically, one document correspondto one webpage (“s” referenced by a URL), or a newspaper article, a book etc. The set of documents form the corpus directly, there is no higher hierarchy level included. Lower hierarchy levels include text structures (paragraphs, sentences) and tokens with their positional attributes.
The basic block of the corpus is a token – one single position in the text. Traditionally in corpus linguistics, one token represents one word in the source text, with additional information, such as lemma, part of speech or syntactical function. For Chinese, there are two possibilities—to tokenize a text into characters (Hanzi) or words. Even if the division of text into words is often fuzzy and subject to individual interpretation, it was decided to tokenize a text into words. In the Hanku, each token is annotated for part of speech (POS), its composition into characters and the Hanyu pinyin transcription. The POS annotation and tokenization are results of automatic processing.
The corpus may be used in linguistics research and language teaching. The Hanku common usage scenarios in language teaching are as follows:
- basic word usage – KWIC
- collocation preferences of a word
- sentence pattern search
- register’s specific usage of a word
- register’s preference of synonyms etc.
The system of the corpus (under the query type “lema”) allows a user to search for Chinese words or characters by writing them in Hanyu pinyin with or without the tones.
Performing the KWIC and collocation’s search are basic tasks which an ordinary user of corpora is familiar with. Using a regular expression, e.g. the sentence pattern, might be regarded as an advanced level. Results may be saved directly from the interface as txt or XML files.
If you use this corpus in your research, please refer to:
If you have any questions, please feel free to contact me via e-mail at lubos.gajdos(at)uniba.sk
Available corpora through a web-based interface: