The HANKU Corpus
Na našom akademickom pracovisku sme v roku 2016 začali s budovaním čínskeho monolingválneho korpusu Hanku. Hanku je synchrónny korpus čínskeho jazyka dostupný cez webové rozhranie na stránkach <konfuciovinstitut.sk>. Korpus vznikol vďaka podpore Chinese National Office for Teaching Chinese as a Foreign Language (2016).
Korpus má rozsah približne 900 miliónov tokenov (slov) a je využívaný v lingvistickom výskume i v otázkach didaktiky jazyka. Korpus je dostupný cez webové rozhranie s pokročilým korpusovým manažérom. Texty v korpuse sú doplnené o štýlovo-žánrovú anotáciu, slová o slovnodruhovú (POS) a pripravujeme syntaktickú anotáciu.
The Hanku is a monolingual, synchronous Chinese corpus (in simplified Chinese characters) available via web interface. It is available via the website of the Confucius Institute at Comenius University in Bratislava at: <konfuciovinstitut.sk>. The building process has begun in spring 2016 and it was supported by the Chinese National Office for Teaching Chinese as a Foreign Language in 2016.
The corpus is available here (marec 2022). The corpus can be used free of charge, but registration is required. Please send us the following information (via email to lubos.gajdos(at)uniba.sk):
- Your Name
- Your Academic Affiliation
- Your E-mail
By registering you agree to use the Hanku corpus and its resources solely for study, research, teaching and other non-commercial purposes.
Learn how to use the corpus and more or see the videos.
The Hanku uses an open-source version of the Sketch Engine corpus manager (NoSketch Enigine) as well as open-source tools for tokenization (ZPar) and POS tagging (the Penn Chinese Treebank). The corpus has reached the size of 880 million tokens (June 2016), is equipped with bibliographic, POS, style and genre, phonetic annotation. Syntactic annotation is prepared (autumn 2016). So far, the Hanku corpus is equipped with the following style and genre annotation: (1) web-zh (texts from the Internet), (2) zh-law (legal texts from the PRC; texts of laws and regulations), (3) . Texts from different registers will follow (e.g. professional texts, texts of Modern Chinese literature etc.).
Structure of the Corpus
The logical (as presented to the end user) structure of the corpus is based on documents. Typically, one document correspond to one webpage (“s” referenced by a URL), or a newspaper article, a book etc. The set of documents form the corpus directly, there is no higher hierarchy level included. Lower hierarchy levels include text structures (paragraphs, sentences) and tokens with their positional attributes.
The basic block of the corpus is a token – one single position in the text. Traditionally in corpus linguistics, one token represents one word in the source text, with additional information, such as lemma, part of speech or syntactical function. For Chinese, there are two possibilities—to tokenize a text into characters (Hanzi) or words. Even if the division of text into words is often fuzzy and subject to individual interpretation, it was decided to tokenize a text into words. In the Hanku, each token is annotated for part of speech (POS), its composition into characters and the Hanyu pinyin transcription. The POS annotation and tokenization are results of automatic processing.
The corpus may be used in linguistics research and language teaching. The Hanku common usage scenarios in language teaching are as follows:
- basic word usage – KWIC
- collocation preferences of a word
- sentence pattern search
- register’s specific usage of a word
- register’s preference of synonyms etc.
The system of the corpus (under the query type “lema”) allows a user to search for Chinese words or characters by writing them in Hanyu pinyin with or without the tones.
Performing the KWIC and collocation’s search are basic tasks which an ordinary user of corpora is familiar with. Using a regular expression, e.g. the sentence pattern, might be regarded as an advanced level. Results may be saved directly from the interface as txt or XML files.
If you use this corpus in your research, please refer to:
Gajdoš, Ľ., Garabík, R., Benická, J. The New Chinese Webcorpus Hanku – Origin, Parameters, Usage. In Studia Orientalia Slovaca, Vol. 15, No. 1 (2016), pp. 21—33.
If you have any questions, please feel free to contact me via e-mail at lubos.gajdos(at)uniba.sk
Dostupné (nielen) čínske korpusy
Dostupné korpusy čínskeho jazyka cez webové rozhranie:
Korpus nemeckého právneho textu