Recently, the LRC has begun working with faculty on a project to develop online activities to support students in reading at the advanced level in Chinese courses. In planning for this project, one topic that has come up is readability analysis for Chinese texts.
As an English language teacher, I am familiar with a number of readability formulae for English text (including their inherent limitations), and I have often made use of a range of online tools that help instructors to analyze English text in different ways, including:
My hunch was that there must be similar tools available for Chinese text. As I do not speak or read Chinese, my ability to find them is limited, but in doing some initial research online in English, I came across this 2009 ACTFL presentation by Jun Da (Middle Tennessee State University). Da argues that a simple formula for readability may be effective: calculating the number of characters per sentence, and the sum of characters per sentence plus unique characters per sentence. The results show a differentiation of texts at beginning, intermediate and advanced levels; I'm not sure that this analysis would be fine-grained enough to distinguish among texts at the advanced level, as we may want to do in the current project. Nonetheless, I thought why not put together an online tool that would allow instructors to input text and do this analysis - and here's a very basic first version:
When I read Da's presentation, I didn't realize that he had also developed an online vocabulary profiler tool, which pretty much renders my basic tool obsolete. (Regardless, it was a good opportunity to learn about working with multi-byte text strings!) However, if I'm interpreting the technical description correctly, the bigram/trigram/quadrigram segmentation in this analyzer is based on proximity (characters appearing next to each other), and thus could result in nonsense words. The tool does allow the instructor to review the "word" lists and select which should be included or excluded from the final analysis, which is a nice feature - but also time-consuming for the instructor. His website was developed some years ago, so this may have been the best available approach at the time, but I know that more work has been done recently in segmentation/tokenization of Chinese text. So perhaps it is worthwhile to continue the development of our own tool.
In that vein, here are some future directions I'd like to explore and some resources I've collected so far that may be helpful:
Expanding the character-level analysis with stroke counts and frequency:
- Jens-Ingo Farley has built an api for getting character info from the Unicode Consortium's Unihan database
- Though it may be better to build and query a local copy of Unihan
- Some people have generated character frequency lists using Internet as corpus, including:
- Jun Da has also published character frequency lists, using more selected texts as corpora
Segmenting the text into a words (also called tokenization):
- Since my webserver runs PHP, I've found a couple of open-source resources for Chinese text segmentation, using what seem to be widely accepted as effective algorithms:
- Dictionary look-up, including pinyin/tone and English equivalent
- We could output vocabulary lists to popular formats for input to flashcard sites like Quizlet
- Comparing words in text with HSK lists
- Comparing words in text with word frequency or "usefulness" lists