As Google Japanese Input Method was released in December 2009, I suppose that the advantage of using the web as a corpus is confirmed. Therefore, I stopped to provide a demo page for ChaIME on the 16th February 2010. Thank you for your cooperation. We will prepare a standalone version fo ChaIME in the near future.
Recent growth in WWW allows internet users to access massive amount of text data. Japanese people get more and more familiar with typing Japanese documents. Compared to good-old days of word processors like Fujitsu OASYS and Toshiba Rupo, Japanese input methods that come with Windows and Mac become sophisticated and give less stress on writing Japanese.
However, Japanese input methods in open source environment such as Linux and FreeBSD stay the same. For example, Canna and Wnn are the most famous input methods on Unix which have been used for a long time. Not until 2001, input methods with new algorithms such as Anthy emerged, and came into use. Notably, Anthy gained the de facto standard as open source input methos after 2005.
Although Anthy becomes the modernest input method by introducing probabilistic language model in 2005 and discriminative learning framework in 2006, it is based on cannadic, which was originally designed and built for Canna. Also, Anthy needs hand-turn for word cost and connection graph, which makes development even harder.
Thus, proposed input method ChaIME (pronounced "chime") uses stochastic language model estimaned from large corpura (Google Japanese N-gram) to overcome problem of manual annotation cost. It also allevetes from data sparseness problem by increasing the size of corpus. It is maintainance free in that users do not need to determine the part-of-speech of a word when they register a word into users' dictionary.
We compared sample Kana-Kanji sentences taken from error analysis of ATOK 2007 (one of the best known Japanese input method).
|ChaIME||ATOK 2007||Anthy 9100c||AjaxIME|
AjaxIME is yet-another browser-based Japanese input method. Proposed method achieves much better accuracy because it uses more data (~x100 in size) to build Kana-Kanji conversion model. It is not fair to compare it with ATOK 2007 with these examples since they are taken from error analysis of ATOK 2007. (ATOK 2007 usually gives moderate results)
Since the language model is constructed from Web corpus and Kana-Kanji conversion model is estimated from newspapers, it is easy for ChaIME to convert the first four sentences, while it would be rather difficult to convert the last four sentences. This could be solved by using a corpus in more informal style such as blog data to build Kana-Kanji conversion model.
Google Japanese N-gram can be purchased at Gengo-Shigen-Kyokai (GSK). However, it is prohibited for commercial use and only available at academic purpose. There is no plan to distribute the data outside this site. We plan to release statistical langauge models estimated from web text crawled by ourselves.
This input method is partly support by Creative and International Competitiveness Project 2007, Nara Institute of Science and Technology, Japan.
I thank Shinsuke Mori for inviting me for developing free and open source Japanese input methods during the week of annual meeting of the association for natural language processing in 2007. I appreciate all his help in the algorithm, implementation and resource for building Japanese input methods. I learned a lot from his profound knowledge of Japanese input methods. Without him, I wouldn't develop the software.
Hiroyuki Tokunaga (Preferred Infrastructure), co-developer of the input method, greatly improved the quality of the software. I'm always impressed at his work every time he shows me his idea, along with some code.
Taku Kudo (Google) and Hideto Kazawa (Google) created Google Japanese N-gram. It literally opens the possibility of statictical Japanese input method. Thank both for releasing such a valuable resourse. Also, Taku constantly gives me many tips to implement Japanese input methods.
Masayuki Asahara let me enjoy the project of developing statistical input method as a tutor. I really liked it. He regularly maintains NAIST-jdic (previously known as IPADic), which is used for word dictionary in ChaIME.
Anthy founder, Yusuke Tabata, is one of the oldest friends of me. Our friendship goes back to the time when I was an official developer of Gentoo Linux. He invited me to the Input Method Party, and introduced me many input method developers. I, in turn, would like to commit recent advances in natural language processing to free and open source Japanese input methods.