ChaIME: Stochastic Input Method Editor
-- Japanese Input Method with Google N-gram Language Model --


As Google Japanese Input Method was released in December 2009, I suppose that the advantage of using the web as a corpus is confirmed. Therefore, I stopped to provide a demo page for ChaIME on the 16th February 2010. Thank you for your cooperation. We will prepare a standalone version fo ChaIME in the near future.


Recent growth in WWW allows internet users to access massive amount of text data. Japanese people get more and more familiar with typing Japanese documents. Compared to good-old days of word processors like Fujitsu OASYS and Toshiba Rupo, Japanese input methods that come with Windows and Mac become sophisticated and give less stress on writing Japanese.

However, Japanese input methods in open source environment such as Linux and FreeBSD stay the same. For example, Canna and Wnn are the most famous input methods on Unix which have been used for a long time. Not until 2001, input methods with new algorithms such as Anthy emerged, and came into use. Notably, Anthy gained the de facto standard as open source input methos after 2005.

Although Anthy becomes the modernest input method by introducing probabilistic language model in 2005 and discriminative learning framework in 2006, it is based on cannadic, which was originally designed and built for Canna. Also, Anthy needs hand-turn for word cost and connection graph, which makes development even harder.

Thus, proposed input method ChaIME (pronounced "chime") uses stochastic language model estimaned from large corpura (Google Japanese N-gram) to overcome problem of manual annotation cost. It also allevetes from data sparseness problem by increasing the size of corpus. It is maintainance free in that users do not need to determine the part-of-speech of a word when they register a word into users' dictionary.

Sample converted sentences

We compared sample Kana-Kanji sentences taken from error analysis of ATOK 2007 (one of the best known Japanese input method).

ChaIME ATOK 2007 Anthy 9100c AjaxIME
請求書の支払日時 請求書の市は来日時 請求書の支払い日時 請求書の支払いに知事
近く市場調査を行う。 知覚し冗長さを行う。 近く市場調査を行う。 近く市場調査を行う。
その後サイト内で その五歳都内で その後サイト内で その後再都内で
去年に比べ高い水準だ。 去年に比べた海水順だ。 去年に比べたかい水準だ。 去年に比べ高い水準だ。
昼イチまでに書類作っといて。 昼一までに書類津くっといて。 昼一までに書類作っといて。 肥留市までに書類作っといて。
そんな話信じっこないよね。 そんな話心十個内よね。 そんなはな視診時っこないよね。 そんな話神事っ子ないよね。
初めっからもってけばいいのに。 恥メッカら持って毛羽いいのに。 恥メッカ羅持ってケバ飯野に。 始っから持ってけば良いのに。
熱々の肉まんにぱくついた。 熱々の肉まん二泊着いた。 あつあつの肉まん2泊付いた。 熱熱の肉まんにぱくついた。

AjaxIME is yet-another browser-based Japanese input method. Proposed method achieves much better accuracy because it uses more data (~x100 in size) to build Kana-Kanji conversion model. It is not fair to compare it with ATOK 2007 with these examples since they are taken from error analysis of ATOK 2007. (ATOK 2007 usually gives moderate results)

Since the language model is constructed from Web corpus and Kana-Kanji conversion model is estimated from newspapers, it is easy for ChaIME to convert the first four sentences, while it would be rather difficult to convert the last four sentences. This could be solved by using a corpus in more informal style such as blog data to build Kana-Kanji conversion model.

Known issues

  1. Reduce the size of the bigram dictionary (it is 2GB right now, but could be less than several MB if we use class bigram instead of word bigram)
  2. Speed up dictionaly lookup by converting the dictionary in DFA format
  3. Allow operation in large segment (currently, only sentence level operation is allowed; it could be word or "bunsetsh" chunk)

Google Japanese N-gram can be purchased at Gengo-Shigen-Kyokai (GSK). However, it is prohibited for commercial use and only available at academic purpose. There is no plan to distribute the data outside this site. We plan to release statistical langauge models estimated from web text crawled by ourselves.


This input method is partly support by Creative and International Competitiveness Project 2007, Nara Institute of Science and Technology, Japan.

I thank Shinsuke Mori for inviting me for developing free and open source Japanese input methods during the week of annual meeting of the association for natural language processing in 2007. I appreciate all his help in the algorithm, implementation and resource for building Japanese input methods. I learned a lot from his profound knowledge of Japanese input methods. Without him, I wouldn't develop the software.

Hiroyuki Tokunaga (Preferred Infrastructure), co-developer of the input method, greatly improved the quality of the software. I'm always impressed at his work every time he shows me his idea, along with some code.

Taku Kudo (Google) and Hideto Kazawa (Google) created Google Japanese N-gram. It literally opens the possibility of statictical Japanese input method. Thank both for releasing such a valuable resourse. Also, Taku constantly gives me many tips to implement Japanese input methods.

Masayuki Asahara let me enjoy the project of developing statistical input method as a tutor. I really liked it. He regularly maintains NAIST-jdic (previously known as IPADic), which is used for word dictionary in ChaIME.

Anthy founder, Yusuke Tabata, is one of the oldest friends of me. Our friendship goes back to the time when I was an official developer of Gentoo Linux. He invited me to the Input Method Party, and introduced me many input method developers. I, in turn, would like to commit recent advances in natural language processing to free and open source Japanese input methods.

Mamoru Komachi <>
Tokyo Metropolitan University