ChEnPat & LEnChPat: Chinese-English Parallel Patent Corpora

We have been building Chinese-English parallel corpora since 2007 in Language Information Sciences Research Centre, City University of Hong Kong. Currently, there are two Chinese-English parallel corpora of parallel sentences extracted from comparable patents.

Note: The corpora are in their preliminary stages, and we are still trying to improve them now. Any comments or suggestions are greatly appreciated.
The data collections are copyrighted by Language Information Sciences Research Centre, City University of Hong Kong.

Parallel Corpora

ChEnPat: A Chinese-English Patent Parallel Corpus
It contains 160K parallel sentences mined from about 7000 comparable patents which were first filed in the State Intellectual Property Office (SIPO), P.R.China in Chinese, and then manually translated into English and filed in United States Patent and Trademark Office (USPTO).

Samples: 4000 parallel sentences encoded in UTF-8.(1000 for each section: title, abstract, claims, description).

References
- Bin Lu, Benjamin K. Tsou, Jingbo Zhu, Tao Jiang, and Oi Yee Kwong. 2009. The Construction of a Chinese-English Patent Parallel Corpus. Proceedings of MT Summit XII 3rd Workshop on Patent Translation. pp. 17-24. Ottawa, Canada. 2009.08.
LEnChPat: A Large English-Chinese Patent Parallel Corpus
It contains more than 7 million parallel sentences. It was mined from more than 78K comparable patents which were first filed in some countries with English as the original language, and then manually translated in Chinese and filed in SIPO, P.R. China.

Samples: 4000 parallel sentences encoded in UTF-8.(1000 for each section: title, abstract, claims, description).
References
- Bin LU, Jiang Tao, Kapo Chow and Benjamin K. Tsou. 2010. Building A Large Chinese-English Parallel Corpus by Harvesting Comparable Patents from the Web. LREC Workshop on Building and Using Comparable Patents. Malta. May, 2010.
- Bin LU, Benjamin K. Tsou, Jiang Tao, and Oi Yee Kwong. 2010. Mining Large-scale Parallel Corpora from Multilingual Patents - An English-Chinese Excample and its Application to SMT. (sumbitted)

Other References

The main papers relevant to the corpora:

Tao Jiang, Benjamin K. Tsou and Bin Lu. Part-of-speech Model for N-Best List Reranking in Experimental Chinese-English SMT. The 1st International Workshop on Advances in Patent Information Retrieval (collocated with the Annual European Conference on Information Retrieval, ECIR'10). Milton Keynes, UK. March, 2010. (to appear)
Bin Lu and Benjamin K. Tsou. 2009. Towards Bilingual Term Extraction in Comparable Patents. Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation (PACLIC 23). pp. 755-762. Hong Kong. 2009.12.
Bin Lu, Benjamin K. Tsou, Tao Jiang, Jingbo Zhu, and Oi Yee Kwong. Mining Parallel Knowledge from Comparable Patents. 2010. (submited)

Contact Information

We are planning to make part of them public to the research community.
If you are interested in or have any question about the corpora, please contact with Professor Benjamin K. Tsou at Language Information Sciences Research Centre, City University of Hong Kong.
Bin LU (PhD student) : lubin2010 at gmail DOT com
Professor Benjamin K. Tsou : rlbtsou at cityu DOT edu DOT hk

version 0.1
last modified March 16, 2010