نوع مقاله : مقاله پژوهشی
نویسندگان
1 دانشجوی کارشناسی ارشد، دانشکده مهندسی برق و کامپیوتر، دانشگاه تهران
2 استادیار، دانشکده مهندسی برق و کامپیوتر، دانشگاه تهران
چکیده
کلیدواژهها
عنوان مقاله [English]
نویسندگان [English]
Parallel corpora regard as rich linguistic resources for Natural Language Processing and Cross Language Information Retrieval tasks. It is usually needed to align sentences before using these valuable resources; however, sentence alignments are expensive in terms of time and cost. With development of the World Wide Web and free access to it, automatically building parallel corpus from the Web is desirable. In this paper, we first choose bilingual pages with parallel content to extract parallel sentence candidates. Then, by computing several features and learning a Maximum Entropy classifier, parallel sentences are extracted from the candidate sentences. Our approach is not dependent on specific domain and it can cover different domains in the Web.
کلیدواژهها [English]