A new feature to improve Moore’s sentence alignment method
Đăng ngày 4/25/2019 3:42:28 AM
| Thể loại:
| Lần tải: 0
| Lần xem: 13
| Page: 13
| FileSize: 0.42 M
| File type: PDF
A new feature to improve Moore’s sentence alignment method. The sentence alignment approach proposed by Moore, 2002 (M-Align) is an effective method which gets a relatively high performance based on combination of length-based and word correspondences. Nevertheless, despite the high precision, M-Align usually gets a low recall especially when dealing with sparse data problem.
(”(other)”;”(other)”), the error of this estimation
"i 2 f1,2,3geffectstothecorrectestimationby
a total error !:
Brown’sAlgorithm.Word clustering Brown, et
al.  is considered as a method for estimating
! = "i
the probabilities of low frequency events that
are likely unobserved in an unlabeled data.
One of aims of word clustering is the problem
of predicting a word from previous words in
suppose that there are k, (0 k l + 1), word
HƯỚNG DẪN DOWNLOAD TÀI LIỆU
Bước 1:Tại trang tài liệu slideshare.vn bạn muốn tải, click vào nút Download màu xanh lá cây ở phía trên.
Bước 2: Tại liên kết tải về, bạn chọn liên kết để tải File về máy tính. Tại đây sẽ có lựa chọn tải File được lưu trên slideshare.vn
Bước 3: Một thông báo xuất hiện ở phía cuối trình duyệt, hỏi bạn muốn lưu .
- Nếu click vào Save, file sẽ được lưu về máy
(Quá trình tải file nhanh hay chậm phụ thuộc vào đường truyền internet, dung lượng file bạn muốn tải)
Có nhiều phần mềm hỗ trợ việc download file về máy tính với tốc độ tải file nhanh như: Internet Download Manager (IDM), Free Download Manager, ... Tùy vào sở thích của từng người mà người dùng chọn lựa phần mềm hỗ trợ download cho máy tính của mình
A new feature to improve Moore’s sentence alignment method. The sentence alignment approach proposed by Moore, 2002 (M-Align) is an effective method which gets a relatively high performance based on combination of length-based and word correspondences. Nevertheless, despite the high precision, M-Align usually gets a low recall especially when dealing with sparse data problem..
VNU Journal of Science: Comp. Science & Com. Eng. Vol. 31. No. 1 (2015) 32–44
A New Feature to Improve Moore’s Sentence Alignment Method
Hai-Long Trieu1 Phuong-Thai Nguyen2 Le-Minh Nguyen1
1Japan Advanced Institute of Science and Technology, Ishikawa, Japan 2VNU University of Engineering and Technology, Hanoi, Vietnam
The sentence alignment approach proposed by Moore, 2002 (M-Align) is an effective method which gets a rela-tively high performance based on combination of length-based and word correspondences. Nevertheless, despite the high precision, M-Align usually gets a low recall especially when dealing with sparse data problem. We pro-pose an algorithm which not only exploits advantages of M-Align but overcomes the weakness of this baseline method by using a new feature in sentence alignment, word clustering. Experiments shows an improvement on the baseline method up to 30% recall while precision is reasonable.
2015 Published by VNU Journal of Science.
Manuscript communication: received 17 June 2014, revised 4 january 2015, accepted 19 January 2015 Corresponding author: Trieu Hai Long, email@example.com
Keywords: Sentence Alignment, Parallel Corpora, Word Clustering, Natural Language Processing
Online parallel texts are ample and substantial resources today. In order to apply these materials into useful applications like machine translation, these resources need to be aligned at sentence level. This is the task known as sentence alignment which maps sentences in the text of the source language to their corresponding units in the text of the target language. After aligned at sentence level, the bilingual corpora are greatly useful in many important applications. Efficient and powerful sentence alignment algorithms, therefore, become increasingly important.
The sentence alignment approach proposed by Moore, 2002  is an effective method which gets a relatively high performance especially in precision. Nonetheless, this method has a drawback that it usually gets a low recall especially when dealing with sparse data problem. In any real text, sparseness of data is an inherentproperty,anditisaproblemthataligners
encounter in collecting frequency statistics on words. This may lead to an inadequate estimation probabilities of rare but nevertheless possible words. Therefore, reducing unreliable probability estimates in processing sparse data is also a solution to improve the quality of aligners. In this paper, we propose a method which overcomes weaknesses of the Moore’s approach by using a new feature in sentence alignment, word clustering. In the Moore’s method, a bilingual word dictionary is built by using IBM Model 1, which mainly effects on performance of the aligner. However, this dictionary may lack a large number of vocabulary when input corpus contains sparse data. Therefore, in order to deal with this problem, we propose an algorithm which applies monolingual word clustering to enrich the dictionary in such case. Our approach obtains a high recall while the accuracy is still relatively high, which leads to a considerably better overall performance than the baseline
H.L. Trieu, et al. / VNU Journal of Science: Comp. Science & Com. Eng. Vol. 31. No. 1 (2015) 32–44 33
Initial Corpus Processed Corpus
Pairs of Sentences with
2.2. Effect of Bilingual Word Dictionary
Length and Word
Aligning by Length
IBM Model 1
Sentence aligners based on the combination length-based and word correspondences usually use bilingual word dictionary. Moore  uses IBM Model 1 to make a bilingual word
dictionary. Varga, et al.  use an extra dictionary or train IBM Model 1 to make a dictionary in the case of absence such a resource.
Fig 1: Framework of our sentence alignment algorithm.
In the next section, we present our approach and sentence alignment framework. Section 3 indicates experimental results and evaluations on our algorithm compared to the baseline method. Section 4 is a survey of related works. Finally, Section 5 gives conclusions and future works.
2. Our Method
Our method is based on the framework of the Moore’s algorithm , which is presented in section 2.1. Section 2.2 illustrates our analyses and evaluations impacts of dictionary quality to performance of the sentence aligner. We briefly introduce to word clustering (Section 2.3) and using this feature to improve the Moore’s method (Section 2.4). An example is also included in this section to illustrate our algorithm more detail.
2.1. Sentence Alignment Framework
We use the framework of the Moore’s algorithm  with some modifications. This framework consists of two phases. Firstly, input corpus is aligned based on a sentence-length model in order to extract sentence pairs with high probability to train word alignment model
Let (s, t) is a pair of sentences where s is a sentence of source language, t is a sentence of target language.
s=(s1, s2,..., sl),where si iswordsofsentence s.
t = (t1, t2, ..., tm), where tj is words of sentence t.
To estimate alignment probability for this sentence pair, all word pairs (si, tj) are searched in bilingual word dictionary. However, the more input corpus contains sparse data, the more these word pairs are not contained in the dictionary. In the Moore’s method , words which are not included in the dictionary are simply replaced by an only term "(other)".
In the Moore’s method, word translation is applied to evaluate alignment probability as formula below:
P(s;t) = Pl+(l;m)(j=1 i=0 t(tjjsi))(i=1 fu(si)) (1)
Where m is the length of t, and l is the length of s; t(tjjsi) is word translation probability of word pair (tj;si); and fu is the observed relative unigram frequency of the word in the text of corresponding language.
In the below section, we will analyse how the Moore’s method makes errors when word pairs are absent in dictionary, or sparse data problem.
(IBM Model 1). In the second phase, the corpus According to the Moore’s method, when
is aligned again based on a combination of length-basedandbilingualworddictionary.Word clustering is used in the second phrase to improve sentence alignment quality. Our approach is illustrated in the Fig. 1.
si or tj is not included in dictionary, it is replaced by one of pairs: (tj;”(other)”), (”(other)”;si), or (”(other)”;”(other)”). Suppose thatthecorrecttranslationprobabilityoftheword pair (tj;si) is , and the translation probabilities
34 H.L. Trieu, et al. / VNU Journal of Science: Comp. Science & Com. Eng. Vol. 31. No. 1 (2015) 32–44
Algorithm 1: Generating Bilingual Word Dictionary Input : set of sentence pairs (s,t)
Output: translation prob. t(e; f)
2 initialize t(ejf) uniformly 3 while not converged do
5 count(ejf) = 0 for all e; f 6 total(f) = 0 for all f
7 for all sentence pairs (s,t) do 8 //compute normalization 9 for all words e in s do
10 total(e) = 0
11 for all words f in t do 12 total(e)+ = t(ejf)
13 //collect counts
14 for all words e in s do
15 for all words f in t do 16 count(ejf)+ = total(e) 17 total(f)+ = total(e)
18 //estimate probabilities 19 for all words f do
20 for all words e do
21 t(ejf) = fraccount(ejf)total(f)
22 return t(ejf)
of the word pair (tj;”(other)”), (”(other)”;si), (”(other)”;”(other)”) are 1;2;3 respectively. These estimations make errors as follows:
1 = 1;2 = 2;3 = 3; (2)
Therefore, when (tj;si) is replaced by one of these word pairs: (tj;”(other)”), (”(other)”;si), (”(other)”;”(other)”), the error of this estimation "i 2 f1,2,3geffectstothecorrectestimationby a total error !:
pairswhicharenotincludedindictionary,andthe error average is ; then the total error is:
! = (k)m; (4)
The more word pairs which are not included in dictionary, the more the number of word pairs k, or total error !.
2.3. Word Clustering
Brown’s Algorithm. Word clustering Brown, et al.  is considered as a method for estimating
! = "i (3)
the probabilities of low frequency events that are likely unobserved in an unlabeled data. One of aims of word clustering is the problem
If (tj;si) is contained dictionary, "i = 0; suppose that there are k, (0 k l + 1), word
of predicting a word from previous words in a sample of text. This algorithm counts the
H.L. Trieu, et al. / VNU Journal of Science: Comp. Science & Com. Eng. Vol. 31. No. 1 (2015) 32–44 35
Table 1: An English-Vietnamese sentence pair
damodaran ’ s solution is gelatin hydrolysate , a protein known to act as a natural antifreeze .
gi£i_pháp cıa damodaran là ch§t thıy_phân gelatin , mºt lo⁄i protein có chøc_năng như ch§t chŁng đông tü_nhiên .
Table 2: Several word pairs in Dictionary
Fig 2: An example of Brown’s cluster algorithm
similarity of a word based on its relations with words on left and the right of it. Input to the algorithm is a corpus of unlabeled data which consists of a vocabulary of words to be clustered.
damodaran 0.22 cıa 0.12 gi£i_pháp 0.03 là 0.55 mºt 0.73 như 0.46
Initially, each word in the corpus is considered to be in its own distinct cluster. The algorithm then repeatedly merges pairs of clusters that maximizesthequalityoftheclusteringresult,and each word belongs to exactly one cluster until the number of clusters is reduced to a predefined number. Output of the word cluster algorithm is a binary tree as shown in Fig. 2, in which the leaves of the tree are the words in the vocabulary. A word cluster contains a main word and several subordinatewords.Eachsubordinatewordhasthe same bit string and corresponding frequency.
2.4. Proposed Algorithm
We propose using word clustering data to supplementlexicalinformationforbilingualword dictionaryandimprovealignmentquality.Weuse the hypothesis that same cluster have a specific correlation, and in some cases they are able to be replacedtoeachother.Wordsthatdisappearinthe dictionary would be replaced other words of their cluster rather than replacing all of those words to an only term as in method of Moore . We use twowordclusteringdatasetscorrespondingtothe twolanguagesinthecorpus.Thisideaisindicated at the Algorithm 2.
In Algorithm 2, D is bilingual word dictionary created by training IBM Model 1. The dictionary D contains word pairs (e, v) in which each
languages correspondingly, and t(e, v) is their word translation probability.
In addition, Ce and Cv are two data sets clustered by word of texts of source and target languages respectively. Ce is the cluster of the word e, and Cv is the cluster of the word v.
When the word pair (e, v) is absent in the dictionary, e and v are replaced by all words of their cluster. A combined value of probability of new word pairs is counted, and it is treated as alignment probability for the absent word pair (e, v). In this algorithm, we use average function to get this combined value.
Consider an English-Vietnamese sentence pair as indicated in Table 1.
Some word pairs of bilingual word dictionary are listed in Table 2.
Considerawordpairswhichisnotcontainedin the Dictionary: (act, chøc_năng). In the first step, ouralgorithmreturnsclustersofeachwordinthis pair. The result is shown in Table 3 and Table 4.
Table 3: Cluster of act
0110001111 act 0110001111 society 0110001111 show 0110001111 departments 0110001111 helps
word belongs to texts of source and target
36 H.L. Trieu, et al. / VNU Journal of Science: Comp. Science & Com. Eng. Vol. 31. No. 1 (2015) 32–44
Algorithm 2: Sentence Alignment Using Word Clustering
Input : A word pair (e, v), Dictionary D, Clusters Ce and Cv Output: Word translation prob. of (e, v)
2 if (e, v) contained in D then 3 P = t(e;v)
5 if (e contained in D) and (v contained in D) then 6 with all (e1,..., en) in Ce
7 with all (v1,..., vm) in Cv
8 if ((ei, v) contained in D) or ((e, vj) contained in D) then 9 P = 1 ( t(ei;v)+ t(e;vj))
11 P = t(”(other)”;”(other)”)
13 if (e contained in D) or (v contained in D) then 14 if (e contained in D) then
15 with all (v1,..., vm) in Cv
16 if (e, v ) contained in D then 17 P = 1 Xt(e;vj)
19 P = 1 Xt(e;”(other)”)
21 with all (e1,..., en) in Ce
22 if (e , v) contained in D then 23 P = 1 Xt(ei;v)
25 P = 1 Xt(”(other)”;v)
27 P = t(”(other)”;”(other)”)
28 return P
H.L. Trieu, et al. / VNU Journal of Science: Comp. Science & Com. Eng. Vol. 31. No. 1 (2015) 32–44 37
Table 4: Cluster of chøc_năng
11111110 11111110 11111110 11111110 ...
chøc_năng hành_vi ph⁄t ho⁄t_đºng
The bit strings “0110001111" and “11111110" are identification of the clusters. Word pairs of these two clusters are then searched in the
Dictionary as shown in Table 5. Fig 3: Frequencies of Vietnamese Sentence Length
Table 5: Word pairs are searched in Dictionary
act act act
hành_vi ph⁄t ho⁄t_đºng
0.43 7.41E-4 0.01
shown in Table 7. We align this corpus at the sentence level manually and obtain 846 bilingualsentencespairs.WeusedatafromVLSP project available at1including 100,836 English-Vietnamese sentence pairs (En Training Data and Vi Training Data) with 1743040 English words
In the next step, the algorithm returns a translation probability for the initial word pair (act, chøc_năng).
Table 6: Probability of the word pair (act, chøc_năng)
(36149 distinct words) and 1681915 Vietnamese words (25523 distinct words). The VLSP data consists of 80,000 sentence pairs in Economics-Social topics and 20,000 sentence pairs in information technology topic.
Pr(act,chøc_năng) =average of (9.15E-4, Table 7: Bilingual Corpora
7.41E-4, 0.01) = 0.11
En Training Data Vi Training Data En Test Data
Vi Test Data
Sentences 100038 100038 1800 1828
Vocabularies 36149 25523
In this section, we evaluate performance of our algorithm and compare to the baseline method (M-Align).
3.1.1. Bilingual Corpora
The test data of our experiment is English-Vietnamese parallel data extracted from some websites including World Bank, Science, WHO, and Vietnamtourism. The data consist of 1800 English sentences (En Test Data) with 39526 words (6333 distint words) and 1828 Vietnamese sentences (Vi Test Data) with 40491 words
(5721 distinct words). These data sets are
We conduct lowercase, tokenize, word segmentation these data sets using the tool of1.
3.1.2. Sentence Length Frequency
The frequencies of sentence length are describedinFig.3andFig.4.Inthesefigures,the horizontal axis describe sentence lengths, and the vertical axis describe frequencies. The average sentence lengths of English and Vietnamese are 17.3 (English), 16.7 (Vietnamese), respectively.
38 H.L. Trieu, et al. / VNU Journal of Science: Comp. Science & Com. Eng. Vol. 31. No. 1 (2015) 32–44
Table 9: Word clustering data sets.