A new feature to improve Moore’s sentence alignment method

Đăng ngày 4/25/2019 3:42:28 AM | Thể loại: | Lần tải: 0 | Lần xem: 13 | Page: 13 | FileSize: 0.42 M | File type: PDF
A new feature to improve Moore’s sentence alignment method. The sentence alignment approach proposed by Moore, 2002 (M-Align) is an effective method which gets a relatively high performance based on combination of length-based and word correspondences. Nevertheless, despite the high precision, M-Align usually gets a low recall especially when dealing with sparse data problem.
c
VNU Journal of Science: Comp. Science & Com. Eng. Vol. 31. No. 1 (2015) 32–44
A New Feature to Improve Moore’s Sentence
Alignment Method
Hai-Long Trieu1 Phuong-Thai Nguyen2 Le-Minh Nguyen1
1Japan Advanced Institute of Science and Technology, Ishikawa, Japan
2VNU University of Engineering and Technology, Hanoi, Vietnam
Abstract
The sentence alignment approach proposed by Moore, 2002 (M-Align) is an effective method which gets a rela-
tively high performance based on combination of length-based and word correspondences. Nevertheless, despite
the high precision, M-Align usually gets a low recall especially when dealing with sparse data problem. We pro-
pose an algorithm which not only exploits advantages of M-Align but overcomes the weakness of this baseline
method by using a new feature in sentence alignment, word clustering. Experiments shows an improvement on the
baseline method up to 30% recall while precision is reasonable.
2015 Published by VNU Journal of Science.
Manuscript communication: received 17 June 2014, revised 4 january 2015, accepted 19 January 2015
Corresponding author: Trieu Hai Long, trieulh@jaist.ac.jp
Keywords: Sentence Alignment, Parallel Corpora, Word Clustering, Natural Language Processing
1. Introduction
encounter in collecting frequency statistics on
Online parallel texts are ample and substantial
resources today. In order to apply these materials
into useful applications like machine translation,
these resources need to be aligned at sentence
level. This is the task known as sentence
alignment which maps sentences in the text of
the source language to their corresponding units
in the text of the target language. After aligned at
sentence level, the bilingual corpora are greatly
useful in many important applications. Efficient
and powerful sentence alignment algorithms,
therefore, become increasingly important.
The sentence alignment approach proposed by
Moore, 2002 [14] is an effective method which
gets a relatively high performance especially
in precision. Nonetheless, this method has
a drawback that it usually gets a low recall
especially when dealing with sparse data
problem. In any real text, sparseness of data is an
inherentproperty,anditisaproblemthataligners
words. This may lead to an inadequate estimation
probabilities of rare but nevertheless possible
words. Therefore, reducing unreliable probability
estimates in processing sparse data is also a
solution to improve the quality of aligners. In this
paper, we propose a method which overcomes
weaknesses of the Moore’s approach by using
a new feature in sentence alignment, word
clustering. In the Moore’s method, a bilingual
word dictionary is built by using IBM Model
1, which mainly effects on performance of the
aligner. However, this dictionary may lack a
large number of vocabulary when input corpus
contains sparse data. Therefore, in order to deal
with this problem, we propose an algorithm
which applies monolingual word clustering to
enrich the dictionary in such case. Our approach
obtains a high recall while the accuracy is still
relatively high, which leads to a considerably
better overall performance than the baseline
Y
X
Y
m
l
l
1 1
m
( 1)
H.L. Trieu, et al. / VNU Journal of Science: Comp. Science & Com. Eng. Vol. 31. No. 1 (2015) 32–44
33
Initial Corpus
Processed Corpus
Pairs of Sentences with
High Probability
2.2. Effect of Bilingual Word Dictionary
PreProcessing
Aligning by Length
Sentence aligners based on the combination
length-based and word correspondences usually
Word
Clustering
Data
Aligning by
Length and Word
Dictionary
Training
IBM Model 1
use bilingual word dictionary. Moore [14] uses
IBM Model 1 to make a bilingual word
New Feature
Pairs of
Aligned Sentences
dictionary. Varga, et al. [20] use an extra
dictionary or train IBM Model 1 to make a
dictionary in the case of absence such a resource.
Fig 1: Framework of our sentence alignment algorithm.
Let (s, t) is a pair of sentences where s is a
sentence of source language, t is a sentence of
method [14].
In the next section, we present our approach
and sentence alignment framework. Section 3
indicates experimental results and evaluations on
our algorithm compared to the baseline method.
Section 4 is a survey of related works. Finally,
Section 5 gives conclusions and future works.
target language.
s=(s1, s2,..., sl),where si iswordsofsentence
s.
t = (t1, t2, ..., tm), where tj is words of sentence
t.
To estimate alignment probability for this
sentence pair, all word pairs (si, tj) are searched
in bilingual word dictionary. However, the more
input corpus contains sparse data, the more these
2. Our Method
word pairs are not contained in the dictionary. In
the Moore’s method [14], words which are not
included in the dictionary are simply replaced by
Our method is based on the framework of the
Moore’s algorithm [14], which is presented in
section 2.1. Section 2.2 illustrates our analyses
and evaluations impacts of dictionary quality to
an only term "(other)".
In the Moore’s method, word translation is
applied to evaluate alignment probability as
formula below:
performance of the sentence aligner. We briefly
introduce to word clustering (Section 2.3) and
using this feature to improve the Moore’s method
(Section 2.4). An example is also included in this
section to illustrate our algorithm more detail.
P(s;t) = Pl+(l;m)(j=1 i=0 t(tjjsi))(i=1 fu(si))
(1)
2.1. Sentence Alignment Framework
Where m is the length of t, and l is the length
of s; t(tjjsi) is word translation probability of
We
use
the
framework
of
the
Moore’s
word pair (tj;si); and fu is the observed relative
algorithm [14] with some modifications. This
unigram frequency of the word in the text of
framework consists of two phases. Firstly, input
corresponding language.
corpus
is
aligned
based
on
a
sentence-length
In the below section, we will analyse how the
model in order to extract sentence pairs with
Moore’s method makes errors when word pairs
high probability to train word alignment model
are absent in dictionary, or sparse data problem.
(IBM Model 1). In the second phase, the corpus
According
to
the
Moore’s
method,
when
is
aligned
again
based
on
a
combination
of
si
or
tj
is
not
included
in
dictionary,
it
length-basedandbilingualworddictionary.Word
is
replaced
by
one
of
pairs:
(tj;”(other)”),
clustering is used in the second phrase to improve
(”(other)”;si), or (”(other)”;”(other)”). Suppose
sentence
alignment
quality.
Our
approach
is
thatthecorrecttranslationprobabilityoftheword
illustrated in the Fig. 1.
pair (tj;si) is , and the translation probabilities
t(ejf)
t(ejf)
m l
34
H.L. Trieu, et al. / VNU Journal of Science: Comp. Science & Com. Eng. Vol. 31. No. 1 (2015) 32–44
Algorithm 1: Generating Bilingual Word Dictionary
Input
: set of sentence pairs (s,t)
Output: translation prob. t(e; f)
1 begin
2
initialize t(ejf) uniformly
3
while not converged do
4
//initialize
5
count(ejf) = 0 for all e; f
6
total(f) = 0 for all f
7
for all sentence pairs (s,t) do
8
//compute normalization
9
for all words e in s do
10
total(e) = 0
11
for all words f in t do
12
total(e)+ = t(ejf)
13
//collect counts
14
for all words e in s do
15
16
17
for all words f in t do
count(ejf)+ = total(e)
total(f)+ = total(e)
18
//estimate probabilities
19
for all words f do
20
for all words e do
21
t(ejf) = fraccount(ejf)total(f)
22
return t(ejf)
of the word pair (tj;”(other)”), (”(other)”;si),
pairswhicharenotincludedindictionary,andthe
(”(other)”;”(other)”) are 1;2;3 respectively.
error average is ; then the total error is:
These estimations make errors as follows:
! = (k)m;
(4)
1 =  1;2 =  2;3 =  3;
(2)
The more word pairs which are not included in
Therefore, when (tj;si) is replaced by one of
these word pairs: (tj;”(other)”), (”(other)”;si),
dictionary, the more the number of word pairs k,
or total error !.
(”(other)”;”(other)”), the error of this estimation
"i 2 f1,2,3geffectstothecorrectestimationby
a total error !:
2.3. Word Clustering
Brown’s Algorithm. Word clustering Brown, et
al. [3] is considered as a method for estimating
YX
! = "i
(3)
the probabilities of low frequency events that
are likely unobserved in an unlabeled data.
j=1 i=0
One of aims of word clustering is the problem
If
(tj;si)
is
contained
dictionary,
"i
=
0;
of predicting a word from previous words in
suppose that there are k, (0  k  l + 1), word
a
sample
of
text.
This
algorithm
counts
the
HƯỚNG DẪN DOWNLOAD TÀI LIỆU

Bước 1:Tại trang tài liệu slideshare.vn bạn muốn tải, click vào nút Download màu xanh lá cây ở phía trên.
Bước 2: Tại liên kết tải về, bạn chọn liên kết để tải File về máy tính. Tại đây sẽ có lựa chọn tải File được lưu trên slideshare.vn
Bước 3: Một thông báo xuất hiện ở phía cuối trình duyệt, hỏi bạn muốn lưu . - Nếu click vào Save, file sẽ được lưu về máy (Quá trình tải file nhanh hay chậm phụ thuộc vào đường truyền internet, dung lượng file bạn muốn tải)
Có nhiều phần mềm hỗ trợ việc download file về máy tính với tốc độ tải file nhanh như: Internet Download Manager (IDM), Free Download Manager, ... Tùy vào sở thích của từng người mà người dùng chọn lựa phần mềm hỗ trợ download cho máy tính của mình  
13 lần xem

A new feature to improve Moore’s sentence alignment method. The sentence alignment approach proposed by Moore, 2002 (M-Align) is an effective method which gets a relatively high performance based on combination of length-based and word correspondences. Nevertheless, despite the high precision, M-Align usually gets a low recall especially when dealing with sparse data problem..

Nội dung

VNU Journal of Science: Comp. Science & Com. Eng. Vol. 31. No. 1 (2015) 32–44 A New Feature to Improve Moore’s Sentence Alignment Method Hai-Long Trieu1 Phuong-Thai Nguyen2 Le-Minh Nguyen1 1Japan Advanced Institute of Science and Technology, Ishikawa, Japan 2VNU University of Engineering and Technology, Hanoi, Vietnam Abstract The sentence alignment approach proposed by Moore, 2002 (M-Align) is an effective method which gets a rela-tively high performance based on combination of length-based and word correspondences. Nevertheless, despite the high precision, M-Align usually gets a low recall especially when dealing with sparse data problem. We pro-pose an algorithm which not only exploits advantages of M-Align but overcomes the weakness of this baseline method by using a new feature in sentence alignment, word clustering. Experiments shows an improvement on the baseline method up to 30% recall while precision is reasonable. 2015 Published by VNU Journal of Science. Manuscript communication: received 17 June 2014, revised 4 january 2015, accepted 19 January 2015 Corresponding author: Trieu Hai Long, trieulh@jaist.ac.jp Keywords: Sentence Alignment, Parallel Corpora, Word Clustering, Natural Language Processing 1. Introduction Online parallel texts are ample and substantial resources today. In order to apply these materials into useful applications like machine translation, these resources need to be aligned at sentence level. This is the task known as sentence alignment which maps sentences in the text of the source language to their corresponding units in the text of the target language. After aligned at sentence level, the bilingual corpora are greatly useful in many important applications. Efficient and powerful sentence alignment algorithms, therefore, become increasingly important. The sentence alignment approach proposed by Moore, 2002 [14] is an effective method which gets a relatively high performance especially in precision. Nonetheless, this method has a drawback that it usually gets a low recall especially when dealing with sparse data problem. In any real text, sparseness of data is an inherentproperty,anditisaproblemthataligners encounter in collecting frequency statistics on words. This may lead to an inadequate estimation probabilities of rare but nevertheless possible words. Therefore, reducing unreliable probability estimates in processing sparse data is also a solution to improve the quality of aligners. In this paper, we propose a method which overcomes weaknesses of the Moore’s approach by using a new feature in sentence alignment, word clustering. In the Moore’s method, a bilingual word dictionary is built by using IBM Model 1, which mainly effects on performance of the aligner. However, this dictionary may lack a large number of vocabulary when input corpus contains sparse data. Therefore, in order to deal with this problem, we propose an algorithm which applies monolingual word clustering to enrich the dictionary in such case. Our approach obtains a high recall while the accuracy is still relatively high, which leads to a considerably better overall performance than the baseline H.L. Trieu, et al. / VNU Journal of Science: Comp. Science & Com. Eng. Vol. 31. No. 1 (2015) 32–44 33 Initial Corpus Processed Corpus Pairs of Sentences with High Probability 2.2. Effect of Bilingual Word Dictionary PreProcessing Word Clustering Data Aligning by Length and Word Aligning by Length Dictionary Training IBM Model 1 Sentence aligners based on the combination length-based and word correspondences usually use bilingual word dictionary. Moore [14] uses IBM Model 1 to make a bilingual word New Feature Pairs of Aligned Sentences dictionary. Varga, et al. [20] use an extra dictionary or train IBM Model 1 to make a dictionary in the case of absence such a resource. Fig 1: Framework of our sentence alignment algorithm. method [14]. In the next section, we present our approach and sentence alignment framework. Section 3 indicates experimental results and evaluations on our algorithm compared to the baseline method. Section 4 is a survey of related works. Finally, Section 5 gives conclusions and future works. 2. Our Method Our method is based on the framework of the Moore’s algorithm [14], which is presented in section 2.1. Section 2.2 illustrates our analyses and evaluations impacts of dictionary quality to performance of the sentence aligner. We briefly introduce to word clustering (Section 2.3) and using this feature to improve the Moore’s method (Section 2.4). An example is also included in this section to illustrate our algorithm more detail. 2.1. Sentence Alignment Framework We use the framework of the Moore’s algorithm [14] with some modifications. This framework consists of two phases. Firstly, input corpus is aligned based on a sentence-length model in order to extract sentence pairs with high probability to train word alignment model Let (s, t) is a pair of sentences where s is a sentence of source language, t is a sentence of target language. s=(s1, s2,..., sl),where si iswordsofsentence s. t = (t1, t2, ..., tm), where tj is words of sentence t. To estimate alignment probability for this sentence pair, all word pairs (si, tj) are searched in bilingual word dictionary. However, the more input corpus contains sparse data, the more these word pairs are not contained in the dictionary. In the Moore’s method [14], words which are not included in the dictionary are simply replaced by an only term "(other)". In the Moore’s method, word translation is applied to evaluate alignment probability as formula below: P(s;t) = Pl+(l;m)(j=1 i=0 t(tjjsi))(i=1 fu(si)) (1) Where m is the length of t, and l is the length of s; t(tjjsi) is word translation probability of word pair (tj;si); and fu is the observed relative unigram frequency of the word in the text of corresponding language. In the below section, we will analyse how the Moore’s method makes errors when word pairs are absent in dictionary, or sparse data problem. (IBM Model 1). In the second phase, the corpus According to the Moore’s method, when is aligned again based on a combination of length-basedandbilingualworddictionary.Word clustering is used in the second phrase to improve sentence alignment quality. Our approach is illustrated in the Fig. 1. si or tj is not included in dictionary, it is replaced by one of pairs: (tj;”(other)”), (”(other)”;si), or (”(other)”;”(other)”). Suppose thatthecorrecttranslationprobabilityoftheword pair (tj;si) is , and the translation probabilities 34 H.L. Trieu, et al. / VNU Journal of Science: Comp. Science & Com. Eng. Vol. 31. No. 1 (2015) 32–44 Algorithm 1: Generating Bilingual Word Dictionary Input : set of sentence pairs (s,t) Output: translation prob. t(e; f) 1 begin 2 initialize t(ejf) uniformly 3 while not converged do 4 //initialize 5 count(ejf) = 0 for all e; f 6 total(f) = 0 for all f 7 for all sentence pairs (s,t) do 8 //compute normalization 9 for all words e in s do 10 total(e) = 0 11 for all words f in t do 12 total(e)+ = t(ejf) 13 //collect counts 14 for all words e in s do 15 for all words f in t do 16 count(ejf)+ = total(e) 17 total(f)+ = total(e) 18 //estimate probabilities 19 for all words f do 20 for all words e do 21 t(ejf) = fraccount(ejf)total(f) 22 return t(ejf) of the word pair (tj;”(other)”), (”(other)”;si), (”(other)”;”(other)”) are 1;2;3 respectively. These estimations make errors as follows: 1 = 1;2 = 2;3 = 3; (2) Therefore, when (tj;si) is replaced by one of these word pairs: (tj;”(other)”), (”(other)”;si), (”(other)”;”(other)”), the error of this estimation "i 2 f1,2,3geffectstothecorrectestimationby a total error !: pairswhicharenotincludedindictionary,andthe error average is ; then the total error is: ! = (k)m; (4) The more word pairs which are not included in dictionary, the more the number of word pairs k, or total error !. 2.3. Word Clustering Brown’s Algorithm. Word clustering Brown, et al. [3] is considered as a method for estimating YX ! = "i (3) j=1 i=0 the probabilities of low frequency events that are likely unobserved in an unlabeled data. One of aims of word clustering is the problem If (tj;si) is contained dictionary, "i = 0; suppose that there are k, (0 k l + 1), word of predicting a word from previous words in a sample of text. This algorithm counts the H.L. Trieu, et al. / VNU Journal of Science: Comp. Science & Com. Eng. Vol. 31. No. 1 (2015) 32–44 35 Table 1: An English-Vietnamese sentence pair damodaran ’ s solution is gelatin hydrolysate , a protein known to act as a natural antifreeze . gi£i_pháp cıa damodaran là ch§t thıy_phân gelatin , mºt lo⁄i protein có chøc_năng như ch§t chŁng đông tü_nhiên . Table 2: Several word pairs in Dictionary Fig 2: An example of Brown’s cluster algorithm similarity of a word based on its relations with words on left and the right of it. Input to the algorithm is a corpus of unlabeled data which consists of a vocabulary of words to be clustered. damodaran ’s solution is a as damodaran 0.22 cıa 0.12 gi£i_pháp 0.03 là 0.55 mºt 0.73 như 0.46 Initially, each word in the corpus is considered to be in its own distinct cluster. The algorithm then repeatedly merges pairs of clusters that maximizesthequalityoftheclusteringresult,and each word belongs to exactly one cluster until the number of clusters is reduced to a predefined number. Output of the word cluster algorithm is a binary tree as shown in Fig. 2, in which the leaves of the tree are the words in the vocabulary. A word cluster contains a main word and several subordinatewords.Eachsubordinatewordhasthe same bit string and corresponding frequency. 2.4. Proposed Algorithm We propose using word clustering data to supplementlexicalinformationforbilingualword dictionaryandimprovealignmentquality.Weuse the hypothesis that same cluster have a specific correlation, and in some cases they are able to be replacedtoeachother.Wordsthatdisappearinthe dictionary would be replaced other words of their cluster rather than replacing all of those words to an only term as in method of Moore [14]. We use twowordclusteringdatasetscorrespondingtothe twolanguagesinthecorpus.Thisideaisindicated at the Algorithm 2. In Algorithm 2, D is bilingual word dictionary created by training IBM Model 1. The dictionary D contains word pairs (e, v) in which each languages correspondingly, and t(e, v) is their word translation probability. In addition, Ce and Cv are two data sets clustered by word of texts of source and target languages respectively. Ce is the cluster of the word e, and Cv is the cluster of the word v. When the word pair (e, v) is absent in the dictionary, e and v are replaced by all words of their cluster. A combined value of probability of new word pairs is counted, and it is treated as alignment probability for the absent word pair (e, v). In this algorithm, we use average function to get this combined value. Consider an English-Vietnamese sentence pair as indicated in Table 1. Some word pairs of bilingual word dictionary are listed in Table 2. Considerawordpairswhichisnotcontainedin the Dictionary: (act, chøc_năng). In the first step, ouralgorithmreturnsclustersofeachwordinthis pair. The result is shown in Table 3 and Table 4. Table 3: Cluster of act 0110001111 act 0110001111 society 0110001111 show 0110001111 departments 0110001111 helps word belongs to texts of source and target 36 H.L. Trieu, et al. / VNU Journal of Science: Comp. Science & Com. Eng. Vol. 31. No. 1 (2015) 32–44 Algorithm 2: Sentence Alignment Using Word Clustering Input : A word pair (e, v), Dictionary D, Clusters Ce and Cv Output: Word translation prob. of (e, v) 1 begin 2 if (e, v) contained in D then 3 P = t(e;v) 4 else 5 if (e contained in D) and (v contained in D) then 6 with all (e1,..., en) in Ce 7 with all (v1,..., vm) in Cv 8 if ((ei, v) contained in D) or ((e, vj) contained in D) then 9 P = 1 ( t(ei;v)+ t(e;vj)) i=1 j=1 10 else 11 P = t(”(other)”;”(other)”) 12 else 13 if (e contained in D) or (v contained in D) then 14 if (e contained in D) then 15 with all (v1,..., vm) in Cv 16 if (e, v ) contained in D then 17 P = 1 Xt(e;vj) i=1 18 else 19 P = 1 Xt(e;”(other)”) i=1 20 else 21 with all (e1,..., en) in Ce 22 if (e , v) contained in D then 23 P = 1 Xt(ei;v) i=1 24 else 25 P = 1 Xt(”(other)”;v) i=1 26 else 27 P = t(”(other)”;”(other)”) 28 return P H.L. Trieu, et al. / VNU Journal of Science: Comp. Science & Com. Eng. Vol. 31. No. 1 (2015) 32–44 37 Table 4: Cluster of chøc_năng 11111110 11111110 11111110 11111110 ... chøc_năng hành_vi ph⁄t ho⁄t_đºng The bit strings “0110001111" and “11111110" are identification of the clusters. Word pairs of these two clusters are then searched in the Dictionary as shown in Table 5. Fig 3: Frequencies of Vietnamese Sentence Length Table 5: Word pairs are searched in Dictionary departments act act act chøc_năng hành_vi ph⁄t ho⁄t_đºng 9.15E-4 0.43 7.41E-4 0.01 shown in Table 7. We align this corpus at the sentence level manually and obtain 846 bilingualsentencespairs.WeusedatafromVLSP project available at1including 100,836 English-Vietnamese sentence pairs (En Training Data and Vi Training Data) with 1743040 English words In the next step, the algorithm returns a translation probability for the initial word pair (act, chøc_năng). Table 6: Probability of the word pair (act, chøc_năng) (36149 distinct words) and 1681915 Vietnamese words (25523 distinct words). The VLSP data consists of 80,000 sentence pairs in Economics-Social topics and 20,000 sentence pairs in information technology topic. Pr(act,chøc_năng) =average of (9.15E-4, Table 7: Bilingual Corpora 0.43, 3. Experiments 7.41E-4, 0.01) = 0.11 En Training Data Vi Training Data En Test Data Vi Test Data Sentences 100038 100038 1800 1828 Vocabularies 36149 25523 6333 5721 In this section, we evaluate performance of our algorithm and compare to the baseline method (M-Align). 3.1. Data 3.1.1. Bilingual Corpora The test data of our experiment is English-Vietnamese parallel data extracted from some websites including World Bank, Science, WHO, and Vietnamtourism. The data consist of 1800 English sentences (En Test Data) with 39526 words (6333 distint words) and 1828 Vietnamese sentences (Vi Test Data) with 40491 words (5721 distinct words). These data sets are We conduct lowercase, tokenize, word segmentation these data sets using the tool of1. 3.1.2. Sentence Length Frequency The frequencies of sentence length are describedinFig.3andFig.4.Inthesefigures,the horizontal axis describe sentence lengths, and the vertical axis describe frequencies. The average sentence lengths of English and Vietnamese are 17.3 (English), 16.7 (Vietnamese), respectively. 1http://vlsp.vietlp.org:8080/demo/?page= resources 38 H.L. Trieu, et al. / VNU Journal of Science: Comp. Science & Com. Eng. Vol. 31. No. 1 (2015) 32–44 Table 9: Word clustering data sets.

Tài liệu liên quan