I’m looking for an algorithm to accurately map words from a source sentence into the corresponding words in a target sentence, taking into account the following challenges:
- Different word order between the two languages
- Missing words in either language
- Many-to-many and one-to-many mappings
- Duplicate words in the same sentence
I started by creating inverted indexes of words in each sentence, and then used them to calculate the correlation between any source and target word. Then I used Dijkstra’s algorithm to draw a path connecting the alignment points, but it doesn’t seem to work because of the changing word order and missing words. I think the optimum solution will involve something like expanding rectangles which start from the most likely correspondences, and span many-to-many correspondences, and skip words with no alignment.
I’m currently using the following code:
import random
src_words=["I","know","this"]
trg_words=["Ich","kenne","das"]
def match_indexes(word1,word2):
return random.random() #adjust this to get the actual correlation value
all_pairs_vals=[] #list for all the source (src) and taget (trg) indexes and the corresponding correlation values
for i in range(len(src_words)): #iterate over src indexes
src_word=src_words[i] #identify the correponding src word
for j in range(len(trg_words)): #iterate over trg indexes
trg_word=trg_words[j] #identify the correponding trg word
val=match_indexes(src_word,trg_word) #get the matching value from the inverted indexes of each word (or from the data provided in the speadsheet)
all_pairs_vals.append((i,j,val)) #add the sentence indexes for scr and trg, and the corresponding val
all_pairs_vals.sort(key=lambda x:-x[-1]) #sort the list in descending order, to get the pairs with the highest correlation first
selected_alignments=[]
used_i,used_j=[],[] #exclude the used rows and column indexes
for i0,j0,val0 in all_pairs_vals:
if i0 in used_i: continue #if the current column index i0 has been used before, exclude current pair-value
if j0 in used_j: continue #same if the current row was used before
selected_alignments.append((i0,j0)) #otherwise, add the current pair to the final alignment point selection
used_i.append(i0) #and include it in the used row and column indexes so that it will not be used again
used_j.append(j0)
I’m looking for an algorithm to accurately map words from a source sentence into the corresponding words in a target sentence, taking into account different word orders, missing words, many-to-many and one-to-many mappings, and duplicate words in the same sentence.
Here is an example of the alignment between an English and German sentence, showing the correlations between words, and the green cells are the correct alignment points that should be identified by the word-alignment algorithm:
I started by creating inverted indexes of words in each sentence, and then used them to calculate the correlation between any source and target word. I tried Dijkstra’s algorithm to draw a path connecting the alignment points, but it doesn’t seem to work because of the changing word order and missing words. I think the optimum solution will involve something like expanding rectangles which start from the most likely correspondences, and span many-to-many correspondences, and skip words with no alignment.
Here is some data to try: https://docs.google.com/spreadsheets/d/1-eO47RH6SLwtYxnYygow1mvbqwMWVqSoAhW64aZrubo/edit?usp=sharing
I’m looking for an algorithm to map words from a source sentence to the corresponding words in a target sentence, taking into account different word orders, missing words, many-to-many and one-to-many mappings, and duplicate words in the same sentence.
Here is an example of the alignment between an English and German sentence, showing the correlations between words and the correct alignment points:
I tried using Dijkstra’s algorithm for the alignment, but it doesn’t work with the changing word order and missing words. I’m looking for an algorithm that can identify the correct alignment points, like the ones indicated in green in the example above, taking into account the many-to-many mappings and words with no alignment.
Here is some data to try: https://docs.google.com/spreadsheets/d/1-eO47RH6SLwtYxnYygow1mvbqwMWVqSoAhW64aZrubo/edit?usp=sharing
I’m looking for an algorithm to accurately map words from a source sentence to the corresponding words in a target sentence, taking into account different word orders, missing words, many-to-many and one-to-many mappings, and duplicate words in the same sentence.
Here is an example of the alignment between an English and German sentence:
The green cells are the correct alignment points that should be identified by the word-alignment algorithm, taking into account the changing word order, missing words, many-to-many and one-to-many mappings, and duplicate words in the same sentence.
I’ve tried using Dijkstra’s algorithm for the alignment, but it doesn’t work with the changing word order and missing words. I’m looking for an algorithm that can identify the correct alignment points, like the ones indicated in green in the example above.
Here is some data to try: https://docs.google.com/spreadsheets/d/1-eO47RH6SLwtYxnYygow1mvbqwMWVqSoAhW64aZrubo/edit?usp=sharing
I’m looking for an algorithm to map words from a source sentence to the corresponding words in a target sentence, taking into account different word orders, missing words, many-to-many and one-to-many mappings, and duplicate words in the same sentence. Here is an example of the alignment between an English and German sentence, and the green cells are the correct alignment points that should be identified by the word-alignment algorithm:
I’ve tried using Dijkstra’s algorithm, but it doesn’t work with the changing word order and missing words. I’m looking for an algorithm that can identify the correct alignment points, taking into account the many-to-many mappings and words with no alignment.
Here is some data to