Friday, April 06, 2018

Removing text blocks containing repetition with Unix or Linux Power Tools

Let us illustrate the issue with an example. In the Translation Industry a TMX file is an XML representation of a translation memory (TM). This format is useful to exchange TMs. It contains translation units (tu node) with properties (prop node) with translation unit variants (tuv node) and segments (seg node) that contain the source language and the target for translation language. Many times the same segment is added again and again by the Computer Aided Translation (CAT) Tool and while useful to get more precise translations it can become a burden if you try to process such a big TMX with an open source CAT Tool like OmegaT. Since OmegaT is client side only, processing big TMX would be problematic. In such case you might want to compromise on more precise translations versus being able to use the free tool. These repetitions are mostly related to the addition of context around the specific segment (x-context-post and x-context-post seg type attribute).

The question is then how to remove the whole "tu" node containing duplicated segments and leaving just one of them (again we are losing precision in the translation output but it might be worth it because of the savings when using a free CAT Tool).

The straightforward answer would be to export the TMX from the original tool using some options provided by such tool that would allow less data to be exported, specifically ignoring context specific translations. If that is not as possibility we are left with building a tool to clean it up.

First we can get an idea of which segments are duplicated and how many times each:
cat input.tmx | grep '<seg>' \
| sort | uniq -c | sort -nr \
| grep -v '^ *1 ' > tmx-repetitions.txt
Then we can replace them by a string like DUPLICATE_NODE_PLEASE_REMOVE
cat input.tmx \
| awk '{if($0 ~ /<eg>/ && !seen[$0]++ || $0 !~ /<seg>/) print $0; \
else print "DUPLICATE_NODE_PLEASE_REMOVE"}' > input-with-marked-duplicates.tmx
Finally we can try removing the whole translation unit (tu) node with perl:
cat input-with-marked-duplicates.tmx \
| perl -0pe 's#<tu(.*?)DUPLICATE(.*?)</tu>##gs'
But if the file is big enough this won't work as expected, probably because of how perl does multiline parsing in this particular commend (in memory). This is the reason why I built open sourced bash-multiline-replace project which contains a simple bash script (multilineReplace.sh) that will eliminate full blocks from start to end patterns if they contain an inner pattern.
cat input-with-marked-duplicates.tmx \
| ./multilineReplace.sh '<tu ' 'DUPLICATE' '</tu>' 

Followers