Aligning texts with Hunalign

translation_articles_icon

ProZ.com Translation Article Knowledgebase

Articles about translation and interpreting
Article Categories
Search Articles


Advanced Search
About the Articles Knowledgebase
ProZ.com has created this section with the goals of:

Further enabling knowledge sharing among professionals
Providing resources for the education of clients and translators
Offering an additional channel for promotion of ProZ.com members (as authors)

We invite your participation and feedback concerning this new resource.

More info and discussion >

Article Options
Your Favorite Articles
You Recently Viewed...
Recommended Articles
  1. ProZ.com overview and action plan (#1 of 8): Sourcing (ie. jobs / directory)
  2. Réalité de la traduction automatique en 2014
  3. Getting the most out of ProZ.com: A guide for translators and interpreters
  4. Does Juliet's Rose, by Any Other Name, Smell as Sweet?
  5. The difference between editing and proofreading
No recommended articles found.

 »  Articles Overview  »  Technology  »  CAT Tools  »  Aligning texts with Hunalign

Aligning texts with Hunalign

By FarkasAndras | Published  12/28/2008 | CAT Tools | Recommendation:RateSecARateSecARateSecARateSecARateSecI
Contact the author
Quicklink: http://arm.proz.com/doc/2176
Author:
FarkasAndras
Հունգարիա
անգլերենից հունգարերեն translator
 
View all articles by FarkasAndras

See this author's ProZ.com profile
*** IMPORTANT NOTE ***

While the infromation in this article is still technically correct, most of it has been made obsolete by a subsequent project: I wrote an aligner program that makes using Hunalign a lot more user friendly.
Download the package including a readme from: http://sourceforge.net/projects/aligner


*************************************


I. INTRODUCTION

This article gives practical advice on alignment, that is, pairing up the sentences from two language versions of the same text in order to populate translation memories (TM's) with them. This is needed when one starts using a computer assisted translation (CAT) tool and wants to feed in their earlier translations, or when one comes across a pair of texts, translated by someone else, that could come in handy converted to a TM. If you came here for something else, now would be a good time to stop reading.

SDL offers a tool for alignment, Winalign, but in my opinion it is terrible to the point of uselessness. It is fine for two-page documents, but then almost any method is fine for two-page documents... When you have a few hundred or a few thousand pages to align, you'll want something better. And, of course, many people use other CATs and don't have Winalign. (Note: if you happen to just need a few pages aligned and you have Winalign already, you might be better off using Winalign than the method described here, as the learning curve is steeper with the latter. But if you are a translator, you'll probably need longer texts aligned at some point so you might as well learn how to do it right from the start.)

The workflow described here is what I myself developed for alignment; if you have any suggestions for improving it, please drop me a line. I was not involved in any way in creating the tools mentioned here, I just found them through Internet research. Given that they are not widely known and their use is not the most straightforward, I decided to write this guide. It may seem intimidatingly long, but don't despair. I wanted to make it more or less comprehensive and foolproof so it ended up being pretty long. Once you have learned how to do it, the process itself is fast.


II. SOFTWARE NEEDED

I'm going to assume Windows XP as an OS and the presence of MS Office. Things probably work the same under Vista. I know far too little about Linux and OSX to say anything about compatibility apart from the fact that all the tools here should run under Linux. Openoffice.org is perfectly capable of doing everything needed as well, I just reference Office here because it is more widely used and I know it better.
I also reference Notepad a couple of times. Pretty much any plaintext editor should work.

For identifying the sentence boundaries and inserting line breaks: Europarl pre-processing tools from http://www.statmt.org/europarl/
For alignment: Hunalign from http://mokk.bme.hu/resources/Hunalign
The Europarl tools are in perl so you need a perl interpreter like activeperl from http://www.activestate.com/Products/activeperl/index.mhtml

All of these tools are freeware. I uploaded a package to http://www.mediafire.com/?gtbwgjktzjf with all the files needed and some other useful bits and pieces (Activeperl is not in the package). Even if you use my package, you might want to check if a newer version of the software included is available from the original sources.


III. WORKFLOW


1. Sentence splitting

The goal is simple: as Hunalign uses line breaks as sentence delimiters, we want to insert a line break at the end of every sentence before alignment.

First of all, you may not need to do this step at all. Your text may be split into sentences already, or you may be content with paragraph-level alignment, i.e. pairing paragraph to paragraph instead of sentence to sentence. Paragraph-level alignment is easier and faster to do and provides you more context when you do a concordance search, but it makes it harder to identify how a source language term was translated as the chunks are longer. It also removes pretty much all hope of ever getting matches from the TM during translation. If either of your source texts come from the enemy of all translators, a pdf file, and has line breaks at the end of each line, you will definitely need to do this. Read the following section even if you don't need to split your text into sentences as some of the things are relevant to all alignment projects.

The sentence splitter I use is a perl script, so you'll need a perl interpreter. Install one if you don't have one installed already. I use Activeperl. (Note: the makers of Hunalign also made a sentence splitter, HunToken. It runs natively in Linux but needs Cygwin or MinGW in Windows. If you run Linux, you might want to try it instead of the tool described here. I never managed to get HunToken to run on my Windows machine.)

The sentence splitter script requires UTF-8 encoded txt files as input. Copy and paste your text to Notepad, choose save as, and pick UTF-8 from the drop-down menu at the bottom. To speed up both this step and alignment, merge all files into one, or, in case of huge texts, create files of about 50000 sentences each. (See below for tips.) Do this with both the source and the target language text, and save them in the folder where split-sentences.perl resides, with simple filenames for convenience.

The script identifies sentence boundaries based on characters like periods or question marks, capitalization and a list of exceptions. Exception lists for English and German are included in the package. You can adapt them to your languages and the texts you are processing. They are UTF-8 encoded txt files with a language-identifying file extension.

The script is run from a command line window. To open one, click Run in the Windows Start Menu, type cmd and press enter. Navigate to the folder where split-sentences.perl is located using the cd command. (To go up to the parent directory: type cd.. and press enter; to see the contents of the directory you're in: type dir and press enter. For more help on how to use the command line interface, type help and press enter, or just google around. Command line can also be used for nifty things like merging txt files that you are about to align. One possible command for this is "copy *.txt allfilesmerged.txt". All txt files in the current folder will be merged in the alphabetic order of their file names into allfilesmerged.txt. Obviously, you'll want to make sure that source and target language files are named identically to ensure that the alphabetic order and hence the order of text chunks is the same in both. Otherwise, you will have a giant mess on your hands after alignment.)

Once you are in the right folder, use the command "perl split-sentences.perl -l [en] (( [input.txt] )) [output.txt]". (Note: Here and everywhere else in this guide, )) and (( is used instead of angle brackets > and < for fear that they get misinterpreted as HTML code tags by the site software. Substitute angle brackets in the commands and tags, or, even better, read the .doc version of this article contained in the mediafire package where I didn't have to resort to such tricks.)
Obviously, everything in [] is to be replaced with the appropriate string. [en] stands for the English exception list, i.e. "nonbreaking_prefix.en" located in tools/nonbreaking_prefixes.

The script replaces paragraph breaks (multiple line breaks) with ((P)), so open output.txt with Word and replace all ((P)) with line breaks using the search and replace function (line breaks are symbolized by "^p"). Replace all tab characters ("^t") with a space while you're at it.


2. Alignment with Hunalign

Hunalign, the alignment tool discussed here, is run through command line so if you skipped part one above and don't know your way around command lines, go back and read it. It takes two text files and a dictionary file as input and produces a text file in which each line contains a sentence pair and a number showing the pairing "confidence" i.e. the certainty of the pairing, separated by tab. Hunalign also produces a report in the command line window during and after the process which you might want to check. (Note: apart from this guide, you may want to read the readme file provided with Hunalign for more details. This is intended as a more user-friendly but not comprehensive introduction to the use of Hunalign.)

Copy the two source files – in txt format– into the folder where hunalign.exe resides.
Open a command line window, navigate to the same folder, and type the command "hunalign -text data/hu-en.dic source.txt target.txt )) aligned.txt".
Without the -text parameter, Hunalign's output would be code and not text so don't forget to put it in there.
The "data/hu-en.dic" string is a reference to a dictionary file for the Hungarian-English language pair. For other language pairs, create a similar dictionary file from your own glossaries and word lists found on the Internet and use its filename in the command. Dictionary files are just txt files in the following format:

[target language term 1] @ [source language term 1]
[target language term 2] @ [source language term 2]
[target language term 3] @ [source language term 3]

Note that the order of the languages is reversed compared to the command that runs Hunalign.
If you have no dictionary, use the empty "data/null.dic" file and add the "-realign" parameter. See the Hunalign readme file for details.

After alignment is done, open Excel. Select the C column, choose Format/Cells/Numbers and pick "text" from the options for the display of numbers. This is to make sure that Excel does not convert numbers into dates. Copy the output of Hunalign into the workbook. You should get a three-column table with text in the first two and the confidence numbers in the third. Wherever Hunalign has decided to merge two sentences into one TU (translation unit), it puts three tilde symbols (~~~) between them. Apart from that, the output is self-explanatory.


3. Reviewing and perfecting the alignment

With closely matching texts and a large dictionary, the output of Hunalign is very good. So good that you may decide to use it as it is, especially if the text is too long to make manual checking feasible. If you decide to manually review the alignment, here are some tips.
Fist, set the first two columns as wide as your monitor allows, then select the whole table and use the Format/Cells/Align/Wrap text setting to make all text appear on the screen.
If you hold CTRL and press an arrow button, Excel jumps to the next empty cell (or the next cell with content if you're in an empty cell). You can use this to look for empty cells, a common sign of alignment problems.
You can also use the confidence indicator numbers. For example, select the whole C column and use Format/Conditional formatting to change the colour of every cell with a value under a given limit to red. Then you can scroll through the document and stop at the red cells to see what's up. Or number the rows (put 1 in d1, pull it down by the bottom right corner and choose fill with series from the context menu), then put the whole table in alphabetic order by column D. Then all the problem cells will end up at one end so you can find them, and you can then reorder the table by putting it in the alphabetic order of the numbered cells.
If you enabled Wrap text, just scrolling through he table makes it obvious where the problem areas are. Wherever you see empty space is suspicious.
Whatever method you choose, make sure you don't shift one column in relation to the other when you make your corrections. With a dictionary, Hunalign is pretty much guaranteed to find the right alignment pretty soon… so even if a couple of consecutive TUs are incorrectly paired, the ones under them are correctly aligned. You don't want to keep messing up the rest of your alignment as you go along.


4. Creating a TM

There are several options here: you can use Plustools or do it manually in a variety of ways. I prefer the manual method because that way I can control the process more, and I will not go into the use of Plustools here. See its documentation if you decide to go down that road.
Which manual method you choose partly depends on what CAT tool you use. For Trados, you can create a bilingual document or a file for Trados txt import. For other tools, the you can create a TMX file. These three methods are described below. Please note that in the case of the second and third, you are probably best off opening an export from your own CAT with Notepad and adapting the methods described here to recreate the structure of the TUs perfectly, and then pasting the TUs in the same file under the original header, replacing the original TUs and saving under a different name. Then you can be certain there will be no encoding, language code or structure troubles when you import.

4A. Bilingual text file

Perhaps the fastest way is to just generate a bilingual text file. Open Excel, and in column A, put "{0))", in column B, the source segments, in C, "((}100{))", in D, the target segments, and in E, "((0}". Obviously, you'll have to copy the symbols to all rows of A, C and D where there is text in B and D. Then select the whole table and copy it to Notepad. Select it there and copy it to Word. (The extra step is inserted because if you just copied straight into Word, you'd get a table as opposed to plain old text which is what we want here.) Remove all tabs by replacing all "^t" with nothing. The text is ready to be saved as .doc and cleaned with Workbench.

4B. Trados importable txt

My favourite method involves generating a txt file that can be imported into Trados TMs. These txt files have a header and then the TUs and accompanying data in a specific format. You are best off copying the header from one of your exported memories, so I will not go into what the header contains. At their simplest, TUs themselves look like this:

((TrU))
((CrD))23062008, 10:03:29
((CrU))ANDRAS_FARKAS
((Seg L=EN-GB))Job enquiries
((Seg L=HU))Álláskeresési kérdések
((/TrU))

Pretty straightforward: TrU stands for start of TU, CrD is creation date in "ddmmyyyy, hh:mm:ss" format, CrU is the creator's name (all capitals!), then the two language codes and texts, then a closing tag. You can add what Trados calls a text field using the ((Txt L=textfieldname)) tag. For example, by inserting "((Txt L=Document))Website of client X" after the CrU, the TU gets tagged as coming from the Website of client X, and this information is displayed if the TU comes up in concordance search.

To generate such a file, use the same method described at bilingual files. A blank xls is provided in the package to make it easier. After filling in all the information (including replacing the language identifiers unless you happen to have a British English – Hungarian text pair on your hands, setting the date etc.), copy the whole thing to Notepad, then Word. Replace "((Seg L=[languagecode]))^t" with "((Seg L=[languagecode]))" for both languages to remove the unwanted tabs, then replace all remaining tabs with line breaks.
Copy the result under the header of a TM export and save as txt, ready for importing.
This method allows you to add all the information you may need to organize your TM. You can have several creation dates, creator names and other information assigned to the various TUs within the same TM.

4C. TMX

TMX is supposed to be a common TM exchange format. The one time I tried to use it I ran into encoding and language code issues so it's obviously not as standardized as we'd like it to be, but these issues are easy to solve and then TMX does work. Here's a link to the full description of TMX: http://www.lisa.org/Translation-Memory-e.34.0.html . The Specification Document contains a sample file as well.

XML TMs, like Trados import files, have a header which you should extract from a TMX TM export of your own. Copy the entry structure and language codes from the same export to be sure of compatibility.
There is a ((body)) tag before the first TU, and a ((/body)) tag and a ((/tmx)) tag at the very end.

Typical TUs look like this:

((tu creationdate="20080623T144108Z" creationid="ANDRAS_FARKAS"))
((tuv xml:lang="HU"))
((seg))Álláskeresési kérdések((/seg))
((/tuv))
((tuv xml:lang="EN-GB"))
((seg))Job enquiries((/seg))
((/tuv))
((/tu))

An xls file for matching this structure is included in the package. In Word, replace "((seg))^t" with "((seg))" and "^t((/seg))" with "((/seg))", then replace all "^t" with "^p" and you should get a working TMX.

And there you have it, this is the end of my alignment guide. I hope it was of use. If you want to send me any comments, send a PM at proz.com.


Copyright © ProZ.com, 1999-2024. All rights reserved.
Comments on this article

Knowledgebase Contributions Related to this Article
  • No contributions found.
     
Want to contribute to the article knowledgebase? Join ProZ.com.


Articles are copyright © ProZ.com, 1999-2024, except where otherwise indicated. All rights reserved.
Content may not be republished without the consent of ProZ.com.