Pages in topic:   < [1 2 3] >
How to convert TMX to tab-delimited?
Thread poster: Hans Lenting
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Multi Oct 14, 2022

Stepan Konev wrote:

Hans Lenting wrote:
Replace them with a unique sign?
Err.. what lines do you mean? I don't quite understand.


Multi-line segments


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 17:54
Member (2006)
English to Afrikaans
+ ...
New lines Oct 14, 2022

Some people use the term "new line" or "newline" for characters that are either a carriage return, a line feed, or a combination of both. The TMX specification uses the term "line-break". It is my understanding that TMX allows both characters, and both characters is assumed to have their actual meaning.

Of course, it's possible that some converters also convert one to the other. For example, a converter might convert line feed (common under Linux) to carriage return + line feed (
... See more
Some people use the term "new line" or "newline" for characters that are either a carriage return, a line feed, or a combination of both. The TMX specification uses the term "line-break". It is my understanding that TMX allows both characters, and both characters is assumed to have their actual meaning.

Of course, it's possible that some converters also convert one to the other. For example, a converter might convert line feed (common under Linux) to carriage return + line feed (common under Windows). It can be difficult to decide whether converting them would be good or bad.

What does the \n mean in both of your regular expressions?
Collapse


Hans Lenting
 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
A totally new line Oct 14, 2022

Samuel Murray wrote:

Some people use the term "new line" or "newline" for characters that are either a carriage return, a line feed, or a combination of both. The TMX specification uses the term "line-break". It is my understanding that TMX allows both characters, and both characters is assumed to have their actual meaning.

Of course, it's possible that some converters also convert one to the other. For example, a converter might convert line feed (common under Linux) to carriage return + line feed (common under Windows). It can be difficult to decide whether converting them would be good or bad.

What does the \n mean in both of your regular expressions?


Guess what: new line.

What I am referring to are segments that consists of several lines.

When aligning source and target to tab-del, these segments become a problem.

So my question was: should I replace the line-break with a ÿ or similar.


 
Stepan Konev
Stepan Konev  Identity Verified
Russian Federation
Local time: 18:54
English to Russian
Ah, line breaks Oct 14, 2022

Well, the regex (\n|.) means a line break (\n) or (|) any character (.). Therefore it covers both single line segments and multiline segments. However manual work may be needed indeed to fix them (by removing the paragraph mark) before converting them into a 2-column table. A second option is to use a regex that only covers single line segments having sacrificed the multiline segments. Not sure which evil is lesser though.

[Edited at 2022-10-14 23:16 GMT]


Hans Lenting
 
Stepan Konev
Stepan Konev  Identity Verified
Russian Federation
Local time: 18:54
English to Russian
Replace line breaks with spaces Oct 14, 2022

Hans Lenting wrote:
So my question was: should I replace the line-break with a ÿ or similar.
I don't know about your specific editor for Mac, but what regards Notepad++, it does not use soft line breaks. It simply inserts paragraph marks instead. That is why you have to fix it manually. If your editor uses line breaks, then replace them with spaces. This would make one segment from two.

[Edited at 2022-10-14 19:24 GMT]


 
Milan Condak
Milan Condak  Identity Verified
Local time: 17:54
English to Czech
TMLookUP Oct 14, 2022

Hans Lenting wrote:

I am looking for an easy way to convert TMX files (~100 MB) to tab-delimited files?

What are the other options?


My favorite tool is TMLookUP (for Windows)

https://farkastranslations.com/tmlookup.php

1. I create database
2. I import TMX
3. I export TXT (full or only hitted sentences with a match in one or in both languages)

It work fine with DGT Translation Memories, too:

https://joint-research-centre.ec.europa.eu/language-technology-resources/dgt-translation-memory_en#dgt-memory

Milan


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Smart approach Oct 14, 2022

Stepan Konev wrote:

Well, the regex (\n|.) means either line break (\n) or (|) any character (.). Therefore it covers both single line segments and multiline segments. However manual work may be needed indeed to fix them (by removing the paragraph mark) before converting them into a 2-column table. A second option is to use a regex that only covers single line segments having sacrificed the multiline segments. Not sure which evil is lesser though.


I’ll try to come up with regex that replaces all line-breaks with ÿ, except when they are followed by <seg>


[Edited at 2022-10-14 19:59 GMT]


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
These segments Oct 14, 2022

<seg>- First line
- Second line
- Third line </seg>

Something like:

Find: \n(?!(<seg>))
Replace: ÿ



[Edited at 2022-10-14 19:56 GMT]


 
Stepan Konev
Stepan Konev  Identity Verified
Russian Federation
Local time: 18:54
English to Russian
Checkbox in Notepad++ Oct 14, 2022

Ok, I can't figure out a regex so far, but Notepad++ has a checkbox '. matches newline'. When it is checked, the match covers all new lines. Solution: use Windows

Update
Got it:
(?<=<seg>)(\r|\n|.)+?(?=</seg>)
This regex seems to cover new lines.

[Edited at 2022-10-14 21:32 GMT]


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Pilcrow Oct 15, 2022

Hans Lenting wrote:

- First line
- Second line
- Third line

Something like:

Find: \n(?!())
Replace: ÿ


Tested it, used the ¶ for the replacement. Better surround it with spaces, for better term recognition later, perhaps.

2221

Now every multi-line segment will be placed in its own "table cell", so that no lines are left out.


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Further preparation Oct 15, 2022

Further preparation:

Remove in-segment tags:

Find:
<[eb]pt.*?>

Replace:
Nothing

Find:
<ph.*?>

Replace:
Nothing

Convert entities:

Find:
&amp;

Replace:
&

Same for: &lt;, &gt;, &apos; and &quot;

Remove all other entities:

Find:
&.*?;

Replace:
Nothing
... See more
Further preparation:

Remove in-segment tags:

Find:
<[eb]pt.*?>

Replace:
Nothing

Find:
<ph.*?>

Replace:
Nothing

Convert entities:

Find:
&amp;

Replace:
&

Same for: &lt;, &gt;, &apos; and &quot;

Remove all other entities:

Find:
&.*?;

Replace:
Nothing

Further cleaning:

Remove dashed lines:

Find:
_{2,}

Replace:
Nothing


[Edited at 2022-10-15 06:42 GMT]
Collapse


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
If it cannot be avoided Oct 15, 2022

Stepan Konev wrote:
Solution: use Windows


I use Windows apps whenever I cannot avoid them. E.g. Trados. Or a silly SonicWall component to log in to a client’s Plunet portal (very irritating).


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 17:54
Member (2006)
English to Afrikaans
+ ...
@Hans Oct 15, 2022

Hans Lenting wrote:
When aligning source and target to tab-del, these segments become a problem.


Yes, so you have to first replace all whitespace characters (except spaces, duh) with replacement characters.

I wonder if it would be sufficient (and if it would "work" in your target CAT tool) if you were to replace them with numbered entities:

&#09; = horizontal tab
&#10; = line feed
&#13; = carriage return
&#13;&#10; = carriage return + line feed (i.e. Windows line endings)

So, if your text editor can't distinguish between carriage returns and line feeds, then you'd just replace \n with &#10; inside segments. Windows programs tend to "understand" both line feeds and carriage returns + line feeds, but not carriage returns on their own, but I'm not sure about Mac programs... I'm under the impression that the carriage return by itself is a Mac thing...?

When I convert TMs, and I have a replacement character in either the source or target text (I usually use {{LF}} and {{TAB}}), I add a flagging character such as ". " or "# " to the start of the source text in order to prevent it from being a 100% match to anything (and since the flagging character is clearly visible in the comparison window of the CAT tool, I know instantly that the segment is one that had some work done on it).

Stepan Konev wrote:
Notepad++ has a checkbox '. matches newline'.

Different text editors do things differently: for some, "." includes horizontal white space such as new lines, but for others, it doesn't include it (so you have to specify it, or as with N++ you have to check a box) (and some text editors can't even handle regular expressions that span across new lines, i.e. they consider the new line to be the end of the expressible content).

Note that the regular expressions in text editors tend to be deliberately limited or customized because text editors users typically have very specific sets of things that they want to do.

Hans Lenting wrote:
Find: \n(?!(<seg>))
Replace: ÿ

Or just replace \n with ① and replace \t with ② throughout the file -- no need to restrict it to segments, for since you're not going to use the TMX file after this, it doesn't matter if you replace tabs and new lines outside of segments with other stuff (unless your content extraction method relies on the assumption that <seg> elements always start and end on a line break, which is a dodgy assumption).

[Edited at 2022-10-15 10:14 GMT]


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 17:54
Member (2006)
English to Afrikaans
+ ...
(ignore) Oct 15, 2022

(ignore... I forgot that you're not trying to *fix* the TMX; you're trying to *convert* it).

[Edited at 2022-10-15 10:53 GMT]


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 17:54
Member (2006)
English to Afrikaans
+ ...
@Hans II Oct 15, 2022

Since it would seem that you're trying to create your own TMX-to-text converter, you have to decide what you want to do with disallowed entities in the file. Ask yourself: if the TMX file contains this:

<seg>Hello&hellip;</seg>

what did the original CAT tool try to accomplish here?

For when you convert it to tabbed text, would you then change it to:

Hello…... See more
Since it would seem that you're trying to create your own TMX-to-text converter, you have to decide what you want to do with disallowed entities in the file. Ask yourself: if the TMX file contains this:

<seg>Hello&hellip;</seg>

what did the original CAT tool try to accomplish here?

For when you convert it to tabbed text, would you then change it to:

Hello…

or to:

Hello&hellip;

Did the CAT tool intend to have three dots in the texts or did it intend to have something that actually looks like an entity? You can only know the answer to this question if you can have a look at the text inside the CAT tool's own editing field.

[Edited at 2022-10-15 10:58 GMT]
Collapse


 
Pages in topic:   < [1 2 3] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to convert TMX to tab-delimited?







Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »