Showing posts with label translation table. Show all posts
Showing posts with label translation table. Show all posts

Tuesday, March 1, 2016

Day 23 Translation what's the difference

Day 23

Translation what’s the difference

First let’s create some Connect:Direct translation tables using PowerShell:
PS C:\Users\nicke> conv ibm285 iso-8859-15 @(0..255) | Set-Content -Encoding byte ibm285-iso-8859-15.cdx

PS C:\Users\nicke> conv ibm01146 iso-8859-15 @(0..255) | Set-Content -Encoding byte ibm01146-iso-8859-15.cdx
Here we are converting from codepage ibm285 IBM EBCDIC (UK) to iso-8859-15 which has the Euro currency symbol, and converting all the byte values from 0 through to 255 (that is what the @(0..255) means), saving the result with the Connect:Direct Windows translation table file extension .cdx.
We then do the same thing for producing a translation table that will convert from IBM EBCDIC (UK-Euro) to iso-8859-15 .
Using the Powershell function below we can see the difference between these two translation tables.
# List difference between translation tables
function xlt_diff ([byte[]]$tbla,[byte[]]$tblb) {
    0..255 | %{
        if($tbla[$_] -ne $tblb[$_]) {
            "{0:x} : {1:x} | {2:x}" -f $_,$tbla[$_],$tblb[$_]
        }
    }
}
The above functions can be used as follows:
PS C:\Users\nicke> xlt_diff (cat -Encoding byte .\ibm285-iso-8859-15.cdx) (cat -Encoding byte .\ibm01146-iso-8859-15.cdx)
9f : 3f | a4
The output above shows that two translation tables differ when they map hex byte value 0x9f. In the first table it maps to hex value 0x3f, and in the other to 0xa4.
Now if we create the translation tables for translating back to either ibm285/ibm01146 from iso-8859-15, and then compare like so:
PS C:\Users\nicke> conv iso-8859-15 ibm01146 @(0..255) | Set-Content -Encoding byte iso-8859-15-ibm01146.cdx

PS C:\Users\nicke> conv iso-8859-15 ibm285 @(0..255) | Set-Content -Encoding byte iso-8859-15-ibm285.cdx

PS C:\Users\nicke> xlt_diff (cat -Encoding byte .\iso-8859-15-ibm01146.cdx) (cat -Encoding byte .\iso-8859-15-ibm285.cdx)
a4 : 9f | 6f
Here the translation tables differ in how they convert the Euro (€) symbol in iso-8859-15 (0xa4) to the two mainframe codepages.
This is not that surprising as ibm01146 has the Euro (€ 0x9f) and codepage ibm285 does not. In fact if you look up codepage 1146 on wikipedia you will see that ibm01146 was created to be ibm285 with the addition of the Euro (€) symbol.
I chose these two codepages as a simple example to showcase the finding the difference between translation tables.
These last two posts were about creating custom codepage translation tables for Connect:Direct, and spotting the differences between tables.
Next time we will look at displaying what maps to what more easily with these translation tables, and show a generally better way of translating from one codepage to another.

Thursday, February 18, 2016

Power Translation

Day 22

Power Translation

While on assignment some time ago the only scripting language I had available to me was PowerShell. The PowerShell turned out to be a very useful tool. Today I will share some of its features that helped me with Connect:Direct custom translation tables.
I needed help in understanding the translation of certain characters between the Mainframe and Windows platforms, and understanding quickly what was different between a customised translation table, and the default table on a Windows machine.
I also wanted to experiment with the translation tables without actually always being on the machine that had Connect:Direct on it.
My solution was a set of PowerShell helper functions that allowed me to examine, compare, generate and test Connect:Direct translation tables without Connect:Direct necessarily.
Here are some simple PowerShell functions to illustrate. Please be aware that I have deliberately kept these minimal for brevity.
# List all codepages
function lsenc () {
    [System.Text.Encoding]::GetEncodings()
}

# Get an object representing the codepage
function getenc ($str) {
    [System.Text.Encoding]::GetEncoding($str)
}

# Simple filter that displays bytes as hex values
function hex {
    $input | %{ write-host -NoNewline ("{0:x2} " -f $_)}
    ""
}

# Converts a byte buffer to different codepage
function conv ($from,$to,$buf,$str=$false) {
    $from_enc=getenc $from
    $to_enc=getenc $to
    if($buf.gettype().BaseType.Name -ne "Array") {
        [System.Text.Encoding]::Convert($from_enc,$to_enc,$from_enc.getbytes($buf))
    } else {
        [System.Text.Encoding]::Convert($from_enc,$to_enc,$buf)
    }
}
The above functions can be used as follows:
# Test for well known EBCDIC value
PS C:\Users\nicke> conv 1252 37 " " | hex
40 
# Test for well known ASCII value
PS C:\Users\nicke> conv 37 1252 @(0x40) | hex
20 
# hex value for £ in Windows
PS C:\Users\nicke> conv 1252 1252 "£" | hex
a3 
# hex value for £ UTF-8
PS C:\Users\nicke> conv 1252 utf-8 "£" | hex
c2 a3 
# hex value for £ in UTF-16
PS C:\Users\nicke> conv 1252 utf-16 "£" | hex
a3 00 
# hex value of £ in IBM EBCDIC (UK-Euro)
PS C:\Users\nicke> conv 1252 1146 "£" | hex
5b
You can also do the same thing with longer strings and even contents of files.
So if I have a file that is encoded using the mainframe codepage IBM-1146 like so:
PS C:\Users\nicke> (cat -Encoding Byte ibm1146.1146) | hex
c9 c2 d4 60 f1 f1 f4 f6 

PS C:\Users\nicke>
I can translate it to Windows 1252 like so:
PS C:\Users\nicke> conv 1146 1252 (cat -Encoding Byte ibm1146.1146) | Set-Content -Encoding byte ibm1146.1252

PS C:\Users\nicke> type ibm1146.1252

IBM-1146

PS C:\Users\nicke>
So as you can see the 3rd paremeter to the conv function can be a string, a byte array, or the contents of a file which is then converted to a byte array.
I wouldn’t use this for large files, but with small files just to help understand the codepages and their translation between them.
Next time we will look at actual Connect:Direct translation tables, and how to create custom translation tables easily with some Powershell functions.

Sunday, December 16, 2012

Mistaken Identity

Day 18 

Most problems with translation tables amount to a case of mistaken identity. Sometimes it is the source of the file that is assumed to be something, but turns out to be something else.

For example someone says they are having problems transferring a file from a VMS system to UNIX and the Excel spreadsheet is not arriving in the correct format.

Well in this case you can not just look at this problem from a VMS/UNIX perspective. The Excel spreadsheet probably originated from a Windows machine. So how was it transferred to the VMS machine? Was it transferred in binary mode? Is the spreadsheet file really an Excel file, or just a .csv file?

The answers to those questions have an impact on the problem and its' solution. If the file was truely an Excel spreadsheet then you would want to transfer it in binary mode so the file ends up at its' destination literally the same as at the source of the transfer, no matter how many hops there are in the transfer.

It all depends on what is being used to produce the file to be transferred and what will end up consuming/processing the file at the destination. In the case of an Excel spreadsheet it will be a piece of software expecting a file the same as would be on a Windows machine, hence the binary mode.

If the source of the transfer was . csv file (comma separated values), i.e. a text file and was to be consumed/processed on a UNIX machine by an application, then we would want the file to arrive on the UNIX machine as a UNIX text file with each line terminated with a newline character as opposed to a carriage-return and newline characters on a Windows platform.

For this to happen we want Connect:Direct to treat the file as a text file and not binary as in the previous example. So we would not specify DATATYPE binary as before but use the default DATATYPE which is TEXT.

Some times you are told which codepages are being used on both ends of the proposed Connect:Direct transfer with absolute certainty.

For example you be told that a text file transferred from a mainframe was produced using codepage IBM-1140 and that the application on a Windows machine receiving the file is using UTF-8, an encoding for Unicode.

It really does depend on the application that will consume/process the file. It might be assumed that the application can handle UTF-8, or that as ASCII is a subset of UTF-8 there should be no problem.

In this case an international character might be used within the file on the mainframe that is available to it within the IBM-1140 codepage and this will be translated to the corresponding UTF-8 encoding of Unicode.

For characters that map directly to a single ASCII/UTF-8 there will not be a problem, but international characters can be encoded as 1,2,3 or even 4 byte UTF-8 encoded Unicode characters.

This is because UTF-8 is a variable byte character encoding. If the application is written to use Windows codepage CP-1252, then it will only be expecting single byte characters and not multi-byte Unicode characters that UTF-8 can encode. It will then probably choke on the multi-byte encoding or just not recognise what it is supposed to represent, and not process the file properly.

Imagine that data entered into an application on the mainframe is using one codepage, but the application was programmed to use a field delimiter character from another codepage. The file the application produces on the mainframe will contain data from one codepage and delimiters from another, and then transferred to another machine with codepage translation specified for the destination.

You may not be surprised to find that the field delimiter characters were not translated correctly for the destination.

In this particular case I suggested that the application programmer on the mainframe use a particular hex value character for the field delimiter that was available within the IBM-1140 codepage, and it turned out that the application on the Windows machine was using CP-1252 and not UTF-8.

It turned out for this particular file and applications that there were no special codepage requirements, as the default translation was sufficient.

So next time some one is enfatically, absolutely certain about codepage requirements, it might just be a good idea to check the facts for yourself, as it is easy for people to get this wrong.