Sunday, December 16, 2012

Mistaken Identity

Day 18 

Most problems with translation tables amount to a case of mistaken identity. Sometimes it is the source of the file that is assumed to be something, but turns out to be something else.

For example someone says they are having problems transferring a file from a VMS system to UNIX and the Excel spreadsheet is not arriving in the correct format.

Well in this case you can not just look at this problem from a VMS/UNIX perspective. The Excel spreadsheet probably originated from a Windows machine. So how was it transferred to the VMS machine? Was it transferred in binary mode? Is the spreadsheet file really an Excel file, or just a .csv file?

The answers to those questions have an impact on the problem and its' solution. If the file was truely an Excel spreadsheet then you would want to transfer it in binary mode so the file ends up at its' destination literally the same as at the source of the transfer, no matter how many hops there are in the transfer.

It all depends on what is being used to produce the file to be transferred and what will end up consuming/processing the file at the destination. In the case of an Excel spreadsheet it will be a piece of software expecting a file the same as would be on a Windows machine, hence the binary mode.

If the source of the transfer was . csv file (comma separated values), i.e. a text file and was to be consumed/processed on a UNIX machine by an application, then we would want the file to arrive on the UNIX machine as a UNIX text file with each line terminated with a newline character as opposed to a carriage-return and newline characters on a Windows platform.

For this to happen we want Connect:Direct to treat the file as a text file and not binary as in the previous example. So we would not specify DATATYPE binary as before but use the default DATATYPE which is TEXT.

Some times you are told which codepages are being used on both ends of the proposed Connect:Direct transfer with absolute certainty.

For example you be told that a text file transferred from a mainframe was produced using codepage IBM-1140 and that the application on a Windows machine receiving the file is using UTF-8, an encoding for Unicode.

It really does depend on the application that will consume/process the file. It might be assumed that the application can handle UTF-8, or that as ASCII is a subset of UTF-8 there should be no problem.

In this case an international character might be used within the file on the mainframe that is available to it within the IBM-1140 codepage and this will be translated to the corresponding UTF-8 encoding of Unicode.

For characters that map directly to a single ASCII/UTF-8 there will not be a problem, but international characters can be encoded as 1,2,3 or even 4 byte UTF-8 encoded Unicode characters.

This is because UTF-8 is a variable byte character encoding. If the application is written to use Windows codepage CP-1252, then it will only be expecting single byte characters and not multi-byte Unicode characters that UTF-8 can encode. It will then probably choke on the multi-byte encoding or just not recognise what it is supposed to represent, and not process the file properly.

Imagine that data entered into an application on the mainframe is using one codepage, but the application was programmed to use a field delimiter character from another codepage. The file the application produces on the mainframe will contain data from one codepage and delimiters from another, and then transferred to another machine with codepage translation specified for the destination.

You may not be surprised to find that the field delimiter characters were not translated correctly for the destination.

In this particular case I suggested that the application programmer on the mainframe use a particular hex value character for the field delimiter that was available within the IBM-1140 codepage, and it turned out that the application on the Windows machine was using CP-1252 and not UTF-8.

It turned out for this particular file and applications that there were no special codepage requirements, as the default translation was sufficient.

So next time some one is enfatically, absolutely certain about codepage requirements, it might just be a good idea to check the facts for yourself, as it is easy for people to get this wrong.

No comments: