Convert EBCDIC to UTF-8 in RPM
RPM Remote Print Manager® ("RPM") can translate data between one code page and another.
RPM has always had EBCDIC to ASCII conversion built-in since we work with the AS/400 and other IBM systems. The AS/400 automatically converts for you in many situations. Nonetheless, you can always depend on having native EBCDIC data available for processing.
We continued to field requests from customers to add code pages in our translations. So, RPM incorporates the UNICODE code pages, plus the Microsoft code pages currently installed in Windows.
What is a code page?
A code page is a set of printable characters plus control codes, each identified by a number. The term "code page" started with IBM, who used it to identify versions of EBCDIC used in various countries.
Code pages have a number and also a text name. For instance, the version of EBCDIC used in the United States is "IBM037".
The code page name can also have one or more aliases. The Unicode definition of the IBM037 code page has these aliases:
- CP037
- EBCDIC-CP-US
- EBCDIC-CP-CA
- EBCDIC-CP-WT
- EBCDIC-CP-NL
The Wikipedia page on code pages includes history and detailed discussion.
Set up RPM to translate
First, in the RPM user interface, create or select a queue.
Open the Queue Settings dialog. Select a transform type to add and specify "Codepage Conversion"
The Codepage Conversion dialog lets you configure an input character set, or code page, and one for output.
Note that you can select a code page or search for an alias.
Here I have selected the Input Character Set. The default was ISO-8859-1, which is "plain" ASCII. I've scrolled down to show some of the EBCDIC character sets.
Using this dialog, I can also look at aliases and pick one of the EBCDIC entries. Or I could look at other names, hoping to find something familiar.
The code page conversion transform is now ready to use:
Results
I used this setup to translate a sample EBCDIC file I found online in a GitHub repo about mainframes. I noticed a special character in the text result, so I explored the original using a hex display:
0000000 \0 \0 037 304 226 205 @ @ @ @ @ @ @ @ @ @ 0000 c41f 8596 4040 4040 4040 4040 4040
The first three characters are null, null, and hex 37. These are useless to my application, so I'll use a string translation transform to drop them from the result.