Reading Stream Files in RPG

While working on a task to parse IBM MQ logs for a web front end, I needed to solve a couple of problems:

Wrangle the log files into something far more manageable than their raw state; single entries span multiple lines and there is a lot of whitespace.
Parse out the relevant details from each entry to put in a DB2 table that the web app can easily query.

This second item is what I will cover here.

Choosing the Tools

I had no real idea on how to go about reading stream files, so conducted a web search for some pointers. I eventually came across this article written in 2016 which seemed to fit the bill. However, when calling APIs with pointers and the like, there’s really no better way to go than fully modern, freeform RPG IV, so I set about converting it. I also read all of the comments and took on board various suggestions and corrections therein.

I’ll take a similar approach to going through the code with explanations, but if you want to cut to the chase, you can find the complete source in my GitHub repo.

The Code

Header

// C APIs cannot run in the default activation group
ctl-opt option(*srcstmt) dftactgrp(*no);

The only important option here is dftactgrp(*no) which is required because the C APIs that will be used cannot run in the default activation group.

The Stream File APIs

// C APIs for IFS file handling
dcl-pr openFile pointer extproc('_C_IFS_fopen');
  *n pointer value options(*string);  // File name
  *n pointer value options(*string);  // File mode and flags
end-pr;
dcl-pr readFile pointer extproc('_C_IFS_fgets');
  *n pointer value;  // Retrieved data
  *n int(10) value;  // Data size
  *n pointer value;  // Open file
end-pr;
dcl-pr closeFile extproc('_C_IFS_fclose');
  *n pointer value;  // Open file
end-pr;

These are the three prototypes for the APIs required to open, read, and close the stream file. The hallmark of these C-style APIs is the use of pointers. Later we will see liberal use of RPG’s %addr() function in relation to these.

The external names are the documented ones — fopen, fgets, and fclose — all prefixed with _C_IFS_.

For the two (pointer to) string parameters to fopen I have specified options(*string) which tells the compiler to pass the values as zero-terminated strings that C APIs generally require.

The Error API

// C API for fetching error code
dcl-pr getErrnoPtr pointer extproc('__errno');
end-pr;

Missing from the original code was a means of checking an error code if the file open did not succeed. In creating my program I had several issues to resolve before the file would open and I needed the error code to get through those.

There’s little to see in this prototype — it simply returns a pointer to an integer error code. We’ll see how to extract that later.

Standalone Variables

// Standalones
dcl-s filePath varchar(200);
dcl-s openMode varchar(50);
dcl-s filePtr pointer inz;
dcl-s rtvData char(2000); // ENSURE this is longer than the longest line
dcl-s maxLen int(10) inz(%size(rtvData));
dcl-s rtvLine varchar(%size(rtvData));
dcl-s toTrim char(2) inz(X'0D25');
dcl-s errNo int(10) based(errnoPtr);
dcl-s longest int(10);

The first two, filePath and openMode are used when calling fopen. I’ve opted for varchar()s because that enabled me to use the options(*string) modifier on the fopen prototype to take care of the zero-termination. If these were fixed length strings, the trailing blanks would upset the API. (This was mentioned in a comment on the original article.)

filePtr is just a straightforward pointer that will refer to our opened file handle.

The next four work together to get us a string that is exactly, and only, the content of one line of the file.

rtvData will hold the raw data fetched from the fgets API. This data includes all line terminators — CRLF, or LF — and a zero byte string terminator. If this was a varchar() then we’d have a problem because the length will always start as zero and we therefore will not be able to see the contents at all! (Guess how I learned about this!) It is important to make this longer than your longest possible line. This is one simple assumption I have made in the code in the name of efficiency, as you will see later.

maxLen is automatically set to the length of rtvData as we will need to reference that in the fgets call.

rtvLine will be our final “one line of text” variable without all the cruft on the end. It’s a varchar() that you can easily test the length of with %len() if required.

toTrim is a convenience value which represents the (CCSID 37) values for CR and LF line terminators.

The errNo definition is an interesting line, as it defines two variables. errNo itself is a standard integer and errnoPtr is a pointer that references errNo. It took me a little while to grok this construct!

Finally, longest is only used to give the program something to actually do. You won’t need this, most likely, in your own code.

Open the File

// Open the file
filePath = '/path/to/my/file.txt';
openMode = 'r, o_ccsid=37'; // Read only, translate to CCSID 37
filePtr = openFile(%addr(FilePath: *data): %addr(OpenMode: *data));
// Deal with an error opening the file
if (filePtr = *null);
  errnoPtr = getErrnoPtr();
  dsply ('File open error ' + %char(errno));
  return;
endif;

Here we open the file and also deal with a failure of that action.

filePath is simply set to the absolute path of our file. This could be a relative path, which would reference the current directory. By default, that would be the user’s home directory.

openMode sets the file mode, which here I have set to read only, and also any flags we wish to include. The one flag included is o_ccsid=37 which sets the ‘output CCSID’ to 37. This may not be required, but on my system, the job default CCSID is 65535 which indicates no translation should occur and the API does not like this! ‘Output’ refers to the output of the translation. You could also set the input CCSID with the flag ccsid= but in general there is no need as it will default to the CCSID of the file anyway.

The call to the fopen API (using the openFile prototype) passes as the two parameters pointers to our two variables. These are defined by getting the %addr() — address — of each. The *data modifier is required because these are varchar() variables. If that was not included, the address would point to the length bytes at the start of the variable storage. This is a small accommodation for the benefit of having the compiler automatically zero-terminate the passed values via the options(*string) modifiers on the prototype parameters.

Phew! The file is now open. We hope! But what if there’s a problem? Unhelpfully, if the file fails to open for any reason, the only direct signal we get is that the pointer returned will be null.

To find out why the file could not be opened we have to do a little extra work. This is what the second block in this section does. If the pointer is null, a call is made to the __errno API which simply returns a pointer to an integer. That integer is our error code. By assigning the return value to our errnoPtr pointer, the actual value will be readable from errNo.

Here I have just done a quick dsply, which isn’t much use other than when debugging, and then leave the program. For production code, you should include the relevant error handling here.

To see what error codes may be returned and what they mean, consult the include member QSYSINC/QRPGLESRC.ERRNO.

Reading the File

// Read through the file contents
dow (readFile(%addr(rtvData): maxLen: filePtr) <> *null);
  rtvLine = %trimr(%str(%addr(rtvData)): toTrim); // Get actual content
  rtvData = ''; // In case the next line is shorter
  if %len(rtvLine) > longest;
    longest = %len(rtvLine);
  endif;
enddo;

This is where the rubber meets the road and we actually get the contents of the file, a line at a time. This would be used much like any other read loop, except you will need to allow that this is purely a sequential read of the lines from top to bottom. No positioning occurs.

The dow line does the work of fetching a line from the file. We pass a pointer to our rtvData for the raw data, our maximum length of line to retrieve, and our opened file pointer. If this call returns a null pointer, then the end of file has been reached. If not, we have something in rtvData.

This is where my assumption comes into play. The fgets API will happily return less than a full line if it encounters a line longer than maxLen. In this case, there will be no line terminating characters and no zero byte terminator. The next line will balk at this!

The original code had three %xlate() functions to simply wipe out all line terminators and zero bytes with blanks. This has two downsides:

You have to use %trim() to find the actual data length and this would not take into account any trailing blanks actually present in the file. Obviously, if trailing blanks are unimportant, that doesn’t matter, though.
The three %xlate()% functions take a very real amount of time to execute. On my candidate file of 160,000+ lines (and 100 MB size), the complete file took 53 seconds to process. When adapted to the above approach, this reduced to less than 4 seconds!

Starting from the inside of the assignment, %addr(rtvData) provides a pointer to the raw data in its (fixed length) variable. This data, remember, has a zero somewhere to mark the end of the actual data, and this is preceded by one or two line terminators.

This address is passed to the %str() function which returns the value of a zero terminated string. The net effect of this is we lose the zero byte and have a proper length string representing the data. It still has the one or two line terminators, only now we know they are on the right hand end of the string.

The %trimr() function can now simply be used to strip any terminators (as defined in the toTrim variable), leaving us with just the line of text from the file — exactly as you would see it with your own eyes and no hidden bytes.

rtvData is then blanked, ready for the next time around the loop.

In this example, I have simply kept track of the longest line in the file. I used this to prove to myself that the program was successfully processing the entire file. I independently verified this value using the bash command wc -L /path/to/my/file.txt.

Closing the File

// Close the file
closeFile(%addr(filePath));
return;

This task is simple — call the fclose API with a pointer to the open file handle.

Conclusion

There is another way to tackle this task which many may find easier. You can simply use CRTPF ... RCDLEN(2000) to create a ‘flat file’ and then CPYFRMIMPF to copy the data across. I did this with no trouble — taking care to specify a target CCSID, as my system default is 65535 — and ended up with a Physical File (PF) which could be simply read in RPG or CL or SQL.

There is are several downsides to this approach, however.

It takes extra time. In my example where the above code was able to loop through 160,000+ lines in under 4 seconds, the CPYF took 17 seconds. For small files that’s probably not an issue, but it is real time spent just getting the data to somewhere you can read it — wasted time.
You don’t know how long the lines really are. My code has to make an assumption about the longest line, and the CRTPF makes this same assumption, so they’re even on that score. But once in the PF, every line is 2000 bytes long and any significant trailing space is effectively lost. Again, this probably doesn’t matter in many cases, but it might.
It takes extra space. My stream file was 100 MB. Copying this to a PF takes another 100+ MB. Why more than 100? Because every line is 2000 bytes long!

I decided I would tackle this task precisely because I now have an approach, documented here in this example, which has only one minor assumption (and you could just make that a very large number), saves some processing time, and feels like the right way to process data out of a stream file.

Thank you for reading this far. You can find the complete code over on GitHub, where I’ve included some brief comments at the top so that it can act as a standalone resource, including the key things you would need to consider changing for your own purposes.