XML PARSE by ddeclerck · Pull Request #263 · OCamlPro/gnucobol

ddeclerck · 2025-12-04T15:48:01Z

Note: initial commit from Chuck, fixes to come

GitMensch

just a quick note for the first iteration; note that Chucks changes are based on mlio from August 2025, so the "real" version may be easier to get by checking out a previous commit, then replace the file and commit locally, then fetch the newer commit with rebase-merging

GitMensch · 2025-12-05T07:21:25Z

@chuck-haatvedt passed me the newest file (you may do a diff to add a changelog entry) which looks much better concerning libxml version compat. It is from November 14th: mlio.c with a note

that file was built / tested on
libcob (branches/gnucobol-3.x r5603M) 3.3-dev.5603

GitMensch

Thanks for inspecting / working on necessary changes.
I think we can have those in at least a second commit :-)

This is work by Chuck Haatvedt edited by David Declerck. * mlio.c: modified to support xml parse with xmlss. eliminated the xml_event_data structure and moved that data into the xml_event structure. Created a new enum cob_xml_registers and added it to the add_xml_event_data function. This function was modified to update the xml_event structure. All of the context parser callback functions were modified to use the add_xml_event_data function. the cob_xml_parse and xml_parse functions were modified to support the new end_of_input event required by xmlss. a new eof variable was added to the xml_state structure so that the endDocument callback function could be triggered by the parser in the xml_parse funtction. TODO ==> logic needs to be added to support returning NATIONAL data this needs to support the RETURNING NATIONAL phrase.

* common.h: rename COB_XML_PARSE_XMLNSS into COB_XML_PARSE_XMLSS to match the IBM option name * mlio.c [WITH_XML2]: Fix issues in XML PARSE handling most notably a use after free error if the internal buffer needs to grow during the parsing. Respect the high order half-word for exception XML-CODE. Reduce the number of parsing states by removing useless ones, and encode eof in these states. Handle XML chunks with more than one recoverable error. Trigger ON EXCEPTION code after EXCEPTION XML events. * parser.y: remove the CB_PENDING warning on XML PARSE but still warn for untested XML PARSE RETURNING NATIONAL and XML PARSE VALIDATING. * typeck.c: remove invalid call to cob_check_based for XML-* builtin variable length registers (like XML-TEXT) * codegen.c: remove the uninitialized and unused b_* field for XML-* builtin variable length registers

GBertholon · 2026-04-20T11:57:01Z

I am taking the responsibility for this PR on OCamlPro's behalf. I applied changes according to your comments and fixed several issues. @GitMensch: is this new version more satisfying ?

GitMensch

the output data is not as expected:

not included version/standalone/encoding may not be in the returned values
exceptions should include the exception data

GitMensch · 2026-04-21T19:17:39Z

+EXCEPTION                     +000262345||||
+START-OF-ELEMENT              +000000000|root|pfx0||
+NAMESPACE-DECLARATION         +000000000||pfx1|http://whatever|
+START-OF-ELEMENT              +000000000|localElName1|pfx1|http://whatever|
+EXCEPTION                     +000262345||||
+START-OF-ELEMENT              +000000000|localElName2|pfx2||
+END-OF-ELEMENT                +000000000|localElName2|pfx2||
+EXCEPTION                     +000262345||||
+EXCEPTION                     +000262345||||
+START-OF-ELEMENT              +000000000|localElName3|pfx3||
+ATTRIBUTE-NAME                +000000000|localAtName4|pfx4||
+ATTRIBUTE-CHARACTERS          +000000000||||
+CONTENT-CHARACTERS            +000000000|c1|||
+EXCEPTION                     +000262345||||
+EXCEPTION                     +000262345||||


The exceptions should have the date from the exception in the register - this is the IBM output (with XMLSS):

EXCEPTION 000264193|pfx0:root||| START-OF-ELEMENT 000000000|root|pfx0|| NAMESPACE-DECLARATION 000000000||pfx1|http://whatever| START-OF-ELEMENT 000000000|localElName1|pfx1|http://whatever| EXCEPTION 000264193|pfx2:localElName2||| START-OF-ELEMENT 000000000|localElName2|pfx2|| END-OF-ELEMENT 000000000|localElName2|pfx2|| EXCEPTION 000264193|pfx3:localElName3||| START-OF-ELEMENT 000000000|localElName3|pfx3|| EXCEPTION 000264192|pfx4:localAtName4||| ATTRIBUTE-NAME 000000000|localAtName4|pfx4|| ATTRIBUTE-CHARACTERS 000000000|||| CONTENT-CHARACTERS 000000000|c1||| EXCEPTION 000264193|pfx5:localElName5||| START-OF-ELEMENT 000000000|localElName5|pfx5|| EXCEPTION 000264192|pfx6:localAtName6|||

Indeed, but I do not really have time right now to implement the mapping between libxml2 and IBM exception codes, and I cannot imagine a meaningful code that uses the XML-TEXT of an EXCEPTION event without first checking the XML-CODE...
I would say that the support for XML PARSE without exception codes is useful enough to merge this PR first and then take care of those EXCEPTION events another time.

The behavior I have implemented simply let the COBOL developer choose between ignoring all recoverable errors, or failing on the first.
That said, I think I made a mistake here by trying to pass the libxml2 error code to COBOL while it is not fully stable, and this will be fixed by my next commit (I should simply tell whether the error is recoverable or not).

this is not about matching exception codes but to output the part that resulted in an exception in the appropriate register (as done by IBM, MF ... and if I remember correctly also libxml2

note that we explicit noted in NEWS that the exception codes are not identical to other implementations (I think MF and IBM differ as well)

Yes but the definition of "the part that resulted in an exception" is very unclear unless you also know which exception is returned.
For me it is currently out of scope to do any kind of exception specific work for EXCEPTION event aside from distinguishing recoverable and non-recoverable.

Besides, IBM documentation says (https://www.ibm.com/docs/en/cobol-zos/6.3.0?topic=registers-xml-event) that for EXCEPTION events, "XML-TEXT or XML-NTEXT contains the document fragment up to the point of the error or anomaly that caused the exception.", but in practice this is contradictory with the output you mentionned, where only the name of the element or attribute is placed in XML-TEXT.

… libxml2 error codes in COBOL

GitMensch

I consider that my "final" review. There are some things open, but I think we're nearly done to finally get this upstream!

But I'd like to have a review of @chuck-haatvedt as the original author of the code (and the rewrite from my initial event/data handling) before, if possible.

GitMensch · 2026-04-24T16:59:49Z

+		/* IBM doc states that we should store 1 in XML-INFORMATION on events
+		   ATTRIBUTE-CHARACTERS and CONTENT-CHARACTERS if the value in XML-TEXT
+		   is complete. It seems to be always the case with libxml2. */


Is this also true for the push parser (where the COBOL program gives in data, commonly from a line sequential file) where the attribute is split between multiple lines)?
Do we have a testcase for that?

We have a test case with a push parser (currently badly named "XML PARSE complex XML": I will change that).

The issue is that IBM can split the content of ATTRIBUTE-CHARACTERS and CONTENT-CHARACTERS between several events, and in that case it reports that the *-CHARACTERS event is incomplete by writing 2 in XML-INFORMATION.
In libxml2, as far as I know, we never get incomplete events and emulating those seems out of scope for now, as it requires digging into the internal structure of the parser state (and I don't think this structure is supposed to be stable across versions).
Therefore, we always send only one *-CHARACTERS event, even though IBM states it can send more.

In practice, for most COBOL codes, and especially those following the IBM example I took for the unit test, this practice of combining incomplete event should not alter the behavior since the only meaningful thing to do with partial *-CHARACTERS events is to concatenate them.
Actually, we can even argue that this behavior should be kept even if we support IBM split one day because it allows for simpler COBOL code.

Is https://deepwiki.com/search/i-currently-get-contentcharact_a2ed1cf8-d2e9-445f-8896-7c8bf724ac6b?mode=deep wrong or does our code work around that?

It seems partially wrong: the calling function xmlParseTryOrFinish does not call xmlParseCharDataInternal at pushed chunk boundary, but at internal buffer size boundary instead.

This is an issue here though... At internal buffer boundary we should put 2 in the XML-INFORMATION register.

thanks for adding a test going over the boundary and checking the adjusted code ❤️

I just thought about another potential quirk of XML-INFORMATION: for the XML file

<test>Try <![CDATA[some]]> wierd things</test>

What is the content of XML-INFORMATION of the different CONTENT-CHARACTERS events ?
I don't have an IBM compiler at hand and it is not stated in the documentation whether CDATA text is considered to be a continuation of normal text or not.

with your test data and

display xml-event xml-code '|' xml-text '|' xml-information '|' xml-namespace-prefix '|' xml-namespace '|'

the result on IBM with xmlss is

START-OF-DOCUMENT 000000000||000000000||| START-OF-ELEMENT 000000000|test|000000000||| CONTENT-CHARACTERS 000000000||000000001||| EXCEPTION 000798761|<test>Try <!|000000000|||

and with compat

START-OF-DOCUMENT 000000000|<test>Try <![CDATA[some]]> wierd things</test> |000000000||| START-OF-ELEMENT 000000000|test|000000000||| CONTENT-CHARACTERS 000000000|Try |000000000||| EXCEPTION 000000136|<test>Try <!|000000000|||

<![CDATA[some]]><test>Try valid things</test><![CDATA[more]]> leads to

START-OF-DOCUMENT 000000000||000000000||| EXCEPTION 000798761|<!|000000000|||

compat:

START-OF-DOCUMENT 000000000|<![CDATA[some]]><test>Try valid things</test><![CDATA[more]]> |000000000||| EXCEPTION 000000002|<![|000000000||| EXCEPTION 000000001|<![C|000000000||| EXCEPTION 000000001|<![CD|000000000||| EXCEPTION 000000001|<![CDA|000000000||| EXCEPTION 000000001|<![CDAT|000000000||| EXCEPTION 000000001|<![CDATA|000000000||| EXCEPTION 000000001|<![CDATA[|000000000||| EXCEPTION 000000001|<![CDATA[s|000000000||| EXCEPTION 000000001|<![CDATA[so|000000000||| EXCEPTION 000000001|<![CDATA[som|000000000||| EXCEPTION 000000001|<![CDATA[some|000000000||| EXCEPTION 000000001|<![CDATA[some]|000000000||| EXCEPTION 000000001|<![CDATA[some]]|000000000||| EXCEPTION 000000001|<![CDATA[some]]>|000000000||| EXCEPTION 000000002|<![CDATA[some]]><test>Try valid things</test><![|000000000||| EXCEPTION 000000001|<![CDATA[some]]><test>Try valid things</test><![C|000000000||| EXCEPTION 000000001|<![CDATA[some]]><test>Try valid things</test><![CD|000000000||| EXCEPTION 000000001|<![CDATA[some]]><test>Try valid things</test><![CDA|000000000||| EXCEPTION 000000001|<![CDATA[some]]><test>Try valid things</test><![CDAT|000000000||| EXCEPTION 000000001|<![CDATA[some]]><test>Try valid things</test><![CDATA|000000000||| EXCEPTION 000000001|<![CDATA[some]]><test>Try valid things</test><![CDATA[|000000000||| EXCEPTION 000000001|<![CDATA[some]]><test>Try valid things</test><![CDATA[m|000000000||| EXCEPTION 000000001|<![CDATA[some]]><test>Try valid things</test><![CDATA[mo|000000000||| EXCEPTION 000000001|<![CDATA[some]]><test>Try valid things</test><![CDATA[mor|000000000||| EXCEPTION 000000001|<![CDATA[some]]><test>Try valid things</test><![CDATA[more|000000000||| EXCEPTION 000000001|<![CDATA[some]]><test>Try valid things</test><![CDATA[more]|000000000||| EXCEPTION 000000001|<![CDATA[some]]><test>Try valid things</test><![CDATA[more]]|000000000||| EXCEPTION 000000001|<![CDATA[some]]><test>Try valid things</test><![CDATA[more]]>|000000000|||

:-)

I'm just confused why parsing

1 xml-document-data. 2 pic x(39) value '<?xml version="1.0" encoding="US-ASCII"'. 2 pic x(19) value ' standalone="yes"?>'. 2 pic x(39) value ''. 2 pic x(10) value '<sandwich>'. 2 pic x(33) value '<bread type="baker''s best"/>'. 2 pic x(36) value '<?spread We''ll use real mayonnaise?>'. 2 pic x(29) value '<meat>Ham + turkey</meat>'. 2 pic x(34) value '<filling>Cheese, lettuce, tomato, '. 2 pic x(32) value 'and that''s all, Folks!</filling>'. 2 pic x(25) value '<![CDATA[We should add a '. 2 pic x(20) value '<relish> element!]]>'. 2 pic x(28) value '<listprice>$4.99</listprice>'. 2 pic x(25) value '<discount>0.10</discount>'. 2 pic x(31) value '</sandwich>'.

with XMLSS does not result in START-OF-CDATA and so on but also raises an exception

START-OF-DOCUMENT 000000000||000000000||| VERSION-INFORMATION 000000000|1.0|000000000||| ENCODING-DECLARATION 000000000|US-ASCII|000000000||| STANDALONE-DECLARATION 000000000|yes|000000000||| COMMENT 000000000|This document is just an example|000000000||| START-OF-ELEMENT 000000000|sandwich|000000000||| START-OF-ELEMENT 000000000|bread|000000000||| ATTRIBUTE-NAME 000000000|type|000000000||| ATTRIBUTE-CHARACTERS 000000000|baker's best|000000001||| END-OF-ELEMENT 000000000|bread|000000000||| CONTENT-CHARACTERS 000000000| |000000001||| PROCESSING-INSTRUCTION-TARGET 000000000|spread|000000000||| PROCESSING-INSTRUCTION-DATA 000000000|We'll use real mayonnaise|000000000||| START-OF-ELEMENT 000000000|meat|000000000||| CONTENT-CHARACTERS 000000000|Ham + turkey|000000001||| END-OF-ELEMENT 000000000|meat|000000000||| CONTENT-CHARACTERS 000000000| |000000001||| START-OF-ELEMENT 000000000|filling|000000000||| CONTENT-CHARACTERS 000000000|Cheese, lettuce, tomato, and that's all, Folks!|000000001||| END-OF-ELEMENT 000000000|filling|000000000||| EXCEPTION 000798761|<?xml version="1.0" encoding="US-ASCII" standalone="yes"?><sandwich><bread type="baker's best"/> <?spread We'll use real mayonnaise?><meat>Ham + turkey</meat > <filling>Cheese, lettuce, tomato, and that's all, Folks!</filling><!|000000000|||

no matter if I save the file with UTF8 encoding and also mention that in the xml's encoding or not...

The error code 000798761 corresponds to XRSN_MARKUP_INVALID: An incorrect character is found within markup.
It seems that the XML parser you used for tests is unable to recognize CDATA elements (it always stop after <! as if it was expecting a comment <!-- and nothing else)...
Therefore I will not get any information on the expected behaviour from that :(

By the way, I think

<![CDATA[some]]><test>Try valid things</test><![CDATA[more]]>

is supposed to be invalid XML, unlike what its text suggest (you cannot have content outside the root XML element and CDATA is treated as content)...

On the contrary my "weird" example is unusual but supposedly valid.

chuck-haatvedt · 2026-04-24T21:13:38Z

I am a bit confused as to the changes to the version I supplied to Simon as the code appeared to be working fine before the changes.

As for the testsuite, I have attached the sample program I used for testing. I ran it on both GnuCOBOL and MF COBOL.

xmlsmpl-3.txt is the test program rename it to xmlsmpl-3.cbl. This is a much better test program as it exercises more of the complex xml elements.

set infile=sample_test_complex_split.xml this is the input xml document as a line sequential file.

xmlsmpl3-mfcobol.txt is the output from the MF COBOL test

xmlsmpl3-gnucobol.txt
xmlsmpl3-gnucobol-new.txt
xmlsmpl3-mfcobol.txt

xmlsmpl-3.txt

GitMensch · 2026-04-25T07:36:57Z

As for the testsuite, I have attached the sample program I used for testing. I ran it on both GnuCOBOL and MF COBOL.

Can you change that from file based to be memory based, please? That way I can easily run it on IBM (files would also work but I'd need to creat a dataset, add the data, handle JCL, ... - in-memory is just much easier)

…data check later in the parsing.

Note that, compared to IBM, we may merge short contiguous CONTENT-CHARACTERS events across END-OF-INPUT boundaries. This is due to libxml2 internal details. Also improve some tests to check predefined entities and long content.

GBertholon · 2026-04-29T13:13:14Z

Without forking libxml2, it seems impossible to generate the exact same stream of event as IBM in push parser mode.
This is due to the fact that libxml2 does not cut events on chunk boundaries like IBM does.

That said, I guess I found a reasonable compromise between not depending too much on internal libxml2 details and not breaking COBOL code expecting the IBM behavior: the rule is that we allow ourselves to postpone characters delivered by IBM at chunk boundary but we try to guarantee that we do not generate more events than IBM since COBOL code might rely on the fact that some content is never split.
This new code also relies on internal libxml2 heuristics to never wait for too long before delivering an event (the internal rule seems to be "if the content already contains more than 300 characters at chunk boundary then deliver before next chunk, else wait").

Moreover, my last commit should handle XML-INFORMATION correctly notifying whenever there might be more characters later or not.
Note that I adjusted the long text example to check that.

@chuck-haatvedt: Can you tell me what your test is checking that is not already covered by my additions in run_ml.at ?

@GitMensch: With that done, I think I have taken into account all your comments. Do you have final remarks ?

chuck-haatvedt · 2026-05-04T06:04:30Z

this simple patch to mlio.c will add the xml-text line to the EXCEPTION event

*** F:/gnucobol-xml_parse/libcob/mlio.c	Wed Apr 29 07:52:52 2026
--- R:/msys64/home/spcwh2/x32/gnucobol-trunk/libcob/mlio.c	Sun May  3 19:59:43 2026
***************
*** 1641,1642 ****
--- 1641,1643 ----
  	size_t message_len;
+ 	char	buff[255];
  
***************
*** 1701,1702 ****
--- 1702,1706 ----
  		new_xml_event (state, EVENT_EXCEPTION);
+ 		snprintf(buff, 254, "%s:%s", err->str1, err->str2);
+ 		set_xml_event_text (state, buff, xmlStrlen ((xmlChar *)buff));
+

Note that this is a simple case and should be modified to check all 3 of the str1..3 variables in the err structure.

here is the output for XMLup with the above change, I can upload these in a text file tomorrow if that would be easier.

F:\AA-minGW32-static\XML>xmlup
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx0 on root is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx2 on localElName2 is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx4 for localAtName4 on localElName3 is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx3 on localElName3 is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx6 for localAtName6 on localElName5 is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx5 on localElName5 is not defined
START-OF-DOCUMENT             +000000000||||
EXCEPTION                     +000262144|pfx0:root|||
START-OF-ELEMENT              +000000000|root|pfx0||
NAMESPACE-DECLARATION         +000000000||pfx1|http://whatever|
START-OF-ELEMENT              +000000000|localElName1|pfx1|http://whatever|
EXCEPTION                     +000262144|pfx2:localElName2|||
START-OF-ELEMENT              +000000000|localElName2|pfx2||
END-OF-ELEMENT                +000000000|localElName2|pfx2||
EXCEPTION                     +000262144|pfx4:localAtName4|||
EXCEPTION                     +000262144|pfx3:localElName3|||
START-OF-ELEMENT              +000000000|localElName3|pfx3||
ATTRIBUTE-NAME                +000000000|localAtName4|pfx4||
ATTRIBUTE-CHARACTERS          +000000000||||
CONTENT-CHARACTERS            +000000000|c1|||
EXCEPTION                     +000262144|pfx6:localAtName6|||
EXCEPTION                     +000262144|pfx5:localElName5|||
START-OF-ELEMENT              +000000000|localElName5|pfx5||
ATTRIBUTE-NAME                +000000000|localAtName6|pfx6||
ATTRIBUTE-CHARACTERS          +000000000||||
END-OF-ELEMENT                +000000000|localElName5|pfx5||
CONTENT-CHARACTERS            +000000000|c2|||
END-OF-ELEMENT                +000000000|localElName3|pfx3||
CONTENT-CHARACTERS            +000000000|c3|||
END-OF-ELEMENT                +000000000|localElName1|pfx1|http://whatever|
END-OF-ELEMENT                +000000000|root|pfx0||
END-OF-INPUT                  +000000000||||
END-OF-DOCUMENT               +000000000||||

here is the same when executed on IBM Z/OS ENTERPRISE COBOL

START-OF-DOCUMENT             000000000||||                                
EXCEPTION                     000264193|pfx0:root|||                       
START-OF-ELEMENT              000000000|root|pfx0||                        
NAMESPACE-DECLARATION         000000000||pfx1|http://whatever|             
START-OF-ELEMENT              000000000|localElName1|pfx1|http://whatever| 
EXCEPTION                     000264193|pfx2:localElName2|||               
START-OF-ELEMENT              000000000|localElName2|pfx2||                
END-OF-ELEMENT                000000000|localElName2|pfx2||                
EXCEPTION                     000264193|pfx3:localElName3|||               
START-OF-ELEMENT              000000000|localElName3|pfx3||                
EXCEPTION                     000264192|pfx4:localAtName4|||               
ATTRIBUTE-NAME                000000000|localAtName4|pfx4||                
ATTRIBUTE-CHARACTERS          000000000||||                                
CONTENT-CHARACTERS            000000000|c1|||                              
EXCEPTION                     000264193|pfx5:localElName5|||               
START-OF-ELEMENT              000000000|localElName5|pfx5||                
EXCEPTION                     000264192|pfx6:localAtName6|||               
ATTRIBUTE-NAME                000000000|localAtName6|pfx6||                
ATTRIBUTE-CHARACTERS          000000000||||                                
END-OF-ELEMENT                000000000|localElName5|pfx5||                
CONTENT-CHARACTERS            000000000|c2|||                              
END-OF-ELEMENT                000000000|localElName3|pfx3||                
CONTENT-CHARACTERS            000000000|c3|||                              
END-OF-ELEMENT                000000000|localElName1|pfx1|http://whatever| 
END-OF-ELEMENT                000000000|root|pfx0||                        
END-OF-DOCUMENT               000000000||||

Note that a couple of the EXCEPTION events are in a different order. Also note that the XML-CODE is displayed a bit differently on IBM, perhaps they use an implied PIC +++++++++9 instead of the floating "-" character which would only print the negative sign.

Obviously the XML-CODE values are different as well. Personally I think it would be better to use the code value from the err structure as it would all cobol programmers better access to the cause of the error within application code.

Chuck Haatvedt

chuck-haatvedt · 2026-05-04T06:13:41Z

here are the files...

Also revert the undeclared namespace test to the IBM original one since this is fully implemented now.

GBertholon · 2026-05-04T13:05:01Z

Since you seem to really want the correct errors for undeclared namespace, I have spent a couple of hours to implement it correctly.
It should also give us a base to encode more IBM error codes.

I am strongly opposed to exposing libxml2 error codes in COBOL: libxml2 does not guarantee that it will always generate the same errors on the same input across versions, so we will end up having to emulate older libxml2 in order to preserve COBOL code that relies on them.
Instead, we should emulate the existing XMLSS codes that existing COBOL code may be currently using.
Most importantly, when we don't know which error it should be, we should not send any information to COBOL besides the severity (and that includes any additional XML-TEXT).

GitMensch · 2026-05-04T19:43:43Z

Just a note: tested the exception/namespace one with MF vc7 -> sigsegv; tested with vc9

COMPAT:

START-OF-DOCUMENT             +000000000|<pfx0:root xmlns:pfx1="http://whatever"><pfx1:localElName1><pfx2:localElName2/><pfx3:localElName3 pfx4:localAtName4="">c1<pfx5:localElName5 pfx6:localAtName6=""/>c2</pfx3:localElName3>c3</pfx1:localElName1></pfx0:root>|||
START-OF-ELEMENT              +000000000|pfx0:root|||
ATTRIBUTE-NAME                +000000000|xmlns:pfx1|||
ATTRIBUTE-CHARACTERS          +000000000|http://whatever|||
START-OF-ELEMENT              +000000000|pfx1:localElName1|||
START-OF-ELEMENT              +000000000|pfx2:localElName2|||
END-OF-ELEMENT                +000000000|pfx2:localElName2|||
START-OF-ELEMENT              +000000000|pfx3:localElName3|||
ATTRIBUTE-NAME                +000000000|pfx4:localAtName4|||
ATTRIBUTE-CHARACTERS          +000000000||||
CONTENT-CHARACTERS            +000000000|c1|||
START-OF-ELEMENT              +000000000|pfx5:localElName5|||
ATTRIBUTE-NAME                +000000000|pfx6:localAtName6|||
ATTRIBUTE-CHARACTERS          +000000000||||
END-OF-ELEMENT                +000000000|pfx5:localElName5|||
CONTENT-CHARACTERS            +000000000|c2|||
END-OF-ELEMENT                +000000000|pfx3:localElName3|||
CONTENT-CHARACTERS            +000000000|c3|||
END-OF-ELEMENT                +000000000|pfx1:localElName1|||
END-OF-ELEMENT                +000000000|pfx0:root|||
END-OF-DOCUMENT               +000000000||||

--> no exception, namespace as part of the element name

XMLSS:

START-OF-DOCUMENT             +000000000||||
EXCEPTION                     +000000004|<pfx0:root xmlns:pfx1="http://whatever"|||
EXCEPTION                     +000000118|<pfx0:root xmlns:pfx1="http://whatever"><pfx1:localElName1><pfx2:localElName2/><pfx3:localElName3 pfx4:localAtName4=""|||
START-OF-ELEMENT              +000000000|localElName3|pfx3||
ATTRIBUTE-NAME                +000000000|localAtName4|pfx4||
CONTENT-CHARACTERS            +000000000|c1|||
EXCEPTION                     +000000160|<pfx0:root xmlns:pfx1="http://whatever"><pfx1:localElName1><pfx2:localElName2/><pfx3:localElName3 pfx4:localAtName4="">c1<pfx5:localElName5 pfx6:localAtName6=""|||
START-OF-ELEMENT              +000000000|localElName5|pfx5||
ATTRIBUTE-NAME                +000000000|localAtName6|pfx6||
END-OF-ELEMENT                +000000000|localElName5|pfx5||
CONTENT-CHARACTERS            +000000000|c2|||
END-OF-ELEMENT                +000000000|localElName3|pfx3||
CONTENT-CHARACTERS            +000000000|c3|||
END-OF-ELEMENT                +000000000|localElName1|pfx1|http://whatever|
END-OF-ELEMENT                +000000000|root|pfx0||
END-OF-DOCUMENT               +000000000||||

--> exception-text always "a fragment of alphanumeric text" (everything parsed until the exception), and on unclear places, ... and the error code does not make sense to me (compared to their docs https://docs.rocketsoftware.com/bundle/visualcobolvs_ug_100/page/rpb1743378344996.html)

GitMensch · 2026-05-04T19:49:14Z

Instead, we should emulate the existing XMLSS codes that existing COBOL code may be currently using.

I agree that this will be best.

Most importantly, when we don't know which error it should be, we should not send any information to COBOL besides the severity (and that includes any additional XML-TEXT).

I disagree - we can have an XML-TEXT which explicit starts with "unhandled internal xml error %d: error text" -> that way we can definitely tell people that this goes away (and they ideally should provide us with a reproducer - as we can then cross-check with IBM) while the contained internal libxml error number is still helpful for developers as they can lookup in the libxml header (or deepwiki) to check what the actual error is.

Having no detail information whatsover (the runtime warning may be suppressed or is not easily relatable to the current place by being put into a different output file), for example in a nightly batch job that already took 2 hours, is definitely bad.

GitMensch · 2026-05-04T19:51:43Z

+EXCEPTION                     +000264192|pfx4:localAtName4|||
+EXCEPTION                     +000264193|pfx3:localElName3|||
+START-OF-ELEMENT              +000000000|localElName3|pfx3||
+ATTRIBUTE-NAME                +000000000|localAtName4|pfx4||


just an observation, not a request to change it:

the order of exceptions is different to IBM where we get the namespace error at the place that uses it, not the place where it is parsed. As noted: that's just a difference; it still would be good to have a note about that (for now possibly in the NEWS entry, later moved to gnucobol.texi).

Sure, a note like that should also tell that this PR, unlike IBM's parser does not always immediately send incomplete events before END-OF-INPUT and can buffer them.

chuck-haatvedt · 2026-05-04T20:00:04Z

here is a link to the IBM reference manual for XML System Services User's Guide and
Reference. Check Appendix A & B for the XML-CODE values. The RETURN CODE is stored in the high order 2 bytes and the REASON CODE is in the low order 2 bytes.

https://www.ibm.com/docs/en/SSLTBW_3.2.0/pdf/gxla100_v3r2.pdf

Chuck Haatvedt

chuck-haatvedt · 2026-05-05T03:44:36Z

change xmlup.cbl as follows note the change in the first tag...

        Identification division.
         Program-id. XMLup.
       Data division.
        Working-storage section.
         1 d.
          2 pic x(40) value '<pfxz:root xmlns:pfx1="http://whatever">'.
          2 pic x(19) value '<pfx1:localElName1>'.
          2 pic x(20) value '<pfx2:localElName2/>'.
          2 pic x(40) value '<pfx3:localElName3 pfx4:localAtName4="">'.
          2 pic x(02) value 'c1'.
          2 pic x(41) value '<pfx5:localElName5 pfx6:localAtName6=""/>'.
          2 pic x(24) value 'c2</pfx3:localElName3>c3'.
          2 pic x(32) value '</pfx1:localElName1></pfx0:root>'.
       Procedure division.
         main.
           xml parse d processing procedure h
           goback.
         h.
           display xml-event xml-code '|' xml-text '|'
               xml-namespace-prefix '|'
               xml-namespace '|'
      * In the original IBM example they check specifically the two exceptions
      * codes for undeclared namespaces: 264192 and 264193
      * We do not yet support these IBM code
      * -> ignore all recoverable errors for now
      *    if xml-event = 'EXCEPTION' and xml-code = 264192 or 264193
             move 0 to xml-code
      *    end-if
           .
       End program XMLup.

when compiled and run with this gnucobol pr it generates the following output

F:\AA-minGW32-static\XML>xmlup
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfxz on root is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx2 on localElName2 is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx4 for localAtName4 on localElName3 is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx3 on localElName3 is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx6 for localAtName6 on localElName5 is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx5 on localElName5 is not defined
libcob: warning: XML PARSE non-recoverable error (76): Opening and ending tag mismatch: root line 1 and root
START-OF-DOCUMENT             +000000000||||
EXCEPTION                     +000264193|pfxz:root|||
START-OF-ELEMENT              +000000000|root|pfxz||
NAMESPACE-DECLARATION         +000000000||pfx1|http://whatever|
START-OF-ELEMENT              +000000000|localElName1|pfx1|http://whatever|
EXCEPTION                     +000264193|pfx2:localElName2|||
START-OF-ELEMENT              +000000000|localElName2|pfx2||
END-OF-ELEMENT                +000000000|localElName2|pfx2||
EXCEPTION                     +000264192|pfx4:localAtName4|||
EXCEPTION                     +000264193|pfx3:localElName3|||
START-OF-ELEMENT              +000000000|localElName3|pfx3||
ATTRIBUTE-NAME                +000000000|localAtName4|pfx4||
ATTRIBUTE-CHARACTERS          +000000000||||
CONTENT-CHARACTERS            +000000000|c1|||
EXCEPTION                     +000264192|pfx6:localAtName6|||
EXCEPTION                     +000264193|pfx5:localElName5|||
START-OF-ELEMENT              +000000000|localElName5|pfx5||
ATTRIBUTE-NAME                +000000000|localAtName6|pfx6||
ATTRIBUTE-CHARACTERS          +000000000||||
END-OF-ELEMENT                +000000000|localElName5|pfx5||
CONTENT-CHARACTERS            +000000000|c2|||
END-OF-ELEMENT                +000000000|localElName3|pfx3||
CONTENT-CHARACTERS            +000000000|c3|||
END-OF-ELEMENT                +000000000|localElName1|pfx1|http://whatever|
EXCEPTION                     +001048576||||

however that same code compiled on IBM Enterprise COBOL generates the following output

START-OF-DOCUMENT             000000000||||                                     
EXCEPTION                     000264193|pfxz:root|||                            
START-OF-ELEMENT              000000000|root|pfxz||                             
NAMESPACE-DECLARATION         000000000||pfx1|http://whatever|                  
START-OF-ELEMENT              000000000|localElName1|pfx1|http://whatever|      
EXCEPTION                     000264193|pfx2:localElName2|||                    
START-OF-ELEMENT              000000000|localElName2|pfx2||                     
END-OF-ELEMENT                000000000|localElName2|pfx2||                     
EXCEPTION                     000264193|pfx3:localElName3|||                    
START-OF-ELEMENT              000000000|localElName3|pfx3||                     
EXCEPTION                     000264192|pfx4:localAtName4|||                    
ATTRIBUTE-NAME                000000000|localAtName4|pfx4||                     
ATTRIBUTE-CHARACTERS          000000000||||                                     
CONTENT-CHARACTERS            000000000|c1|||                                   
EXCEPTION                     000264193|pfx5:localElName5|||                    
START-OF-ELEMENT              000000000|localElName5|pfx5||                     
EXCEPTION                     000264192|pfx6:localAtName6|||                    
ATTRIBUTE-NAME                000000000|localAtName6|pfx6||                     
ATTRIBUTE-CHARACTERS          000000000||||                                     
END-OF-ELEMENT                000000000|localElName5|pfx5||                     
CONTENT-CHARACTERS            000000000|c2|||                                   
END-OF-ELEMENT                000000000|localElName3|pfx3||                     
CONTENT-CHARACTERS            000000000|c3|||                                   
END-OF-ELEMENT                000000000|localElName1|pfx1|http://whatever|      
EXCEPTION                     000798773|<pfxz:root xmlns:pfx1="http://whatever">
pfx3:localElName3 pfx4:localAtName4="">c1<pfx5:localElName5 pfx6:localAtName6=""
Name1></|||

note that the XML-CODE is different from GnuCOBOL on the last EXCEPTION event

I think that this demonstrates the difficulty of attempting to map all of the libxml2 err->code values to IBM equivalent values..

Also note that 798773 === x'000C3035' which is a value found in XML System Services User's Guide and Reference in Appendix B. So we would need to create a cross reference mapping of the libxml2 err->code values to those in Appendix B.

I think that using the err->code and err->msg would be much more useful for programmers to diagnose any xml errors.

Chuck Haatvedt

GBertholon · 2026-05-05T10:55:31Z

As long as it is very clear that we do not guarantee any form of stability on those error messages, I agree we could indeed put the libxml2 error text inside XML-TEXT for unhandled errors.
I simply want to avoid any COBOL code that relies on this text for anything else than propagating error messages, because maintaining such stability around unstable libxml2 errors would be a long term nightmare.

I already noticed subtle changes across libxml2 versions for some of the reported errors. For instance version 2.12.7+dfsg+really2.9.14-2.1+deb13u2 you can find on current Debian stable can be very weird for syntax errors on the root element tag.

GBertholon · 2026-05-05T11:10:38Z

I think that this demonstrates the difficulty of attempting to map all of the libxml2 err->code values to IBM equivalent values..

I don't think we will ever map all of them, just the few useful ones that COBOL programmers may rely on.

Also note that 798773 === x'000C3035' which is a value found in XML System Services User's Guide and Reference in Appendix B. So we would need to create a cross reference mapping of the libxml2 err->code values to those in Appendix B.

This test suggests that maybe we should use XRC_NOT_WELL_FORMED instead of XRC_FATAL for parsing errors. So probably x'000C0000' would be slightly better for our default unmapped unrecoverable parsing error.

GitMensch · 2026-05-05T19:28:18Z

Two quick notes: the snprintf should use the _MAX define, which is one byte less than its matching_BUFF.

If you rebase then the warnings and errors in CI should be fixed.

GBertholon · 2026-05-06T09:58:53Z

I don't understand any of the two comments:

snprintf takes the size of the buffer as argument not the index of the last writable slot (e.g. writes last char at sz-2 and '\0' at sz-1)
you seem to have repaired the CI, and it now works even without rebasing anything

GitMensch · 2026-05-06T11:34:47Z

CI: correct (it works here in the PR, just not in your own branch).
snprintf - correct, there is a MSVC compat issue, but that's unrelated - and can be fixed by

snprintf(err_text_buf, COB_MINI_BUFF, ...
err_text_buf[COB_MINI_MAX] = 0;

a construct you'll see in many places where snprintf is used. So... maybe use here as well :-)

Note: I want to do a final comparison vs. IBM (may need to wait until Saturday) and maybe a first performance check with some bigger XML (both "in general" and vs. MF) - then merge upstream (not later than next week, currently).

codecov-commenter · 2026-05-06T12:20:59Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 72.13115% with 85 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (gitside-gnucobol-3.x@326ce55). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
libcob/mlio.c	72.78%	51 Missing and 29 partials ⚠️
cobc/parser.y	33.33%	4 Missing ⚠️
cobc/typeck.c	0.00%	0 Missing and 1 partial ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@                   Coverage Diff                   @@
##             gitside-gnucobol-3.x     #263   +/-   ##
=======================================================
  Coverage                        ?   67.82%           
=======================================================
  Files                           ?       34           
  Lines                           ?    61565           
  Branches                        ?    16043           
=======================================================
  Hits                            ?    41756           
  Misses                          ?    13851           
  Partials                        ?     5958

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

XML-INFORMATION and XML-*NAMESPACE* are not available with MF.

GitMensch · 2026-05-07T15:03:23Z

Just FYI: I've run some production XML (156MB) through a program just DISPLAYing the event/data.
The result is nearly identical to MF - we just generate one END-OF-INPUT event "too much" (seems like the parsers in both MF and IBM stop after END-DOCUMENT).

MF is much slower (factor 5-6). >20% of cpu-time came from DISPLAY (putc), if this is taken out of the values then 3.2% is spent in cob_xml_parse::get_xml_code (the inlined function), 50.2 in cob_xml_parse::xml_parse.
From the later, xmlPaseChunk takes 26.4 (included xmlParseTryOrFinish 22.8) and xml_process_next_event 22.6%

Possibly more checks next week.

chuck-haatvedt · 2026-05-07T18:52:53Z

here is a link to Wikipedia dumps website where you can download some huge xml file for testing

https://dumps.wikimedia.org/enwiki/latest/

this is what I've downloaded for performance testing.

5/07/2026 01:41 PM

.
05/07/2026 01:08 PM 114,446,966 enwiki-latest-pages-articles-multistream16.xml-p20460153p20570392
05/07/2026 01:40 PM 454,514,835 enwiki-latest-pages-articles-multistream18.xml-p26716198p27121850
05/07/2026 01:40 PM 297,012,315 enwiki-latest-pages-articles-multistream22.xml-p44496246p44788941
3 File(s) 865,974,116 bytes

there are larger files available on this site if you want to test with files more that 1 GB

chuck-haatvedt · 2026-05-08T03:48:21Z

my performance test results using COB_OPEN_FILE, COB_READ_FILE, COB_CLOSE_FILE reading 128KB chunks of data. All displays of xml data / events are removed. The code is just counting bytes read, xml-events returned.

Note that I had to fix a bug in fileio.c the cob_sys_read_file does not return the number of bytes fetched into the buffer.

XML document size ====> 24 MB
elapsed time =========> 6.15 seconds

F:\AA-minGW32-static\XML>xmlfast
FILE-NAME ==> F:\XML\enwiki-latest-pages-articles-multistream16.xml-p20460153p20570392
STARTING WITH: <mediawiki xmlns="http://www.mediawiki.org/xml/exp
TIME PROCESS XML DOCUMENT ==> +0000006151 MILLISECONDS
*** CPU CLOCK CYCLES ==> 24,002,229,300
FILE SIZE IN BYTES ==> 114,446,966
NUMBER OF EVENTS ====> 4,892,274

GitMensch · 2026-05-08T07:24:28Z

@chuck-haatvedt Have you tested with a different chunk size in xmlio.c (the open question was if we should make that configurable)?

chuck-haatvedt · 2026-05-08T07:30:17Z

@chuck-haatvedt Have you tested with a different chunk size in xmlio.c (the open question was if we should make that configurable)?

Simon, from my analysis it appears that this version of xml_parse just parses the chunk passed from the cobol program. Let me know if your inspection of the code is different.

I just did a build using the xml_parse code from this ddeclerck:xml_parse code base.

chuck-haatvedt · 2026-05-08T07:35:18Z

parsing xml documents as raw data does require the xml to be well formed. As my testing showed that as my testing of "raw" non-well formed xml data failed with an EXCEPTION even.

So this should be mentioned in the programmers guide so that users are aware of this requirement when passing "raw" data to XML PARSE.

GitMensch · 2026-05-08T07:45:57Z

Can you please write a short entry that you'd like to see about that in the documentation? I'm not really sure to understand what you refer to.

Also: shouldn't "non valid data" always return an error (or do you only mean bad line breaks not "visible" when passing lsq chunks, but breaking the parsing when having big "non-lsq" chunks that include those)?

ddeclerck force-pushed the xml_parse branch from 85cd664 to a24e630 Compare December 4, 2025 15:56

GitMensch requested changes Dec 4, 2025

View reviewed changes

Comment thread tests/testsuite.src/run_ml.at Outdated

Comment thread libcob/mlio.c

Comment thread libcob/mlio.c

ddeclerck force-pushed the xml_parse branch from a24e630 to 5807ac7 Compare December 5, 2025 12:27

GitMensch requested changes Dec 5, 2025

View reviewed changes

GBertholon force-pushed the xml_parse branch from 5807ac7 to 9222d5d Compare April 20, 2026 08:31

GBertholon force-pushed the xml_parse branch from 9222d5d to 79eb432 Compare April 20, 2026 09:19

GBertholon requested a review from GitMensch April 20, 2026 11:57

Fix reference modifiers on XML-TEXT & co. builtin registers

c39341d

GitMensch requested changes Apr 21, 2026

View reviewed changes

Remove spurious events when there is no <?xml?> tag and stop exposing…

abd4379

… libxml2 error codes in COBOL

GBertholon requested a review from GitMensch April 24, 2026 13:56

GitMensch reviewed Apr 24, 2026

View reviewed changes

Comment thread cobc/ChangeLog

Comment thread tests/testsuite.src/run_ml.at Outdated

Comment thread tests/testsuite.src/run_ml.at

Comment thread libcob/mlio.c

Comment thread libcob/mlio.c Outdated

GBertholon added 2 commits April 24, 2026 17:29

ChangeLog & copyright adjustments

cdae640

Merge together tests with and without libxml2

8fed31a

GitMensch requested changes Apr 24, 2026

View reviewed changes

GBertholon added 3 commits April 27, 2026 16:06

Small code improvement suggested during the code review for XML PARSE

2216928

Use empty XML-* registers, instead of NULL, and move LINKAGE without …

00d9295

…data check later in the parsing.

GBertholon requested a review from GitMensch April 29, 2026 13:13

GitMensch reviewed May 3, 2026

View reviewed changes

Comment thread tests/testsuite.src/run_ml.at Outdated

Implement IBM XMLSS EXCEPTION code for undeclared prefix

fd1ce2c

Also revert the undeclared namespace test to the IBM original one since this is fully implemented now.

GBertholon requested a review from GitMensch May 4, 2026 14:16

GitMensch reviewed May 4, 2026

View reviewed changes

Send libxml2 error message as XML-TEXT for unhandled errors

dd3cf10

GBertholon force-pushed the xml_parse branch from 8fccb51 to dd3cf10 Compare May 6, 2026 12:11

Adapt MF dialect to include XML-* registers

6bbabfe

XML-INFORMATION and XML-*NAMESPACE* are not available with MF.

Conversation

ddeclerck commented Dec 4, 2025

Uh oh!

GitMensch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GitMensch commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GitMensch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GBertholon commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GitMensch left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GBertholon Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GitMensch Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GBertholon Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GitMensch left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GBertholon Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitMensch commented Dec 5, 2025 •

edited

Loading

GBertholon commented Apr 20, 2026 •

edited

Loading

GitMensch left a comment •

edited

Loading

GBertholon Apr 24, 2026 •

edited

Loading

GitMensch Apr 24, 2026 •

edited

Loading

GBertholon Apr 24, 2026 •

edited

Loading

GitMensch left a comment •

edited

Loading

GBertholon Apr 27, 2026 •

edited

Loading

GBertholon Apr 27, 2026 •

edited

Loading

GitMensch Apr 27, 2026 •

edited

Loading

GBertholon Apr 28, 2026 •

edited

Loading

GBertholon commented Apr 29, 2026 •

edited

Loading

chuck-haatvedt commented May 4, 2026 •

edited by GBertholon

Loading

GBertholon commented May 4, 2026 •

edited

Loading

chuck-haatvedt commented May 5, 2026 •

edited by GitMensch

Loading

GBertholon commented May 5, 2026 •

edited

Loading