Skip to content

XML PARSE#263

Open
ddeclerck wants to merge 12 commits into
OCamlPro:gitside-gnucobol-3.xfrom
ddeclerck:xml_parse
Open

XML PARSE#263
ddeclerck wants to merge 12 commits into
OCamlPro:gitside-gnucobol-3.xfrom
ddeclerck:xml_parse

Conversation

@ddeclerck

Copy link
Copy Markdown
Collaborator

Note: initial commit from Chuck, fixes to come

@GitMensch GitMensch left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a quick note for the first iteration; note that Chucks changes are based on mlio from August 2025, so the "real" version may be easier to get by checking out a previous commit, then replace the file and commit locally, then fetch the newer commit with rebase-merging

Comment thread tests/testsuite.src/run_ml.at Outdated
Comment thread libcob/mlio.c
Comment thread libcob/mlio.c
@GitMensch

GitMensch commented Dec 5, 2025

Copy link
Copy Markdown
Collaborator

@chuck-haatvedt passed me the newest file (you may do a diff to add a changelog entry) which looks much better concerning libxml version compat. It is from November 14th: mlio.c with a note

that file was built / tested on
libcob (branches/gnucobol-3.x r5603M) 3.3-dev.5603

@GitMensch GitMensch left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for inspecting / working on necessary changes.
I think we can have those in at least a second commit :-)

Comment thread libcob/mlio.c Outdated
Comment thread libcob/mlio.c Outdated
Comment thread libcob/mlio.c Outdated
Comment thread libcob/mlio.c Outdated
Comment thread libcob/mlio.c Outdated
Comment thread libcob/mlio.c Outdated
Comment thread libcob/mlio.c Outdated
Comment thread libcob/mlio.c Outdated
Comment thread libcob/mlio.c Outdated
Comment thread libcob/mlio.c Outdated
This is work by Chuck Haatvedt edited by David Declerck.

	* mlio.c: modified to support xml parse with xmlss.
	  eliminated the xml_event_data structure and moved that data
	  into the xml_event structure. Created a new enum cob_xml_registers
	  and added it to the add_xml_event_data function. This function was
	  modified to update the xml_event structure. All of the context parser
	  callback functions were modified to use the add_xml_event_data function.
	  the cob_xml_parse and xml_parse functions were modified to support
	  the new end_of_input event required by xmlss. a new eof variable
	  was added to the xml_state structure so that the endDocument callback
	  function could be triggered by the parser in the xml_parse funtction.

	  TODO ==> logic needs to be added to support returning NATIONAL data
	  this needs to support the RETURNING NATIONAL phrase.
	* common.h: rename COB_XML_PARSE_XMLNSS into COB_XML_PARSE_XMLSS to match
	  the IBM option name
	* mlio.c [WITH_XML2]: Fix issues in XML PARSE handling most notably a use
	  after free error if the internal buffer needs to grow during the parsing.
	  Respect the high order half-word for exception XML-CODE.
	  Reduce the number of parsing states by removing useless ones,
	  and encode eof in these states.
	  Handle XML chunks with more than one recoverable error.
	  Trigger ON EXCEPTION code after EXCEPTION XML events.
	* parser.y: remove the CB_PENDING warning on XML PARSE but still warn for
	  untested XML PARSE RETURNING NATIONAL and XML PARSE VALIDATING.
	* typeck.c: remove invalid call to cob_check_based for XML-* builtin variable
	  length registers (like XML-TEXT)
	* codegen.c: remove the uninitialized and unused b_* field for XML-* builtin
	  variable length registers
@GBertholon

GBertholon commented Apr 20, 2026

Copy link
Copy Markdown

I am taking the responsibility for this PR on OCamlPro's behalf. I applied changes according to your comments and fixed several issues. @GitMensch: is this new version more satisfying ?

@GBertholon GBertholon requested a review from GitMensch April 20, 2026 11:57

@GitMensch GitMensch left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the output data is not as expected:

  • not included version/standalone/encoding may not be in the returned values
  • exceptions should include the exception data

Comment thread tests/testsuite.src/run_ml.at
Comment thread tests/testsuite.src/run_ml.at Outdated
Comment thread tests/testsuite.src/run_ml.at Outdated
Comment thread tests/testsuite.src/run_ml.at
Comment thread tests/testsuite.src/run_ml.at Outdated
Comment thread tests/testsuite.src/run_ml.at Outdated
Comment thread tests/testsuite.src/run_ml.at Outdated
Comment on lines +1221 to +1235
EXCEPTION +000262345||||
START-OF-ELEMENT +000000000|root|pfx0||
NAMESPACE-DECLARATION +000000000||pfx1|http://whatever|
START-OF-ELEMENT +000000000|localElName1|pfx1|http://whatever|
EXCEPTION +000262345||||
START-OF-ELEMENT +000000000|localElName2|pfx2||
END-OF-ELEMENT +000000000|localElName2|pfx2||
EXCEPTION +000262345||||
EXCEPTION +000262345||||
START-OF-ELEMENT +000000000|localElName3|pfx3||
ATTRIBUTE-NAME +000000000|localAtName4|pfx4||
ATTRIBUTE-CHARACTERS +000000000||||
CONTENT-CHARACTERS +000000000|c1|||
EXCEPTION +000262345||||
EXCEPTION +000262345||||

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exceptions should have the date from the exception in the register - this is the IBM output (with XMLSS):


 EXCEPTION                     000264193|pfx0:root|||
 START-OF-ELEMENT              000000000|root|pfx0||
 NAMESPACE-DECLARATION         000000000||pfx1|http://whatever|
 START-OF-ELEMENT              000000000|localElName1|pfx1|http://whatever|
 EXCEPTION                     000264193|pfx2:localElName2|||
 START-OF-ELEMENT              000000000|localElName2|pfx2||
 END-OF-ELEMENT                000000000|localElName2|pfx2||
 EXCEPTION                     000264193|pfx3:localElName3|||
 START-OF-ELEMENT              000000000|localElName3|pfx3||
 EXCEPTION                     000264192|pfx4:localAtName4|||
 ATTRIBUTE-NAME                000000000|localAtName4|pfx4||
 ATTRIBUTE-CHARACTERS          000000000||||
 CONTENT-CHARACTERS            000000000|c1|||
 EXCEPTION                     000264193|pfx5:localElName5|||
 START-OF-ELEMENT              000000000|localElName5|pfx5||
 EXCEPTION                     000264192|pfx6:localAtName6|||

@GBertholon GBertholon Apr 24, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, but I do not really have time right now to implement the mapping between libxml2 and IBM exception codes, and I cannot imagine a meaningful code that uses the XML-TEXT of an EXCEPTION event without first checking the XML-CODE...
I would say that the support for XML PARSE without exception codes is useful enough to merge this PR first and then take care of those EXCEPTION events another time.

The behavior I have implemented simply let the COBOL developer choose between ignoring all recoverable errors, or failing on the first.
That said, I think I made a mistake here by trying to pass the libxml2 error code to COBOL while it is not fully stable, and this will be fixed by my next commit (I should simply tell whether the error is recoverable or not).

@GitMensch GitMensch Apr 24, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not about matching exception codes but to output the part that resulted in an exception in the appropriate register (as done by IBM, MF ... and if I remember correctly also libxml2

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that we explicit noted in NEWS that the exception codes are not identical to other implementations (I think MF and IBM differ as well)

@GBertholon GBertholon Apr 24, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but the definition of "the part that resulted in an exception" is very unclear unless you also know which exception is returned.
For me it is currently out of scope to do any kind of exception specific work for EXCEPTION event aside from distinguishing recoverable and non-recoverable.

Besides, IBM documentation says (https://www.ibm.com/docs/en/cobol-zos/6.3.0?topic=registers-xml-event) that for EXCEPTION events, "XML-TEXT or XML-NTEXT contains the document fragment up to the point of the error or anomaly that caused the exception.", but in practice this is contradictory with the output you mentionned, where only the name of the element or attribute is placed in XML-TEXT.

@GBertholon GBertholon requested a review from GitMensch April 24, 2026 13:56
Comment thread cobc/ChangeLog
Comment thread tests/testsuite.src/run_ml.at Outdated
Comment thread tests/testsuite.src/run_ml.at
Comment thread libcob/mlio.c
Comment thread libcob/mlio.c Outdated

@GitMensch GitMensch left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider that my "final" review. There are some things open, but I think we're nearly done to finally get this upstream!

But I'd like to have a review of @chuck-haatvedt as the original author of the code (and the rewrite from my initial event/data handling) before, if possible.

Comment thread libcob/mlio.c
Comment thread libcob/mlio.c Outdated
Comment thread libcob/mlio.c Outdated
Comment thread libcob/mlio.c Outdated
Comment thread libcob/mlio.c Outdated
Comment thread libcob/mlio.c Outdated
Comment thread libcob/mlio.c Outdated
Comment thread libcob/mlio.c Outdated
Comment thread libcob/mlio.c Outdated
Comment thread libcob/mlio.c Outdated
Comment on lines +2081 to +2083
/* IBM doc states that we should store 1 in XML-INFORMATION on events
ATTRIBUTE-CHARACTERS and CONTENT-CHARACTERS if the value in XML-TEXT
is complete. It seems to be always the case with libxml2. */

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also true for the push parser (where the COBOL program gives in data, commonly from a line sequential file) where the attribute is split between multiple lines)?
Do we have a testcase for that?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a test case with a push parser (currently badly named "XML PARSE complex XML": I will change that).

The issue is that IBM can split the content of ATTRIBUTE-CHARACTERS and CONTENT-CHARACTERS between several events, and in that case it reports that the *-CHARACTERS event is incomplete by writing 2 in XML-INFORMATION.
In libxml2, as far as I know, we never get incomplete events and emulating those seems out of scope for now, as it requires digging into the internal structure of the parser state (and I don't think this structure is supposed to be stable across versions).
Therefore, we always send only one *-CHARACTERS event, even though IBM states it can send more.

In practice, for most COBOL codes, and especially those following the IBM example I took for the unit test, this practice of combining incomplete event should not alter the behavior since the only meaningful thing to do with partial *-CHARACTERS events is to concatenate them.
Actually, we can even argue that this behavior should be kept even if we support IBM split one day because it allows for simpler COBOL code.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GBertholon GBertholon Apr 27, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems partially wrong: the calling function xmlParseTryOrFinish does not call xmlParseCharDataInternal at pushed chunk boundary, but at internal buffer size boundary instead.

This is an issue here though... At internal buffer boundary we should put 2 in the XML-INFORMATION register.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding a test going over the boundary and checking the adjusted code ❤️

@GBertholon GBertholon Apr 27, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just thought about another potential quirk of XML-INFORMATION: for the XML file

<test>Try <![CDATA[some]]> wierd things</test>

What is the content of XML-INFORMATION of the different CONTENT-CHARACTERS events ?
I don't have an IBM compiler at hand and it is not stated in the documentation whether CDATA text is considered to be a continuation of normal text or not.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with your test data and

           display xml-event xml-code '|' xml-text '|' xml-information
                '|' xml-namespace-prefix '|' xml-namespace '|'

the result on IBM with xmlss is

 START-OF-DOCUMENT             000000000||000000000|||
 START-OF-ELEMENT              000000000|test|000000000|||
 CONTENT-CHARACTERS            000000000||000000001|||
 EXCEPTION                     000798761|<test>Try <!|000000000|||

and with compat

 START-OF-DOCUMENT             000000000|<test>Try <![CDATA[some]]> wierd things</test>                    |000000000|||
 START-OF-ELEMENT              000000000|test|000000000|||
 CONTENT-CHARACTERS            000000000|Try |000000000|||
 EXCEPTION                     000000136|<test>Try <!|000000000|||

@GitMensch GitMensch Apr 27, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<![CDATA[some]]><test>Try valid things</test><![CDATA[more]]> leads to

 START-OF-DOCUMENT             000000000||000000000|||
 EXCEPTION                     000798761|<!|000000000|||

compat:

 START-OF-DOCUMENT             000000000|<![CDATA[some]]><test>Try valid things</test><![CDATA[more]]>
 |000000000|||
 EXCEPTION                     000000002|<![|000000000|||
 EXCEPTION                     000000001|<![C|000000000|||
 EXCEPTION                     000000001|<![CD|000000000|||
 EXCEPTION                     000000001|<![CDA|000000000|||
 EXCEPTION                     000000001|<![CDAT|000000000|||
 EXCEPTION                     000000001|<![CDATA|000000000|||
 EXCEPTION                     000000001|<![CDATA[|000000000|||
 EXCEPTION                     000000001|<![CDATA[s|000000000|||
 EXCEPTION                     000000001|<![CDATA[so|000000000|||
 EXCEPTION                     000000001|<![CDATA[som|000000000|||
 EXCEPTION                     000000001|<![CDATA[some|000000000|||
 EXCEPTION                     000000001|<![CDATA[some]|000000000|||
 EXCEPTION                     000000001|<![CDATA[some]]|000000000|||
 EXCEPTION                     000000001|<![CDATA[some]]>|000000000|||
 EXCEPTION                     000000002|<![CDATA[some]]><test>Try valid things</test><![|000000000|||
 EXCEPTION                     000000001|<![CDATA[some]]><test>Try valid things</test><![C|000000000|||
 EXCEPTION                     000000001|<![CDATA[some]]><test>Try valid things</test><![CD|000000000|||
 EXCEPTION                     000000001|<![CDATA[some]]><test>Try valid things</test><![CDA|000000000|||
 EXCEPTION                     000000001|<![CDATA[some]]><test>Try valid things</test><![CDAT|000000000|||
 EXCEPTION                     000000001|<![CDATA[some]]><test>Try valid things</test><![CDATA|000000000|||
 EXCEPTION                     000000001|<![CDATA[some]]><test>Try valid things</test><![CDATA[|000000000|||
 EXCEPTION                     000000001|<![CDATA[some]]><test>Try valid things</test><![CDATA[m|000000000|||
 EXCEPTION                     000000001|<![CDATA[some]]><test>Try valid things</test><![CDATA[mo|000000000|||
 EXCEPTION                     000000001|<![CDATA[some]]><test>Try valid things</test><![CDATA[mor|000000000|||
 EXCEPTION                     000000001|<![CDATA[some]]><test>Try valid things</test><![CDATA[more|000000000|||
 EXCEPTION                     000000001|<![CDATA[some]]><test>Try valid things</test><![CDATA[more]|000000000|||
 EXCEPTION                     000000001|<![CDATA[some]]><test>Try valid things</test><![CDATA[more]]|000000000|||
 EXCEPTION                     000000001|<![CDATA[some]]><test>Try valid things</test><![CDATA[more]]>|000000000|||

:-)

I'm just confused why parsing

         1 xml-document-data.
          2 pic x(39) value '<?xml version="1.0" encoding="US-ASCII"'.
          2 pic x(19) value ' standalone="yes"?>'.
          2 pic x(39) value '<!--This document is just an example-->'.
          2 pic x(10) value '<sandwich>'.
          2 pic x(33) value '<bread type="baker''s best"/>'.
          2 pic x(36) value '<?spread We''ll use real mayonnaise?>'.
          2 pic x(29) value '<meat>Ham + turkey</meat>'.
          2 pic x(34) value '<filling>Cheese, lettuce, tomato, '.
          2 pic x(32) value 'and that''s all, Folks!</filling>'.
          2 pic x(25) value '<![CDATA[We should add a '.
          2 pic x(20) value '<relish> element!]]>'.
          2 pic x(28) value '<listprice>$4.99</listprice>'.
          2 pic x(25) value '<discount>0.10</discount>'.
          2 pic x(31) value '</sandwich>'.

with XMLSS does not result in START-OF-CDATA and so on but also raises an exception

 START-OF-DOCUMENT             000000000||000000000|||
 VERSION-INFORMATION           000000000|1.0|000000000|||
 ENCODING-DECLARATION          000000000|US-ASCII|000000000|||
 STANDALONE-DECLARATION        000000000|yes|000000000|||
 COMMENT                       000000000|This document is just an example|000000000|||
 START-OF-ELEMENT              000000000|sandwich|000000000|||
 START-OF-ELEMENT              000000000|bread|000000000|||
 ATTRIBUTE-NAME                000000000|type|000000000|||
 ATTRIBUTE-CHARACTERS          000000000|baker's best|000000001|||
 END-OF-ELEMENT                000000000|bread|000000000|||
 CONTENT-CHARACTERS            000000000|     |000000001|||
 PROCESSING-INSTRUCTION-TARGET 000000000|spread|000000000|||
 PROCESSING-INSTRUCTION-DATA   000000000|We'll use real mayonnaise|000000000|||
 START-OF-ELEMENT              000000000|meat|000000000|||
 CONTENT-CHARACTERS            000000000|Ham + turkey|000000001|||
 END-OF-ELEMENT                000000000|meat|000000000|||
 CONTENT-CHARACTERS            000000000|    |000000001|||
 START-OF-ELEMENT              000000000|filling|000000000|||
 CONTENT-CHARACTERS            000000000|Cheese, lettuce, tomato, and that's all, Folks!|000000001|||
 END-OF-ELEMENT                000000000|filling|000000000|||
 EXCEPTION                     000798761|<?xml version="1.0" encoding="US-ASCII" standalone="yes"?><!--This document is j
 ust an example--><sandwich><bread type="baker's best"/>     <?spread We'll use real mayonnaise?><meat>Ham + turkey</meat
 >    <filling>Cheese, lettuce, tomato, and that's all, Folks!</filling><!|000000000|||

no matter if I save the file with UTF8 encoding and also mention that in the xml's encoding or not...

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error code 000798761 corresponds to XRSN_MARKUP_INVALID: An incorrect character is found within markup.
It seems that the XML parser you used for tests is unable to recognize CDATA elements (it always stop after <! as if it was expecting a comment <!-- and nothing else)...
Therefore I will not get any information on the expected behaviour from that :(

@GBertholon GBertholon Apr 28, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I think

<![CDATA[some]]><test>Try valid things</test><![CDATA[more]]>

is supposed to be invalid XML, unlike what its text suggest (you cannot have content outside the root XML element and CDATA is treated as content)...

On the contrary my "weird" example is unusual but supposedly valid.

@chuck-haatvedt

Copy link
Copy Markdown

I am a bit confused as to the changes to the version I supplied to Simon as the code appeared to be working fine before the changes.

As for the testsuite, I have attached the sample program I used for testing. I ran it on both GnuCOBOL and MF COBOL.

xmlsmpl-3.txt is the test program rename it to xmlsmpl-3.cbl. This is a much better test program as it exercises more of the complex xml elements.

set infile=sample_test_complex_split.xml this is the input xml document as a line sequential file.

xmlsmpl3-mfcobol.txt is the output from the MF COBOL test

xmlsmpl3-gnucobol.txt
xmlsmpl3-gnucobol-new.txt
xmlsmpl3-mfcobol.txt

xmlsmpl-3.txt

@GitMensch

Copy link
Copy Markdown
Collaborator

As for the testsuite, I have attached the sample program I used for testing. I ran it on both GnuCOBOL and MF COBOL.

Can you change that from file based to be memory based, please? That way I can easily run it on IBM (files would also work but I'd need to creat a dataset, add the data, handle JCL, ... - in-memory is just much easier)

Note that, compared to IBM, we may merge short contiguous CONTENT-CHARACTERS events across END-OF-INPUT boundaries.
This is due to libxml2 internal details.

Also improve some tests to check predefined entities and long content.
@GBertholon

GBertholon commented Apr 29, 2026

Copy link
Copy Markdown

Without forking libxml2, it seems impossible to generate the exact same stream of event as IBM in push parser mode.
This is due to the fact that libxml2 does not cut events on chunk boundaries like IBM does.

That said, I guess I found a reasonable compromise between not depending too much on internal libxml2 details and not breaking COBOL code expecting the IBM behavior: the rule is that we allow ourselves to postpone characters delivered by IBM at chunk boundary but we try to guarantee that we do not generate more events than IBM since COBOL code might rely on the fact that some content is never split.
This new code also relies on internal libxml2 heuristics to never wait for too long before delivering an event (the internal rule seems to be "if the content already contains more than 300 characters at chunk boundary then deliver before next chunk, else wait").

Moreover, my last commit should handle XML-INFORMATION correctly notifying whenever there might be more characters later or not.
Note that I adjusted the long text example to check that.

@chuck-haatvedt: Can you tell me what your test is checking that is not already covered by my additions in run_ml.at ?

@GitMensch: With that done, I think I have taken into account all your comments. Do you have final remarks ?

@GBertholon GBertholon requested a review from GitMensch April 29, 2026 13:13
Comment thread tests/testsuite.src/run_ml.at Outdated
@chuck-haatvedt

chuck-haatvedt commented May 4, 2026

Copy link
Copy Markdown

this simple patch to mlio.c will add the xml-text line to the EXCEPTION event

*** F:/gnucobol-xml_parse/libcob/mlio.c	Wed Apr 29 07:52:52 2026
--- R:/msys64/home/spcwh2/x32/gnucobol-trunk/libcob/mlio.c	Sun May  3 19:59:43 2026
***************
*** 1641,1642 ****
--- 1641,1643 ----
  	size_t message_len;
+ 	char	buff[255];
  
***************
*** 1701,1702 ****
--- 1702,1706 ----
  		new_xml_event (state, EVENT_EXCEPTION);
+ 		snprintf(buff, 254, "%s:%s", err->str1, err->str2);
+ 		set_xml_event_text (state, buff, xmlStrlen ((xmlChar *)buff));
+ 

Note that this is a simple case and should be modified to check all 3 of the str1..3 variables in the err structure.

here is the output for XMLup with the above change, I can upload these in a text file tomorrow if that would be easier.

F:\AA-minGW32-static\XML>xmlup
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx0 on root is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx2 on localElName2 is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx4 for localAtName4 on localElName3 is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx3 on localElName3 is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx6 for localAtName6 on localElName5 is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx5 on localElName5 is not defined
START-OF-DOCUMENT             +000000000||||
EXCEPTION                     +000262144|pfx0:root|||
START-OF-ELEMENT              +000000000|root|pfx0||
NAMESPACE-DECLARATION         +000000000||pfx1|http://whatever|
START-OF-ELEMENT              +000000000|localElName1|pfx1|http://whatever|
EXCEPTION                     +000262144|pfx2:localElName2|||
START-OF-ELEMENT              +000000000|localElName2|pfx2||
END-OF-ELEMENT                +000000000|localElName2|pfx2||
EXCEPTION                     +000262144|pfx4:localAtName4|||
EXCEPTION                     +000262144|pfx3:localElName3|||
START-OF-ELEMENT              +000000000|localElName3|pfx3||
ATTRIBUTE-NAME                +000000000|localAtName4|pfx4||
ATTRIBUTE-CHARACTERS          +000000000||||
CONTENT-CHARACTERS            +000000000|c1|||
EXCEPTION                     +000262144|pfx6:localAtName6|||
EXCEPTION                     +000262144|pfx5:localElName5|||
START-OF-ELEMENT              +000000000|localElName5|pfx5||
ATTRIBUTE-NAME                +000000000|localAtName6|pfx6||
ATTRIBUTE-CHARACTERS          +000000000||||
END-OF-ELEMENT                +000000000|localElName5|pfx5||
CONTENT-CHARACTERS            +000000000|c2|||
END-OF-ELEMENT                +000000000|localElName3|pfx3||
CONTENT-CHARACTERS            +000000000|c3|||
END-OF-ELEMENT                +000000000|localElName1|pfx1|http://whatever|
END-OF-ELEMENT                +000000000|root|pfx0||
END-OF-INPUT                  +000000000||||
END-OF-DOCUMENT               +000000000||||

here is the same when executed on IBM Z/OS ENTERPRISE COBOL

START-OF-DOCUMENT             000000000||||                                
EXCEPTION                     000264193|pfx0:root|||                       
START-OF-ELEMENT              000000000|root|pfx0||                        
NAMESPACE-DECLARATION         000000000||pfx1|http://whatever|             
START-OF-ELEMENT              000000000|localElName1|pfx1|http://whatever| 
EXCEPTION                     000264193|pfx2:localElName2|||               
START-OF-ELEMENT              000000000|localElName2|pfx2||                
END-OF-ELEMENT                000000000|localElName2|pfx2||                
EXCEPTION                     000264193|pfx3:localElName3|||               
START-OF-ELEMENT              000000000|localElName3|pfx3||                
EXCEPTION                     000264192|pfx4:localAtName4|||               
ATTRIBUTE-NAME                000000000|localAtName4|pfx4||                
ATTRIBUTE-CHARACTERS          000000000||||                                
CONTENT-CHARACTERS            000000000|c1|||                              
EXCEPTION                     000264193|pfx5:localElName5|||               
START-OF-ELEMENT              000000000|localElName5|pfx5||                
EXCEPTION                     000264192|pfx6:localAtName6|||               
ATTRIBUTE-NAME                000000000|localAtName6|pfx6||                
ATTRIBUTE-CHARACTERS          000000000||||                                
END-OF-ELEMENT                000000000|localElName5|pfx5||                
CONTENT-CHARACTERS            000000000|c2|||                              
END-OF-ELEMENT                000000000|localElName3|pfx3||                
CONTENT-CHARACTERS            000000000|c3|||                              
END-OF-ELEMENT                000000000|localElName1|pfx1|http://whatever| 
END-OF-ELEMENT                000000000|root|pfx0||                        
END-OF-DOCUMENT               000000000||||                                

Note that a couple of the EXCEPTION events are in a different order. Also note that the XML-CODE is displayed a bit differently on IBM, perhaps they use an implied PIC +++++++++9 instead of the floating "-" character which would only print the negative sign.

Obviously the XML-CODE values are different as well. Personally I think it would be better to use the code value from the err structure as it would all cobol programmers better access to the cause of the error within application code.

Chuck Haatvedt

@chuck-haatvedt

Copy link
Copy Markdown

here are the files...

Also revert the undeclared namespace test to the IBM original one since this is fully implemented now.
@GBertholon

GBertholon commented May 4, 2026

Copy link
Copy Markdown

Since you seem to really want the correct errors for undeclared namespace, I have spent a couple of hours to implement it correctly.
It should also give us a base to encode more IBM error codes.

I am strongly opposed to exposing libxml2 error codes in COBOL: libxml2 does not guarantee that it will always generate the same errors on the same input across versions, so we will end up having to emulate older libxml2 in order to preserve COBOL code that relies on them.
Instead, we should emulate the existing XMLSS codes that existing COBOL code may be currently using.
Most importantly, when we don't know which error it should be, we should not send any information to COBOL besides the severity (and that includes any additional XML-TEXT).

@GBertholon GBertholon requested a review from GitMensch May 4, 2026 14:16
@GitMensch

Copy link
Copy Markdown
Collaborator

Just a note: tested the exception/namespace one with MF vc7 -> sigsegv; tested with vc9

COMPAT:

START-OF-DOCUMENT             +000000000|<pfx0:root xmlns:pfx1="http://whatever"><pfx1:localElName1><pfx2:localElName2/><pfx3:localElName3 pfx4:localAtName4="">c1<pfx5:localElName5 pfx6:localAtName6=""/>c2</pfx3:localElName3>c3</pfx1:localElName1></pfx0:root>|||
START-OF-ELEMENT              +000000000|pfx0:root|||
ATTRIBUTE-NAME                +000000000|xmlns:pfx1|||
ATTRIBUTE-CHARACTERS          +000000000|http://whatever|||
START-OF-ELEMENT              +000000000|pfx1:localElName1|||
START-OF-ELEMENT              +000000000|pfx2:localElName2|||
END-OF-ELEMENT                +000000000|pfx2:localElName2|||
START-OF-ELEMENT              +000000000|pfx3:localElName3|||
ATTRIBUTE-NAME                +000000000|pfx4:localAtName4|||
ATTRIBUTE-CHARACTERS          +000000000||||
CONTENT-CHARACTERS            +000000000|c1|||
START-OF-ELEMENT              +000000000|pfx5:localElName5|||
ATTRIBUTE-NAME                +000000000|pfx6:localAtName6|||
ATTRIBUTE-CHARACTERS          +000000000||||
END-OF-ELEMENT                +000000000|pfx5:localElName5|||
CONTENT-CHARACTERS            +000000000|c2|||
END-OF-ELEMENT                +000000000|pfx3:localElName3|||
CONTENT-CHARACTERS            +000000000|c3|||
END-OF-ELEMENT                +000000000|pfx1:localElName1|||
END-OF-ELEMENT                +000000000|pfx0:root|||
END-OF-DOCUMENT               +000000000||||

--> no exception, namespace as part of the element name

XMLSS:

START-OF-DOCUMENT             +000000000||||
EXCEPTION                     +000000004|<pfx0:root xmlns:pfx1="http://whatever"|||
EXCEPTION                     +000000118|<pfx0:root xmlns:pfx1="http://whatever"><pfx1:localElName1><pfx2:localElName2/><pfx3:localElName3 pfx4:localAtName4=""|||
START-OF-ELEMENT              +000000000|localElName3|pfx3||
ATTRIBUTE-NAME                +000000000|localAtName4|pfx4||
CONTENT-CHARACTERS            +000000000|c1|||
EXCEPTION                     +000000160|<pfx0:root xmlns:pfx1="http://whatever"><pfx1:localElName1><pfx2:localElName2/><pfx3:localElName3 pfx4:localAtName4="">c1<pfx5:localElName5 pfx6:localAtName6=""|||
START-OF-ELEMENT              +000000000|localElName5|pfx5||
ATTRIBUTE-NAME                +000000000|localAtName6|pfx6||
END-OF-ELEMENT                +000000000|localElName5|pfx5||
CONTENT-CHARACTERS            +000000000|c2|||
END-OF-ELEMENT                +000000000|localElName3|pfx3||
CONTENT-CHARACTERS            +000000000|c3|||
END-OF-ELEMENT                +000000000|localElName1|pfx1|http://whatever|
END-OF-ELEMENT                +000000000|root|pfx0||
END-OF-DOCUMENT               +000000000||||

--> exception-text always "a fragment of alphanumeric text" (everything parsed until the exception), and on unclear places, ... and the error code does not make sense to me (compared to their docs https://docs.rocketsoftware.com/bundle/visualcobolvs_ug_100/page/rpb1743378344996.html)

@GitMensch

Copy link
Copy Markdown
Collaborator

Instead, we should emulate the existing XMLSS codes that existing COBOL code may be currently using.

I agree that this will be best.

Most importantly, when we don't know which error it should be, we should not send any information to COBOL besides the severity (and that includes any additional XML-TEXT).

I disagree - we can have an XML-TEXT which explicit starts with "unhandled internal xml error %d: error text" -> that way we can definitely tell people that this goes away (and they ideally should provide us with a reproducer - as we can then cross-check with IBM) while the contained internal libxml error number is still helpful for developers as they can lookup in the libxml header (or deepwiki) to check what the actual error is.

Having no detail information whatsover (the runtime warning may be suppressed or is not easily relatable to the current place by being put into a different output file), for example in a nightly batch job that already took 2 hours, is definitely bad.

Comment on lines +1286 to +1289
EXCEPTION +000264192|pfx4:localAtName4|||
EXCEPTION +000264193|pfx3:localElName3|||
START-OF-ELEMENT +000000000|localElName3|pfx3||
ATTRIBUTE-NAME +000000000|localAtName4|pfx4||

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just an observation, not a request to change it:

the order of exceptions is different to IBM where we get the namespace error at the place that uses it, not the place where it is parsed. As noted: that's just a difference; it still would be good to have a note about that (for now possibly in the NEWS entry, later moved to gnucobol.texi).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, a note like that should also tell that this PR, unlike IBM's parser does not always immediately send incomplete events before END-OF-INPUT and can buffer them.

@chuck-haatvedt

Copy link
Copy Markdown

here is a link to the IBM reference manual for XML System Services User's Guide and
Reference. Check Appendix A & B for the XML-CODE values. The RETURN CODE is stored in the high order 2 bytes and the REASON CODE is in the low order 2 bytes.

https://www.ibm.com/docs/en/SSLTBW_3.2.0/pdf/gxla100_v3r2.pdf

Chuck Haatvedt

@chuck-haatvedt

chuck-haatvedt commented May 5, 2026

Copy link
Copy Markdown

change xmlup.cbl as follows note the change in the first tag...

        Identification division.
         Program-id. XMLup.
       Data division.
        Working-storage section.
         1 d.
          2 pic x(40) value '<pfxz:root xmlns:pfx1="http://whatever">'.
          2 pic x(19) value '<pfx1:localElName1>'.
          2 pic x(20) value '<pfx2:localElName2/>'.
          2 pic x(40) value '<pfx3:localElName3 pfx4:localAtName4="">'.
          2 pic x(02) value 'c1'.
          2 pic x(41) value '<pfx5:localElName5 pfx6:localAtName6=""/>'.
          2 pic x(24) value 'c2</pfx3:localElName3>c3'.
          2 pic x(32) value '</pfx1:localElName1></pfx0:root>'.
       Procedure division.
         main.
           xml parse d processing procedure h
           goback.
         h.
           display xml-event xml-code '|' xml-text '|'
               xml-namespace-prefix '|'
               xml-namespace '|'
      * In the original IBM example they check specifically the two exceptions
      * codes for undeclared namespaces: 264192 and 264193
      * We do not yet support these IBM code
      * -> ignore all recoverable errors for now
      *    if xml-event = 'EXCEPTION' and xml-code = 264192 or 264193
             move 0 to xml-code
      *    end-if
           .
       End program XMLup.

when compiled and run with this gnucobol pr it generates the following output

F:\AA-minGW32-static\XML>xmlup
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfxz on root is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx2 on localElName2 is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx4 for localAtName4 on localElName3 is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx3 on localElName3 is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx6 for localAtName6 on localElName5 is not defined
libcob: warning: XML PARSE recoverable error (201): Namespace prefix pfx5 on localElName5 is not defined
libcob: warning: XML PARSE non-recoverable error (76): Opening and ending tag mismatch: root line 1 and root
START-OF-DOCUMENT             +000000000||||
EXCEPTION                     +000264193|pfxz:root|||
START-OF-ELEMENT              +000000000|root|pfxz||
NAMESPACE-DECLARATION         +000000000||pfx1|http://whatever|
START-OF-ELEMENT              +000000000|localElName1|pfx1|http://whatever|
EXCEPTION                     +000264193|pfx2:localElName2|||
START-OF-ELEMENT              +000000000|localElName2|pfx2||
END-OF-ELEMENT                +000000000|localElName2|pfx2||
EXCEPTION                     +000264192|pfx4:localAtName4|||
EXCEPTION                     +000264193|pfx3:localElName3|||
START-OF-ELEMENT              +000000000|localElName3|pfx3||
ATTRIBUTE-NAME                +000000000|localAtName4|pfx4||
ATTRIBUTE-CHARACTERS          +000000000||||
CONTENT-CHARACTERS            +000000000|c1|||
EXCEPTION                     +000264192|pfx6:localAtName6|||
EXCEPTION                     +000264193|pfx5:localElName5|||
START-OF-ELEMENT              +000000000|localElName5|pfx5||
ATTRIBUTE-NAME                +000000000|localAtName6|pfx6||
ATTRIBUTE-CHARACTERS          +000000000||||
END-OF-ELEMENT                +000000000|localElName5|pfx5||
CONTENT-CHARACTERS            +000000000|c2|||
END-OF-ELEMENT                +000000000|localElName3|pfx3||
CONTENT-CHARACTERS            +000000000|c3|||
END-OF-ELEMENT                +000000000|localElName1|pfx1|http://whatever|
EXCEPTION                     +001048576||||

however that same code compiled on IBM Enterprise COBOL generates the following output

START-OF-DOCUMENT             000000000||||                                     
EXCEPTION                     000264193|pfxz:root|||                            
START-OF-ELEMENT              000000000|root|pfxz||                             
NAMESPACE-DECLARATION         000000000||pfx1|http://whatever|                  
START-OF-ELEMENT              000000000|localElName1|pfx1|http://whatever|      
EXCEPTION                     000264193|pfx2:localElName2|||                    
START-OF-ELEMENT              000000000|localElName2|pfx2||                     
END-OF-ELEMENT                000000000|localElName2|pfx2||                     
EXCEPTION                     000264193|pfx3:localElName3|||                    
START-OF-ELEMENT              000000000|localElName3|pfx3||                     
EXCEPTION                     000264192|pfx4:localAtName4|||                    
ATTRIBUTE-NAME                000000000|localAtName4|pfx4||                     
ATTRIBUTE-CHARACTERS          000000000||||                                     
CONTENT-CHARACTERS            000000000|c1|||                                   
EXCEPTION                     000264193|pfx5:localElName5|||                    
START-OF-ELEMENT              000000000|localElName5|pfx5||                     
EXCEPTION                     000264192|pfx6:localAtName6|||                    
ATTRIBUTE-NAME                000000000|localAtName6|pfx6||                     
ATTRIBUTE-CHARACTERS          000000000||||                                     
END-OF-ELEMENT                000000000|localElName5|pfx5||                     
CONTENT-CHARACTERS            000000000|c2|||                                   
END-OF-ELEMENT                000000000|localElName3|pfx3||                     
CONTENT-CHARACTERS            000000000|c3|||                                   
END-OF-ELEMENT                000000000|localElName1|pfx1|http://whatever|      
EXCEPTION                     000798773|<pfxz:root xmlns:pfx1="http://whatever">
pfx3:localElName3 pfx4:localAtName4="">c1<pfx5:localElName5 pfx6:localAtName6=""
Name1></|||                                                                 

note that the XML-CODE is different from GnuCOBOL on the last EXCEPTION event

I think that this demonstrates the difficulty of attempting to map all of the libxml2 err->code values to IBM equivalent values..

Also note that 798773 === x'000C3035' which is a value found in XML System Services User's Guide and Reference in Appendix B. So we would need to create a cross reference mapping of the libxml2 err->code values to those in Appendix B.

I think that using the err->code and err->msg would be much more useful for programmers to diagnose any xml errors.

Chuck Haatvedt

@GBertholon

Copy link
Copy Markdown

As long as it is very clear that we do not guarantee any form of stability on those error messages, I agree we could indeed put the libxml2 error text inside XML-TEXT for unhandled errors.
I simply want to avoid any COBOL code that relies on this text for anything else than propagating error messages, because maintaining such stability around unstable libxml2 errors would be a long term nightmare.

I already noticed subtle changes across libxml2 versions for some of the reported errors. For instance version 2.12.7+dfsg+really2.9.14-2.1+deb13u2 you can find on current Debian stable can be very weird for syntax errors on the root element tag.

@GBertholon

GBertholon commented May 5, 2026

Copy link
Copy Markdown

I think that this demonstrates the difficulty of attempting to map all of the libxml2 err->code values to IBM equivalent values..

I don't think we will ever map all of them, just the few useful ones that COBOL programmers may rely on.

Also note that 798773 === x'000C3035' which is a value found in XML System Services User's Guide and Reference in Appendix B. So we would need to create a cross reference mapping of the libxml2 err->code values to those in Appendix B.

This test suggests that maybe we should use XRC_NOT_WELL_FORMED instead of XRC_FATAL for parsing errors. So probably x'000C0000' would be slightly better for our default unmapped unrecoverable parsing error.

@GitMensch

Copy link
Copy Markdown
Collaborator

Two quick notes: the snprintf should use the _MAX define, which is one byte less than its matching_BUFF.

If you rebase then the warnings and errors in CI should be fixed.

@GBertholon

Copy link
Copy Markdown

I don't understand any of the two comments:

  • snprintf takes the size of the buffer as argument not the index of the last writable slot (e.g. writes last char at sz-2 and '\0' at sz-1)
  • you seem to have repaired the CI, and it now works even without rebasing anything

@GitMensch

Copy link
Copy Markdown
Collaborator

CI: correct (it works here in the PR, just not in your own branch).
snprintf - correct, there is a MSVC compat issue, but that's unrelated - and can be fixed by

snprintf(err_text_buf, COB_MINI_BUFF, ...
err_text_buf[COB_MINI_MAX] = 0;

a construct you'll see in many places where snprintf is used. So... maybe use here as well :-)

Note: I want to do a final comparison vs. IBM (may need to wait until Saturday) and maybe a first performance check with some bigger XML (both "in general" and vs. MF) - then merge upstream (not later than next week, currently).

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 72.13115% with 85 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (gitside-gnucobol-3.x@326ce55). Learn more about missing BASE report.

Files with missing lines Patch % Lines
libcob/mlio.c 72.78% 51 Missing and 29 partials ⚠️
cobc/parser.y 33.33% 4 Missing ⚠️
cobc/typeck.c 0.00% 0 Missing and 1 partial ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@                   Coverage Diff                   @@
##             gitside-gnucobol-3.x     #263   +/-   ##
=======================================================
  Coverage                        ?   67.82%           
=======================================================
  Files                           ?       34           
  Lines                           ?    61565           
  Branches                        ?    16043           
=======================================================
  Hits                            ?    41756           
  Misses                          ?    13851           
  Partials                        ?     5958           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

XML-INFORMATION and XML-*NAMESPACE* are not available with MF.
@GitMensch

GitMensch commented May 7, 2026

Copy link
Copy Markdown
Collaborator

Just FYI: I've run some production XML (156MB) through a program just DISPLAYing the event/data.
The result is nearly identical to MF - we just generate one END-OF-INPUT event "too much" (seems like the parsers in both MF and IBM stop after END-DOCUMENT).

MF is much slower (factor 5-6). >20% of cpu-time came from DISPLAY (putc), if this is taken out of the values then 3.2% is spent in cob_xml_parse::get_xml_code (the inlined function), 50.2 in cob_xml_parse::xml_parse.
From the later, xmlPaseChunk takes 26.4 (included xmlParseTryOrFinish 22.8) and xml_process_next_event 22.6%

Possibly more checks next week.

@chuck-haatvedt

Copy link
Copy Markdown

here is a link to Wikipedia dumps website where you can download some huge xml file for testing

https://dumps.wikimedia.org/enwiki/latest/

this is what I've downloaded for performance testing.

5/07/2026 01:41 PM

.
05/07/2026 01:08 PM 114,446,966 enwiki-latest-pages-articles-multistream16.xml-p20460153p20570392
05/07/2026 01:40 PM 454,514,835 enwiki-latest-pages-articles-multistream18.xml-p26716198p27121850
05/07/2026 01:40 PM 297,012,315 enwiki-latest-pages-articles-multistream22.xml-p44496246p44788941
3 File(s) 865,974,116 bytes

there are larger files available on this site if you want to test with files more that 1 GB

@chuck-haatvedt

Copy link
Copy Markdown

my performance test results using COB_OPEN_FILE, COB_READ_FILE, COB_CLOSE_FILE reading 128KB chunks of data. All displays of xml data / events are removed. The code is just counting bytes read, xml-events returned.

Note that I had to fix a bug in fileio.c the cob_sys_read_file does not return the number of bytes fetched into the buffer.

XML document size ====> 24 MB
elapsed time =========> 6.15 seconds

F:\AA-minGW32-static\XML>xmlfast
FILE-NAME ==> F:\XML\enwiki-latest-pages-articles-multistream16.xml-p20460153p20570392
STARTING WITH: <mediawiki xmlns="http://www.mediawiki.org/xml/exp
TIME PROCESS XML DOCUMENT ==> +0000006151 MILLISECONDS
*** CPU CLOCK CYCLES ==> 24,002,229,300
FILE SIZE IN BYTES ==> 114,446,966
NUMBER OF EVENTS ====> 4,892,274

@GitMensch

Copy link
Copy Markdown
Collaborator

@chuck-haatvedt Have you tested with a different chunk size in xmlio.c (the open question was if we should make that configurable)?

@chuck-haatvedt

Copy link
Copy Markdown

@chuck-haatvedt Have you tested with a different chunk size in xmlio.c (the open question was if we should make that configurable)?

Simon, from my analysis it appears that this version of xml_parse just parses the chunk passed from the cobol program. Let me know if your inspection of the code is different.

I just did a build using the xml_parse code from this ddeclerck:xml_parse code base.

@chuck-haatvedt

Copy link
Copy Markdown

parsing xml documents as raw data does require the xml to be well formed. As my testing showed that as my testing of "raw" non-well formed xml data failed with an EXCEPTION even.

So this should be mentioned in the programmers guide so that users are aware of this requirement when passing "raw" data to XML PARSE.

@GitMensch

Copy link
Copy Markdown
Collaborator

Can you please write a short entry that you'd like to see about that in the documentation? I'm not really sure to understand what you refer to.

Also: shouldn't "non valid data" always return an error (or do you only mean bad line breaks not "visible" when passing lsq chunks, but breaking the parsing when having big "non-lsq" chunks that include those)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants