[OF-92] UTF-8 message decoding within XMLLightweightParser Created: 22/May/08  Updated: 12/Feb/12  Resolved: 12/Feb/12

Status: Closed
Project: Openfire
Components: Core
Affects versions: 3.6.4
Fix versions: 3.7.1

Type: Bug Priority: Critical
Reporter: LG Assignee: Gaston Dombiak
Resolution: Duplicate Votes: 6
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified
Environment:

any


Attachments: Text File XMLLightweightParser.java     Java Source File XMLLightweightParser.java     File XMLLightweightParser.java     Text File XMLLightweightParserTest.java    

 Description   

See http://www.igniterealtime.org/community/message/171554 for a discussion.
"... After checking XMLLightweightParser, it seems to me that the following code: if (lastChar >= 0xfff0) {..."



 Comments   
Comment by LG [ 15/Jan/09 ]

store XMLLightweightParserTest.java in \openfire_src\src\test\java\org\jivesoftware\openfire\nio\
store XMLLightweightParser.java in \openfire_src\src\java\org\jivesoftware\openfire\nio\
For a "diff" you may need to re-format the 2nd file.

The test case shows that the current implementation has a bug and that the one that is attached here fixes it.
It does no longer need a CharBuffer, it uses a "decoder" instead of an "encoder." The ByteBuffer byteBuffer is directly converted into the StringBuilder buffer, so it could be much faster as it allocates less memory.

Comment by Daniel Haigh [ 15/Feb/10 ]

I am trying out this patch to possibly fix an issue I am having:
http://www.igniterealtime.org/community/message/200831

I updated XMLLightweightParser.java to the one above and it compiled ok.

Unfortunately when I send through a sentence of Chinese characters I get the following error and disconnected:

2010.02.15 13:17:38 Closing session due to exception: (SOCKET, R: /192.168.0.1:61535, L: /192.168.0.220:5222, S: 0.0.0.0/0.0.0.0:5222)
org.apache.mina.filter.codec.ProtocolDecoderException: java.nio.charset.MalformedInputException: Input length = 1 (Hexdump: 88 B6 E7 9A 84 E5 85 8D E8 B4 B9 E4 B8 8E E5 85 B6 E4 BB 96 E4 BC 9A E5 91 98 E8 81 94 E7 B3 BB EF BC 8C E7 AE 80 E5 8D 95 E5 BF AB E6 8D B7 E7 9A 84 E5 8A A0 E5 85 A5 E8 BF 87 E7 A8 8B EF BC 81 E6 97 A0 E9 99 90 E5 88 B6 E7 9A 84 E5 85 8D E8 B4 B9 E4 B8 8E E5 85 B6 E4 BB 96 E4 BC 9A E5 91 98 E8 81 94 E7 B3 BB EF BC 8C E7 AE 80 E5 8D 95 E5 BF AB E6 8D B7 E7 9A 84 E5 8A A0 E5 85 A5 E8 BF 87 E7 A8 8B EF BC 81 E6 97 A0 E9 99 90 E5 88 B6 E7 9A 84 E5 85 8D E8 B4 B9 E4 B8 8E E5 85 B6 E4 BB 96 E4 BC 9A E5 91 98 E8 81 94 E7 B3 BB EF BC 8C E7 AE 80 E5 8D 95 E5 BF AB E6 8D B7 E7 9A 84 E5 8A A0 E5 85 A5 E8 BF 87 E7 A8 8B EF BC 81 3C 2F 62 6F 64 79 3E 3C 2F 6D 65 73 73 61 67 65 3E)
at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:170)
at org.apache.mina.common.support.AbstractIoFilterChain.callNextMessageReceived(AbstractIoFilterChain.java:299)
at org.apache.mina.common.support.AbstractIoFilterChain.access$1100(AbstractIoFilterChain.java:53)
at org.apache.mina.common.support.AbstractIoFilterChain$EntryImpl$1.messageReceived(AbstractIoFilterChain.java:648)
at org.apache.mina.filter.executor.ExecutorFilter.processEvent(ExecutorFilter.java:239)
at org.apache.mina.filter.executor.ExecutorFilter$ProcessEventsRunnable.run(ExecutorFilter.java:283)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:51)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:260)
at org.apache.mina.common.ByteBuffer.getString(ByteBuffer.java:1098)
at org.jivesoftware.openfire.nio.XMLLightweightParser.read(XMLLightweightParser.java:206)
at org.jivesoftware.openfire.nio.XMPPDecoder.doDecode(XMPPDecoder.java:41)
at org.apache.mina.filter.codec.CumulativeProtocolDecoder.decode(CumulativeProtocolDecoder.java:133)
at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:163)

Let me know if I can provide more information?

Comment by LG [ 02/Apr/10 ]

Can you reproduce this and modify XMLLightweightParserTest.java so we have a JUnit test?
I wonder whether it helps to insert before line 206:
if (( missingUTF8bytes > 0 ) || ( UTF8Buffer.remaining() < incompleteUTF8bytes ))

{ return; }

;
Anyhow this should never be called and inserting this could cause a loop (100% cpu usage).

Comment by LG [ 03/Apr/10 ]

Does it work better when you add "-Dfile.encoding=UTF-8" as a JVM start parameter? Actually it shouldn't change a thing but it's still worth a try.

Comment by LG [ 04/Apr/10 ]

Some fixes for XMLLightweightParser.java to make sure that
*) UTF8Buffer will be filled correctly
*) exit loop when UTF8Buffer is full
*) UTF8Buffer will never be longer than 4 bytes

Comment by LG [ 06/Apr/10 ]

zero fixes, two improvments:
*) optimized loop
*) return if byteBuffer is empty after completing UTF-8 char

Comment by Daniel Haigh [ 15/Apr/10 ]

Hi LG, I just tried the latest code and am still getting exactly the same error I had last time mentioned above.

It happens when I try to send any string of more than about 10 chinese characters. Not every time though.

To test it I am simply sending 人人人人人人人人人人人人人.

Before I applied the patch I get the debug message "Waiting to get complete char:" and it inserts the blank characters in.

After applying the patch it just crashes with the error message above.

Thanks

Comment by Daniel Haigh [ 15/Apr/10 ]

Also - I seem to be getting this issue when sending the characters through XIFF. Is it possible that the Flash socket connections could be causing issues? Thanks.

Comment by LG [ 23/May/10 ]

Daniel do you get these errors when using the Spark client? I just want to make sure that this is not a problem of your client.

Comment by Daniel Haigh [ 27/May/10 ]

I have just tried to replicate it and it seems the issue is only occuring when sending messages using the XIFF Flash client. Could it be related to the way that Flash handles sockets? I might do some packet analysis. I thought I had seen it with other clients previously but can't seem to replicate that and it could be very rare. The issue is very obvious when using XIFF.

Comment by LG [ 27/May/10 ]

I did try xiffian and I can reproduce the problem. It's not a xiffian one, my tcpdump looks fine.
I could fix this problem locally by simply discarding any incomplete UTF-8 character at the end. Anyhow this is quite scary as if I send 人人 aka "E4 BA BA e4 ba ba" Openfire may call the decoder with "E4 BA BA e4 ba" the first time and with "e4 ba ba" the second time. So skipping "e4 ba" does solve this but I really wonder what's going on in MINA. Maybe the MINA developers did think but did not document the handling of incomplete UTF-8 chars at the end ...

Comment by Daniel Haigh [ 02/Oct/10 ]

I have some time again to work on this project as it is still an ongoing issue for us. Can you let me know how to discard the incomplete characters - I assume the change is in XMLLightweightParser. I haven't looked at the code in detail yet. Because this issue is also randomly corrupting our unicode JID nodes it is causing a lot of random errors on our system.

Comment by Daniel Haigh [ 02/Oct/10 ]

I have been playing around with this a lot and unfortunately it seems that the second half of the character never comes through. Just the first half twice.

READ: <message type="chat" id="m_9" to="test1@vtx-xps" from="test1@vtx-xps/OASIS_WEB"><body>人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日�

Waiting to get complete char... leaves the remaining incomplete � character in the buffer...

READ: ��日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日人日</body></message>

So then the next read has another incomplete character at the beginning. This is the issue and it just happens sometimes - most of the time the existing code works as the character that comes through in the next read is the missing character but not in this case.

So all I get is two incomplete characters but not the second half of the character so it seems like it can't be easily corrected? The issue must be in MINA.

Comment by Daniel Haigh [ 03/Oct/10 ]

I just spent a day updating all the NIO classes to support Apache Mina 2.0 just on the off chance it would fix this issue. Unfortunately it didn't help at all. It was a good exercise though and Openfire seems to work fine with that version of Mina but I can't think of any compelling reason to update it in the trunk.

Comment by Daniel Haigh [ 03/Oct/10 ]

Good news! I think I have fixed this issue and it wasn't related to Mina.

The existing code would go back one byte in buffer if the last character wasn't a complete Unicode character.

The problem is that the Unicode characters are made up of 3 bytes, so 50% of the time it actually has to go back two bytes in the buffer and keep them for the next read. The trickiest problem I had was trying to work out when it has to go back two bytes instead of one, and my code is rather messy but works! I hope you guys can improve on how I have done it.

I will post the XMLLightweightParser.java file with my changes.

Comment by Daniel Haigh [ 03/Oct/10 ]

I couldn't submit the file here? so posted it to http://community.igniterealtime.org/message/206738

I know UTF characters can be a variable number of bytes, so if the character was made up of 4 bytes then this solution still won't work but should be easy to modify so that it does.

Comment by Konstantin Satunin [ 13/Feb/11 ]

Tried solution from Daniel, didn't help even modified for 3 bytes.
Also tried attached fix by LG (10935), same result: client gets disconnected when sends some long UTF-8 sequence with variable size of chars.

Comment by Daniel Haigh [ 13/Feb/11 ]

Yes as you can see in my last comment some languages use 4 byte characters and the solution I provided doesn't correct that issue (only 3 bytes) but the same theory should apply. I am not the most experienced Java programmer so was hoping someone else could check my code and get it working for 4 byte characters. We have been using my 3 byte fix in a very busy environment for some time without an issue.

Just to confirm - what language is causing the issue?

Comment by Konstantin Satunin [ 15/Feb/11 ]

Russian language. Only when person tries to send LOOOOOONG message. On short messages below 100 chars fix for 3 bytes works fine.

Comment by Daniel Haigh [ 16/Feb/11 ]

Yes I am pretty certain Russian is 4 byte UTF-8. The fix I put in only addresses 3 byte but the theory is simple to get this working for 4 byte as well just by following the same logic in my code. Only problem is I am not the best Java programmer so it would be great if someone else could pick this up, fix this issue and close this one off. I expect it wouldn't be difficult - my fix was only a few lines of extra code.

Comment by Daniel Haigh [ 09/Aug/11 ]

One of our staff have developed a complete fix for this issue which looks very solid.
We are doing some final testing on the latest release then will provide the update.

Comment by Daniel Haigh [ 09/Aug/11 ]

I just noticed OF-458 Resolved which looks like the same issue as being discussed here, so our fix for this may be redundant.

However - is the initial issue reported by LG still relevant?

Comment by LG [ 14/Jan/12 ]

Should be closed as duplicate of OF-458 Resolved and thus also fixed with 3.7.1.

Comment by Daryl Herzmann [ 12/Feb/12 ]

LG notes this should be closed as a dup of OF-458 Resolved

Generated at Fri Mar 29 08:40:33 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100248-rev:6a03a54452e975225e04dfda06fdac6fd9e95b00.