Bug 23771

Summary: UTF8 Decoder's Convert does not keep internal state between calls when 'flush' parameter is false
Product: [Mono] Class Libraries Reporter: Yusuke Fujiwara <yf.akari.ya>
Component: mscorlibAssignee: Bugzilla <bugzilla>
Status: VERIFIED FIXED    
Severity: normal CC: atsushi, kyle.white, mono-bugs+mono, peter.collins, shrutis
Priority: ---    
Version: unspecified   
Target Milestone: 4.2.0 (C6)   
Hardware: Other   
OS: Other   
Tags: Is this bug a regression?: ---
Last known good build:

Description Yusuke Fujiwara 2014-10-13 01:10:34 UTC
Version: 3.10.0 # not appeared in the dropdown
Hardware: iPad, Genymotion
OS: Android, iOS

There are 3 problems in UTF-8 Decoder.Convert to prevent streaming decoding and non-ASCII char support in Mono 3.10.0 (of latest Xamarin Android/Xamarin iOS).

1) bytesUsed and charsUsed out parameters should return really used counts instead of possible used counts. They are used to shift byte/char offset in next call.
2) completed out parameter should return which the bytes parameter's contents are fully used or not.
3) Decoding state must be preserved in the decoder to enable streaming decoding. This is important to decode 'multi-byte' chars like non-ASCII area of UTF-8. 

Following code is test code to reproduce it (note that this code will pass in desktop CLR and previous Mono):

[Test]
public void ReproDecoderIssue()
{
    var input = "\u733F"; // 'mono' on Japanese, 3bytes in UTF-8.
    var encoded = Encoding.UTF8.GetBytes(input);
    var decoder = Encoding.UTF8.GetDecoder();
    var chars = new char[ 10 ]; // Just enough space to decode.
    var result = new StringBuilder();
    var bytes = new byte[ 1 ]; // Simulates chunked input bytes.
    // Specify encoded bytes separetely.
    foreach ( var b in encoded )
    {
        bytes[ 0 ] = b;
        int bytesUsed, charsUsed;
        bool completed;
        decoder.Convert( bytes, 0, bytes.Length, chars, 0, chars.Length, false, out bytesUsed, out charsUsed, out completed );
        result.Append( chars, 0, charsUsed );
        // Expected outputs are written in bottom.
        Debug.Print( "bytesUsed:{0}, charsUsed:{1}, completed:{2}, result:'{3}'", bytesUsed, charsUsed, completed, result );
    }

    // Expected: NO assertion error.
    Assert.That( result.ToString(), Is.EqualTo( input ) );

    /*
     * Expected Debug outputs are:
     * bytesUsed:1, charsUsed:0, completed:True, result:''
     * bytesUsed:1, charsUsed:0, completed:True, result:''
     * bytesUsed:1, charsUsed:1, completed:True, result:'猿'
     * 
     * -- Note: '猿' is U+733F (1char in UTF-16)
     * 
     * Actual Debug output are:
     * bytesUsed:3, charsUsed:1, completed:False, result:'�'
     * bytesUsed:3, charsUsed:1, completed:False, result:'��'
     * bytesUsed:3, charsUsed:1, completed:False, result:'���'
     * 
     * All output parameters are not match.
     * -- Note: '�' is decoder fallback char (U+FFFD)
     */
}
// end of test code

This issue might be related to bug #10692, but I'm not sure so.
Comment 1 Atsushi Eno 2015-02-10 14:57:14 UTC
Once we could bring referencesource UTF8Encoding, it will get fixed. I just verified with my ongoing attempt to do so.
https://github.com/atsushieno/mono/tree/import-text-encoding

(Still several fixes are needed to get it working.)
Comment 2 Atsushi Eno 2015-02-16 02:45:32 UTC
UTF8Encoding and co. are now based on referencesource and it's fixed. Thanks for the report.

[master 90b11244]
Comment 3 Shruti 2016-04-13 08:16:01 UTC
I have checked this issue with C7 sku alignment builds and observed that I am getting expected behaviour as given in comment(0)
* Expected Debug outputs are:
     * bytesUsed:1, charsUsed:0, completed:True, result:''
     * bytesUsed:1, charsUsed:0, completed:True, result:''
     * bytesUsed:1, charsUsed:1, completed:True, result:'猿'

Screencast:http://www.screencast.com/t/jOx6BBfxGZ
Environment Info: https://gist.github.com/Shruti360/950884068499e51af12288616ad6c43c

Hence, Closing this issue.