#136 ✓resolved
sds

RedCloth changes encoding in 1.9.1

Reported by sds | March 24th, 2009 @ 06:18 AM | in 4.2.0

In ruby 1.9.1 the output of to_html seems to always be in UTF-8, no matter what your input string is in. Sometimes this can result in an exception.

For example:

irb(main):030:0> s.bytes.each {|c| puts c}
163
=> "?"
irb(main):031:0> s.encoding
=> #<Encoding:ISO-8859-10>
irb(main):032:0> t = RedCloth.new(s).to_html
ArgumentError: invalid byte sequence in UTF-8

from /usr/local/lib/ruby19/gems/1.9.1/gems/RedCloth-4.1.9/lib/redcloth/textile_doc.rb:81:in `strip'
from /usr/local/lib/ruby19/gems/1.9.1/gems/RedCloth-4.1.9/lib/redcloth/textile_doc.rb:81:in `to'
from /usr/local/lib/ruby19/gems/1.9.1/gems/RedCloth-4.1.9/lib/redcloth/textile_doc.rb:81:in `to_html'
from (irb):34
from /usr/local/bin/irb19:12:in `<main>'




Is it possible for redcloth to return the same encoding the original string was in without going via UTF-8 - because those conversions are not always possible.

Comments and changes to this ticket

  • Jason Garber

    Jason Garber March 24th, 2009 @ 02:06 PM

    • State changed from “new” to “open”

    I thought about doing this when I originally coded that part, but the shortest path was UTF-8 and I figured someone would tell me if they wanted it differently. Seems that time has come.

    Can you give me some more examples of some strings in other encodings? Basically, I wish I knew what your previous 30 lines in IRB were (above) that set up s.

    Then I can make some progress toward preserving encoding.

  • sds

    sds March 24th, 2009 @ 05:07 PM

    The code was just some playing around with it, but essentially I did as follows:

    $ irb19
    irb(main):001:0> s = "\xa3"
    => "\xA3"
    irb(main):002:0> s.encoding
    => #
    irb(main):003:0> s.force_encoding 'iso-8859-1'
    => "?"
    irb(main):004:0> require 'RedCloth'
    => true
    irb(main):005:0> t = RedCloth.new(s).to_html
    ArgumentError: invalid byte sequence in UTF-8
        from /usr/local/lib/ruby19/gems/1.9.1/gems/RedCloth-4.1.9/lib/redcloth/textile_doc.rb:81:in `strip'
        from /usr/local/lib/ruby19/gems/1.9.1/gems/RedCloth-4.1.9/lib/redcloth/textile_doc.rb:81:in `to'
        from /usr/local/lib/ruby19/gems/1.9.1/gems/RedCloth-4.1.9/lib/redcloth/textile_doc.rb:81:in `to_html'
        from (irb):5
        from /usr/local/bin/irb19:12:in `'
    irb(main):006:0> 
    

    Character 163 is a (stirling) pound sign in iso-8859-1.

    Without forcing the encoding ruby doesn't know what to do with that character either:

    $ irb19
    irb(main):001:0> s = "\xa3"
    => "\xA3"
    irb(main):002:0> require 'RedCloth'
    => true
    irb(main):003:0> t = RedCloth.new(s).to_html
    ArgumentError: invalid byte sequence in UTF-8
        from /usr/local/lib/ruby19/gems/1.9.1/gems/RedCloth-4.1.9/lib/redcloth/textile_doc.rb:81:in `strip'
        from /usr/local/lib/ruby19/gems/1.9.1/gems/RedCloth-4.1.9/lib/redcloth/textile_doc.rb:81:in `to'
        from /usr/local/lib/ruby19/gems/1.9.1/gems/RedCloth-4.1.9/lib/redcloth/textile_doc.rb:81:in `to_html'
        from (irb):3
        from /usr/local/bin/irb19:12:in `'
    

    It is a character that frequently pops up in strings we process.

    Let me know if you need any more examples.

  • Jason Garber

    Jason Garber June 10th, 2009 @ 11:15 AM

    • Tag set to encodings, multibyte, ruby1.9
    • State changed from “open” to “resolved”

    Your wish has been granted (and it was surprisingly easy).

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

RedCloth is a Ruby library for converting Textile into HTML

Shared Ticket Bins

People watching this ticket

Pages