
RedCloth changes encoding in 1.9.1
Reported by sds | March 24th, 2009 @ 06:18 AM | in 4.2.0
In ruby 1.9.1 the output of to_html seems to always be in UTF-8, no matter what your input string is in. Sometimes this can result in an exception.
For example:
irb(main):030:0> s.bytes.each {|c| puts c} 163 => "?" irb(main):031:0> s.encoding => #<Encoding:ISO-8859-10> irb(main):032:0> t = RedCloth.new(s).to_html ArgumentError: invalid byte sequence in UTF-8
from /usr/local/lib/ruby19/gems/1.9.1/gems/RedCloth-4.1.9/lib/redcloth/textile_doc.rb:81:in `strip'
from /usr/local/lib/ruby19/gems/1.9.1/gems/RedCloth-4.1.9/lib/redcloth/textile_doc.rb:81:in `to'
from /usr/local/lib/ruby19/gems/1.9.1/gems/RedCloth-4.1.9/lib/redcloth/textile_doc.rb:81:in `to_html'
from (irb):34
from /usr/local/bin/irb19:12:in `<main>'
Is it possible for redcloth to return the same encoding the original string was in without going via UTF-8 - because those conversions are not always possible.
Comments and changes to this ticket
-
Jason Garber March 24th, 2009 @ 02:06 PM
- State changed from new to open
I thought about doing this when I originally coded that part, but the shortest path was UTF-8 and I figured someone would tell me if they wanted it differently. Seems that time has come.
Can you give me some more examples of some strings in other encodings? Basically, I wish I knew what your previous 30 lines in IRB were (above) that set up
s
.Then I can make some progress toward preserving encoding.
-
sds March 24th, 2009 @ 05:07 PM
The code was just some playing around with it, but essentially I did as follows:
$ irb19 irb(main):001:0> s = "\xa3" => "\xA3" irb(main):002:0> s.encoding => # irb(main):003:0> s.force_encoding 'iso-8859-1' => "?" irb(main):004:0> require 'RedCloth' => true irb(main):005:0> t = RedCloth.new(s).to_html ArgumentError: invalid byte sequence in UTF-8 from /usr/local/lib/ruby19/gems/1.9.1/gems/RedCloth-4.1.9/lib/redcloth/textile_doc.rb:81:in `strip' from /usr/local/lib/ruby19/gems/1.9.1/gems/RedCloth-4.1.9/lib/redcloth/textile_doc.rb:81:in `to' from /usr/local/lib/ruby19/gems/1.9.1/gems/RedCloth-4.1.9/lib/redcloth/textile_doc.rb:81:in `to_html' from (irb):5 from /usr/local/bin/irb19:12:in `' irb(main):006:0>
Character 163 is a (stirling) pound sign in iso-8859-1.
Without forcing the encoding ruby doesn't know what to do with that character either:
$ irb19 irb(main):001:0> s = "\xa3" => "\xA3" irb(main):002:0> require 'RedCloth' => true irb(main):003:0> t = RedCloth.new(s).to_html ArgumentError: invalid byte sequence in UTF-8 from /usr/local/lib/ruby19/gems/1.9.1/gems/RedCloth-4.1.9/lib/redcloth/textile_doc.rb:81:in `strip' from /usr/local/lib/ruby19/gems/1.9.1/gems/RedCloth-4.1.9/lib/redcloth/textile_doc.rb:81:in `to' from /usr/local/lib/ruby19/gems/1.9.1/gems/RedCloth-4.1.9/lib/redcloth/textile_doc.rb:81:in `to_html' from (irb):3 from /usr/local/bin/irb19:12:in `'
It is a character that frequently pops up in strings we process.
Let me know if you need any more examples.
-
Jason Garber June 10th, 2009 @ 11:15 AM
- Tag set to encodings, multibyte, ruby1.9
- State changed from open to resolved
Your wish has been granted (and it was surprisingly easy).
Please Sign in or create a free account to add a new ticket.
With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.
Create your profile
Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป
RedCloth is a Ruby library for converting Textile into HTML