#149 open
iamtin (at gmail)

Redcloth 4 in JRuby doesn't support multi-bytes content

Reported by iamtin (at gmail) | May 19th, 2009 @ 05:53 AM

give this test case, run it in JRuby environment:

def test_mutibytes_chars

assert_equal "牛", RedCloth.new("  牛").to_html

end

Redcloth 4 will return "".

This test will passed in C Ruby environment.
And we found it may be caused by the "when" operator in redcloth_inline.rl. After comment out all "Semantic Condition", this test will pass. We tried on Ragel 6.3 and 6.5. The reference guide of Ragel said it may not support non-alphabet characters in when operator.

Comments and changes to this ticket

  • Jason Garber

    Jason Garber May 26th, 2009 @ 10:46 PM

    • Tag set to multibyte
    • State changed from “new” to “open”

    I'm aware of the problem from the test that includes "En français." I just don't know how to fix it. Do you have suggestions or a patch?

  • iamtin (at gmail)

    iamtin (at gmail) May 31st, 2009 @ 01:26 AM

    I'm not familiar with Ragel. But after debug I think it's a bug of Ragel's java code generator. But the ragel's user guide says it doesn't support multi-bytes characters. The semantic condition feature works only with alphabet types that are smaller in width than the long type.. Changes the inline scanner's declaration, avoid using semantic condition may fix it, but it's expensive for our project. If I have spare time, I will invest on this, see if there is a cheaper way to fix it.

  • Jason Garber

    Jason Garber June 2nd, 2009 @ 02:47 PM

    • Title changed from “Redcloth 4 doesn't support multi-bytes content” to “Redcloth 4 in JRuby doesn't support multi-bytes content”

    I still don't have a solution. I tried making all the conditionals just return true/false, but it didn't work.

    Resources:
    A unicode script in ragel contrib

  • Jason Garber

    Jason Garber June 7th, 2009 @ 06:27 AM

    • Milestone cleared.
    • Tag changed from multibyte to difficult, multibyte

    This one's tough. Not going to happen in this release.

  • valters

    valters September 4th, 2009 @ 05:52 AM

    This is unfortunate - I need to html-ize some unicode text articles (uses Baltic characters), and I was looking forward to use RedCloth for this, because pure-ruby library that I use right now is too slow.
    This, unfortunately, is showstopper - on first unicode char (say: ā) RedCloth stops, and returns only the part of article up to that first unicdoe char.

  • valters

    valters September 4th, 2009 @ 06:05 AM

    (Yes, I am hosting my application on JRuby (looking forward to use Google AppEngine/J), and am trying to use Java-backed text-manipulation libraries, because that's pretty fast actually.

  • Jason Garber

    Jason Garber September 6th, 2009 @ 09:35 AM

    Look for multibyte support in the rewrite of RedCloth (treetop or a divide-and-conquer parser). Ain' gonna happen in RedCloth w/ Ragel.

  • Benjamin Bock

    Benjamin Bock December 4th, 2009 @ 07:47 AM

    I'm using this work around code in an initializer file for my JRuby projects:

    if RedCloth::EXTENSION_LANGUAGE == "Java"
      module RedCloth
        class TextileDoc
          def initialize( string, restrictions = [] )
            restrictions.each { |r| method("#{r}=").call( true ) }
            super( string.chars.map{|x| x.size > 1 ? "&##{x.unpack("U*")};" : x}.join )
          end
        end
      end
    end
    
  • Tommy Li

    Tommy Li February 6th, 2010 @ 03:13 AM

    Benjamin's solution is a good workaround. I'm calling this from the JVM though, so I did the character replacement in the calling language (Scala), may or may not be a speedup - but probably it's faster than JRuby.

    
      def textile_render(textile_input: String) : Node = {
        // RedCloth under JRuby does not support multi-byte characters, so replace ahead of time
        // http://jgarber.lighthouseapp.com/projects/13054/tickets/149-redcloth-4-doesnt-support-multi-bytes-content
        
        val amended_input = textile_input.map(c => {
          val c_code = c.toLong
          
          // return if ascii otherwise give html entity
          if(c_code < 128) c else "&#" + c_code.toString + ";"
        }).mkString
        
        Unparsed(RedClothParser.makeTextile(amended_input))
      }
    
  • Marek Kowalski

    Marek Kowalski May 26th, 2010 @ 05:44 AM

    I made a lot of progress with debugging - solving this issue, but I'm stuck and need help. My work in progress can be seen on github fork:

    http://github.com/kowalski/redcloth.

    So the reason for the problem is that ruby doesn't care about encoding. String is just an array of bytes. If this is encode, so be it. If not.. who cares. So to make RubyString work with Java you have to make an assumption about encoding of the input.

    Second step for fixing this is to switch Ragel into char mode with:

    alphtype char;

    and to store input data in char[] array instead of byte[].

    When I did all that I managed to run a simple test:
    puts RedCloth.new("Zażółć gęślą jaźń").to_html
    "

    Zażółć gęślą jaźńZażółć gęślą jaźńZażółć gęślą jaźńZażółć gęślą jaźńZażółć gęślą jaźń

    "

    I can see the UTF characters but wtf!? Every line input is repeated 4 times. This is were I'm stuck.

    I did a lot of debugging and learned that problems begin in RedclothInline.inline method, which is generated by Ragel. Unfortunatelly I don't know Ragel enough to deal with this problem. Help would be very much appreciated, lets solve this together!

  • Marek Kowalski

    Marek Kowalski May 26th, 2010 @ 06:39 AM

    Update:
    The problem is solved. It was an obvious bug in my code. However after running rake spec I have 37 failures. Still need to track them down. Help would be still appreciated.

  • Marek Kowalski

    Marek Kowalski May 27th, 2010 @ 05:47 AM

    Update:
    32 failures to go, but they are all connected with html_esc methd.
    I'm very close :)

  • glebm

    glebm December 10th, 2012 @ 04:20 PM

    Any update on this? Benjamin's solution did not work for me on jruby 1.7.1

  • glebm

    glebm December 10th, 2012 @ 04:56 PM

    My workaround for jruby >= 1.7.1 or jruby 1.6 in 1.9 compat mode

    require 'redcloth/textile_doc'
    module RedCloth
      class TextileDoc
        def initialize(string, restrictions = [])
          restrictions.each { |r| method("#{r}=").call(true) }
          super(string.chars.map { |x| x.bytesize > 1 ? "&##{x.unpack("U*").first};" : x }.join)
        end
      end
    end
    

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

RedCloth is a Ruby library for converting Textile into HTML

Shared Ticket Bins

Pages