Dealing with a non-ascii character in Rspec Testing-Collection of common programming errors

11 years ago

admin

2 minutes

I’m using the DocSplit gem for Ruby 1.9.3 to create Unicode UTF-8 versions of word documents. To my surprise today while I was running a test on a particular piece of one of these documents I started running into character encoding inconstencies.

I have tried a number of different methods to resolve the issue which I will list below, but the best success I’ve had so far is to remove all non-ASCII characters. This is far from ideal, as I don’t think the character’s are really going to be all that problematic in the DB.

gsub(/[^[:ascii:]]/, "")

This is a sample of what my output looks like vs. what I’m expecting:

My CODES'S APOSTROPHE

My CODES’S APOSTROPHE

The second apostrophe should look squiggly. If you paste it into irb, you get the following: \U+FFE2

I tried Regexing specifically for this character and it appears to work in Rubular. As soon as I put it in my model however, I got a syntax error.

syntax error, unexpected $end, expecting ')'
raw_title = raw_title.gsub(/’/, "")

I also tried forcing the encoding to UTF-8, but everything is already in UTF-8 and this does not appear to have an effect. I tried forcing the output to US-ASCII, but I get a byte sequence error.

I also tried a few of the encoding options found in Ruby library. These basically did the same thing as the Regex.

This all comes down to that I’m trying to match output for testing purposes. Should I even be concerned about these special characters? Is there a better way to match these characters without blindly removing them?