String encoding is something that we don't really think until we see
Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT
Or when users complains about missing special characters like "’" (apostrophe copied from Microsoft Word) or when "菜医生" becomes "иЏњеЊ»з”џ".
Before we go into encoding problems, lets understand what encoding is.
A string can be considered as an array of bytes:
irb(main):001:0> "world".bytes => [119, 111, 114, 108, 100]
o and so on. This relationship between bytes and characters is defined by Encoding.
Lets see what happens when we change encoding
irb(main):001:0> str = "Café" => "Café" irb(main):002:0> str.bytes => [67, 97, 102, 195, 169] irb(main):003:0> str.force_encoding("windows-1251"); str.encode("utf-8"); => "CafГ©" irb(main):004:0> str.bytes => [67, 97, 102, 195, 169]
Changing the encoding changes how the string is printed, without changing the bytes. You'll see that error when a character in one encoding doesn't exist in another, or when Ruby can't figure out how to translate a character between two encodings.
irb(main):001:0> str = "Café" => "Café" irb(main):002:0> str.encode("windows-1251") Encoding::UndefinedConversionError: U+00E9 to WINDOWS-1251 in conversion from UTF-8 to WINDOWS-1251 from (irb):8:in `encode' from (irb):8
To prevent the error we can pass extra arguments
undef options replaces characters that cannot be translated to different character with a
? or with any character passed in
irb(main):016:0> str = "Café" => "Café" irb(main):017:0> str.encode("windows-1251", invalid: :replace, undef: :replace, replace: " ") => "Caf "
Unfortunately we lose information while replacing characters with
encode. We would have no idea which characters were replaced. But losing data can be better than things being broken in new encoding.
Encoding problems we faced
Recently we stumbled upon string encoding while implementing CSV import feature. We first open the CSV located in remote location and read it. While reading one of the CSV file we got
Encoding::CompatibilityError. This error is raised when source encoding is incompatible with the target encoding. So we need to encode the CSV string to UTF-8.
open(file_url).read.encode('UTF-8', invalid: :replace, undef: :replace, replace: ' ' )
replace option, apostrophe ("’") which is copied from Microsoft word was being replaced by blank string including. To fix this we had to first encode string with
windows-1251 encoding and then encode it back to
irb(main):001:0> str = "Ruby\x92s string encoding" => "Ruby\x92s string encoding" irb(main):002:0> str.force_encoding("windows-1251").encode("utf-8", invalid: :replace, undef: :replace, replace: ' ' ) => "Ruby’s string encoding"
We encountered chinese characters. Chinese characters were converted to weird characters after encoding from
windows-1251 and back to
irb(main):001:0> str = "菜医生" => "菜医生" irb(main):002:0> str.force_encoding("windows-1251").encode("utf-8") => "иЏњеЊ»з”џ"
To fix this we had no options but to replace
\x92 separately using
gsub and then process CSV files. While replacing make sure the strings (original and substitution string) are encoded using same encoding else it would throw error.
Its a brain consuming task to fix encoding issues. To become comfortable with encodings - just play around
force_encoding methods in
Do let me know in case there is better solution for fixing encoding problems.
Subscribe to Engineering At Kiprosh
Get the latest posts delivered right to your inbox