String encoding is something that we don't really think until we see
Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT
Or when users complains about missing special characters like "’" (apostrophe copied from Microsoft Word) or when "菜医生" becomes "иЏњеЊ»з”џ".
Before we go into encoding problems, lets understand what encoding is.
A string can be considered as an array of bytes:
irb(main):001:0> "world".bytes
=> [119, 111, 114, 108, 100]
Here 119
means w
, 111
means o
and so on. This relationship between bytes and characters is defined by Encoding.
Lets see what happens when we change encoding
irb(main):001:0> str = "Café"
=> "Café"
irb(main):002:0> str.bytes
=> [67, 97, 102, 195, 169]
irb(main):003:0> str.force_encoding("windows-1251"); str.encode("utf-8");
=> "CafГ©"
irb(main):004:0> str.bytes
=> [67, 97, 102, 195, 169]
Changing the encoding changes how the string is printed, without changing the bytes. You'll see that error when a character in one encoding doesn't exist in another, or when Ruby can't figure out how to translate a character between two encodings.
irb(main):001:0> str = "Café"
=> "Café"
irb(main):002:0> str.encode("windows-1251")
Encoding::UndefinedConversionError: U+00E9 to WINDOWS-1251 in conversion from UTF-8 to WINDOWS-1251
from (irb):8:in `encode'
from (irb):8
To prevent the error we can pass extra arguments invalid
and undef
to encode
. The invalid
and undef
options replaces characters that cannot be translated to different character with a ?
or with any character passed in replace
option.
irb(main):016:0> str = "Café"
=> "Café"
irb(main):017:0> str.encode("windows-1251", invalid: :replace, undef: :replace, replace: " ")
=> "Caf "
Unfortunately we lose information while replacing characters with encode
. We would have no idea which characters were replaced. But losing data can be better than things being broken in new encoding.
Encoding problems we faced
Recently we stumbled upon string encoding while implementing CSV import feature. We first open the CSV located in remote location and read it. While reading one of the CSV file we got Encoding::CompatibilityError
. This error is raised when source encoding is incompatible with the target encoding. So we need to encode the CSV string to UTF-8.
open(file_url).read.encode('UTF-8', invalid: :replace, undef: :replace, replace: ' ' )
Due to replace
option, apostrophe ("’") which is copied from Microsoft word was being replaced by blank string including. To fix this we had to first encode string with windows-1251
encoding and then encode it back to UTF-8
.
irb(main):001:0> str = "Ruby\x92s string encoding"
=> "Ruby\x92s string encoding"
irb(main):002:0> str.force_encoding("windows-1251").encode("utf-8", invalid: :replace, undef: :replace, replace: ' ' )
=> "Ruby’s string encoding"
We encountered chinese characters. Chinese characters were converted to weird characters after encoding from windows-1251
and back to utf-8
.
irb(main):001:0> str = "菜医生"
=> "菜医生"
irb(main):002:0> str.force_encoding("windows-1251").encode("utf-8")
=> "иЏњеЊ»з”џ"
To fix this we had no options but to replace \x92
separately using gsub
and then process CSV files. While replacing make sure the strings (original and substitution string) are encoded using same encoding else it would throw error.
Its a brain consuming task to fix encoding issues. To become comfortable with encodings - just play around encode
and force_encoding
methods in irb
console.
Do let me know in case there is better solution for fixing encoding problems.