Bug 151845 - [smbfs] [patch] smbfs should be upgraded to support Unicode
Summary: [smbfs] [patch] smbfs should be upgraded to support Unicode
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 8.1-RELEASE
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-fs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-10-31 13:40 UTC by Michael Meelis
Modified: 2011-11-27 15:50 UTC (History)
0 users

See Also:


Attachments
file.diff (6.55 KB, patch)
2010-10-31 13:40 UTC, Michael Meelis
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Meelis 2010-10-31 13:40:09 UTC
Windows stores all file names in UTF-16 encoding. When you put files from windows to freebsd using samba server it converts file names from UTF-16 to UTF-8. Then you get files with samba - reverse conversion occurs. This is correct lossless bidirectional conversion. This can possible because samba server uses modern interaction protocol with UTF-16 encoding support. On this way all is ok.

When you want to cp files from freebsd to windows you first mounts windows share using mount_smbfs and smbfs.ko. But smbfs.ko (what do the main work) supports only old DOS-style interaction protocol without unicode encoding. It use simple byte encoding. On windows side server component converts byte coded characters into windows UTF-16 using conversion table. By default windows (beautiful "I knows better" solution) use CP437. But in most cases to represent wide range of file names used ISO8859-1 table. I checks this by analyzing you test archive. And this is not all. I found many characters that can't fit into ISO8859-1 because they from WINDOWS-1252 table (I done this check too).

So even if we use UTF-8 to CP437 conversion on freebsd we lost most of additional characters on freebsd side. If we use UTF-8 to WINDOWS-1252 conversion on freebsd we not lost anything on freebsd side, but lost the same characters as in previous case on windows side.
We MUST change conversion on windows side to correct one - must be used WINDOWS-1252 table.
After this we may use UTF-8 to WINDOWS-1252 conversion on freebsd and get perfect result.

Additionally I found smbfs have erroneous realization of conversion from various byte length characters (UTF-8) to single bytes characters (like WINDOWS-1252). And this can't be fixed without significant effort and take a long time to debug. But this is no problem - we may use "iconv" option in rsync. Libiconv with rsync works perfect.

Continues. All look fine. But windows can put (and do it, I checked it) in the file names several control characters not defined in WINDOWS-1252. This characters comes from UTF-16 and converts into UTF-8 correct, but conversion from UTF-8 to WINDOWS-1252 fails. So we need to make a patch for iconv and libiconv to allow conversion in libiconv work without errors (else rsync fails with "can't convert name" or similar error).

I near to break down my mind with smbfs and rsync. I makes new patch smbfs with replace unconvertible characters to "_". And rsync becomes crazy and copying same files to windows share again and again when runs several times with same parameters. Funny. But bad.
This problem connected with whole conversion sequences:
on first and next runs while copying files to windows share: localfs->rsync->smbfs->iconv->patch->windowsfs
on next runs while finding files need to be rsynced: windowsfs->smbfs->iconv->rsync->localfs
Here file named ex. "FrØya.html" converts to "Fr_ya.html" on windows share and first rsync run done without errors. But when rsync runs second time it lookups windows share for "FrØya.html" but got only "Fr_ya.html" (rsync didn't knows about this lossy conversion inside smbfs) and it copies this file again and again. Bug.

To fix this we need to leave smbfs module untouched and add new conversion table (to do "_" replace inside rsync) to libiconv and use rsync with "iconv" option.

Added new encoding "CP437FIXED" with always good conversion to '_' for wrong symbols.

Fix: smbfs should be upgrade to support unicode. Until than work with attached libiconv patch and new CP437FIXED encoding. (The full patch & test doesn't fix the 100kb & txt extention.

Patch attached with submission follows:
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2010-11-05 07:32:24 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-fs

Over to maintainer(s).
Comment 2 lists 2010-11-11 19:55:35 UTC
There were some patches floating around to add Unicode support to
smbfs as long as 5 years ago, apparently inspired b work done on Mac
OS X. These are still available here:

http://people.freebsd.org/~imura/kiconv/

The smbfs code hasn't changed too much in that time (it's pretty much
unmaintained), so I don't think it would be too much work to dust them
off and get them to apply against 8.x or -CURRENT.

If someone out there with some knowledge of this area were able to
spare a few hours to look at this would be a huge step in bringing
SMBFS up to a modern usable level - at the moment it is largely
useless as soon as you hit files with non-ASCII characters in it.
Comment 3 Kevin Lo freebsd_committer freebsd_triage 2011-11-27 15:49:55 UTC
State Changed
From-To: open->closed

Fixed. Commited to HEAD(r227650 and r228023).