Unicode Paths

Sebastian Spaeth Sebastian at SSpaeth.de
Fri Sep 16 03:58:49 PDT 2011


On Thu, 15 Sep 2011 13:52:12 -0400, Austin Clements <amdragon at mit.edu> wrote:
> On Tue, Sep 13, 2011 at 11:55 PM, Martin Owens <doctormo at gmail.com> wrote:
> > Hello Again,
> >
> > I notice in the lib code notmuch_database_open(),
> > notmuch_database_create() these functions use const char *path for the
> > directory path input. Is this unicode safe?
> >
> > The python bindings (and ctype docs) seem to suggest using something
> > called 'wchar_t *' for accepting unicode but that's for C not C++.
> >
> > Is this something that should be patched?
> 
> char* is the correct type for paths on POSIX systems.  The *meaning*
> of those bytes is a more complicated matter and depends on your locale
> settings.  On old systems it was generally ASCII, on modern systems
> it's generally UTF-8, and it can be many other things.  However, as a
> consequence of UNIX's C heritage, it is *always* terminated with a
> NULL byte and cannot contain embedded NULL's.

Right, that's what we are doing, passing in utf-8 encoded unicode
strings to char*, which should be just fine if that is what the
underlying OS uses.

> wchar_t is another matter entirely.  wchar_t is the type used by C to
> represent wide strings internally, which generally (but not
> necessarily!) means it stores a Unicode code point.  However, this
> isn't an encoding, and different compilers can give wchar_t different
> meanings, so wchar_t strings aren't generally appropriate for storing
> or sharing between processes or with the kernel.

Mmh, I remember I attempted to user wchar_t to pass in unicode objects
directly and it had failed miserably.

Sebastian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://notmuchmail.org/pipermail/notmuch/attachments/20110916/fa047b02/attachment.pgp>


More information about the notmuch mailing list