2008-03-08 08:52:38 -05:00
|
|
|
/////////////////////////////////////////////////////////////////////////////
|
|
|
|
// Name: convauto.h
|
2008-03-10 11:24:38 -04:00
|
|
|
// Purpose: interface of wxConvAuto
|
2008-03-08 08:52:38 -05:00
|
|
|
// Author: wxWidgets team
|
|
|
|
// RCS-ID: $Id$
|
2010-07-13 09:29:13 -04:00
|
|
|
// Licence: wxWindows licence
|
2008-03-08 08:52:38 -05:00
|
|
|
/////////////////////////////////////////////////////////////////////////////
|
|
|
|
|
2011-10-27 18:48:54 -04:00
|
|
|
/**
|
|
|
|
Constants representing various BOM types.
|
|
|
|
|
|
|
|
BOM is an abbreviation for "Byte Order Mark", a special Unicode character
|
|
|
|
which may be inserted into the beginning of a text stream to indicate its
|
|
|
|
encoding.
|
|
|
|
|
|
|
|
@since 2.9.3
|
|
|
|
*/
|
|
|
|
enum wxBOM
|
|
|
|
{
|
|
|
|
/**
|
|
|
|
Unknown BOM.
|
|
|
|
|
|
|
|
This is returned if BOM presence couldn't be determined and normally
|
|
|
|
happens because not enough bytes of input have been analysed.
|
|
|
|
*/
|
|
|
|
wxBOM_Unknown = -1,
|
|
|
|
|
|
|
|
/**
|
|
|
|
No BOM.
|
|
|
|
|
|
|
|
The stream doesn't contain BOM character at all.
|
|
|
|
*/
|
|
|
|
wxBOM_None,
|
|
|
|
|
|
|
|
/**
|
|
|
|
UTF-32 big endian BOM.
|
|
|
|
|
|
|
|
The stream is encoded in big endian variant of UTF-32.
|
|
|
|
*/
|
|
|
|
wxBOM_UTF32BE,
|
|
|
|
|
|
|
|
/**
|
|
|
|
UTF-32 little endian BOM.
|
|
|
|
|
|
|
|
The stream is encoded in little endian variant of UTF-32.
|
|
|
|
*/
|
|
|
|
wxBOM_UTF32LE,
|
|
|
|
|
|
|
|
/**
|
|
|
|
UTF-16 big endian BOM.
|
|
|
|
|
|
|
|
The stream is encoded in big endian variant of UTF-16.
|
|
|
|
*/
|
|
|
|
wxBOM_UTF16BE,
|
|
|
|
|
|
|
|
/**
|
|
|
|
UTF-16 little endian BOM.
|
|
|
|
|
|
|
|
The stream is encoded in little endian variant of UTF-16.
|
|
|
|
*/
|
|
|
|
wxBOM_UTF16LE,
|
|
|
|
|
|
|
|
/**
|
|
|
|
UTF-8 BOM.
|
|
|
|
|
|
|
|
The stream is encoded in UTF-8.
|
|
|
|
|
|
|
|
Notice that contrary to a popular belief, it's perfectly possible and,
|
|
|
|
n fact, common under Microsoft Windows systems, to have a BOM in an
|
|
|
|
UTF-8 stream: while it's not used to indicate the endianness of UTF-8
|
|
|
|
stream (as it's byte-oriented), the BOM can still be useful just as an
|
|
|
|
unambiguous indicator of UTF-8 being used.
|
|
|
|
*/
|
|
|
|
wxBOM_UTF8
|
|
|
|
};
|
|
|
|
|
2008-03-08 08:52:38 -05:00
|
|
|
/**
|
|
|
|
@class wxConvAuto
|
2008-03-08 09:43:31 -05:00
|
|
|
|
2008-03-08 08:52:38 -05:00
|
|
|
This class implements a Unicode to/from multibyte converter capable of
|
|
|
|
automatically recognizing the encoding of the multibyte text on input. The
|
|
|
|
logic used is very simple: the class uses the BOM (byte order mark) if it's
|
2008-04-12 19:27:36 -04:00
|
|
|
present and tries to interpret the input as UTF-8 otherwise. If this fails,
|
|
|
|
the input is interpreted as being in the default multibyte encoding which
|
|
|
|
can be specified in the constructor of a wxConvAuto instance and, in turn,
|
|
|
|
defaults to the value of GetFallbackEncoding() if not explicitly given.
|
2008-03-08 09:43:31 -05:00
|
|
|
|
2008-03-08 08:52:38 -05:00
|
|
|
For the conversion from Unicode to multibyte, the same encoding as was
|
|
|
|
previously used for multibyte to Unicode conversion is reused. If there had
|
|
|
|
been no previous multibyte to Unicode conversion, UTF-8 is used by default.
|
2008-04-12 19:27:36 -04:00
|
|
|
Notice that once the multibyte encoding is automatically detected, it
|
|
|
|
doesn't change any more, i.e. it is entirely determined by the first use of
|
|
|
|
wxConvAuto object in the multibyte-to-Unicode direction. However creating a
|
|
|
|
copy of wxConvAuto object, either via the usual copy constructor or
|
|
|
|
assignment operator, or using wxMBConv::Clone(), resets the automatically
|
|
|
|
detected encoding so that the new copy will try to detect the encoding of
|
|
|
|
the input on first use.
|
2008-03-08 09:43:31 -05:00
|
|
|
|
2008-04-12 19:27:36 -04:00
|
|
|
This class is used by default in wxWidgets classes and functions reading
|
|
|
|
text from files such as wxFile, wxFFile, wxTextFile, wxFileConfig and
|
|
|
|
various stream classes so the encoding set with its SetFallbackEncoding()
|
|
|
|
method will affect how these classes treat input files. In particular, use
|
|
|
|
this method to change the fall-back multibyte encoding used to interpret
|
|
|
|
the contents of the files whose contents isn't valid UTF-8 or to disallow
|
|
|
|
it completely.
|
2008-03-08 09:43:31 -05:00
|
|
|
|
2008-03-08 08:52:38 -05:00
|
|
|
@library{wxbase}
|
2008-04-12 19:27:36 -04:00
|
|
|
@category{data}
|
2008-03-08 09:43:31 -05:00
|
|
|
|
2008-04-12 19:27:36 -04:00
|
|
|
@see @ref overview_mbconv
|
2008-03-08 08:52:38 -05:00
|
|
|
*/
|
|
|
|
class wxConvAuto : public wxMBConv
|
|
|
|
{
|
|
|
|
public:
|
|
|
|
/**
|
2008-04-12 19:27:36 -04:00
|
|
|
Constructs a new wxConvAuto instance. The object will try to detect the
|
|
|
|
input of the multibyte text given to its wxMBConv::ToWChar() method
|
|
|
|
automatically but if the automatic detection of Unicode encodings
|
|
|
|
fails, the fall-back encoding @a enc will be used to interpret it as
|
|
|
|
multibyte text.
|
|
|
|
|
|
|
|
The default value of @a enc, @c wxFONTENCODING_DEFAULT, means that the
|
|
|
|
global default value (which can be set using SetFallbackEncoding())
|
|
|
|
should be used. As with that method, passing @c wxFONTENCODING_MAX
|
|
|
|
inhibits using this encoding completely so the input multibyte text
|
|
|
|
will always be interpreted as UTF-8 in the absence of BOM and the
|
|
|
|
conversion will fail if the input doesn't form valid UTF-8 sequence.
|
|
|
|
|
|
|
|
Another special value is @c wxFONTENCODING_SYSTEM which means to use
|
|
|
|
the encoding currently used on the user system, i.e. the encoding
|
|
|
|
returned by wxLocale::GetSystemEncoding(). Any other encoding will be
|
|
|
|
used as is, e.g. passing @c wxFONTENCODING_ISO8859_1 ensures that
|
|
|
|
non-UTF-8 input will be treated as latin1.
|
2008-03-08 08:52:38 -05:00
|
|
|
*/
|
|
|
|
wxConvAuto(wxFontEncoding enc = wxFONTENCODING_DEFAULT);
|
|
|
|
|
2011-10-27 18:48:54 -04:00
|
|
|
|
|
|
|
/**
|
|
|
|
Return the detected BOM type.
|
|
|
|
|
|
|
|
The BOM type is detected after sufficiently many initial bytes have
|
|
|
|
passed through this conversion object so it will always return
|
|
|
|
wxBOM_Unknown immediately after the object creation but may return a
|
|
|
|
different value later.
|
|
|
|
|
|
|
|
@since 2.9.3
|
|
|
|
*/
|
|
|
|
wxBOM GetBOM() const;
|
|
|
|
|
2011-11-05 07:23:41 -04:00
|
|
|
/**
|
|
|
|
Return a pointer to the characters that makes up this BOM.
|
|
|
|
|
|
|
|
The returned character count is 2, 3 or 4, or undefined if the return
|
|
|
|
value is NULL.
|
|
|
|
|
|
|
|
@param bom
|
|
|
|
A valid BOM type, i.e. not wxBOM_Unknown or wxBOM_None.
|
|
|
|
@param count
|
|
|
|
A non-@NULL pointer receiving the number of characters in this BOM.
|
|
|
|
@return
|
|
|
|
Pointer to characters composing the BOM or @NULL if BOM is unknown
|
|
|
|
or invalid. Notice that the returned string is not NUL-terminated
|
|
|
|
and may contain embedded NULs so @a count must be used to handle it
|
|
|
|
correctly.
|
|
|
|
|
|
|
|
@since 2.9.3
|
|
|
|
*/
|
|
|
|
const char* GetBOMChars(wxBOM bom, size_t* count);
|
|
|
|
|
2008-03-08 08:52:38 -05:00
|
|
|
/**
|
2008-04-12 19:27:36 -04:00
|
|
|
Disable the use of the fall back encoding: if the input doesn't have a
|
|
|
|
BOM and is not valid UTF-8, the conversion will fail.
|
2008-03-08 08:52:38 -05:00
|
|
|
*/
|
|
|
|
static void DisableFallbackEncoding();
|
|
|
|
|
|
|
|
/**
|
2008-04-12 19:27:36 -04:00
|
|
|
Returns the encoding used by default by wxConvAuto if no other encoding
|
|
|
|
is explicitly specified in constructor. By default, returns
|
2008-03-08 09:43:31 -05:00
|
|
|
@c wxFONTENCODING_ISO8859_1 but can be changed using
|
2008-04-12 19:27:36 -04:00
|
|
|
SetFallbackEncoding().
|
2008-03-08 08:52:38 -05:00
|
|
|
*/
|
|
|
|
static wxFontEncoding GetFallbackEncoding();
|
|
|
|
|
|
|
|
/**
|
2008-04-12 19:27:36 -04:00
|
|
|
Changes the encoding used by default by wxConvAuto if no other encoding
|
|
|
|
is explicitly specified in constructor. The default value, which can be
|
|
|
|
retrieved using GetFallbackEncoding(), is @c wxFONTENCODING_ISO8859_1.
|
|
|
|
|
|
|
|
Special values of @c wxFONTENCODING_SYSTEM or @c wxFONTENCODING_MAX can
|
|
|
|
be used for the @a enc parameter to use the encoding of the current
|
|
|
|
user locale as fall back or not use any encoding for fall back at all,
|
|
|
|
respectively (just as with the similar constructor parameter). However,
|
|
|
|
@c wxFONTENCODING_DEFAULT can't be used here.
|
2008-03-08 08:52:38 -05:00
|
|
|
*/
|
|
|
|
static void SetFallbackEncoding(wxFontEncoding enc);
|
2008-03-10 11:24:38 -04:00
|
|
|
|
2011-10-27 18:48:54 -04:00
|
|
|
/**
|
|
|
|
Return the BOM type of this buffer.
|
|
|
|
|
|
|
|
This is a helper function which is normally only used internally by
|
|
|
|
wxConvAuto but provided for convenience of the code that wants to
|
|
|
|
detect the encoding of a stream by checking it for BOM presence on its
|
|
|
|
own.
|
|
|
|
|
|
|
|
@since 2.9.3
|
|
|
|
*/
|
|
|
|
static wxBOM DetectBOM(const char *src, size_t srcLen);
|
|
|
|
};
|