How encoding detection works in HippoEDIT

Started by alex, March 15, 2009, 01:20:23 PM

Previous topic - Next topic

alex

Preconditions:

1) File can have one encoding (same as code page). Encoding can be as unicode ( UTF16 LE, BE (1200, 1201), UTF8 (65000) ) as not unicode (for example 252 (Western European) etc).
2) There are several places, where encoding conversion can be applied to document: Open, Save As, New, Search and Replace
3) The encoding can be selected/changed in: File Open/Save dialog, via context menu or status bar, in Project Settings, in Tools->Options->Document settings, in syntax specification (here you can set as prefferd encoding, as forced encoding). In addition to this HippoEDIT does an autodetection of the encoding using different algorithms (Check BOM bytes, statistics test for UTF16 LE/BE, statistics test for UTF8, check by encoding strings and same checks as IE uses).
4) If encoding for document once changed by user, this preference has priority over all the rest of settings. Preferences is machine specific, but can be reseted, if HippoEDIT temp files would be deleted or format of them would chnage in new version.

So, how all this works togeter (or designed to work :) ) :

1) New File encoding selection (if setting is not defined, or set to Automatic next taken)

  • Syntax force encoding
  • Current Project settings encoding
  • Document settings encoding
  • Syntax preffered encoding
  • System local encoding

2) Open File encoding selection  (if setting is not defined, or set to Automatic next taken)

  • Syntax force encoding
  • Encoding selected in File Open dialog
  • Auto-detected encoding with usage of all algorithms mentioned above. Set of applied algorithms can be changed in settings.xml
  • Syntax preffered encoding
  • Project settings encoding taken
  • Document setting encoding
  • System local encoding

3) Save File encoding selection  (if setting is not defined, or set to Automatic next taken)

  • Encoding selected in File Save dialog
  • Current document encoding
  • During save, HippoEDIT checks consistance of current document encoding and encoding found with encoding strings (XML, HTML etc). If encoding does not match, user would be asked to select which encoding to use
  • Because HippoEDIT inernally works with Unicode representation of text (UTF16 LE), on save, can happen that current text could not be saved without lost of information with currently selected encoding. In this case HippoEDIT should pop-up a warning, informing user about possible data loss and suggest to save document as Unicode or using some another encoding. This behaviour controlled by flag Check encoding accuracy in Tools->Options->Formatting

4) Search and Replace encoding uses same logic as for Open/Save file, just interactive selection of encoding, with Open/Save dialog, not available.

So, if you see that documents are open with wrong encoding, you have several choices of how to solve this:
1) Explicitly select correct encoding in File Open dialog
2) Set, for syntax you are using, forced encoding:
<Encoding default="852" force="true"/>
in SPECIFICATION section of schema spec file.
3) Disable extended auto-detection (IE algorithms). It can return wrong result, if data for analysis is not sufficient.
It can be done with xml flags in settings.xml, section General:
<EncodingDetection extended="false"/>

Also from now on, extended encoding detection is enabled by default only for syntaxes inherited from deftext (as Plain Text, XML and HTML).

Best regards,
Alex
HippoEDIT team
[url="http://www.hippoedit.com/"]http://www.hippoedit.com/[/url]