Author Topic: font and encoding issues  (Read 2164 times)

Offline mlwang

  • Newbie
  • *
  • Posts: 9
  • Karma: +0/-0
    • View Profile
font and encoding issues
« on: January 12, 2010, 03:43:42 am »
I'm not sure this should be posted here or in the feature request forum. I'm putting it here because 1) they're really bugging me, and 2) other editors I use (emeditor and notepad) don't behave that way.

I'm using 1.48 beta 755 on Windows 7 x64, with my system locale (language for non-Unicode programs) set to Chinese (Traditional, Taiwan). The issues mentioned below aren't specific to build 755, however, they have been there since I started using HippoEDIT 3 months ago.

Issue 1: font settings not working

HippoEDIT defaults to Courier New, which is what I want, but regular text files aren't really displayed in Courier New (image attached below). The English text is jagged. From appearance it's rendered in a Chinese font (Mingliu). It's not only ugly, but also non-monospace.

Changing HippoEDIT's default font to Consolas doesn't work. The funny thing is, other than Asian fonts, changing font only works with Fixedsys, another jagged font. Changing to some proportional fonts (Segoe UI, e.g.) does work, but not others. And I want monospace anyway.

Another workaround is to change encoding to something like ASCII or Windows 1252, but then Chinese (or other Asian text) is garbled.

I guess it has something to do with my system locale setting, but it shouldn't be. Both Notepad and Emeditor do this perfectly. They use Courier New to display English (Latin) text, using associated Asian fonts only when necessary.

Issue 2: saving encoding changes

Due to issue 1, I often resort to changing encoding when there's no Asian text involved. When trying to close a file, however, HippoEDIT always prompts to save the changes, even though the files is never touched. That's strange because encoding shouldn't be saved with a plain text file aside from unicode files with BOM. If I let HippoEDIT save the file anyway and do a binary comparison afterward, the saved file is exactly the same as the original one, except with a different modified time, which is annoying.

I understand sometimes encoding changes do result in actual changes due to encoding conversion, especially between 8-bit text and unicode text, but it would be great if HippoEDIT could be smart enough to know if such changes are needed.

Personally I prefer the way Emeditor does it: changing encoding only changes the way text is displayed, and is saved only when you choose to "Save As" a different encoding.


Issue 3: encoding cache list renewal

Recently used code pages are listed at the top of the encoding sub-menu, which is great. Thanks. The list is nonetheless not updated until you quit and reopen HippoEDIT. So when you change encoding to, say, Windows 1252, for a file, and then want to do the same for another file, you have to go through the large cascade menu again unless the code page has been used in a previous session.

Issue 4: language setting for code page names

Is there a way to change how code page names are displayed? HippoEDIT list all of them (except unicode and US-ASCII) in Chinese, which looks weird to me since I'm used to their English names. E.g., I would prefer Cyrillic rather than, ehh, I couldn't find it in HippoEDIT, for I don't know what Cyrillic is called in Chinese.

Alex once mentioned that HippoEDIT uses the "same code pages that you see in Internet Explorer and installed in Windows," but my IE displays code pages in English, not Chinese.

Due to the above issues, HippoEDIT has been my "specialty" editor from day 1. I use it to edit html files when I need to work on tables or other nested elements. For everything else, I keep going back to my original editor of choice, despite the fact that I've been trying to use HippoEDIT more often, since it's such a great tool. Hope this will change someday.

Thanks for bearing with me with this long post.

Offline alex

  • Developer
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1666
  • Karma: +29/-2
    • View Profile
    • HippoEDIT
Re: font and encoding issues
« Reply #1 on: January 12, 2010, 08:19:44 pm »
Hi Ming-Li,

let us go from bottom to top, because Issues 2-4 is much easier then 1.

Issue 4: language setting for code page names
The set of the code pages is same, but you can request it in different languages. Because HE have queried it in User default language, it is in Chinese. User default language correspond to values that you have selected in Regional Settings for locale format (should be same as language of the date hovering system clock). But I agree, this is different from what IE shows, and also a little bit confusing, because differs from HippoEDIT UI language also.
So: I will change query language to correspond HippoEDIT UI language (till now this is only English, for localization done by samuel will not work).

Issue 3: encoding cache list renewal
This is seems to me as a bug, But I cannot reproduce it. Can you please provide step-by-step instruction. Be aware that you have such menu in 3 different places (status bar, document context menu, and main menu View->Encoding). ... Just found the issue. Only relevant for main menu. This will be fixed in next 1.48 beta. Till then you can use two other ways.

Issue 2: saving encoding changes
... this is long and not easy story :)
1) there are two possible ways when encoding can be changed: open of initial file and converting of the encoding of the changed file. If you switch encoding of the non modified file, generally nothing can be lost, and change is lossless because HE just do a reread of original file with new encoding; but if you do conversion of modified document conversion already can be not lossless, because HE does conversion of source encoding to Unicode and then back to target encoding and something can be lost.
2) Encoding change is Undo-able action (but not always... i do not store original source, if something lost - then bad luck :/), and this was one of the reasons to mark document as modified, to be consistent to other actions.
3) When you open a file, HE tries to find best matching encoding and shows document with it. BUT if you some when have changed it HE remembers it and shows file in previously used encoding. So encoding change is saved but separately from file. Do not know, if EmEditor saves change of the encoding or not (I mean for non-unicode files).
4) In HE previously also changing of the encoding for files for non modified document has not change modified state (like in browser), but we (generally I think Stefan ask me ;) ) decide to make it consistent :).

So: ... maybe changing of the modified state on encoding switch (when no changes were done) was really not a good idea. I will evaluate this. And maybe will add check for actual source change after changing of the modified document.

Issue 1: font settings not working
First check this post.
Generally a reason in font substitution done by Windows and HE. Based on code page, HE tries to find correct charset and then create font for complete document passing preferred charset with font name to windows. If windows sees that font does not have complete symbol set to cover desired charset, it selects another font silently which has all symbols.
Problem of HE here that font selected for complete document, not for every text block, so or monospaced courier new and now chinese or non monospaced (substituted) and all symbols. :/ This is currently limitation of HippoEDIT.
EmEditor and Windows Notepad (from Vista) use Uniscribe engine, which takes all this problems away. And I think EmEditor is only one which use it (and maybe it is a best text editor for Unicode texts from popular).
It is a rather big change to HE (and huge efforts) to switch to Uniscribe, but I have this in plans. But maybe after 1.50 released. So, I should say till then EmEditor will be better choice for Unicode editing.
Maybe I will spent some time for checking what can be done with less efforts in this area this week, but I do not promise.

And result: issues 2,3,4 probably will be fixed with next beta, 1 (complete fix) maybe some when...

Wow... big post. Next time please ask one by one :)

Offline alex

  • Developer
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1666
  • Karma: +29/-2
    • View Profile
    • HippoEDIT
Re: font and encoding issues
« Reply #2 on: January 13, 2010, 12:25:39 pm »
About Issue 1 I was wrong...

Just have checked Chinese text in notepad on XP and that it works.
With Courier New and with WESTERN charset...
Surprising, also in HE it works with default charset when pasted from notepad.

So, auto-detecting of the charset, done by HE is not really good solution. I think I will make it optional in 1.48 beta, and then I will ask you to test the result.

Offline mlwang

  • Newbie
  • *
  • Posts: 9
  • Karma: +0/-0
    • View Profile
Re: font and encoding issues
« Reply #3 on: January 13, 2010, 06:36:58 pm »
Thanks for the prompt replies and detailed explanation.

So, auto-detecting of the charset, done by HE is not really good solution.

Thanks to this hint I changed the default document encoding to Western European (instead of autodetect). So now at least for the bulk of my text viewing and editing work where there's no Asian text involved, the display is fine. Files with Asian text remains a problem, but that's ok. I'll wait.

Offline mlwang

  • Newbie
  • *
  • Posts: 9
  • Karma: +0/-0
    • View Profile
Re: font and encoding issues
« Reply #4 on: January 20, 2010, 02:59:50 pm »
From the 1.48 beta Version 756 changelog (thought it's better to put follow-up in this thread, sorry if you prefer otherwise):

  • Fixed. Encoding names are not in HippoEDIT UI language. Details...
  • Fixed. Encoding Recent are not updated in main menu. Details...
  • Fixed. Document IS NOT any more to modified state after change of the encoding and user is not asked for save later. Details...

All confirmed. Thanks.

  • Fixed. Changed default settings for charset auto-detection to false.

I'm a little curious about this one. So now one can no longer set [Document Default] [Encoding] to "Automatic" in the UI anymore, right? Tried it just now and indeed setting it to "Automatic" has no effect; HippoEdit always goes back to the last specific encoding setting.

The good thing is, I can now set the default encoding to Chinese Traditional (Big5), and HippoEdit will display text with mixed Chinese and English properly (i.e., showing English with Courier New and Chinese with the proper Chinese font). Pure English text is displayed properly, too. That solves most of my problem. I now need to set encoding manually only when I open Japanese text or utf-8 text witout BOM.

What I don't really understand is why encoding autodetection can mess things up, when in fact HE detects the encoding right? IOW, with v.756, mixed text (Chinese and English) is displayed properly if I set the encoding manually, but not when HE auto-detects it. That's really strange. I know that's basically what you were talking about in the previous post, but I'm just curious about why.

Not matter what, the current situation is acceptable to me already. I'm happy and would surely use HE more often. Thanks a lot!

Offline alex

  • Developer
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1666
  • Karma: +29/-2
    • View Profile
    • HippoEDIT
Re: font and encoding issues
« Reply #5 on: January 20, 2010, 03:32:55 pm »
Quote
I'm a little curious about this one. So now one can no longer set [Document Default] [Encoding] to "Automatic" in the UI anymore, right? Tried it just now and indeed setting it to "Automatic" has no effect; HippoEdit always goes back to the last specific encoding setting.

No. This is not right. And have no connection to fix I have done.
The misunderstanding comes from mixing charset and encoding terms.

Encoding (or code page) is rule how bytes are encoded and how convert them to symbols and back. This has nothing to do with a font or with representation. This is data only.

Charset (at least how it is used in HE context), is font property, controlling which set of characters should be used from font. For non Unicode fonts (or non complete unicode fonts) you need to specify which charset (native language family) you want to use. Based on this Windows will select corresponding font file (sometimes for every charset you have different font file) or will make a substitution of the font, if font name you want to use does not have desired character set. In this case Windows will try to find a similar font (by some parameters), which contains full character set, and will use it silently instead of one you have selected. If you do not specify character set, and your text has characters which are not in default charset, and font does not have some glyphs for some character - you will see rectangles for missing glyphs.

The function which was by default disabled, was responsible for finding corresponding character set value based on encoding (but not real text), and based on it Windows will find font name with complete glyph set for this character set. That can be that Courier New has glyphs for Simplified Chinese, but not all characters from Traditional Chinese, so you do not see the problem now, but maybe for some other text you will not see some symbols. In this case you need to go to the settings and try to find a font which can show all characters, so do that work that was done before automatically.
But maybe such automatic way is preferable for general text editing, but not for the source code files, where having that font that you have selected is better. In 1.5x, probably I will add possibility to change document font from menu, without changing setting, than I can also suggest such automatically selected font.

BTW: HE should also auto-detect UTF8 without BOM in most cases and Japanese. But this is depends from amount of non-english text in the document (also only first 2000 bytes are checked).

Offline mlwang

  • Newbie
  • *
  • Posts: 9
  • Karma: +0/-0
    • View Profile
Re: font and encoding issues
« Reply #6 on: January 22, 2010, 05:22:54 am »
No. This is not right. And have no connection to fix I have done.
The misunderstanding comes from mixing charset and encoding terms.

Thanks for the clarification, and sorry about the mix-up. I admit I've been using charset and encoding interchangeably. Upon reading your post, I went and read up on "Character encoding" on wikipedia, but still don't comprehend 100%, unfortunately. I'm not a programmer (despite using "HE - Programmers text editor"); my knowledge about computing and software has mostly been self-taught, with many holes.

Encoding (or code page) is rule how bytes are encoded and how convert them to symbols and back. This has nothing to do with a font or with representation. This is data only.

Charset (at least how it is used in HE context), is font property, controlling which set of characters should be used from font.

I understand in the context of QP or UUEncode that an encoding (they're "encoding", right?) can be used to encode anything, independent of the language of the text. But what about "utf-8", "big-5" or "iso-8859-1"? HE, Firefox and IE call them encoding, but Apache configuration files and many web pages (html files) call them charset.

If they are encoding, can they really have nothing to do with "representation" (I understand the "font" part)? Perhaps I'm misunderstanding, again, "representation". By saying "charset is font property", do you mean it is the same as the "script" in Notepad's "Font" dialog (that's the best I could understand charset as a font property)?

Leaving aside my pathetic lack of grasp on the terms, what I found is this: in HE, once you have changed [Document Defaults] [Encoding] to something other than "Automatic", then you could never change it back to "Automatic". The setting wouldn't stick if you do; HE always reverses it to the last chosen encoding.

I thought it's due to the new beta (756). I was wrong. Tried it on 755 and the official 1.47 release, and it's all the same. Deleting the <CodePage> setting from the settings.xml file does work, though. Is this what you intend, or a bug?

Offline mlwang

  • Newbie
  • *
  • Posts: 9
  • Karma: +0/-0
    • View Profile
Re: font and encoding issues
« Reply #7 on: January 22, 2010, 08:50:11 am »
I did some tests in a more systematic way, using a set of text files containing a subset of the same text. The complete text has three lines of short text--each of English, Chinese (Traditional) and Japanese (Hiragana) taking a line. (An empty line is inserted between the text.) The complete text is saved in two different encodings: UTF-8 with BOM (ecj-u8) and UTF-8 without BOM (ecj-u8nb).

The three text lines used in my test are:
Some English first
中文測試 多幾個字
ひらがな すすめ

Then I took out the Japanese text, and saved the rest (English and Chinese) in 3 different files: Chinese big5 (ec-b5), UTF-8 with BOM (ec-u8) and UTF-8 without BOM (ec-u8nb).

Another set of 3 files was prepared for English and Japanese text (no Chinese), saving in the following encodings: JIS (ej-jis), UTF-8 with BOM (ej-u8) and UTF-8 without BOM (ej-u8nb).

All files were prepared with EmEditor, not HE.

I then tested all possible combinations of two HE settings: Codepage and FontDetection. Codepage settings tested include: Auto-detect, 950 (Chinese big5), 50220 (Japanese JIS), 65001 (UTF-8). All four possible FontDetection settings (i.e., true/false with charset/name) are tested.

The version of HE tested is the newest beta: v.756.

As it turns out, all three non-default FontDetection settings (i.e., with either or both of charset/name set to true) give identical results, so they'll be consolidated in my presentation below. In addition, the results for codepage 950 and 50220 are analogous, trading results for the two local-encoding files (big5 and jis) and identical for the rest. For simplicity I'll omit the results for codepage 50220 below.

So, here are the results (hope I could do the tables right, fingers crossed):

HE settingsA-DB5-DU8-DA-NDB5-NDU8-ND
ec-b5R-GR-GU-GR-BR-BU-G
ec-u8R-GR-GR-GR-GR-GR-G
ec-u8nbR-GR-GR-GR-GR-GR-G
ej-jisU*-GU-GU*-GU-BU-BU-G
ej-u8R-GR-GR-GR-GR-GR-G
ej-u8nbR-GR-GR-GR-GR-GR-G
ecj-u8R-GR-GR-GR-GR-GR-G
ecj-u8nbR-GR-GR-GR-GR-GR-G
Notes:
* Change to "R" if Japanese text is long enough.

1. ecj-u8nb means the file contains (E)nglish, (C)hinese and (J)apanese text, and saved in UTF-8 (No BOM). Same naming scheme for the others.

2. For HE setting pairs, the first part refers to "Codepage" setting: A-Autodetect, B5-big5 (950), U8-UTF8 (65001). The second part refers to FontDetection setting: D-default, ND-non-default (either or both of charset/name detection set to true).

3. For results pairs, the first part indicates if the text is rendered correctly: R-recognizable, U-unrecognizable. The second part indicates how the English text (always recognizable) in the file is displayed: G-good (meaning displayed in Courier New, hence good looking), B-bad (displayed in Asian font, hence bad looking).

4. Files were renamed between test runs for HE seemed to cache manual encoding settings between sessions.

5. HE failed to detect either ec-u8nb and ej-u8nb properly (as UTF-8) when the Asian text contained therein was too short. Lengthened it a bit and HE detected both files fine.

I guess the results are what Alex has known all along. With these tests, however, I finally have a better understanding how the two HE settings work. I'm also certain now that for my purpose, the new default setting for FontDetection indeed works better. Many thanks!

[Edit] Sample text added per Alex's request, plus some typos cleanup.

[Edit 2] A star note added to the table.
« Last Edit: January 23, 2010, 06:30:20 am by mlwang »

Offline alex

  • Developer
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1666
  • Karma: +29/-2
    • View Profile
    • HippoEDIT
Re: font and encoding issues
« Reply #8 on: January 22, 2010, 06:14:03 pm »
Yes, there is a problem with all this terms as encoding, code page, charset etc. In my case when I wrote about charset I meant script (as you have correctly mentioned). The Script in Font dialog is that value then later passed to font creation routine of Windows with name charset :)
So, I will use term script further to be more clear :)

Terms encoding and charset are often mixed, even in XML and HTML headers. But I think the better name here is encoding, or code page. But utf-8, simple "unicode" (which is UTF-16) or big-5 are encodings, and to all of them correspond some code page value.

To summarize: you have bytes sequence, using some rules defined by encoding (i think term chraset does not pass here good), you convert bytes sequence to sequence of symbols. To display (represent) these symbols you need some font with set of glyphs which match all your symbols. If glyphs for some symbols are not found, you see rectangles (or any other image representing missing glyph). The font has characteristic which describe which set of glyphs it contains: called script or charset.  Based on it Windows can select font which correspond better for your texts.
As you have seen, there are not too many script values, definitely less then number of encodings. So, script is something like family of encodings. So, it is possible to select passing script value, if you know encoding. But only for single byte encodings, for unicode encoding this is not possible, because they can contain symbols from any script (charset).

Windows can substitute font when charset is available. So with FontDetection charset HE tries to determine charset and pass it later to Windows to help in substitution. In addition to this, HE can determine font name which pass better by itself (actually IE libraries are used) and this is controlled by FontDetection name.

About problem with non revertible Automatic for Encoding settings in Options. This is a bug. Will be fixed in new beta.

And thanks a lot for your detailed tests! You have done huge work. Can you also add to the post sample texts you have used? Then I can use them later for checks, if logic still working after new changes.

What I see, HE has big problems with Japanese (from your test it was never recognizable) and for Chinese with explicit encoding (as I said for unicode encodings neither charset neither font name can be determined).

Offline mlwang

  • Newbie
  • *
  • Posts: 9
  • Karma: +0/-0
    • View Profile
Re: font and encoding issues
« Reply #9 on: January 23, 2010, 05:56:28 am »
Can you also add to the post sample texts you have used?

Done.

What I see, HE has big problems with Japanese (from your test it was never recognizable) and for Chinese with explicit encoding (as I said for unicode encodings neither charset neither font name can be determined).

Most editors do. Emeditor is superb in this regard, and that's why I paid for it years ago. But Emeditor's author is a Japanese, so it's not a fair "fight". :)

I don't think HE does worse for Japanese than for Chinese. HE detects ec-b5.txt fine because, I believe, my System Locale (language for non-unicode programs) is set to Chinese (Traditional, Taiwan). HE detects ej-jis.txt as Chinese Traditional (Big5) as well.

Well, let me try .... After setting my System Locale to Japanese, HE indeed "detects" both ej-jis and ec-b5 as Shift-JIS (932), which I believe is what Windows uses for Japanese as System Locale (can't tell from the Language applet).

Another test after coming back to Chinese as System Locale: this time I saved ec-b5 as ec-et, using ETEN encoding (an alternative but rarely used encoding for Traditional Chinese). Renamed it and then opened it in HE, and sure enough HE still "detects" it as Big5 (950).

So, more accurately, I think HE couldn't decide what encoding a file is in when a non-unicode Asian encoding is used, and falls back to system default.

Offline mlwang

  • Newbie
  • *
  • Posts: 9
  • Karma: +0/-0
    • View Profile
Re: font and encoding issues
« Reply #10 on: January 23, 2010, 06:26:25 am »
So, more accurately, I think HE couldn't decide what encoding a file is in when a non-unicode Asian encoding is used, and falls back to system default.

One think just hit me. As I mentioned in my earlier post, two of three test files saved in UTF-8 without BOM were not detected properly by HE in the beginning. The one with both Chinese and Japanese (ecj-u8nb) was detected fine, but not ec-u8nb or ej-u8nb.

I then remembered Alex's words:

BTW: HE should also auto-detect UTF8 without BOM in most cases and Japanese. But this is depends from amount of non-english text in the document (also only first 2000 bytes are checked).

So I added a few more characters (4 for Chinese and 3 for Japanese) into the test files, then HE detected both without problem. So I thought that's enough for HE and went on with the tests.

It didn't occur to me, until just now, maybe HE couldn't detect ej-jis (or ec-et for that matter) also because the text was too short. So I copied a page from Japanese wikipedia (just any random page) and saved the text as jj-jis. Voila! This time HE detected it properly.

It's all my fault, then. I took care of the length issue the first time, but not the second time. Sorry about that. Will add a note to the earlier post in a minute.

HE apparently needs more bytes to work with in order to detect Asian text in non-unicode encoding properly. I played with the text a little just now by cutting short the wiki text to various length, and found HE succeeded in detecting the JIS encoding when the text is 300 characters long, but failed when it's only 200 characters long.

Since most real-world text files are longer than that, I think HE does a good enough job in this regard already.

Offline alex

  • Developer
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1666
  • Karma: +29/-2
    • View Profile
    • HippoEDIT
Re: font and encoding issues
« Reply #11 on: January 23, 2010, 04:40:25 pm »
Thanks a lot for extended tests.

Yes, that is right: to determine encoding correctly, HE should have enough data to analyze.
Normally it takes first <= 2000 bytes (this is configurable with XML settings).

And there are five types of encoding detection done one by one:
- determine by BOM.
- detect Unicode (UTF-16/UTF-16 LE) statistics based.
- detect UTF-8, statistic based
- detect by encoding strings (like encoding attributes in XML or HTML)
- statistics detection with Windows Multi Language libraries (used by IE and also by EmEditor). For this method, minimum 512 bytes are checked. Because otherwise   results are not very precise.

If non of determination methods returns positive result, HE uses system default code page.
This logic can explain your results.

Also there are some specialties exist. Because normally source code is analyzed, amount of native text is rather small (only comments and strings), and th rest is English. So this should be also taken into account.

 

Related Topics

  Subject / Started by Replies Last post
8 Replies
3205 Views
Last post December 20, 2010, 12:38:45 pm
by Anonymous
1 Replies
331 Views
Last post April 18, 2009, 02:08:31 am
by alex
5 Replies
957 Views
Last post June 30, 2009, 09:19:17 am
by alex
2 Replies
640 Views
Last post July 02, 2010, 07:54:55 pm
by alex
0 Replies
81 Views
Last post November 02, 2011, 12:29:00 am
by alex