Talk:UTF-8

This is the talk page for discussing improvements to the UTF-8 article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Computing Mid‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
Mid	This article has been rated as Mid-importance on the project's importance scale.

Computer science Mid‑importance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

Mid

This article has been rated as Mid-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Typography Mid‑importance

	This article is within the scope of WikiProject Typography, a collaborative effort to improve the coverage of articles related to Typography on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.TypographyWikipedia:WikiProject TypographyTemplate:WikiProject TypographyTypography articles
Mid	This article has been rated as Mid-importance on the importance scale.

Archives

Index

Archive 1	Archive 2	Archive 3
Archive 4	Archive 5

This page has archives. Sections older than 90 days may be automatically archived by .

Table should not only use color to encode information (but formatting like bold and underline)

As in a previous comment https://en.wikipedia.org/wiki/Talk:UTF-8/Archive_1#Colour_in_example_table? this has been done before, and is *better* so that everyone can clearly see the different part of the code. Relying on color alone is not good, due to color vision deficiencies and varying color rendition on devices.

Microsoft script dead link

   and Microsoft has a script for Windows 10, to enable it by default for its program Microsoft Notepad

   "Script How to set default encoding to UTF-8 for notepad by PowerShell". gallery.technet.microsoft.com. Retrieved 2018-01-30.

   https://gallery.technet.microsoft.com/scriptcenter/How-to-set-default-2d9669ae?ranMID=24542&ranEAID=TnL5HPStwNw&ranSiteID=TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w&tduid=(1f29517b2ebdfe80772bf649d4c144b1)(256380)(2459594)(TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w)()

This link is dead. How to fix it? — Preceding unsigned comment added by Un1Gfn (talk • contribs) 02:58, 5 April 2021 (UTC)[reply]

That text, and that link, appears to have been removed, so there's no longer anything to fix. Guy Harris (talk) 23:43, 21 December 2023 (UTC)[reply]

utf8 octal conversion

I think this section should be rewritten. It makes no sense to talk about bytes if you have triplets of octal numbers which make 9 bits in total, not 8. The grouping shown in the section is ambiguous (and wrong). --84.167.187.209 (talk) 02:24, 29 May 2021 (UTC)[reply]

The table is correct, the results are 1 to 4 bytes, each displayed as 3 octal digits, the left-most digit cannot be greater than 3. If the bytes were somehow appended into a single octal number then you would first have an endieness question, and more importantly it would remove the alignment between the output octal digits and the input octal digits.

I have made this modification of the octal table. Do you understand it? x,y,z and w are octal digits.--BIL (talk) 22:11, 29 May 2021 (UTC)[reply]

Octal code point <-> Octal UTF-8 conversion
First code point	Last code point	Code point	Byte 1	Byte 2	Byte 3	Byte 4
000	177	xxx	xxx
0200	3777	xxyy	3xx	2yy
04000	77777	xyyzz	34x	2yy	2zz
100000	177777	1xyyzz	35x	2yy	2zz
0200000	4177777	xyyzzww	36x	2yy	2zz	2ww

Yes that is a lot clearer.Spitzak (talk) 23:44, 29 May 2021 (UTC)[reply]

I do agree there is a huge amount of bloat in this article, conversion from/to UTF-8 is actually really simple and I would love to see the majority of this text spew deleted.Spitzak (talk) 20:24, 29 May 2021 (UTC)[reply]

Suggest/recommend throwing out the whole UTF-8#Octal section. I’m sure the intellectual exercise must have been “neat” or “kind of cool” to whoever took the time and effort to type it up and add it to the article, but IMHO it’s cruft like this that explains how this article got to be so long and bloated. I haven’t seen this appear in _any_ of the Unicode standards documents, and even the single reference cited admits that the API library just compares the binary, even if it might conceivably, theoretically be more convenient for a human with a scientific calculator converting hexadecimal to octal to compare bits manually. This article would IMHO be much more concise and more “encyclopedic” if the half of it comprising personal commentary/observations such as this section (which might be more appropriate, say, as a post on a personal blog, for example) were trimmed. —PowerPCG5 (talk) 08:35, 10 November 2021 (UTC)[reply]

Excellent idea. This section does not add useful information. −Woodstone (talk) 13:52, 10 November 2021 (UTC)[reply]

Absolutely agree. About 3/4 of this article is bloated with trivial observations and/or redundant rewording of the same information over and over again. I did edit this table last, not because I liked it, but it was even larger and more intrusive before (they put it in as more columns in the other tables), and attempts to just remove it got reverted...Spitzak (talk) 15:10, 10 November 2021 (UTC)[reply]

Such is the unfortunate nature of a community-built wiki - editors contribute to their own niche and hobbies. Criticism of Wikipedia#Systemic bias in coverage

Criticism of Wikipedia#Quality of writing is funny too. Wqwt (talk) 07:21, 4 September 2022 (UTC)[reply]

Unfortunately there is an error in it. If first code point is 0200 then last code point can not be 3777, for example. Please consider description in main article of how Encoding process works, then you find that first Unicode code point is always 0 . Apparently that is how it is really done. — Preceding unsigned comment added by SiwardDeGroot (talk • contribs) 14:48, 21 July 2023 (UTC)[reply]

You are talking about "overlong encodings". The code point 0 should be done by the one-byte entry in the first line of the table. Encoding code point 0 using the second line of the table is an error. Spitzak (talk) 15:45, 21 July 2023 (UTC)[reply]

US-ASCII

@Comp.arch: With respect to Special:Diff/1105781113, it's better to use just "ASCII" unless it could be misinterpreted as some other variant of ISO 646 instead of ANSI X3.4-1986. It is not the case here, but I think current usage in the article is okay. IANA preference for "US-ASCII" only matters for use in the charset parameter or similar where "ASCII" is not even a valid label at all. Please don't link it halfway as US-ASCII because that makes absolutely no sense in any context and looks like a formatting mistake. Link the whole US-ASCII, piping it if you don't like the redirect. – MwGamera (talk) 12:49, 22 August 2022 (UTC)[reply]

The article contains "{{efn", which looks like a mistake.

I would've fixed it myself but I don't know how to transform the remaining sentence to make sense. 2A01:C23:8D8D:BF00:C070:85C1:B1B8:4094 (talk) 16:17, 2 April 2024 (UTC)[reply]

I fixed it, I think. I'm not 100% sure it's how the previous editors intended. I invite them to review and confirm. Indefatigable (talk) 19:03, 2 April 2024 (UTC)[reply]

Should "The Manifesto" be mentioned somewhere?

More specifically, this one: https://utf8everywhere.org -- Preceding unsigned comment added by Rudxain (talk o contribs) 21:52, 12 July 2024 (UTC)[reply]

Only if it's got significant coverage in reliable sources. Remsense 22:10, 12 July 2024 (UTC)[reply]

It's kind of ahistorical, since the Microsoft decisions that they deplore were made while developing Windows NT 3.1, and UTF-8 wasn't even a standard until Windows NT 3.1 was close to being released. There was more money to be made from East Asian customized computer systems than Unicode computer systems in 1993, so Unicode was probably not their main focus at that time... AnonMoos (talk) 20:30, 15 July 2024 (UTC)[reply]