[topo] MySQL charsets and collations
Posted by: Otomatic (Moderator)
Date: June 16, 2023 12:02PM

MySQL, databases, tables, columns, charsets and collations

Note : This is a translation of an article I posted on the French forum:
MySQL and charsets
I hope the translation is understandable.

A charset is a set of letters, punctuation marks and other symbols to which a code number has been assigned.
For example, the "standard ASCII" set (American Standard Code for Information Interchange) gives the number 65 to the capital A, 44 to the comma and 13 to the carriage return. This set was created for the English language, and comprises just 128 characters. It therefore ignores accented characters, ligatures, the euro sign and, of course, Greek, Arabic or Japanese characters.

Note: a "character set" is not a "character font": the latter associates a "glyph", i.e. a graphic representation, with each character. Thus, the letters A, A and A are three glyphs of the same ASCII character 65.

The information is encoded by computer in the form of bits, each worth zero or one. An ordered series of 8 bits is a byte, which can therefore have 2^8 different combinations of 1 and 0, numbered from 0 to 255. With the ASCII code, and later its extensions, one byte is enough to record one character.
This "one byte = one character" principle prevailed for a very long time. It has been declined into numerous 256-character sets (some of which use ASCII for the first 128 positions and complete the following 128). The ISO 8859 series of standardized character sets operates in this way. ISO 8859-1, also known as Latin-1 or Western, includes (almost) all the characters needed to write some twenty Western European languages such as French and English, while ISO 8859-6 provides Arabic characters and ISO 8859-15 (Latin-9) is an updated version of Latin-1, including the euro character and the œ ligature.

Unicode aims to provide a single character set to represent all languages. This makes it possible to manage multi-alphabet texts, such as those in a French-Arabic dictionary. As this requires many more than 256 characters, Unicode sets use several bytes. To remain compatible with single-byte sets, Unicode allows the use of variable character sizes, from 1 to 4 bytes for the UTF-8 set, potentially more than 4 billion different characters.

Note: If, due to incorrect settings, a multi-byte Unicode character is interpreted as single-byte, it will be represented by several characters, hence the transformation of é into A©, for example.

MySQL lets you use different character sets, both single-byte and multi-byte.

Some commands for querying the list of character sets installed on your MySQL server, to be used, for example, in a PhpMyAdmin SQL window:
- List of all character sets available on your MySQL
SHOW CHARACTER SET ;
- List of character sets based on the Latin alphabet
SHOW CHARACTER SET LIKE 'latin%' ;
- List of multi-byte character sets
SELECT *
FROM Information Schema.Character_Sets
WHERE MaxLen > 1;

In addition to the character set, MySQL allows you to choose how the data will be ordered (ORDER BY) by what is called a "collation" (COLLATE). This allows you to address, for example, the classic problem of case sensitivity:
- Should uppercase precede lowercase, or should A and a be considered of equal value?
- Accent sensitivity: do they count in sorting? Do they make a difference when searching?
- The possibility that one character (œ ligature) may correspond to several (o followed by e): this is what Unicode calls "expansion".
To formalize these choices, collations were invented (the English term means both "gathering" and "comparison"; collation is sometimes translated as "interclassification"winking smiley. A collation is linked to a character set. It gives both the order in which to classify the characters, and whether any of them should be considered equivalent. Each character set has several collations, including a default one.
Here are a few commands for querying the list of collations installed on your MySQL server, to be used, for example, in a PhpMyAdmin SQL window:
- List of snacks for the latini game
SHOW COLLATION LIKE 'latinl%' ;
- List of utf8 set collations
SHOW COLLATION LIKE 'utf8%' ;

All collations have a name that begins with the character set to which they are linked, and ends with one of three abbreviations:
- _bin as in binary: characters are sorted in the order of their code numbers (i.e. first all uppercase, then all lowercase, then all accented letters, in bulk).
- _cs for case sensitive: characters are sorted according to reference language(s), but in a case-sensitive manner.
- ci for case insensitive: ditto, but case insensitive.

To enable you to make the appropriate choices, you must explicitly declare the character set and collation used up to the column level (Field), otherwise the default values of the higher level apply: The data uses the character set and collation of its column. If the character set and collation have not been specified for the column, the table's character set and collation are used. If the table's character set and collation have not been specified, the database's character set and collation are used, and if the database's character set and collation have not been specified, the server's default character set and collation are used. This is how we end up with Latin1_swedish_ci as collation almost everywhere, MySQL AB being a Swedish company.

- Changing the character set and collation at higher levels (server, base or table) has no immediate consequence: only newly created objects will use these default values.
- Changing the collation of a column, while remaining in the same character set, is done without difficulty. This does not affect the data itself, but the way it is processed. With the next Select, sorting and searching behave differently.
- Changing the character set of a column converts the data to the new set. This can be a problem if some characters in the old set have no equivalent in the new set.

Character set and query collation

These values depend on the MySQL client settings.
Each time a session is opened, MySQL fills in four system variables:

- The character set used by the client for input, stored in the @@character_set_client variable.
- The character set used for communication between the client and MySQL (@@character_set_connection): the default collation of this character set determines the @@collation_connection.
- The character set used to display query results in the client (@@character_set_results).

The query text is interpreted according to the client set, then converted into the connection set (@@character_set_connection and @@collation_connection). MySQL then sends the result, again using @@characterset_connection, and then converting it into @@character_set_results.
To find out the value of system variables linked to games and snacks, run the following queries:
- Display all system variables linked to character sets and collations :
SHOW VARIABLES LIKE 'char%';
SHOW VARIABLES LIKE 'colla%'';

In addition to the four variables listed above, these include the default character sets and collations for the current server and database. As for the @@character_set_system, this is the one MySQL uses to store metadata, i.e. the names of databases, tables, columns, etc. It's always a utensil that's used by MySQL. It's always utf8.

On a local server, e.g. Windows with Wampserver, the default values COULD be modified in the my.ini file:
# CLIENT SECTION
[client]
[mysql]
default-character-set=latin1
# SERVER SECTION
[mysqld]
#Default character set used when creating tables
#when not explicitly defined.
default-character-set=latin1
But this is not the case: With Wampserver, no character set is defined for all MySQL versions.

Under Windows 10 with NO default character set definition in my.ini, here are the results of commands showing the contents of character set and collation system variables in the MySQL console:
- 1 - PHP 8.2.7 and MySQL 8.0.33
mysql> SHOW VARIABLES LIKE 'CHAR%';
+--------------------------+-------------------------------------------------+
| Variable_name            | Value                                           |
+--------------------------+-------------------------------------------------+
| character_set_client     | cp850                                           |
| character_set_connection | cp850                                           |
| character_set_database   | utf8mb4                                         |
| character_set_filesystem | binary                                          |
| character_set_results    | cp850                                           |
| character_set_server     | utf8mb4                                         |
| character_set_system     | utf8mb3                                         |
| character_sets_dir       | E:\wamp64\bin\mysql\mysql8.0.33\share\charsets\ |
+--------------------------+-------------------------------------------------+
8 rows in set, 1 warning (0.01 sec)

mysql> SHOW VARIABLES LIKE 'colla%';
+----------------------+--------------------+
| Variable_name        | Value              |
+----------------------+--------------------+
| collation_connection | cp850_general_ci   |
| collation_database   | utf8mb4_0900_ai_ci |
| collation_server     | utf8mb4_0900_ai_ci |
+----------------------+--------------------+
3 rows in set, 1 warning (0.00 sec)
- 2 - PHP 8.2.7 and MySQL 5.7.42
mysql> SHOW VARIABLES LIKE 'CHAR%';
+--------------------------+-------------------------------------------------+
| Variable_name            | Value                                           |
+--------------------------+-------------------------------------------------+
| character_set_client     | cp850                                           |
| character_set_connection | cp850                                           |
| character_set_database   | latin1                                          |
| character_set_filesystem | binary                                          |
| character_set_results    | cp850                                           |
| character_set_server     | latin1                                          |
| character_set_system     | utf8                                            |
| character_sets_dir       | E:\wamp64\bin\mysql\mysql5.7.42\share\charsets\ |
+--------------------------+-------------------------------------------------+
8 rows in set, 1 warning (0.00 sec)

mysql> SHOW VARIABLES LIKE 'colla%';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | cp850_general_ci  |
| collation_database   | latin1_swedish_ci |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set, 1 warning (0.00 sec)
There are already differences in the default character sets between MySQL 5.7 and MySQL 8.0.

Note: utf8mb4, i.e. Unicode V6 support for four-byte utf-8 characters, has only been possible since MySQL 5.5.3.

--- Possible problems with the MySQL console
Using the MySQL text console under Windows can be "vicious" and throw up some surprises!
The text client character set, while declared Latin1, is in fact cp850 (well, not always, depending on the API).

To see for yourself, just run :
SHOW VARIABLES LIKE 'char%';
USE test;
Then select a field from a CHARSET=latin1 table with accented characters.
To do this, I'm using a test table with a test field (latin1) containing "éèàù ÉÈÀÙ ç Ç" and filled in with PhpMyAdmin, so in "true" latin1.
SELECT test FROM test;
The "é" characters are transformed into "ù" and the other characters into ÚÞÓ¨ ++++ þ Ã
If the console's character set was indeed Latin1, this shouldn't happen.
SET NAMES cp850;
And redo the select. Accented characters are fine.

A good way to check the console's character set when launching a query is to tell MySQL the character set of a literal string by preceding it with an introducer :

SELECT _latin1'été', _utf8'été', _cp850'été';

A literal string preceded by an introducer is sent as is to MySQL, regardless of the @@character_set_client and @@character_set_connexion variables. MySQl interprets it according to the set indicated by the introducer, then returns it.

--- Rules for MySQL "object" names: bases, tables, columns, etc.
- Maximum length 64 characters.
- Certain characters (particularly spaces) are either forbidden or not recommended; to avoid all problems, don't use diacritical characters, use letters from a to z, numbers and the underscore (_); don't start with a number and adopt a consistent case policy common to all your objects. (In my humble opinion, the simplest is "all lowercase", but you can adopt the "proper name" type, i.e. the first letter in uppercase).
- They must not be identical to SQL terms such as Create, Use, Select, Join, etc.

--- Upper and lower case ---
Case rules (i.e. the choice between upper and lower case) change with the operating system:
- SQL keywords and column names are case insensitive, i.e. they can be written in either upper or lower case.
- Table and database names are :
-- case-sensitive if MySQL is installed on Linux
-- case-insensitive if MySQL is installed on Windows
This difference is due to the fact that MySQL stores databases and tables in the form of files; Windows is insensitive to the case of file names, whereas Linux is.

As you may change host and therefore operating system, it's best to always treat MySQL object names as sensitive, so that queries work regardless of operating system.

By using the same rules for variable and PHP file names, you can avoid potential errors.

MySQL allows you to explicitly declare the character sets used for databases, tables and columns. Different character sets can easily coexist within the same database or table:
DROP TABLE IF EXISTS table_test;
CREATE TABLE IF NOT EXISTS table_test (
  champ1 varchar(50) COLLATE latin1_general_ci NOT NULL,
  champ2 varchar(50) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
  champ3 varchar(50) CHARACTER SET latin2 NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;

If you have databases and tables declared as Latin1 (iso-8859-1), it's perfectly possible and even recommended to use files with the utf-8 character set for exports and imports; in fact, this is (almost) always the default character set for MySQL connections.

Note 1: Four-byte UTF-8 characters

The character set currently named utf8 uses a maximum of three bytes per character and contains only BMP: Basic Multilingual Plane characters, i.e. the first 65534 codes (0x0000 to 0xFFFD) of the UCS standard (ISO 10646), which defines the Universal Character Set (UCS).

[UCS[/i] contains all the characters of all the other standard character sets. It also guarantees circular compatibility, which means that conversion tables ensure that no information is lost when a character string is converted to another encoding, and then converted back again.

UCS contains the characters needed to represent almost all known languages. In addition to languages using an extension of the Latin alphabet, it includes: Greek, Cyrillic, Hebrew, Arabic, Armenian, Gregorian, Japanese, Chinese, Hiragana, Katakana, Korean, Hangul, Devangari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayam, Thai, Lao, Bopomofo, and others. Since MySQL 5.5.3, Unicode extensions to BMP characters have been supported: utf8 characters consisting of four bytes, i.e. UCS (ISO 10646) standard codes greater than 65534 (> 0xFFFF), and this support is called utf8mb4.
- the utf8mb4 set uses a maximum of four bytes per character and supports complementary characters, such as those of Unicode version 6, provided that a font (Fonte) has the associated glyphs.
- the utf8mb3 set is an alias for utf8 and enables a more visible differentiation between three-byte and four-byte encoding.

However, for commands such as SHOW, CREATE TABLE or SELECT, the utf8mb3 alias is systematically converted to utf8.

For a BMP character, utf8 and utf8mb4 work identically: same code, same encoding, same length (number of bytes), same way of storing in the database.

For additional characters, e.g. Unicode version 6, utf8 cannot store the character and will generate an error because utf8mb4 requires four bytes. So, if you want to support four-byte uft8 characters, it's imperative to explicitly declare this for databases, tables and columns requiring it.
If a table had previously been created to support utf8, for example :
CREATE TABLE t1 (
  col1 CHAR(10) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
  col2 CHAR(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL
) CHARACTER SET utf8;

The following query will transcode the table to accept four-byte utf8 characters, without any loss of datasad smileyIt is essential to specify the names of the columns to be transcribed)
ALTER TABLE t1
  DEFAULT CHARACTER SET utf8mb4,
  MODIFY col1 CHAR(10)
    CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
  MODIFY col2 CHAR(10)
    CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL;
More simply, you can use :
ALTER TABLE t1
  CONVERT TO CHARSET utf8mb4 COLLATE utf8mb4_unicode_ci (ou utf8mb4_bin);
You don't have to give the column names; only those columns requiring transcoding will be transcoded automatically.

For MySQL to accept queries containing data with four-byte utf8 characters, it must be explicitly declared, i.e. send the command :
SET NAMES utf8mb4;
instead of
SET NAMES utf8;
before queries.

---------------------------------------------------------------
Documentation Apache - Documentation PHP - Documentation MySQL - Wampserver install files & addons



Edited 1 time(s). Last edit at 06/16/2023 04:37PM by Otomatic.

Options: ReplyQuote


Sorry, only registered users may post in this forum.