Some time ago a colleague of mine struggled with an error "Incorrect string value: '\xE2\x86\x92)</...' for column 'description' at row 1" and asked me for a help. He tried to look into google and he found out that the error is connected to the database charset or collation settings but he couldn't find out why does it happen to him even though the settings seemed correct.
Let's find out why is that in this short article.
First we should understand what is the difference between these two to have better understanding of databases.
Charset is a set of characters that a database can store and manipulate. Each character in a charset is represented by a unique code point, which is a numerical value that corresponds to a specific character. The charset determines which characters can be stored and used in a database. Different charsets support different sets of characters, and they can be used to support different languages and writing systems. For example, the ASCII charset includes the basic letters, numbers, and symbols used in English, while the Unicode charset includes a much wider range of characters from many different languages and writing systems.
On the other hand, database collation is a set of rules that determine how the data in a database is sorted and compared. This includes rules for how character strings are ordered and how case, accent marks, and other language-specific features are treated. Different collation settings can be used to support different languages and regional conventions, and they can affect the way that queries and other operations are performed on a database. The difference between collations can be seen on example - letter Ł in polish is after L and before M and some collations will support that (such as utf8_unicode_ci) while others won't and will treat these two letters equally or they will just put the Ł after the Z. This could of course lead to wrong results of your queries.
As you can see these two topics are closely related, but they are not the same. Of course the way both are set depends on the particular database.
Few examples of charsets and collations could be:
- Charset: latin1, collation: latin1_swedish_ci - default charset/collation for MySQL up to version 5.7
- Charset: utf8mb4, collation: utf8mb4_0900_ai_ci - default charset/collation for MySQL 8.0
- Charset: utfmb4, collation: utf8mb4_unicode_ci
It's important to note that different databases have different default charset and collation settings, and these settings may need to be changed depending on the specific requirements of your application. Consult the documentation for your database to learn more about how to configure the charset and collation settings for your database.
It's important to know that charset and collation can be set on different "levels". For example:
- Server level
- Database level
- Table level
- Column level
This of course can slightly differ for each database (MySQL, PostgreSQL, ...) so please check your database documentation for more details. Anyway what is important for us at this stage is that even though your database can have a proper settings the specific column can have wrong settings. How can this happen? Well for example if following situation happened:
- Schema was created with wrong charset/collation
- Liferay was started
- Someone fixed the schema settings
When Liferay starts it creates a lot of tables. By default, it uses the database level charset/collation settings for all the tables. Changing the database settings won't fix the wrong settings for specific columns. And this is a possible case when database seems to have correct settings but Liferay still doesn't start up properly.
If you read the explanation I bet the solution might already be clear for you - you just have to make sure that:
- Your database has correct charset and collation settings
- Your existing tables have correct charset and collation settings
Please note it's not enough to just fix already created tables as the new ones could still get the default, wrong settings. That's why you always it's better to always change settings on both levels.
Consult the documentation for your database to learn how to properly configure the charset and collation settings. For MySQL examples see commands below.
I will only describe MySQL commands here as an example but you should be easily transfer them to your database or find a solution online.
Checking DB default settings
SELECT * FROM information_schema.SCHEMATA WHERE schema_name = 'liferay';
Checking specific table settings
SELECT T.table_name, CCSA.character_set_name as column_character_set, CCSA.COLLATION_NAME as collation_name FROM information_schema.`TABLES` T, information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` CCSA WHERE CCSA.collation_name = T.table_collation AND T.table_schema = 'liferay' AND T.TABLE_NAME='AccountEntryOrganizationRel';
Changing all the tables settings
It's a little bit more complicated - there is no out of the box command to change settings for all the tables at once. It can be done with a little workaround though.
SELECT CONCAT('ALTER TABLE ', T.TABLE_SCHEMA, '.', T.TABLE_NAME, ' CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;') AS SQL_UPDATE, T.table_name, CCSA.character_set_name as column_character_set, CCSA.COLLATION_NAME as collation_name FROM information_schema.`TABLES` T, information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` CCSA WHERE CCSA.collation_name = T.table_collation AND T.table_schema = 'liferay' AND CCSA.CHARACTER_SET_NAME='latin1';
And then in first column you have a list of ALTER TABLE which you have to run. Just copy & paste the list and run it. Each table collation will be changed one by one.
As you can see this error is not too complicated. It's good to have the basic understanding of charsets and collations though that's why I tried to explain it a little bit as well.
I hope you liked it and found it useful.