As developers, our world is made of text, but computers only understand numbers. What gives? A process called character encoding maps numbers to letters. Easy enough, until you realize that we need to encode every character from every language while keeping file sizes small and staying compatible with legacy systems. In this article, José shows us everything we need to know about character encoding for PHP developers – including what to do when it breaks!

May 17, 2021

Your PHP project probably involves dealing with lots of data coming from different places, such as the database or an API, and every time you need to process it, you may run into an encoding issue.

Đang xem: Character Encoding For Php Code Example

This article will help you prepare for when that happens and better understand what's going on behind the scenes.

An introduction to encoding

Encoding is at the core of any programming language, and usually, we take it for granted. Everything works until it doesn't, and we get an ugly error, such as “Malformed UTF-8 characters, possibly incorrectly encoded”.

To find out why something in the encoding might not work, we first need to understand what we mean by encoding and how it works.

Morse code

Morse code is a great way to explain what encoding is about. When it was developed, it was one of the first times in history that a message could be encoded, sent, and then decoded and understood by the receiver.

If we used Morse code to transmit a message, we'd first need to transform our message into dots and dashes (also called short and long marks), the only two signals available in this method. Once the message reaches its destination, the receiver needs to transform it from Morse code to English. It looks something like this:

“Hi” -> Encode(“Hi”) -> Send(“…. ..”) -> Decode(“…. ..”) -> “Hi”
This system was invented around 1837, and people manually encoded and decoded the messages. For example,

S is encoded as … (three short marks)T as – (one long mark)U as ..- (two short marks and one long mark)

Here's a radio operator encoding in Morse code:

*

On the Titanic, Morse code was used to send and receive messages, including the last one where they were asking for help (“CQD” is a distress call).

SOS SOS CQD CQD Titanic. We are sinking fast. Passengers are being put into boats. Titanic
In computer encoding, computers encode and decode characters in a very similar way. The only difference is that instead of dots and dashes, we have ones and zeros in a binary code.

Binary and characters

As you probably know, computers only understand binary code in 1s and 0s, so there's no such thing as a character. It's interpreted by the software you use.

To encode and decode characters into 1s and 0s, we need a standard way to do it so that if I send you a bunch of 1s and 0s, you will interpret them (decode them) in the same way I've encoded them.

Xem thêm: Que Thử Rụng Trứng Và Cách Thử Que Rụng Trứng Đơn Giản Tại Nhà

Imagine what would happen if each computer translated binary code into characters and vice-versa in its own way. If you sent a message to a friend, they couldn't see your real message because, for their computer, your 1s and 0s would mean something else. This is why we need to agree on how we transform the characters into binary code and vice-versa; we need a standard.

Standards

Encoding standards have a long history. We don't need to fully explore the history here, but it's essential to know two significant milestones that defined how computers can use encoding, especially with the birth of the Internet.

ASCII

ASCII, developed in 1963, is one of the first and most important standards, and it is still in use (we'll explain this later). ASCII stands for American Standard Code for Information Interchange. The “American” part is very relevant since it could only encode 127 characters in its first version, including the English alphabet and some basic symbols, such as “?” and “;”.

Here's the full table:

*

Source

Computers can't really use numbers. As we already know, computers only understand binary code, 1s and 0s, so these values were then encoded into binary.

For example, “K” is 75 in ASCII, so we can transform it into binary by dividing 75 by 2 and continue until we get 0. If the division is not exact, we add 1 as a remainder:

75 / 2 = 37 + 137 / 2 = 18 + 118 / 2 = 9 + 09 / 2 = 4 + 14 / 2 = 2 + 02 / 2 = 1 + 01 / 2 = 0 + 1
Now, we extract the “remainders” and put in them in inverse order:

1101001 => 1001011
So, in ASCII, “K” is encoded as 1001011 in binary.

The main problem with ASCII was that it didn't cover other languages. If you wanted to use your computer in Russian or Japanese, you needed a different encoding standard, which would not be compatible with ASCII.

Have you ever seen symbols like “???” or “Ã,” in your text? They’re caused by an encoding problem. The program tries to interpret the characters using one encoding method, but they don't represent anything meaningful since it was created with another encoding method. This is why we needed our second big breakthrough, Unicode and UTF-8.

Xem thêm: Diễn Viên Đặng Luân Cung Hoàng Đạo, Năm Sinh, Chiều Cao Và

Unicode

The goal in developing Unicode was to have a unique way to transform any character or symbol in any language in the world into a unique number, nothing more.

If you go to unicode.org, you can look up the number for any character, including emojis!

For example, “A” is 65, “Y” is 121, and

Related Post

Leave a Reply

Your email address will not be published. Required fields are marked *