Monday, May 21, 2018

Base64 Encoding in Linux and Its Use Cases


Hi there. In this article, I will deal with a topic which is quite simple but also which can be quite useful, time to time. This topic is base64 encoding. As its use case, I can mention file transfer for relatively small files. In practice, easy solutions can sometimes save time.

The logic behind the base64 encoding is simple but I will not start to explain base64 encoding. Instead of this, I start with a more generic problem and will end with base64 encoding, using deduction method.

ASCII code table (from wikipedia)
First of all, I assume everyone has an idea about what ASCII is. Memorizing ASCII is actually not a big deal. One must memorize actually only two characters, first '0' whose code is 30h. It is really simple to memorize it. If you previously wrote some code in Assembly or C and had to deal with BCD numbers, you can still remember this: To print a BCD value on the screen, you have to add 30h to its value. Second character to memorize it 'A' whose code is 41h. If the first character of alphabet is considered as a first element of an array, it is obvious why it does not have a code which ends with zero.

There are 26 characters in English alphabet. Smallest power of 2 greater than this value is 32. Instead of starting lower case letters immediately after 'Z', putting 6 more characters between them and keeping the distance between each upper- and lowercase character at 32 (20h) provides simplicity. If you add 20h to a capital letters code, you will obtain lowercase letter and vice versa. If you only memorize these two characters and other rules, you can translate ASCII codes to readable text in your mind. Roughly, 3Xh codes are numerals, 4Xh and 5Xh's are uppercase and 6Xh and 7Xh's are lowercase letters. After removing the letters and numbers in the text, at least a significant portion of the text becomes readable. Let's not read the period or comma. It doesn't really matter.
Characters between 128 and 255 are named as extended ASCII character set. These characters are different in each code page and they are replaceable. For example Turkish character set is loaded there. If you look at the ASCII code table, you can easily notice that almost all the characters in the the upper right quarter of the table are printable characters.
Let's assume a connection channel where binary communication is not possible. For example, a Facebook chat window without file transfer or SSH sessions in putty. If I want to send a configuration file or a python file where indentation is really important, what should I do?

The answer is simple: Whole file is considered as a stream and this stream is divided into 6-bit chunks. Then I add 40h to these chunks (do not add, do OR) and have some printable characters. What is lost? 6 unit of data is now 8 unit. The overhead is 33%. These algorithms are Binary-to-text encoding algorithms and there are many of them in the literature. The algorithm, I just described above, is similar to UUEncode algorithm, but it's not the same. In UUEncode, 20h value is added instead of 40h and its charset begins with a space character.

Base64 has a similar algorithm but the stream is translated to the text using a character table. The code line below is taken from the source code of base64:






static const char cb64[]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

According to this table, 0 corresponds to 'A' and 63 corresponds to '/'. This table can also be found in the Wikipedia article of Base64. '=' character has a special meaning. If the number of bits in the stream is not a multiple of 8, then '=' is used for padding. 

It is obvious that this operation is not applicable to files which are bigger than a few kilobytes. For some cases, it is more convenient to encode the file after it is compressed. As I explained above, printable characters accumulate in a relatively small range. Therefore text files have low entropy and they are compressed well. Compressing text file and encoding them with base64 is usually applicable when indentations in the file could not preserved after file transfer.

I choose /bin/dmesg for a base64 encoding example:
base64 < /bin/dmesg
base64 /bin/dmesg


Both commands above give the same result. dmesg has a file size of 6763 bytes. After it is encoded with base64 the output is 9103 bytes.


[root@something ~]# ls -la /bin/dmesg
-rwxr-xr-x 1 root root 6736 Oct 15  2014 /bin/dmesg

[root@something ~]# base64  /bin/dmesg | wc -c
9103

9103 / 6736 = 1.351. I have written above that the overhead is 33% but it is slightly more than this because I am ignoring (intentionally) an error. base64 splits the output into 76-char lines to fit the output in a standard 80-char terminal. This means there is a '\n' (LF: Line Feed char in UNIX) for each 76 characters. Splitting behavior is controlled with -w parameter of base64. Let's join the output to a single line:

[root@something ~]# base64 -w 0  /bin/dmesg > dmesg.base64
[root@something ~]# cat dmesg.base64 | wc -c
8984

8984 / 6736 = 1.334. Now the calculation is correct. The distribution of the characters in the output can be analyzed with following command:
[root@something ~]# base64 -w 1  /bin/dmesg | sort -b | uniq -c

There are two '=' characters in output.
So, how I can decode the data? Decoded data would a binary file, therefore I need to redirect the IO to a file. -d parameter of base64 decodes the encoded data. The command accepts file input or stdin input. The encoded file can be dumped to the console. Then redirection can be made to base64 command or it can be copy&pasted to the command as a keyboard input and terminated with Ctrl+D. The latter only applies to relatively small data. The commands below make the same task. BTW, I checked the md5sum of original data and and the data after encode/decode operation to verify the file content.

[root@something ~]# md5sum /bin/dmesg
e638a28f1d13b71fdcb13500fedcf00d  /bin/dmesg
[root@something ~]# cat dmesg.base64 | base64 -d > dmesg
[root@something ~]# md5sum dmesg
e638a28f1d13b71fdcb13500fedcf00d  dmesg
[root@something ~]# base64 -d dmesg.base64 > dmesg
[root@something ~]# md5sum dmesg
e638a28f1d13b71fdcb13500fedcf00d  dmesg
[root@something ~]# base64 -d < dmesg.base64 > dmesg
[root@something ~]# md5sum dmesg
e638a28f1d13b71fdcb13500fedcf00d  dmesg
[root@something ~]# base64 -d > dmesg
<çıktı buraya yapıştırılır>
Ctrl+D
[root@something ~]# md5sum /bin/dmesg
e638a28f1d13b71fdcb13500fedcf00d  /bin/dmesg

Same code can be decoded in Windows with the following command*:
certutil -decode data.b64 data.txt


Note1: A text splitted into multiple lines can also be joined with paste -sd "" command into a single line. 
Note2: The histogram of the characters can also be created with following command*:
od -cvAnone -w1 | sort -b | uniq -c