The Code Segment

Thursday, July 10, 2025

Add Captions or Labels to Images Using Python

Hi there. In this blog post, I'll be addressing a need-based problem. My problem is adding labels or any kind of text to images in a directory in a bulk way. For example a copyright note or an address information. I think, this can be solved by creating a stencil in GIMP. But what if the text to be added is not a constant? My actual problem was adding sequence numbers to images. Consequtive numbers starting from one in the corner of each photo. In this case, a stencil wouldn't be a solution. And even if it can be done by editing each and every image with GIMP, it won't be very practical for several hundred images. Most practical solution is writing a simple script for this. By the way, I could have done this sequencing with their filenames, but I didn't want to touch then, because I also want to preserve the timestamp of the file. And let's assume, I want to use these photos on a web page, where their filenames won't be visible at first glance. Finally, with such a script, you can generate time stamps, similar like the cameras from 90s.

When I talk about scripting, bash comes to (my) mind very first, however it's unfortunately not the best tool for this job. AFAIK, there is no image library usable with bash. Of course, nothing is impossible. It can be done, but as you don't use a hammer to knock down a wall, when you can have a sledgehammer, choosing the appropriate tools for the job is the first step of all solutions. Python lovers may get angry for that, but python, the second best scripting language after bash, has a library for exactly this purpose, which is called Python Imaging Library (PIL). I used a fork of PIL called Pillow in my script. This library is easily imported with pip install pillow command.

I again uploaded the code to my github account the keep the article short. It's a small script, with just some tricks in it. First of all, like any other python script, there are import statements at the very beginning. The Image module of the library contains image related functions. ImageDraw contains simple 2D image effects, which is needed to rotate the image in my script and finally ImageFont for fonts and other text effects.

Exif Data and Orientation

The first challenge of this project is that the images cannot be viewed on the computer as easily as we see it on the phone. Let's take the image below, that I took with my cell phone, as an example:

Above, it appears vertically on the browser. In Gwenview, also appears in vertical format. But, when I open it with the following code

>>> from PIL import Image
>>> img = Image.open("image.jpg")
>>> img.show()

it appears horizontally.

and when I open it with GIMP, a strange dialog box says that the image contains "Exif orienation data" and asks if I want to rotate it. But why?

While holding a cell phone horizontally* and taking a photo, it doesn't actually rotate the photo. When I open such photos, taken with a cell phone or a digital camera, in python, those which are not taken vertically, are shown on the screen with the same orientation as they're taken. The camera saves the orientation info (thanks to their gravity sensors) inside the picture, and rotates them while viewing. If the orientation was not saved, we would have to turn the phone to the exact orientation, at which it was originally taken, each time.

*: The default position of some cameras are horizontal, but some are vertical.

From the above statement, it's clear that an image doesn't only consist of pixel data. There is a field called Exif, where the metadata of an image is stored and today, all image formats as well as cameras support Exif. There is a table on Wikipedia about the data stored in Exif. Typical fields are the manufacturer and the model of the camera, image orientation, the time and date the photo was taken, image resolution etc. For example, saving the timestamp of the photo allows you to find out when the photo was actually taken and to sort it by date, even if the file is renamed afterwards. On the other hand, some phones put the coordinates there from phones GPS, and reveal where the photo was taken and some platforms can then automatically tag the location of the photo, when it is shared. These are occasions, that will send shivers down the spine of those who is sensitive to the privacy of personal data. According to legend, Ukrainians asked Russian soldiers online for their photos, and used the coordinate data of this photos to launch attacks.

In Linux, a tool called exiftool can be used to review this information (exiftool -list <filename>), be manipulated or wiped completely (exiftool -all= <filename>). For the example image above, there is a difference of around 77 KB between the original image and the image with all Exif data wiped.

After a long (and unnecessary) explanation about Exif, let's go back to the image orientation. In Pillow, there is an Image.getexif() function [1], to read and parse Exif data. When I print the output of this function to screen (22., 23. and 24. lines, commented out), I can see the orientation info of .jpg files in the directory. As mentioned in [1], the values 2, 7, 4 and 5 were not in my images. Likewise 8, therefore I didn't implement 8 in my code either, but it's easy. For the value 1, I used "else" in the if structure (line 31), so for 8 and all other values, the image is not rotated.

For rotation, Pillow has Image.rotate() function [2]. Using that, I rotated the image 90 degrees for Orientation=3, and 270 degrees Orientation=6. I have to open a parenthesis here for the line 27: If there is no image orientation field in Exif, or if there is no Exif information in the image at all, the code will throw an error. For this reason, the best practice says, better check the return value of getexif() function first and then rotate, if there is no error. Since all my images have Exif, I did not have any problem and kicked the can down the road. It doesn't mean that you also won't have any issues.

So far, I've corrected the image orientation. I used [3] to solve the main problem. There, it is demonstrated how to add text in a simple way, I only adjusted the parameters for my system. In line 35, an ImageDraw object is created to add text. The next line creates a font object with ImageFont.truetype() function. As the font used in [3] isn't on my machine (and as I don't check the return value of the function for errors), I chose a font among my fonts under /usr/share/fonts/ . The second parameter is font size. I found its value by trial and error. My photos were relatively large (8 MP), so 128 could generate a barely visible caption. In the next line, at the given coordinate of the image (25, 25 - top left), I added the sequence number "sirano" with the font I created at the previous line with red color. At this step, the date and time of the photo could be added to the image automatically, either from its filename or from its Exif data. An example of an enumerated photo is below:

Finally, line 40 can be uncommented to show image on the screen and/or the line 43 can be uncommented to save the tagged image with "_enum" suffix. While working on this script, I ran it in a directory with several hundred images, so I neither wanted to display nor save that many images for each run, therefore commented out.

Not: Alternatively, it's possible to do the same thing in OpenCV with cv2.putText() function, but I'm keeping this for another post.

[1]: https://jdhao.github.io/2019/07/31/image_rotation_exif_info/
[2]: https://note.nkmk.me/en/python-pillow-rotate/
[3]: https://www.geeksforgeeks.org/python/adding-text-on-image-using-python-pil/

Sunday, January 5, 2025

ETag Calculation on Amazon S3 Objects

Hi there. In this article, I'm going to discuss, how the hash of files uploaded to S3, called ETag, is calculated. It is actually nothing more than a simple MD5 hash, calculated for everything uploaded to an S3 bucket. We know, that files in S3 are not really "files", as we understand them. S3 is called as “Object Storage”, thus the files are stored as objects in the correct terminology.

After a certain file size, uploads with aws s3 cp or aws s3 sync are automatically split into equal sized chunks for (possibly) easier storage. Such objects are called multipart objects. So how big is this certain size? Although, there is no general measure, my files are currently stored in chunks of 8M and 16M. The source I used for this article says that files up to 5G are not fragmented [1][2], but my observation is, that this is no longer true, and this value is also not very important for our problem.

If a file is smaller than this so called multipart threshold, it is stored as a single part and the ETag of the object is equal to its MD5 hash. So simple is that. If the file is larger than this threshold, it is stored as a multipart object and things get a bit complicated. You can easily tell, if an object is multipart or not, by looking at its ETag. A normal MD5 hash consists only of hexadecimal digits. Therefore a hyphen ( - ) does not belong to an MD5 hash. If a file in S3 has a hyphen in its ETag, then it is a multipart object and the number of parts of this file is given after the hyphen. I will give a concrete example later in the article.

ETag calculation on multipart objects works like this: Each part is hashed separately, the resulting hashes are concatenated and hashed again. This hash is the part of the ETag before the hyphen. The number of fragments is simply added after the hyphen [3].

I regularly back up my disks with Clonezilla. Backups get written to an external disk and then copied to S3. I usually keep the most recent copy on my external disk and the last three copies on S3. For backwards (FAT32) compatibility, I split backup files into 4G chunks (even though I don't back up to FAT32 media). The need for ETag comparison arose, because I wanted to verify my copies on S3.

At this point, I assume that the aws CLI tool is installed and configured. The settings are made in the .aws/config file, but I won't go into its details, to avoid lengthening this article. Let's take the small file example first:

$ aws s3api head-object --bucket mybucket --key image_backup/2023-10-15-10-img/Info-lshw.txt
{
    "AcceptRanges": "bytes",
    "LastModified": "2023-10-15T18:28:31+00:00",
    "ContentLength": 40960,
    "ETag": "\"fe78f69cb9d41a23ba23b4783e542a7b\"",
    "ContentType": "text/plain",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

As I mentioned before, this is not a multipart object. So the MD5 hash, i.e. the ETag, can be simply found. Below is an example of a large file:

$ aws s3api head-object --bucket mybucket --key image_backup/2024-12-01-13-img/sda5.ntfs-ptcl-img.xz.ac
{
    "AcceptRanges": "bytes",
    "LastModified": "2024-12-03T17:00:58+00:00",
    "ContentLength": 4096008192,
    "ETag": "\"360f5e8babf8cd28673eaafd32eb405f-489\"",
    "ContentType": "application/vnd.nokia.n-gage.ac+xml",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

This file is 4096 MB in size and, as you can see from its ETag, it consists of 489 parts. The main thing here is to find the size of the parts. ContentLength divided by 489 is actually very close to 8M. From this, it's safe to assume that the file is actually divided into 8M chunks, but it would be better to find the exact value, to use it in a script. To do this, I'll add --part-number parameter to the command and check a single part. Since files are splitted into fixed size chunks, only the size of the last fragment is different. And the ETag value for each part is the same. In other words, --part-number will not give the MD5 hash of each individual part.

$ aws s3api head-object --bucket mybucket --key image_backup/2023-10-15-10-img/sda5.ntfs-ptcl-img.gz.aac --part-number 1
{
    "AcceptRanges": "bytes",
    "LastModified": "2023-10-15T18:28:31+00:00",
    "ContentLength": 16777216,
    "ETag": "\"aba379cb0d00f21f53da5136fc5b0366-299\"",
    "ContentType": "audio/aac",
    "ServerSideEncryption": "AES256",
    "Metadata": {},
    "PartsCount": 299
}

$ aws s3api head-object --bucket mybucket --key image_backup/2023-10-15-10-img/sda5.ntfs-ptcl-img.gz.aac --part-number 299
{
    "AcceptRanges": "bytes",
    "LastModified": "2023-10-15T18:28:31+00:00",
    "ContentLength": 401408,
    "ETag": "\"aba379cb0d00f21f53da5136fc5b0366-299\"",
    "ContentType": "audio/aac",
    "ServerSideEncryption": "AES256",
    "Metadata": {},
    "PartsCount": 299
}

According to the official AWS documentation (as of December 2024) [4] the default chunk size is 8 MB, yet as seen above, in October 2023 a file was uploaded with 16 MB chunks. So it makes more sense to get this value from the ContentLength field instead of assuming it as a constant. It seems, that the folks at Amazon change this default, when they get bored. By the way, aws command produces json output. When working with bash script, it is more elegant to parse the output with jq instead of grep:

$ aws s3api head-object --bucket mybucket --key image_backup/2023-10-15-10-img/sda5.ntfs-ptcl-img.gz.aac --part-number 1 | jq -r '.ETag' 
"aba379cb0d00f21f53da5136fc5b0366-299"

$ aws s3api head-object --bucket mybucket --key image_backup/2023-10-15-10-img/sda5.ntfs-ptcl-img.gz.aac --part-number 1 | jq -r '.ContentLength' 
16777216

I wrote a script to compare all files in the backup directory one by one. It's kinda long to paste it here, so it's available via repo link. The script simply asks the bucket name and the name of the directory, where the backups are copied. I keep my backups in subdirectories with the format <YYYY-MM-DD-HH-img>, under a directory called image_backup. This part (line 12) can be changed when needed. If a file is a single part file, I hash it directly (line 26). If it's multipart, the file is split with dd (line 36) and individual hashes of each part are written to a temporary file. When the parts are fully processed, the resulting file is hashed again and the temp file is deleted (lines 41-42). The rest of the file is compared with bash string operations and if the hashes are the same, OK is printed and if not, FAIL is printed.

[1]: https://stackoverflow.com/questions/45421156
[2]: https://stackoverflow.com/questions/6591047
[3]: https://stackoverflow.com/questions/12186993
[4]: https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#multipart-chunksize