Sunday, January 5, 2025

ETag Calculation on Amazon S3 Objects


Hi there. In this article, I'm going to discuss, how the hash of files uploaded to S3, called ETag, is calculated. It is actually nothing more than a simple MD5 hash, calculated for everything uploaded to an S3 bucket. We know, that files in S3 are not really "files", as we understand them. S3 is called as “Object Storage”, thus the files are stored as objects in the correct terminology. 

After a certain file size, uploads with aws s3 cp or aws s3 sync are automatically split into equal sized chunks for (possibly) easier storage. Such objects are called multipart objects. So how big is this certain size? Although, there is no general measure, my files are currently stored in chunks of 8M and 16M. The source I used for this article says that files up to 5G are not fragmented [1][2], but my observation is, that this is no longer true, and this value is also not very important for our problem.

If a file is smaller than this so called multipart threshold, it is stored as a single part and the ETag of the object is equal to its MD5 hash. So simple is that. If the file is larger than this threshold, it is stored as a multipart object and things get a bit complicated. You can easily tell, if an object is multipart or not, by looking at its ETag. A normal MD5 hash consists only of hexadecimal digits. Therefore a hyphen ( - ) does not belong to an MD5 hash. If a file in S3 has a hyphen in its ETag, then it is a multipart object and the number of parts of this file is given after the hyphen. I will give a concrete example later in the article.

ETag calculation on multipart objects works like this: Each part is hashed separately, the resulting hashes are concatenated and hashed again. This hash is the part of the ETag before the hyphen. The number of fragments is simply added after the hyphen [3].

I regularly back up my disks with Clonezilla. Backups get written to an external disk and then copied to S3. I usually keep the most recent copy on my external disk and the last three copies on S3. For backwards (FAT32) compatibility, I split backup files into 4G chunks (even though I don't back up to FAT32 media). The need for ETag comparison arose, because I wanted to verify my copies on S3.

At this point, I assume that the aws CLI tool is installed and configured. The settings are made in the .aws/config file, but I won't go into its details, to avoid lengthening this article. Let's take the small file example first:

$ aws s3api head-object --bucket mybucket --key image_backup/2023-10-15-10-img/Info-lshw.txt
{
    "AcceptRanges": "bytes",
    "LastModified": "2023-10-15T18:28:31+00:00",
    "ContentLength": 40960,
    "ETag": "\"fe78f69cb9d41a23ba23b4783e542a7b\"",
    "ContentType": "text/plain",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

As I mentioned before, this is not a multipart object. So the MD5 hash, i.e. the ETag, can be simply found. Below is an example of a large file:

$ aws s3api head-object --bucket mybucket --key image_backup/2024-12-01-13-img/sda5.ntfs-ptcl-img.xz.ac
{
    "AcceptRanges": "bytes",
    "LastModified": "2024-12-03T17:00:58+00:00",
    "ContentLength": 4096008192,
    "ETag": "\"360f5e8babf8cd28673eaafd32eb405f-489\"",
    "ContentType": "application/vnd.nokia.n-gage.ac+xml",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

This file is 4096 MB in size and, as you can see from its ETag, it consists of 489 parts.  The main thing here is to find the size of the parts. ContentLength divided by 489 is actually very close to 8M. From this, it's safe to assume that the file is actually divided into 8M chunks, but it would be better to find the exact value, to use it in a script. To do this, I'll add --part-number parameter to the command and check a single part. Since files are splitted into fixed size chunks, only the size of the last fragment is different. And the ETag value for each part is the same. In other words, --part-number will not give the MD5 hash of each individual part.

$ aws s3api head-object --bucket mybucket --key image_backup/2023-10-15-10-img/sda5.ntfs-ptcl-img.gz.aac --part-number 1
{
    "AcceptRanges": "bytes",
    "LastModified": "2023-10-15T18:28:31+00:00",
    "ContentLength": 16777216,
    "ETag": "\"aba379cb0d00f21f53da5136fc5b0366-299\"",
    "ContentType": "audio/aac",
    "ServerSideEncryption": "AES256",
    "Metadata": {},
    "PartsCount": 299
}

$ aws s3api head-object --bucket mybucket --key image_backup/2023-10-15-10-img/sda5.ntfs-ptcl-img.gz.aac --part-number 299
{
    "AcceptRanges": "bytes",
    "LastModified": "2023-10-15T18:28:31+00:00",
    "ContentLength": 401408,
    "ETag": "\"aba379cb0d00f21f53da5136fc5b0366-299\"",
    "ContentType": "audio/aac",
    "ServerSideEncryption": "AES256",
    "Metadata": {},
    "PartsCount": 299
}

According to the official AWS documentation (as of December 2024) [4] the default chunk size is 8 MB, yet as seen above, in October 2023 a file was uploaded with 16 MB chunks. So it makes more sense to get this value from the ContentLength field instead of assuming it as a constant. It seems, that the folks at Amazon change this default, when they get bored. By the way, aws command produces json output. When working with bash script, it is more elegant to parse the output with jq instead of grep:

$ aws s3api head-object --bucket mybucket --key image_backup/2023-10-15-10-img/sda5.ntfs-ptcl-img.gz.aac --part-number 1 | jq -r '.ETag'
"aba379cb0d00f21f53da5136fc5b0366-299"

$ aws s3api head-object --bucket mybucket --key image_backup/2023-10-15-10-img/sda5.ntfs-ptcl-img.gz.aac --part-number 1 | jq -r '.ContentLength'
16777216

I wrote a script to compare all files in the backup directory one by one. It's kinda long to paste it here, so it's available via repo link.  The script simply asks the bucket name and the name of the directory, where the backups are copied. I keep my backups in subdirectories with the format <YYYY-MM-DD-HH-img>, under a directory called image_backup. This part (line 12) can be changed when needed. If a file is a single part file, I hash it directly (line 26). If it's multipart, the file is split with dd (line 36) and individual hashes of each part are written to a temporary file. When the parts are fully processed, the resulting file is hashed again and the temp file is deleted (lines 41-42). The rest of the file is compared with bash string operations and if the hashes are the same, OK is printed and if not, FAIL is printed.


[1]: https://stackoverflow.com/questions/45421156
[2]: https://stackoverflow.com/questions/6591047
[3]: https://stackoverflow.com/questions/12186993
[4]: https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#multipart-chunksize