Sunday, January 5, 2025

ETag Calculation on Amazon S3 Objects


Hi there. In this article, I'm going to discuss, how the hash of files uploaded to S3, called ETag, is calculated. It is actually nothing more than a simple MD5 hash, calculated for everything uploaded to an S3 bucket. We know, that files in S3 are not really "files", as we understand them. S3 is called as “Object Storage”, thus the files are stored as objects in the correct terminology. 

After a certain file size, uploads with aws s3 cp or aws s3 sync are automatically split into equal sized chunks for (possibly) easier storage. Such objects are called multipart objects. So how big is this certain size? Although, there is no general measure, my files are currently stored in chunks of 8M and 16M. The source I used for this article says that files up to 5G are not fragmented [1][2], but my observation is, that this is no longer true, and this value is also not very important for our problem.

If a file is smaller than this so called multipart threshold, it is stored as a single part and the ETag of the object is equal to its MD5 hash. So simple is that. If the file is larger than this threshold, it is stored as a multipart object and things get a bit complicated. You can easily tell, if an object is multipart or not, by looking at its ETag. A normal MD5 hash consists only of hexadecimal digits. Therefore a hyphen ( - ) does not belong to an MD5 hash. If a file in S3 has a hyphen in its ETag, then it is a multipart object and the number of parts of this file is given after the hyphen. I will give a concrete example later in the article.

ETag calculation on multipart objects works like this: Each part is hashed separately, the resulting hashes are concatenated and hashed again. This hash is the part of the ETag before the hyphen. The number of fragments is simply added after the hyphen [3].

I regularly back up my disks with Clonezilla. Backups get written to an external disk and then copied to S3. I usually keep the most recent copy on my external disk and the last three copies on S3. For backwards (FAT32) compatibility, I split backup files into 4G chunks (even though I don't back up to FAT32 media). The need for ETag comparison arose, because I wanted to verify my copies on S3.

At this point, I assume that the aws CLI tool is installed and configured. The settings are made in the .aws/config file, but I won't go into its details, to avoid lengthening this article. Let's take the small file example first:

$ aws s3api head-object --bucket mybucket --key image_backup/2023-10-15-10-img/Info-lshw.txt
{
    "AcceptRanges": "bytes",
    "LastModified": "2023-10-15T18:28:31+00:00",
    "ContentLength": 40960,
    "ETag": "\"fe78f69cb9d41a23ba23b4783e542a7b\"",
    "ContentType": "text/plain",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

As I mentioned before, this is not a multipart object. So the MD5 hash, i.e. the ETag, can be simply found. Below is an example of a large file:

$ aws s3api head-object --bucket mybucket --key image_backup/2024-12-01-13-img/sda5.ntfs-ptcl-img.xz.ac
{
    "AcceptRanges": "bytes",
    "LastModified": "2024-12-03T17:00:58+00:00",
    "ContentLength": 4096008192,
    "ETag": "\"360f5e8babf8cd28673eaafd32eb405f-489\"",
    "ContentType": "application/vnd.nokia.n-gage.ac+xml",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

This file is 4096 MB in size and, as you can see from its ETag, it consists of 489 parts.  The main thing here is to find the size of the parts. ContentLength divided by 489 is actually very close to 8M. From this, it's safe to assume that the file is actually divided into 8M chunks, but it would be better to find the exact value, to use it in a script. To do this, I'll add --part-number parameter to the command and check a single part. Since files are splitted into fixed size chunks, only the size of the last fragment is different. And the ETag value for each part is the same. In other words, --part-number will not give the MD5 hash of each individual part.

$ aws s3api head-object --bucket mybucket --key image_backup/2023-10-15-10-img/sda5.ntfs-ptcl-img.gz.aac --part-number 1
{
    "AcceptRanges": "bytes",
    "LastModified": "2023-10-15T18:28:31+00:00",
    "ContentLength": 16777216,
    "ETag": "\"aba379cb0d00f21f53da5136fc5b0366-299\"",
    "ContentType": "audio/aac",
    "ServerSideEncryption": "AES256",
    "Metadata": {},
    "PartsCount": 299
}

$ aws s3api head-object --bucket mybucket --key image_backup/2023-10-15-10-img/sda5.ntfs-ptcl-img.gz.aac --part-number 299
{
    "AcceptRanges": "bytes",
    "LastModified": "2023-10-15T18:28:31+00:00",
    "ContentLength": 401408,
    "ETag": "\"aba379cb0d00f21f53da5136fc5b0366-299\"",
    "ContentType": "audio/aac",
    "ServerSideEncryption": "AES256",
    "Metadata": {},
    "PartsCount": 299
}

According to the official AWS documentation (as of December 2024) [4] the default chunk size is 8 MB, yet as seen above, in October 2023 a file was uploaded with 16 MB chunks. So it makes more sense to get this value from the ContentLength field instead of assuming it as a constant. It seems, that the folks at Amazon change this default, when they get bored. By the way, aws command produces json output. When working with bash script, it is more elegant to parse the output with jq instead of grep:

$ aws s3api head-object --bucket mybucket --key image_backup/2023-10-15-10-img/sda5.ntfs-ptcl-img.gz.aac --part-number 1 | jq -r '.ETag'
"aba379cb0d00f21f53da5136fc5b0366-299"

$ aws s3api head-object --bucket mybucket --key image_backup/2023-10-15-10-img/sda5.ntfs-ptcl-img.gz.aac --part-number 1 | jq -r '.ContentLength'
16777216

I wrote a script to compare all files in the backup directory one by one. It's kinda long to paste it here, so it's available via repo link.  The script simply asks the bucket name and the name of the directory, where the backups are copied. I keep my backups in subdirectories with the format <YYYY-MM-DD-HH-img>, under a directory called image_backup. This part (line 12) can be changed when needed. If a file is a single part file, I hash it directly (line 26). If it's multipart, the file is split with dd (line 36) and individual hashes of each part are written to a temporary file. When the parts are fully processed, the resulting file is hashed again and the temp file is deleted (lines 41-42). The rest of the file is compared with bash string operations and if the hashes are the same, OK is printed and if not, FAIL is printed.


[1]: https://stackoverflow.com/questions/45421156
[2]: https://stackoverflow.com/questions/6591047
[3]: https://stackoverflow.com/questions/12186993
[4]: https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#multipart-chunksize

Sunday, November 24, 2024

On the Invariance of the Distances Between Points in a Plane under Rotations of the Coordinate System

 

Hi there. As the title already suggests, this post is going to be a little academic. I found the following question among the challenge questions of the "Vectors" section of an “University Physics” book [1] and wanted to publish its solution, just because I liked it. Actually, high school math should be enough to follow the solution. I will try to simplify it anyways, as much as I can (so that I can also understand it afterwards).

Question: Distances between points in a plane do not change when a coordinate system is rotated. In other words, the magnitude of a vector is invariant under rotations of the coordinate system. Suppose a coordinate system S (red) is rotated about its origin by angle φ to become a new coordinate system S' (blue), as shown in the following figure.

Image Source: University Physics Vol. 1 [1]

A point in a plane has coordinates (x, y) in S and coordinates (x', y') in S'.

(a) Show that, during the transformation of rotation, the coordinates in S' are expressed in terms of the coordinates in S by the following relations:

Remark: The above formula is only correct, if the angle φ is negative (clockwise). In the image, the angle shows a positive direction. I have still tried to stick to the original formula given in the book.

At this point, a good mathematician would express the above equations with a 2D rotation matrix.


This is a much deeper subject, when generalized to 3D. I'm going to publish another post on this in the future.

(b) Show that the distance of point P to the origin is invariant under rotations of the coordinate system. Here, you have to show that: 

(c) Show that the distance between points P and Q is invariant under rotations of the coordinate system. Here, you have to show that:


Preliminary Information

First of all, since there is a rotation in the question, it is possible to simplify the operations by switching to polar coordinate system. The coordinate system S in the image is called Cartesian coordinate system (CCS), where the perpendicular distances of a point in the plane are given in (x, y) pairs. In addition, the axes can be rotated, as shown in the image, and they also don't need to be always perpendicular to each other (also see: Orthogonal Coordinates). Skewed coordinate systems are also possible. The only condition of existence of such coordinate systems here is, that the axes are not overlapping, i.e. the angle between them is not zero. In case the angle is zero, two dimensions collapse to one (N dimensions are likewise reduced to N-1), and we get the number line, which we know from the elementary school: The number line is a one-dimensional coordinate system. The term "Cartesian" is meaningless, as there is orthogonality of a single vector. 

In polar coordinate system (PCS), a point is expressed by the pairs (r, θ), where r is the distance from the origin and θ is the positive (counterclockwise) angle between the line segment and the horizontal axis.

Note: In general, to express any point in an N-dimensional space, N terms are required regardless of the coordinate system chosen. E.g., in 3D, the triples (x, y, z) are used in the cartesian coordinate system, (r, θ, φ) in the spherical coordinate system or (ρ, φ, z) in the cylindrical coordinate system.


Therefore, I can express the problem in PCS as follows.

Representation of the problem in PCS

As it can be seen, the position of the points P or Q has not changed. We just express them differently. As an analogy, the concept of 'book' exists independently of language, but it's obviously being expressed in different words in different languages. If I'm talking to a Portuguese person, it would be easier to use the Portuguese word for book. Likewise, for problems involving rotation, it is more convenient to switch to polar coordinates. If the problem had involved some sort of displacement, Cartesian would be more advantageous. So let's switch to PCS first:

Transformation of P to PCS

In the given problem, I know that the coordinates of P are (xP, yP). For simplicity, I will take them as (x, y) and do the transformation. r is the distance here:

Since P0x is a right triangle, simple trigonometric equations for θ can be easily written. E.g. sin(θ) = y/r and cos(θ) = x/r . If I divide these two side by side tan(θ) = y/x (r cannot be zero, btw. If it was, P would already be on the origin and the angle would be undefined, since there is no line segment, no triangle. Therefore the division is valid). Hence:

The equations for the inverse transformation are shown in the image below.

Equation (1)

Now, we can proceed to the solutions.


Solution

a) Rotating the coordinate system S by φ means, that we're expressing the point P(x’, y’) as P(r, θ - φ) in the new coordinate system. In this case, the following equations are written from the sum-difference identities for trigonometric functions:

If I open the parentheses on the right hand side, and substitute the terms by x and y from the equation (1), the following equations are found:

If I express this with a matrix:

Equation (2)

 

b)  

To show the relationship between x' and x, as well as y' and y, equations (1) will be used again. Since x and y are real numbers, the expressions in the roots can never be negative. Therefore, by squaring both sides, we can get rid of the roots without doing any mistake. The equation is shown below:

 
In the above equation, I assumed that the equation  cos2(x) + sin2(x) = 1 is already known. Considering the right triangle P0x in the image where P is shown in CCS, the hypotenuse is r and other edge lengths are r*cos(θ) and r*sin(θ), which are obviously perpendicular to each other. The equality is easily proved by writing down the Pythagorean theorem and simplifying all terms by  r2. The geometric proof of the equality for the case r = 1 can also be easily given in an unit circle.


c)

We now want to show the above equality. Once again, the equations in (1) will be used here. The expressions inside the roots are not negative, so we get rid of the roots by squaring both sides. From the rotation transformation given in (2), the following equations are written:

As these are relatively complicated, I first opened the terms with x, and then the terms with y.

 


To get the expression (x'P - x'Q)2 + (y'P - y'Q)2 , I will add the above terms side by side. The terms marked in same color above cancel each other out. For the sake of simplicity, they are excluded in the summation below.

Finally, if the right hand side of the above equation is grouped by x's and y's, and bracketed, and finally the resulting equation is simplified based on  cos2(x) + sin2(x) = 1 , a square bracket expansion is found:


[1]: Moebs, W., Sanny, J., & Ling, S. J. (2016). University Physics Volume 1. Houston, Texas: OpenStax. Link