Saturday, August 6, 2022

Linear Algebra and Its Relationship between Video Subtitles

In memory of Prof. Dr. Metin DEMÄ°RALP, who helped me understand the linear algebra topics and the relationship between them, that I benefit even today.

Hi there. I was recently syncing subtitles to some of my content I have, and I realized that this is actually a linear algebra problem and it is a quite simple one: Linear equations in two variables. In this post, I will first define the problem and then discuss how I solved it.


First of All, Introduction and Basic Definitions
The most important thing to know when syncing subtitles is the frame rate of a video. Each video is actually made up of consecutive frames, which is also called frame per seconds (FPS). This is chosen by its creator from among some predefined values, like 23.976, 24, 25 etc. These values are given with a subtitle. You need to choose the correct one when downloading.

Example from planetdp.org


There are dozens of different formats for subtitles files, but I will focus on two. The first one is MicroDVD format with .sub file extension. These are pretty simple: the frame number between two curly braces, where the text appears, another frame number between two curly braces right next to first one, where the text disappears and the subtitle text. Example:

{0}{25}Hello

This will display "Hello" in the first second of a 25 FPS video. This format was especially popular in the early 2000s. It can be easily noticed that this subtitle won't fit to same video with different FPS and it needs to be edited. If a 24 FPS subtitle is used for a 25 FPS video, the texts move forward one frame for each second and after 25 seconds, the difference becomes 1 sec. Quite annoying!

SubRip is the second format with .srt file extension. This one is also quite simple:
1- A sequence number at the first line
2- Two timestamps in HH:MM:SS,TTT format, separated by " --> ". The first timestamp is when the text appears and the second when it disappears.
3- Subtitle text
4- An empty line defines the end of the text.

Example:
1
00:00:00,000 --> 00:00:01,000
Hello


This is the same as the first example. Since this format is FPS independent, it can be used with any FPS. Even though this format is available since the 2000s, it outstripped MicroDVD format in prevalence. This format is more human readable as there are no obfuscated frame numbers.

Of course, subtitle formats are not limited to these. For instance, Aegisub format, used extensively with anime, is way more complex than these two. As far as I know, it stores the text as vectors and the text can be printed anywhere on screen in any font, size and color. But it is beyond the scope of this post, as it is more difficult to edit than text-based formats.

Even though Subrip format is FPS independent, there are some subtitles on the internet that are somehow out of sync. These are probably subtitles converted from .sub to .srt.


Description of the Problem
As I've given the definitions, I can now describe the problem. In general, there can be two problematic situations with subtitles: (a) Subtitle is constantly ahead or behind. e.g. The subtitled video has an intro at the beginning and somebody else has cut it out while editing the video. So, the subtitle appears later or earlier than it should, but the time difference between the audio and the subtitles is always constant. Or (b) the subtitle is in sync with the audio at some point but it goes fast or stays slow and after a while sync disappears. As I mentioned above, this is seen especially in .sub files with wrong FPS and/or .srt files derived from them. And more frustrating is that, these two coexist. The subtitle is never in sync and the difference is changing constantly.

I've visualized both situations for a better understanding. For simplicity, I assume FPS rate is always 25. Let the X-axis be the time and Y-axis be the frame counter. The frame counter increments by 25 every second and an appropriate subtitle line must be "overlapping" with the video line. If the subtitle is ahead or behind, the subtitle line is parallel to video, either above or below it.


In case of FPS incompatibility, subtitle is in sync at the beginning but it disappears in time. Since they are in sync, both lines intersect at the beginning, and they diverge because the slope of the subtitle line is not equal to the slope of video.


And situations, where both problems coexist:


After formulating the problem with slope, I can now express everything mathematically.


Linear Equations in Two Variables and Solving Systems of Equations
An equation of the form y = mx + c, where m and c are constants, is called an equation with two variables. For instance, let m = 3 and c = 4. It gives y = 3x + 4. For each value of x, I get a y. i.e. if x = 1 then y = 7. If I find all y's that correspond to all x's and mark them as points on the plane, a line is formed. This line intersects the vertical axis at 4 (x = 0 case) and goes up (increases) 3 units for each unit where it goes to the right (where x increases). Therefore, m is called the slope parameter or just slope of the line. As m decreases, slope gets less steep. If m = 0, the line becomes a straight line and if m is negative, the slope is downwards. I can also drag the line up or down, by changing c parameter.

In previous section, I expressed a video with a line. The slope of the video line will be equal to its FPS and c = 0  because the audio starts obviously with the video. For example y = 25x. All thick blue lines are like this. And if the subtitle is compatible with the video, its equation must be the same. If not, my aim is to make them equal. Nice and easy.

As we saw on the graph, if the subtitle is ahead or behind, the lines are parallel. Parallel lines have equal slopes. And having the subtitle behind or ahead is actually the case where at frame zero (x = 0), the subtitle line doesn't intersect vertical axis at zero. It implies that the subtitle equation is actually y = 25x + c, where c is not equal to zero. What I need to do, is to find c and subtract (or sum) this from all timestamps in the subtitle.

Constantly widening gap between a subtitle and the video means that the slope of the subtitle line is different from that of the video. I need to change the slope parameter i.e. the value of 'm' here. Since 'm' is a multiplier, I need to find the value that gives the correct time, when multiplied by the subtitle's timestamps. E.g. if the subtitle is 24 FPS and the video is 25, I multiply all timestamps by 25/24 to rearrange. If the subtitle gets synced as soon as I do it, I can assume that c = 0 in subtitle equation, in other words the subtitle equation is y = 24x. In case c is non-zero, I also need to sum the difference after equalizing the slope value, like I explained in previous paragraph.

This is all about equations in two variables. How about systems of equations? They contain more than one equation, and they represent that many lines on a plane. Example:


Solving them is easy. Because y's are equal, I can equate both of them and add 9 to both sides to eliminate the nine on left hand side (LHS) of the equation. (I could have eliminated 3 on right hand side (RHS), then there would be -6 on LHS but I don't want to deal with negative values). If I subtract x from both sides, I find x = 6. The operation I applied is given after // chars.


When I substitute x with 6 in the second equation, it gives y = 6 - 3 = 3. (x = 6, y = 3) denotes a point, where two lines intersect. Because they are lines, they only intersect at one and only one point. This can be seen on the next figure.

Because the line slopes are not equal, one will increase less, the other will increase more and they always catch each other, unless they have same slope (parallel). They may also be parallel and never intersect. An example of this is, when the subtitle is constantly behind the audio.


Finally, the lines may be overlapping. In this case, all x values satisfy the system, but whether this is a meaningful solution or not depends on the problem.



Matrix Multiplication and Matrix Solutions of Systems
I had explained matrix multiplication in the BLAS and LAPACK article in detail. In summary, first row of the first matrix is multiplied by the first column of second matrix element by element. The values are summed up and the first element of the product matrix is found. Then first row and second column multiplied and summed, gives the second element (first row second column) etc... p. row * q. column = p x q element. This operation is better understandable with an example, but I'll first rearrange our system:


Now, I can easily write them in matrix notation:


First row elements, 2 and -1, multiplied by the first and the only column of the second matrix and added together, gives 2*x + (-1)y = 9 and the same operation with second row elements gives x + (-1)y = 3. The matrix multiplication above is exactly the same as the system of the equations. It is moreover both tidy and stylish. Since the first matrix consists of the coefficients of the equations, it is called the coefficient matrix. The matrix to the right of the equation, which is actually a vector because it is 2-by-1 matrix, is called right hand side vector or simply RHS Vector.

In case, there is no solution of the system, this corresponds to the case where the rows or columns of the matrix are proportional to eachother. In linear algebra, this means that the determinant of the matrix is zero, in other words at least one of its eigenvalues is zero but if I get into that stuff, I really get away from the real topic.

Now, I have a video, whose equation I know and I have subtitle whose equation I don't know but I want to synchronize to the video. Let's call the subtitles equation y = m1x + c1 and the video equation y = m2x + c2. I need to multiply m1 by such a number (let's call it m'), so that the result gives m2. I mean: m1m' = m2 and c1m' + c' = c2. This part could be a bit confusing but it's OK, because I'll explain it later in a simpler way:

Note: The term "subtitle" is refering to all the texts from the video. To distinguish the individual texts appearing on the screen at a given moment, I will refer it as "subtext" throughout the rest of this article. The term "subtitle" geometrically corresponds to the y = mx + c line, while the "subtext" term will be the value of y for a given x.

To adapt the subtitle to a video, I first need two subtexts, of which I know their exact timing (when they are spoken in the video). I can find their timings by listening or if I can for example find a fully synchronous subtitle from another source in any other language, that's even better. Of course, to understand if it is synchronous, it has to be in a language, that I can more or less understand (or check on google translate) or I can choose proper names from the audio and try to understand them approximately.

Okay, but why two? Let's say, I have a timestamp from the video and a corresponding subtext with wrong timing and I want to sync. A concrete example; I heard a "Hello" at the 33rd minute of the video and the timestamp of the "Hello" subtext is at the 35th minute. Should I just subtract by two minutes of should I multiply by 35/33? I cannot know because maybe the difference between the audio and subtext becomes four minutes at the 60th minute (slope is different). To solve m and c precisely, I need two subtexts. This is exactly the same as the Euclidean axiom, "one and only one line passes through two distinct points", we are stumbling upon.

So, I have two subtexts. For instance, "Hello" is spoken at 33rd minute and "Good bye" at 48th minute of the video. The subtitle I have contains corresponding subtexts at 35th and 51st minutes. And I will give my asyncronous subtitle to a function as an input, such that it will produce syncronized subtitle as its output. General expression of the function is y = mx + c. The x's are the timestamps of the asynchronous subtitle (input) and y's are the synched timestamps (output). In this case:


I could find the unknowns by multipling and subtracting side by side, but I'll do it aesthetically:


Since the coefficient of c is always 1, so is the values of the second column. At this point, I will ignore the method to solve this and interpret the results. I left the solution to wolfram alpha and found c = 3/16 and m = 15/16. This means, that the FPS ratio between the video and the subtitle is 15/16 (maybe the subtitle is 24 and the video is 22.5 FPS) and after this skew is corrected, there is still a difference of 3/16 minutes between audio and subtitle, which I need to subtract.

But how accurate is this function? Is the subtitle synced with a perfect accuracy for each second of the video or are there any errors (and if yes, how much)? Of course, I am not talking about the case that the subtitle itself is inaccurate (maybe some subtext is missing). How reliable is this function I found?


Robustness
The reliability of the function is a measure of the accurately it can sync the subtitle throughout the video. In math literature, it is expressed as "robustness" which means how much a system can tolerate errors.

The timestamps I found by listening are quite prone to errors, as they depend on how well I listen and how quickly I react. Furthermore, I don't know, how much error do the selected subtext contain initially. Maybe the subtext, I chose, had originally a skew with the audio. Even a skew of 0.2 seconds can lead to serious discrepancies. Let's examine this graphically:


I assume, that the subtext at minute 4 is perfectly correct, but I made a mistake of 0.2 seconds at minute 5 (4.8 frames for a 24 FPS video). This causes a skew of 12 seconds at first hour of the video. The lower part of the above graph shows the impact of my error on the entire video. When I diverge the second point from the first, for example, first subtext at minute 4 without error (again) and this time second subtext is at the 60th minute with same error of 0.2 sec:


The discrepancy on the both ends has shrunk to an inconspicous extent. The second graph looks way better than the first one. The error of 0.2 seconds at the 60th minute just became 0.343 seconds at the 100th minute. If I had synched with a subtext at the very end, it would be much less. So, it makes sense to get one subtext from the beginning and another one towards the end of the video for least error.

If I may go a bit back to linear algebra, the robustness of a linear system is related to how big the absolute value of the determinant of the coeff. matrix is. (or how big the differences between eigenvalues .... Anyway, I am calming down). I don't need to go to the definition of determinant in detail or its calculation. Because of the special form of the matrix above (all ones in second column), the determinant is equal to the difference between two timestamps. Therefore, the mathematical background of "having a larger gap between two subtexts to reduce the error", has also been proven.


Python Code
The coolest thing is that, in order to sync subtitles, the knowlegde of linear equation systems is not necessarily needed. Given the coeff. matrix and RHS vector, the linalg.solve function in Python's numpy library returns the result. This code snippet below reads two timestamps, that are in the subtitle and should be, converts them to seconds using time2num(). It creates the system and calculates m and c.

#!/usr/bin/python3

# input should be like:
# 00:03:29,632     00:02:13,532
# 00:43:56,890     00:44:00,486
# -------------    --------------
# is in subtitle   needs to be
#
# Formula:
# m * 00:03:29,632 + c*1 = 00:02:13,532
# m * 00:43:56,890 + c*1 = 00:44:00.486
#
# [ 209.632    1 ]   [ m ] = [ 133.532 ]
# [ 2636.890   1 ] * [ c ] = [ 2640.486 ]
# Calculate m, c params with this script then apply
# this transformation with subtitle.py script.

import numpy
import re

def time2num(time):
    timei = re.sub(r",", ".", time);
    (h, m, s) = timei.split(':');
    return round(int(h) * 3600 + int(m) * 60 + float(s), 3);

if __name__ == '__main__':
    print("is in subtitle    needs to be\n--------------    -------------");

    (OL1, OG1) = input().split();
    (OL2, OG2) = input().split();

    A = numpy.array( [[time2num(OL1), 1.0], [time2num(OL2), 1.0]], float );
    b = numpy.array( [time2num(OG1), time2num(OG2)], float);

    x = numpy.linalg.solve(A, b)

    print(x);

Output values are given as input to the second code snippet below, but first input is the subtitle file as command line parameter. The code reads all the timestamps in .srt file. Here, a regex search is performed with re.search(). The string "-->" is present in all timestamp lines are highly unlikely to be found in the subtext.  The appearance and disappearance times of subtexts are captured from these lines, converted to seconds using time2num() and the fitting operation is applied with m and c values. Newly found values converted to timestamps back with a formula and written to a copy of the file.

#!/usr/bin/python3

import argparse
import os
import re

# Using linsolver.py script, calculate first, what the m and c
# parameters are. Then enter them as input to this script.
# These will be used at lines 41-42

def time2num(time):
    timei = re.sub(r",", ".", time);
    (h, m, s) = timei.split(':');
    return round(int(h) * 3600 + int(m) * 60 + float(s), 3);

def num2time(numtime):
    s = round(numtime % 60, 3);
    m = int((numtime / 60) % 60);
    h = int(numtime / 3600);
    return re.sub(r"\.", ",", ("%02d:%02d:%06.3f" % (h, m, s)))

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Modify subtitle timings');
    parser.add_argument('arg_filename', type=str,
                        help='Subtitle file name to open');
    args = parser.parse_args();

    hFileIn  = open(args.arg_filename, "r", encoding="ISO-8859-9");
    hFileOut = open(''.join( os.path.splitext(args.arg_filename)[0] + ".mod.srt" ), "w", newline = "\r\n");

    print("Input m and c separated by space");
    (m, c) = input().split();

    for line in hFileIn.readlines():
        if(re.search(r"-->", line)):
            substart = time2num(line.split()[0]);
            subfinal = time2num(line.split()[2]);

            #####  Calculations here  #####
            substart = substart * m + c;
            subfinal = subfinal * m + c;

            hFileOut.write(num2time(substart) + ' --> ' + num2time(subfinal) + '\n');
        else:
            hFileOut.write(line);

    hFileIn.close();
    hFileOut.close();


Long story short, even linear algebra can be useful to learn without complaining about what it is used for in real life. 

Note: This article was actually ready to publish in April 2022, however it has delayed four months, first because of house moving, then due to a big problem caused by the worst mobile operator ever, that I couldn't receive MFA codes to login and finally because of the summer vacation.

No comments:

Post a Comment