[Math][Python] Trimming backups with logarithmic scale, part II

Previously. This post is like a prequel.

This is a real story. Each day a server dumps MySQL DB into a file:

...
-rw-rw-r-- 1 i i 6697136 Dec  1 00:00 mysql_dump_1669849201.sql.xz
-rw-rw-r-- 1 i i 6730080 Dec  2 00:00 mysql_dump_1669935601.sql.xz
-rw-rw-r-- 1 i i 6762716 Dec  3 00:00 mysql_dump_1670022001.sql.xz
-rw-rw-r-- 1 i i 6603604 Dec  4 00:00 mysql_dump_1670108401.sql.xz
-rw-rw-r-- 1 i i 6590036 Dec  5 00:00 mysql_dump_1670194801.sql.xz
-rw-rw-r-- 1 i i 6639448 Dec  6 00:00 mysql_dump_1670281201.sql.xz
-rw-rw-r-- 1 i i 6673608 Dec  7 00:00 mysql_dump_1670367601.sql.xz
-rw-rw-r-- 1 i i 6701520 Dec  8 00:00 mysql_dump_1670454001.sql.xz
...

But again, I don't need them all (in case of disaster). Logarithmic scale can help here as well, as it did with ZFS snapshots.

This is a general-use utility written in Python for logarithmic trimming.

#!/usr/bin/env python3

import subprocess, sys, os
import math, time, datetime

dry_run=True

def get_files_list():
    global dry_run
    rt={}
    for f in sys.argv[1:]:
        if f=="--commit":
            dry_run=False
        else:
            TS=os.path.getmtime(f)
            if TS not in rt:
                rt[TS]=[f]
            else:
                rt[TS].append(f)
    return rt

files=get_files_list()

if len(files)==0:
    print ("Usage: ./logtrim.py [--commit] filemask")
    print ("By default, it's executed in dry run mode. No files gets deleted.")
    print ("Add --commit to actually delete files.")
    exit(1)

# These parameters are to be tuned if you want different logarithmic 'curve'...
points=sorted(list(set([math.floor(1.09**x) for x in range(1,120+1)])))

# points in hours
#print (points)

now=math.floor(time.time())

# points in UNIX timestamps
SECONDS_IN_HOUR=60*60
points_TS=sorted(list(map(lambda x: now-x*SECONDS_IN_HOUR, points)), reverse=True)
points_TS.append(0) # remove the oldest file, if it's not in range

prev=now

# we are going to keep only one files between each range
# a file to be picked randomly, or just the first/last
# if there is only one file in the range, leave it
for p in points_TS:
    print ("range", prev, p, datetime.datetime.fromtimestamp(prev), datetime.datetime.fromtimestamp(p))
    range_hi=prev
    range_lo=p
    print ("files between:")
    files_between={}
    for s in files:
        # half-closed interval:
        if s>range_lo and s<=range_hi:
            print (s, files[s])
            files_between[s]=files[s]
    print ("files_between total:", len(files_between))
    if len(files_between)>1:
        files_between_vals=list(files_between.values())
        # going to kill all files except the first
        print ("keeping this file(s):", files_between_vals[0])
        for to_kill in files_between_vals[1:]:
            print ("removing this file(s):", to_kill)
            if dry_run==False:
                for f in to_kill:
                    os.unlink(f)
    prev=p

if dry_run==True:
    print ("No files deleted.")
    print ("Add --commit to actually delete files.")

Let's run it on my list of mysql files:

 % ./logtrim.py testdata/*
...
range 1668748740 1668478740 2022-11-18 07:19:00 2022-11-15 04:19:00
files between:
1668549600.0 ['testdata/mysql_dump_1668553201.sql.xz']
1668636000.0 ['testdata/mysql_dump_1668639601.sql.xz']
1668722400.0 ['testdata/mysql_dump_1668726001.sql.xz']
files_between total: 3
keeping this file(s): ['testdata/mysql_dump_1668553201.sql.xz']
removing this file(s): ['testdata/mysql_dump_1668639601.sql.xz']
removing this file(s): ['testdata/mysql_dump_1668726001.sql.xz']
range 1668478740 1668187140 2022-11-15 04:19:00 2022-11-11 19:19:00
files between:
1668204000.0 ['testdata/mysql_dump_1668207601.sql.xz']
1668290400.0 ['testdata/mysql_dump_1668294001.sql.xz']
1668376800.0 ['testdata/mysql_dump_1668380401.sql.xz']
1668463200.0 ['testdata/mysql_dump_1668466801.sql.xz']
files_between total: 4
keeping this file(s): ['testdata/mysql_dump_1668207601.sql.xz']
removing this file(s): ['testdata/mysql_dump_1668294001.sql.xz']
removing this file(s): ['testdata/mysql_dump_1668380401.sql.xz']
removing this file(s): ['testdata/mysql_dump_1668466801.sql.xz']
...
No files deleted.
Add --commit to actually delete files.

...

 % ./logtrim.py --commit testdata/*

... 

List of files after trimming. Isn't it neat?

-rw-rw-r-- 1 i i  584695 Jun  7  2022 mysql_dump_1654552801.sql
-rw-rw-r-- 1 i i  319376 Jun 13  2022 mysql_dump_1655071201.sql.xz
-rw-rw-r-- 1 i i  742012 Jun 29 00:00 mysql_dump_1656453601.sql.xz
-rw-rw-r-- 1 i i 1063884 Jul 13 00:00 mysql_dump_1657663201.sql.xz
-rw-rw-r-- 1 i i 1929164 Jul 27 00:00 mysql_dump_1658872802.sql.xz
-rw-rw-r-- 1 i i 2401192 Aug  8 00:00 mysql_dump_1659909601.sql.xz
-rw-rw-r-- 1 i i 2311372 Aug 19 00:00 mysql_dump_1660860001.sql.xz
-rw-rw-r-- 1 i i 2860008 Aug 30 00:00 mysql_dump_1661810402.sql.xz
-rw-rw-r-- 1 i i 3294004 Sep  8 00:00 mysql_dump_1662588001.sql.xz
-rw-rw-r-- 1 i i 3366360 Sep 17 00:00 mysql_dump_1663365601.sql.xz
-rw-rw-r-- 1 i i 3914516 Sep 25 00:00 mysql_dump_1664056801.sql.xz
-rw-rw-r-- 1 i i 3986248 Oct  3 00:00 mysql_dump_1664748001.sql.xz
-rw-rw-r-- 1 i i 4183152 Oct  9 00:00 mysql_dump_1665266401.sql.xz
-rw-rw-r-- 1 i i 4466500 Oct 16 00:00 mysql_dump_1665871201.sql.xz
-rw-rw-r-- 1 i i 4380092 Oct 21 00:00 mysql_dump_1666303201.sql.xz
-rw-rw-r-- 1 i i 4906184 Oct 26 00:00 mysql_dump_1666735201.sql.xz
-rw-rw-r-- 1 i i 4877932 Oct 31 00:00 mysql_dump_1667170801.sql.xz
-rw-rw-r-- 1 i i 5012264 Nov  5 00:00 mysql_dump_1667602801.sql.xz
-rw-rw-r-- 1 i i 5151808 Nov  9 00:00 mysql_dump_1667948401.sql.xz
-rw-rw-r-- 1 i i 5088692 Nov 12 00:00 mysql_dump_1668207601.sql.xz
-rw-rw-r-- 1 i i 5286184 Nov 16 00:00 mysql_dump_1668553201.sql.xz
-rw-rw-r-- 1 i i 5196168 Nov 19 00:00 mysql_dump_1668812401.sql.xz
-rw-rw-r-- 1 i i 5290272 Nov 22 00:00 mysql_dump_1669071601.sql.xz
-rw-rw-r-- 1 i i 5340424 Nov 24 00:00 mysql_dump_1669244401.sql.xz
-rw-rw-r-- 1 i i 5692236 Nov 27 00:00 mysql_dump_1669503601.sql.xz
-rw-rw-r-- 1 i i 6463064 Nov 29 00:00 mysql_dump_1669676401.sql.xz
-rw-rw-r-- 1 i i 6697136 Dec  1 00:00 mysql_dump_1669849201.sql.xz
-rw-rw-r-- 1 i i 6762716 Dec  3 00:00 mysql_dump_1670022001.sql.xz
-rw-rw-r-- 1 i i 6603604 Dec  4 00:00 mysql_dump_1670108401.sql.xz
-rw-rw-r-- 1 i i 6639448 Dec  6 00:00 mysql_dump_1670281201.sql.xz
-rw-rw-r-- 1 i i 6673608 Dec  7 00:00 mysql_dump_1670367601.sql.xz
-rw-rw-r-- 1 i i 6729428 Dec  9 00:00 mysql_dump_1670540401.sql.xz
-rw-rw-r-- 1 i i 6755784 Dec 10 00:00 mysql_dump_1670626801.sql.xz
-rw-rw-r-- 1 i i 6786088 Dec 11 00:00 mysql_dump_1670713201.sql.xz
-rw-rw-r-- 1 i i 6838348 Dec 12 00:00 mysql_dump_1670799601.sql.xz
-rw-rw-r-- 1 i i 6873036 Dec 13 00:00 mysql_dump_1670886001.sql.xz
-rw-rw-r-- 1 i i 6801340 Dec 14 00:00 mysql_dump_1670972401.sql.xz
-rw-rw-r-- 1 i i 6832060 Dec 15 00:00 mysql_dump_1671058802.sql.xz
-rw-rw-r-- 1 i i 6944328 Dec 16 00:00 mysql_dump_1671145201.sql.xz
-rw-rw-r-- 1 i i 7102432 Dec 17 00:00 mysql_dump_1671231601.sql.xz
-rw-rw-r-- 1 i i 6967316 Dec 18 00:00 mysql_dump_1671318001.sql.xz
-rw-rw-r-- 1 i i 6992008 Dec 19 00:00 mysql_dump_1671404401.sql.xz
-rw-rw-r-- 1 i i 7018544 Dec 20 00:00 mysql_dump_1671490801.sql.xz
-rw-rw-r-- 1 i i 7047548 Dec 21 00:00 mysql_dump_1671577201.sql.xz
-rw-rw-r-- 1 i i 7272416 Dec 22 00:00 mysql_dump_1671663601.sql.xz

You can run logtrim.py as a cron job.

BUG: Treating several files with the same modify timestamp as the single time.

All the files.

(the post first published at 20221223.)


List of my other blog posts.

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.