[Math][Python] Trimming backups with logarithmic scale, part II

Previously. This post is like a prequel.

This is a real story. Each day a server dumps MySQL DB into a file:

...
-rw-rw-r-- 1 i i 6697136 Dec  1 00:00 mysql_dump_1669849201.sql.xz
-rw-rw-r-- 1 i i 6730080 Dec  2 00:00 mysql_dump_1669935601.sql.xz
-rw-rw-r-- 1 i i 6762716 Dec  3 00:00 mysql_dump_1670022001.sql.xz
-rw-rw-r-- 1 i i 6603604 Dec  4 00:00 mysql_dump_1670108401.sql.xz
-rw-rw-r-- 1 i i 6590036 Dec  5 00:00 mysql_dump_1670194801.sql.xz
-rw-rw-r-- 1 i i 6639448 Dec  6 00:00 mysql_dump_1670281201.sql.xz
-rw-rw-r-- 1 i i 6673608 Dec  7 00:00 mysql_dump_1670367601.sql.xz
-rw-rw-r-- 1 i i 6701520 Dec  8 00:00 mysql_dump_1670454001.sql.xz
...

But again, I don't need them all (in case of disaster). Logarithmic scale can help here as well, as it did with ZFS snapshots.

This is a general-use utility written in Python for logarithmic trimming.

#!/usr/bin/env python3

import subprocess, sys, os
import math, time, datetime

dry_run=True

def get_files_list():
    global dry_run
    rt={}
    for f in sys.argv[1:]:
        if f=="--commit":
            dry_run=False
        else:
            TS=os.path.getmtime(f)
            if TS not in rt:
                rt[TS]=[f]
            else:
                rt[TS].append(f)
    return rt

files=get_files_list()

if len(files)==0:
    print ("Usage: ./logtrim.py [--commit] filemask")
    print ("By default, it's executed in dry run mode. No files gets deleted.")
    print ("Add --commit to actually delete files.")
    exit(1)

# These parameters are to be tuned if you want different logarithmic 'curve'...
points=sorted(list(set([math.floor(1.09**x) for x in range(1,120+1)])))

# points in hours
#print (points)

now=math.floor(time.time())

# points in UNIX timestamps
SECONDS_IN_HOUR=60*60
points_TS=sorted(list(map(lambda x: now-x*SECONDS_IN_HOUR, points)), reverse=True)
points_TS.append(0) # remove the oldest file, if it's not in range

prev=now

# we are going to keep only one files between each range
# a file to be picked randomly, or just the first/last
# if there is only one file in the range, leave it
for p in points_TS:
    print ("range", prev, p, datetime.datetime.fromtimestamp(prev), datetime.datetime.fromtimestamp(p))
    range_hi=prev
    range_lo=p
    print ("files between:")
    files_between={}
    for s in files:
        # half-closed interval:
        if s>range_lo and s<=range_hi:
            print (s, files[s])
            files_between[s]=files[s]
    print ("files_between total:", len(files_between))
    if len(files_between)>1:
        files_between_vals=list(files_between.values())
        # going to kill all files except the first
        print ("keeping this file(s):", files_between_vals[0])
        for to_kill in files_between_vals[1:]:
            print ("removing this file(s):", to_kill)
            if dry_run==False:
                for f in to_kill:
                    os.unlink(f)
    prev=p

if dry_run==True:
    print ("No files deleted.")
    print ("Add --commit to actually delete files.")

Let's run it on my list of mysql files:

 % ./logtrim.py testdata/*
...
range 1668748740 1668478740 2022-11-18 07:19:00 2022-11-15 04:19:00
files between:
1668549600.0 ['testdata/mysql_dump_1668553201.sql.xz']
1668636000.0 ['testdata/mysql_dump_1668639601.sql.xz']
1668722400.0 ['testdata/mysql_dump_1668726001.sql.xz']
files_between total: 3
keeping this file(s): ['testdata/mysql_dump_1668553201.sql.xz']
removing this file(s): ['testdata/mysql_dump_1668639601.sql.xz']
removing this file(s): ['testdata/mysql_dump_1668726001.sql.xz']
range 1668478740 1668187140 2022-11-15 04:19:00 2022-11-11 19:19:00
files between:
1668204000.0 ['testdata/mysql_dump_1668207601.sql.xz']
1668290400.0 ['testdata/mysql_dump_1668294001.sql.xz']
1668376800.0 ['testdata/mysql_dump_1668380401.sql.xz']
1668463200.0 ['testdata/mysql_dump_1668466801.sql.xz']
files_between total: 4
keeping this file(s): ['testdata/mysql_dump_1668207601.sql.xz']
removing this file(s): ['testdata/mysql_dump_1668294001.sql.xz']
removing this file(s): ['testdata/mysql_dump_1668380401.sql.xz']
removing this file(s): ['testdata/mysql_dump_1668466801.sql.xz']
...
No files deleted.
Add --commit to actually delete files.

...

 % ./logtrim.py --commit testdata/*

... 

List of files after trimming. Isn't it neat?

-rw-rw-r-- 1 i i  584695 Jun  7  2022 mysql_dump_1654552801.sql
-rw-rw-r-- 1 i i  319376 Jun 13  2022 mysql_dump_1655071201.sql.xz
-rw-rw-r-- 1 i i  742012 Jun 29 00:00 mysql_dump_1656453601.sql.xz
-rw-rw-r-- 1 i i 1063884 Jul 13 00:00 mysql_dump_1657663201.sql.xz
-rw-rw-r-- 1 i i 1929164 Jul 27 00:00 mysql_dump_1658872802.sql.xz
-rw-rw-r-- 1 i i 2401192 Aug  8 00:00 mysql_dump_1659909601.sql.xz
-rw-rw-r-- 1 i i 2311372 Aug 19 00:00 mysql_dump_1660860001.sql.xz
-rw-rw-r-- 1 i i 2860008 Aug 30 00:00 mysql_dump_1661810402.sql.xz
-rw-rw-r-- 1 i i 3294004 Sep  8 00:00 mysql_dump_1662588001.sql.xz
-rw-rw-r-- 1 i i 3366360 Sep 17 00:00 mysql_dump_1663365601.sql.xz
-rw-rw-r-- 1 i i 3914516 Sep 25 00:00 mysql_dump_1664056801.sql.xz
-rw-rw-r-- 1 i i 3986248 Oct  3 00:00 mysql_dump_1664748001.sql.xz
-rw-rw-r-- 1 i i 4183152 Oct  9 00:00 mysql_dump_1665266401.sql.xz
-rw-rw-r-- 1 i i 4466500 Oct 16 00:00 mysql_dump_1665871201.sql.xz
-rw-rw-r-- 1 i i 4380092 Oct 21 00:00 mysql_dump_1666303201.sql.xz
-rw-rw-r-- 1 i i 4906184 Oct 26 00:00 mysql_dump_1666735201.sql.xz
-rw-rw-r-- 1 i i 4877932 Oct 31 00:00 mysql_dump_1667170801.sql.xz
-rw-rw-r-- 1 i i 5012264 Nov  5 00:00 mysql_dump_1667602801.sql.xz
-rw-rw-r-- 1 i i 5151808 Nov  9 00:00 mysql_dump_1667948401.sql.xz
-rw-rw-r-- 1 i i 5088692 Nov 12 00:00 mysql_dump_1668207601.sql.xz
-rw-rw-r-- 1 i i 5286184 Nov 16 00:00 mysql_dump_1668553201.sql.xz
-rw-rw-r-- 1 i i 5196168 Nov 19 00:00 mysql_dump_1668812401.sql.xz
-rw-rw-r-- 1 i i 5290272 Nov 22 00:00 mysql_dump_1669071601.sql.xz
-rw-rw-r-- 1 i i 5340424 Nov 24 00:00 mysql_dump_1669244401.sql.xz
-rw-rw-r-- 1 i i 5692236 Nov 27 00:00 mysql_dump_1669503601.sql.xz
-rw-rw-r-- 1 i i 6463064 Nov 29 00:00 mysql_dump_1669676401.sql.xz
-rw-rw-r-- 1 i i 6697136 Dec  1 00:00 mysql_dump_1669849201.sql.xz
-rw-rw-r-- 1 i i 6762716 Dec  3 00:00 mysql_dump_1670022001.sql.xz
-rw-rw-r-- 1 i i 6603604 Dec  4 00:00 mysql_dump_1670108401.sql.xz
-rw-rw-r-- 1 i i 6639448 Dec  6 00:00 mysql_dump_1670281201.sql.xz
-rw-rw-r-- 1 i i 6673608 Dec  7 00:00 mysql_dump_1670367601.sql.xz
-rw-rw-r-- 1 i i 6729428 Dec  9 00:00 mysql_dump_1670540401.sql.xz
-rw-rw-r-- 1 i i 6755784 Dec 10 00:00 mysql_dump_1670626801.sql.xz
-rw-rw-r-- 1 i i 6786088 Dec 11 00:00 mysql_dump_1670713201.sql.xz
-rw-rw-r-- 1 i i 6838348 Dec 12 00:00 mysql_dump_1670799601.sql.xz
-rw-rw-r-- 1 i i 6873036 Dec 13 00:00 mysql_dump_1670886001.sql.xz
-rw-rw-r-- 1 i i 6801340 Dec 14 00:00 mysql_dump_1670972401.sql.xz
-rw-rw-r-- 1 i i 6832060 Dec 15 00:00 mysql_dump_1671058802.sql.xz
-rw-rw-r-- 1 i i 6944328 Dec 16 00:00 mysql_dump_1671145201.sql.xz
-rw-rw-r-- 1 i i 7102432 Dec 17 00:00 mysql_dump_1671231601.sql.xz
-rw-rw-r-- 1 i i 6967316 Dec 18 00:00 mysql_dump_1671318001.sql.xz
-rw-rw-r-- 1 i i 6992008 Dec 19 00:00 mysql_dump_1671404401.sql.xz
-rw-rw-r-- 1 i i 7018544 Dec 20 00:00 mysql_dump_1671490801.sql.xz
-rw-rw-r-- 1 i i 7047548 Dec 21 00:00 mysql_dump_1671577201.sql.xz
-rw-rw-r-- 1 i i 7272416 Dec 22 00:00 mysql_dump_1671663601.sql.xz

You can run logtrim.py as a cron job.

BUG: Treating several files with the same modify timestamp as the single time.

All the files.

UPD: I use this utility to trim list of old versions of my books. 1, 2, 3.

(the post first published at 20221223.)


List of my other blog posts.

Subscribe to my news feed

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.