Optimized Archiving of Historical Time-Series Data for Analytics

The (forthcoming) Aptise platform features a web-indexer that tabulates a lot of statistical data each day.

While we don’t currently need this data, keeping it in the “live” database is not really a good idea – it’s never used and just becomes a growing stack of general crap that automatically gets backed-up and archived every day as part of standard operations. We will eventually need/want this data, but it’s best to keep things lean and move it out-of-sight and out-of-mind.

An example of this type of data is the daily calculation on the relevance of each given topic to each domain that is tracked. We’re primarily concerned with conveying the current relevance, but in the future we will want to address historical trends.

Let’s look at the basic storage requirements:

Column Format Size
date Date 8 bytes
domain_id Integer 4 bytes
topic_id Integer 4 bytes
count Integer 4 bytes

Each row is taking 20 bytes, 8 of which are due to date alone.

Storing this in PostgreSQL across thousands of domains and tags takes a significant chunk of data. This is only one of many similar tables that primarily have archival data.

An easy way to save some space for archiving purposes is to segment the data by date, and move the storage of date information from a row and into the organizational structure.

If we were to keep everything in Postgres, we would create an explicit table for the date. For example:

CREATE TABLE report_table_a__20160101 (domain_id INT NOT NULL, topic_id INT NOT NULL, count INT NOT NULL DEFAULT 0);

This stores the same data, but instead of repeating the date in every row, we have it once on the table name. This will lead to a nearly 40% reduction in size already.

In our use-case, we don’t want this in PostgreSQL — and if we did, having hundreds of tables is a bit much overkill. So we’re going to export the data into a new file.

SELECT domain_id, topic_id, count FROM table_a WHERE date = '2016-01-01';

And we’re going to save that into a comma delimited file:

table_a-20160101.csv

I skipped a lot of steps because I do this in Python — for reasons I’m about to explain.

As a raw csv, my date-specific table is still pretty large at 7556804 bytes.

Let’s consider ways to compress it:

Using standard zip compression, we can drop that down to 2985257 bytes. That’s not very good.
If we use xz compression, it drops to 2362719, slightly better.

We’ve already compressed the data to 40% the original size, but these numbers are just not very good. We’ve got to do better.

And we can.

We can do much better and actually it’s pretty easy. All we need to do is understand compression algorithms a bit.

Generally speaking, compression algorithms look for repeating patterns. When we pulled the data out of PostgreSQL, it was just random.

Let’s try to help the compression algorithm do its job, and try to give it better input through sorting:

SELECT domain_id, topic_id, count FROM table_a WHERE date = '2016-01-01' ORDER BY domain_id ASC, topic_id ASC, count ASC;

As a raw csv, this sorted date-specific table is still the same exact 7556804 bytes.

Look what happens when we try to compress it:

Using standard zip compression, we can drop that down to 1867502 bytes. That’s pretty good – we’re at 25.7% the size of the raw file AND it’s 60% the size of the non-sorted zip. That is a huge difference!
If we use xz compression, we drop down to 1280996 bytes. That’s even better at 17%.

17% compression is honestly great, and remember — this is compressing the data that is already 40% smaller because we shifted the date column out. If we consider what the filesize with the date column would be, we’re actually at 10% compression. Wonderful.

I’m pretty happy with these numbers, but we can still do better — without much work at all.

As I said above, compression software looks for patterns. Although the sorting helped, we’re still a bit hindered because our data is in a “row” storage format:

1000,1000,1
1000,1000,2
1000,1000,3
1001,1000,1
1001,1000,2
1001,1000,3

There are lots of repeating patterns there, but not as many as if we used a “column” storage format:

1000,1000,1000,1001,1001,1001
1000,1000,1000,1000,1000,1000
1,2,3,1,2,3

This is the same data, but as you can see it’s much more “machine” friendly.

Transposing an array of data from row to column (and back) is incredibly easy with Python and standard functions.

Let’s see what happens when I use transpose_to_csv below on the data from my csv file

def transpose_to_csv(listed_data):
    """given an array of `listed_data` will turn a row-store into a col-store (or vice versa)
    reverse with `transpose_from_csv`"""
    zipped = zip(*listed_data)
    list2 = [','.join([str(i) for i in zippedline]) for zippedline in zipped]
    list2 = '\n'.join(list2)
    return list2

def transpose_from_csv(string_data):
    """given a string of csvdata, will revert the output of `transpose_to_csv`"""
    destringed = string_data.split('\n')
    destringed = [line.split(',') for line in destringed]
    unzipped = zip(*destringed)
    return unzipped

As a raw csv, my transposed file is still the exact same size at 7556804 bytes.

However, if I zip the file – it drops down to 1425585 bytes.

And if I use xz compression… I’m now down to 804960 bytes.

This is a HUGE savings without much work.

The raw data in postgres was probably about 12594673 bytes (based on the savings).
Stripping out the date information and storing it in the filename generated a 7556804 bytes csv file – a 60% savings.
Without thinking about compression, just lazily “zipping” the file created a file 2985257 bytes.
But when we thought about compression: we sorted the input, transposed that data into a column store; and applied xz compression; we resulted in a filesize of 804960 bytes – 10.7% of the csv size and 6.4% of the estimated size in PostgreSQL.

This considerably smaller amount of data can not be archived onto something like Amazon’s Glacier and worried about at a later date.

This may seem like a trivial amount of data to worry about, but keep in mind that this is a DAILY snapshot, and one of several tables. At 12MB a day in PostgreSQL, one year of data takes over 4GB of space on a system that is treated for high-priority data backups. This strategy turns that year of snapshots into under 300MB of information that can be warehoused on 3rd party systems. This saves us a lot of time and even more money! In our situation, this strategy is applied to multiple tables. Most importantly, the benefits cascade across our entire system as we free up space and resources.

Note: I purposefully wrote out the ASC on the sql sorting above, because the sort order (and column order) does actually factor into the compression ratio. On my dataset, this particular column and sort order worked the best — but that could change based on the underlying data.

Leave a Reply

Your email address will not be published. Required fields are marked *