The (forthcoming) Aptise platform features a web-indexer that tabulates a lot of statistical data each day for domains across the global internet.
While we don’t currently need this data, we may need it in the future. Keeping it in the “live” database is not really a good idea – it’s never used and just becomes a growing stack of general crap that automatically gets backed-up and archived every day as part of standard operations. It’s best to keep production lean and move this unnecessary data out-of-sight and out-of-mind.
An example of this type of data is a daily calculation on the relevance of each given topic to each domain that is tracked. We’re primarily concerned with conveying the current relevance, but in the future we will address historical trends.
Let’s look at the basic storage requirements of this as a PostgreSQL table:
| Column | Format | Size |
| --------- | ------- | ------- |
| date | Date | 8 bytes |
| domain_id | Integer | 4 bytes |
| topic_id | Integer | 4 bytes |
| count | Integer | 4 bytes |
Each row is taking 20 bytes, 8 of which are due to date alone.
Storing this in PostgreSQL across thousands of domains and tags every day takes a significant chunk of storage space – and this is only one of many similar tables that primarily have archival data.
An easy way to save some space for archiving purposes is to segment the data by date, and move the storage of date
information from a row and into the organizational structure.
If we were to keep everything in Postgres, we would create an explicit table for the date. For example:
CREATE TABLE report_table_a__20160101 (domain_id INT NOT NULL, topic_id INT NOT NULL, count INT NOT NULL DEFAULT 0);
| Column | Format | Size |
| --------- | ------- | ------- |
| domain_id | Integer | 4 bytes |
| topic_id | Integer | 4 bytes |
| count | Integer | 4 bytes |
This structure conceptually stores the same data, but instead of repeating the date
in every row, we record it only once within the table’s name. This simple shift will lead to a nearly 40% reduction in size.
In our use-case, we don’t want to keep this in PostgreSQL because the extra data complicates automated backups and storage. Even if we wanted this data live, having it within hundreds of tables is a bit much overkill. So for now, we’re going to export the data from a single date into a new file.
SELECT domain_id, topic_id, count FROM table_a WHERE date = '2016-01-01';
And we’re going to save that into a comma delimited file:
table_a-20160101.csv
I skipped a lot of steps above because I do this in Python — for reasons I’m about to explain.
As a raw csv file, my date-specific table is still pretty large at 7556804
bytes — so let’s consider ways to compress it:
Using standard zip
compression, we can drop that down to 2985257
bytes. That’s ok, but not very good. If we use xz
compression, it drops to a slightly better 2362719
.
We’ve already compressed the data to 40% the original size by eliminating the date column, so these numbers are a pretty decent overall improvement — but considering the type of data we’re storing, the compression is just not very good. We’ve got to do better — and we can.
We can do much better and actually it’s pretty easy (!). All we need to do is understand compression algorithms a bit.
Generally speaking, compression algorithms look for repeating patterns. When we pulled the data out of PostgreSQL, we just had random numbers.
We can help the compression algorithm do its job by giving it better input. One way is through sorting:
SELECT domain_id, topic_id, count FROM table_a WHERE date = '2016-01-01' ORDER BY domain_id ASC, topic_id ASC, count ASC;
As a raw csv, this sorted date-specific table is still the same exact 7556804
bytes.
Look what happens when we try to compress it:
Using standard zip compression, we can drop that down to 1867502
bytes. That’s pretty good – we’re at 25.7% the size of the raw file AND it’s 60% the size of the non-sorted zip. That is a huge difference! If we use xz compression, we drop down to 1280996
bytes. That’s even better at 17%.
17% compression is honestly great, and remember — this is compressing the data that is already 40% smaller because we shifted the date column out. If we consider what the filesize with the date column would be, we’re actually at 10% compression. Wonderful.
I’m pretty happy with these numbers, but we can still do better — and without much more work.
As I said above, compression software looks for patterns. Although the sorting helped, we’re still a bit hindered because our data is in a “row” storage format. Consider this example:
1000,1000,1
1000,1000,2
1000,1000,3
1001,1000,1
1001,1000,2
1001,1000,3
There are lots of repeating patterns there, but not as many as if we represented the same information in a “column” storage format:
1000,1000,1000,1001,1001,1001
1000,1000,1000,1000,1000,1000
1,2,3,1,2,3
This is the same data, but as you can see it’s much more “machine” friendly – there are larger repeating patterns.
This transformation from row to column is an example of “transposing an array of data`; performing it (and reversing it) is incredibly easy with Python’s standard functions.
Let’s see what happens when I use transpose_to_csv
below on the data from my csv file
def transpose_to_csv(listed_data):
"""given an array of `listed_data` will turn a row-store into a col-store (or vice versa)
reverse with `transpose_from_csv`"""
zipped = zip(*listed_data)
list2 = [','.join([str(i) for i in zippedline]) for zippedline in zipped]
list2 = '\n'.join(list2)
return list2
def transpose_from_csv(string_data):
"""given a string of csvdata, will revert the output of `transpose_to_csv`"""
destringed = string_data.split('\n')
destringed = [line.split(',') for line in destringed]
unzipped = zip(*destringed)
return unzipped
As a raw csv, my transposed file is still the exact same size at 7556804
bytes.
However, if I zip the file – it drops down to 1425585
bytes.
And if I use xz compression… I’m now down to 804960
bytes.
This is a HUGE savings without much work.
The raw data in postgres was probably about 12594673
bytes (based on the savings, the file was deleted).
Stripping out the date information and storing it in the filename generated a 7556804
bytes csv file – a 60% savings.
Without thinking about compression, just lazily “zipping” the file created a file 2985257
bytes.
But when we thought about compression: we sorted the input, transposed that data into a column store; and applied xz compression; we resulted in a filesize of 804960
bytes – 10.7% of the csv size and 6.4% of the estimated size in PostgreSQL.
This considerably smaller amount of data can not be archived onto something like Amazon’s Glacier and worried about at a later date.
This may seem like a trivial amount of data to worry about, but keep in mind that this is a DAILY snapshot, and one of several tables. At 12MB a day in PostgreSQL, one year of data takes over 4GB of space on a system that is treated for high-priority data backups. This strategy turns that year of snapshots into under 300MB of information that can be warehoused on 3rd party systems. This saves us a lot of time and even more money! In our situation, this strategy is applied to multiple tables. Most importantly, the benefits cascade across our entire system as we free up space and resources.
These numbers could be improved upon further by finding an optimal sort order, or even using custom compression algorithms (such as storing the deltas between columns, then compressing). This was written to illustrate a quick way to easily optimize archived data storage.
The results:
Format | Compression | Sorted? | Size | % csv+date |
---|---|---|---|---|
csv+date | 12594673 | 100 | ||
row csv | 7556804 | 60 | ||
row csv | zip | 2985257 | 23.7 | |
row csv | xz | 2362719 | 18.8 | |
row csv | Yes | 7556804 | 60 | |
row csv | zip | Yes | 1867502 | 14.8 |
row csv | xz | Yes | 1280996 | 10.2 |
col csv | Yes | 7556804 | 60 | |
col csv | zip | Yes | 1425585 | 11.3 |
col csv | xz | Yes | 804960 | 6.4 |
Note: I purposefully wrote out the ASC on the sql sorting above, because the sort order (and column order) does actually factor into the compression ratio. On my dataset, this particular column and sort order worked the best — but that could change based on the underlying data.