deleted dogpile forks

I previously maintained some forks of Mike Bayer’s excellent dogpile.cache package that enabled some aggressive caching techniques, additional logging functions, and fixes to unit tests on Redis.

dogpile has since had some upstream changes, so I’ve decided to delete the forks. most of the ‘problems’ I’ve had have been solved, and compatibility will be better maintained via plugins.

Optimized Archiving of Historical Time-Series Data for Analytics

The (forthcoming) Aptise platform features a web-indexer that tabulates a lot of statistical data each day.

While we don’t currently need this data, keeping it in the “live” database is not really a good idea – it’s never used and just becomes a growing stack of general crap that automatically gets backed-up and archived every day as part of standard operations. We will eventually need/want this data, but it’s best to keep things lean and move it out-of-sight and out-of-mind.

An example of this type of data is the daily calculation on the relevance of each given topic to each domain that is tracked. We’re primarily concerned with conveying the current relevance, but in the future we will want to address historical trends.

Let’s look at the basic storage requirements:

Column Format Size
date Date 8 bytes
domain_id Integer 4 bytes
topic_id Integer 4 bytes
count Integer 4 bytes

Each row is taking 20 bytes, 8 of which are due to date alone.

Storing this in PostgreSQL across thousands of domains and tags takes a significant chunk of data. This is only one of many similar tables that primarily have archival data.

An easy way to save some space for archiving purposes is to segment the data by date, and move the storage of date information from a row and into the organizational structure.

If we were to keep everything in Postgres, we would create an explicit table for the date. For example:

CREATE TABLE report_table_a__20160101 (domain_id INT NOT NULL, topic_id INT NOT NULL, count INT NOT NULL DEFAULT 0);

This stores the same data, but instead of repeating the date in every row, we have it once on the table name. This will lead to a nearly 40% reduction in size already.

In our use-case, we don’t want this in PostgreSQL — and if we did, having hundreds of tables is a bit much overkill. So we’re going to export the data into a new file.

SELECT domain_id, topic_id, count FROM table_a WHERE date = '2016-01-01';

And we’re going to save that into a comma delimited file:

table_a-20160101.csv

I skipped a lot of steps because I do this in Python — for reasons I’m about to explain.

As a raw csv, my date-specific table is still pretty large at 7556804 bytes.

Let’s consider ways to compress it:

Using standard zip compression, we can drop that down to 2985257 bytes. That’s not very good.
If we use xz compression, it drops to 2362719, slightly better.

We’ve already compressed the data to 40% the original size, but these numbers are just not very good. We’ve got to do better.

And we can.

We can do much better and actually it’s pretty easy. All we need to do is understand compression algorithms a bit.

Generally speaking, compression algorithms look for repeating patterns. When we pulled the data out of PostgreSQL, it was just random.

Let’s try to help the compression algorithm do its job, and try to give it better input through sorting:

SELECT domain_id, topic_id, count FROM table_a WHERE date = '2016-01-01' ORDER BY domain_id ASC, topic_id ASC, count ASC;

As a raw csv, this sorted date-specific table is still the same exact 7556804 bytes.

Look what happens when we try to compress it:

Using standard zip compression, we can drop that down to 1867502 bytes. That’s pretty good – we’re at 25.7% the size of the raw file AND it’s 60% the size of the non-sorted zip. That is a huge difference!
If we use xz compression, we drop down to 1280996 bytes. That’s even better at 17%.

17% compression is honestly great, and remember — this is compressing the data that is already 40% smaller because we shifted the date column out. If we consider what the filesize with the date column would be, we’re actually at 10% compression. Wonderful.

I’m pretty happy with these numbers, but we can still do better — without much work at all.

As I said above, compression software looks for patterns. Although the sorting helped, we’re still a bit hindered because our data is in a “row” storage format:

1000,1000,1
1000,1000,2
1000,1000,3
1001,1000,1
1001,1000,2
1001,1000,3

There are lots of repeating patterns there, but not as many as if we used a “column” storage format:

1000,1000,1000,1001,1001,1001
1000,1000,1000,1000,1000,1000
1,2,3,1,2,3

This is the same data, but as you can see it’s much more “machine” friendly.

Transposing an array of data from row to column (and back) is incredibly easy with Python and standard functions.

Let’s see what happens when I use transpose_to_csv below on the data from my csv file

def transpose_to_csv(listed_data):
    """given an array of `listed_data` will turn a row-store into a col-store (or vice versa)
    reverse with `transpose_from_csv`"""
    zipped = zip(*listed_data)
    list2 = [','.join([str(i) for i in zippedline]) for zippedline in zipped]
    list2 = '\n'.join(list2)
    return list2

def transpose_from_csv(string_data):
    """given a string of csvdata, will revert the output of `transpose_to_csv`"""
    destringed = string_data.split('\n')
    destringed = [line.split(',') for line in destringed]
    unzipped = zip(*destringed)
    return unzipped

As a raw csv, my transposed file is still the exact same size at 7556804 bytes.

However, if I zip the file – it drops down to 1425585 bytes.

And if I use xz compression… I’m now down to 804960 bytes.

This is a HUGE savings without much work.

The raw data in postgres was probably about 12594673 bytes (based on the savings).
Stripping out the date information and storing it in the filename generated a 7556804 bytes csv file – a 60% savings.
Without thinking about compression, just lazily “zipping” the file created a file 2985257 bytes.
But when we thought about compression: we sorted the input, transposed that data into a column store; and applied xz compression; we resulted in a filesize of 804960 bytes – 10.7% of the csv size and 6.4% of the estimated size in PostgreSQL.

This considerably smaller amount of data can not be archived onto something like Amazon’s Glacier and worried about at a later date.

This may seem like a trivial amount of data to worry about, but keep in mind that this is a DAILY snapshot, and one of several tables. At 12MB a day in PostgreSQL, one year of data takes over 4GB of space on a system that is treated for high-priority data backups. This strategy turns that year of snapshots into under 300MB of information that can be warehoused on 3rd party systems. This saves us a lot of time and even more money! In our situation, this strategy is applied to multiple tables. Most importantly, the benefits cascade across our entire system as we free up space and resources.

Note: I purposefully wrote out the ASC on the sql sorting above, because the sort order (and column order) does actually factor into the compression ratio. On my dataset, this particular column and sort order worked the best — but that could change based on the underlying data.

The Dangers of URL Shorteners

In preparation for the next release, I just did an audit of all the errors that Aptise/Cliquedin encountered while indexing content.

Out of a few million URLs, there were 35k “well formed” urls that couldn’t be parsed due to “critical errors”.

Most of these 35k errors are due to URL shorteners. A small percentage of them are from shorteners that have dead/broken links. A much larger percentage of them are from shorteners that just do not seem to perform well at all.

I hate saying bad things about any company, but speaking as a former publisher — I feel the need to talk bluntly about this type of stuff. After pulling out some initial findings, I ran more tests on the bad urls from multiple unrelated IP addresses to make sure I wasn’t being singled out for “suspicious” activity. Unfortunately, the behavior was consistent.

The worst offenders are:

* wp.me from WordPress
* ctx.ly from Adobe Social Marketing
* trib.al from SocialFlow when used with custom domains fom Bitly

The SocialFlow+Bitly system was overrepresented because of the sheer number of their clients and urls they handle — and I understand that may skews things — but they have some interesting architecture decisions that seem to be reliably bad for older content. While I would strongly recommend that people NOT use any of these url shortening services, I more strongly recommend that people do not use SocialFlow’s system with a Custom Domain through bitly. I really hate saying this, but the performance is beyond terrible — it’s totally unacceptable.

The basic issue across the board is this: these systems perform very well in redirecting people to newer content (they most likely have the mapping of newer unmaksed urls in a cache), but they all start to stall when dealing with older content that probably needs a database query. I’m not talking about waiting a second or two for a redirect to happen — I’m consistently experiencing wait times from these systems that are reliably over 20 seconds long — and often longer. When using a desktop/mobile web browser, the requests reliably timeout and the browser just gives up.

While wp.me and ctx.ly have a lag as they lookup a url to redirect users to a masked url, the SocialFlow + Bitly combo has a design element that works differently:

1. Request a custom shortened url `http://m.example.com/12345`
2. The shortened url is actually served by Bitly, and you will wait x seconds for lookup + redirect
3. Redirected to a different short url on trib.al (SocialFlow) http://trib.al/abcde
4. Wait x seconds for second lookup + redirect
5. Redirected to end url `http://example.com/VERY+LONG+URL”

All the shorteners could do a better job with speeding up access to older and “less popular” content. But if I were SocialFlow, I would remove Bitly from the equation and just offer custom domains myself. Their service isn’t cheap, and from the viewpoint of a publisher — wait times like this are totally unacceptable.

WARNING: Dreamhost's One-Click Install for Trac still has a security flaw

I first noticed this two years ago (http://www.destructuring.net/2012/03/23/dreamhost-ux-security-flaw/) and contacted Dreamhost. It has yet to be fixed.

If you create a “Private” subversion repository on Dreamhost ( username + password are required to view ) and then add a “One Click Install” of Trac to that private repository ( which is marked as “Private” in their installer ), the Trac instance does not have any security permissions. The entirety of your source code is readable through the Trac browser.

Here’s a illustration:

• Private SVN Repository – http://svn.2xlp.com/ExampleTracSvn/svn
• Default Trac Install – http://svn.2xlp.com/ExampleTracSvn/trac/browser/README.txt

While many people may want to have a Publicly readable repo for ticketing, I think it’s safe to say that most people who use a “One Click Install” are not familiar enough with the intricacies of Trac to know about it’s permissions system.

If you’re affected the easiest fix you can implement, is to add a .htaccess file to your trac directory.

A better fix, is to get off Dreamhost’s OneClickInstall entirely. The Trac One-Click-Install is a halfassed and terrible approach.

Dreamhost did something smart with their Subversion install. Your home directory has a `svn` subdirectory which contains some specific files for each subversion repo:

* RepoName/ ( the actual repo )
* RepoName.access (a .htaccess file for your repo )
* RepoName.passwd ( a htpassed file for the repo’s access file )

It’s a very smart an elegant solution. The Trac install, however, is anything but.

1. Dreamhost installs one version of the Trac library in your home directory for each trac instance. If you have 5 tracs, you have 5 directories like `~/webroot_svn_2xlp_com_ExampleTracSvn_trac_trac`

2. Dreamhost installs the entire Trac environment and database in a web directory. The configuration files, database, everything, are available in an Apache served directory — controlled only by .htaccess files.

A better way to manage Trac, is to create a `~/trac` subdirectory of your home folder, and centralize all your trac projects. You can then use .htaccess files and symlinks to expose the relevant directories to the internet at large.

This will guard you against situations where an erroneous .htaccess file renders your contents as raw-data ( it happens ).

If you have more than one Trac installation, you would probably benefit from installing Trac as a Python library and having multiple projects reference it.

Removing phantom/zombie shortcuts from an OSX sidebar.

Every so often a file accidentally gets dropped onto my OSX sidebar. Usually I can command-click and `remove from sidebar` via the contextual. If the file has been deleted, no contextual menu pops up.

I tried every trick posted on Apple’s forums to fix this , nothing worked.

Then I had an idea — what if I figured out what the file was, and then replaced that file. Would OSX let me delete it from the sidebar manually ?

It did! FINALLY!

So here’s what did work on my 10.6.8 Macbook :

After a bit of `grep`ping, I found the offending entry in “~/Library/Preferences/com.apple.sidebarlists.plist”

It’s a binary plist, so it’s a pain to edit directly.

Looking at the contents via `more` or `vi` , I found the offending entry along with what looked to be a file path. There was a pointer to the file in “/Users/jvanasco/.Trash/NONEXISTANT_FILENAME”

This worked on first try:

1. `touch /Users/jvanasco/.Trash/NONEXISTANT_FILENAME`
2. restart Finder.app ( “Force Quit” by clicking command+option+esc, then selecting Finder”
3. select the bad item on the sidebar, command-click, and delete via the contextual menu item

A few dumb email marketing tips and tricks

A few months ago I got an email from my friend’s startup. It looked something like this:

1-apptentive preview

I immediately saw a problem – the text to the right of the Subject line was the standard “crap” text that people dismiss. He wasn’t optimizing the email for increased open rates.

When I was at TDB, the legacy email systems had the same issues. When I upgraded vendors, I also had the templates updated to allow for a smarter email marketing strategy.

Instead of using the “If you can’t see this email” message at the top, we changed the text to sell in some of the other stories. Below you can see the difference during the changeover. The difference in value proposition to users for an open is quite clear — and the updated format was very compelling.

2-DailyBeast Previews

How to accomplish this? You just need to understand that most email programs ( Gmail , Apple’s Iphone/Mac Mail, etc) will pull the first few lines of text as a preview. You want to control that and use it.

Originally it looked like this:

3-Header - original

An influential person in the company didn’t like that text, so it had to be removed.

No worries, the fix was simple, I just had everything wrapped in a custom style to make it essentially hidden:

Hidden Text Here

The open rates, which were already pretty high, increased noticeably. Several readers consciously noticed the format change, and wrote in with gratitude — it helped them not miss stories that mattered.

As a side note, the “nav bar” at the top of the email [Home…Books] disappeared in the second image because an influential person required that to be killed too. I strongly disagreed with that decision – it was a great source of click-throughs and a “best practice” taught to me by multiple of the top email marketers in the industry ( and backed up with analytics + constantly reiterated each time ). The click through rate dipped because of those links leaving ( people did not click elsewhere ), but the overall numbers increased from the open-rate improvements to offset it.

A possible approach to NSA-proof Secure Email

Nobody likes the NSA , or anyone else , spying on their email activities.

Many secure email systems have all shuttered , because they were compelled by courts to invalidate their security or feared that possibility.

Here are two popular models and weaknesses :

1. *Send/Receive unencrypted email , encrypt on server.* This is what Lavabit did. The weakness is that the government is able to compel the provider to copy/clone/divert the unencrypted mail.

2. *Web based mail, encrypted in browser.* Hushmail did this. Their weakness is that the government is able to compel them to deliver a compromised applet/library onto the client side browsers.

In order for email to be reasonably secure, it requires a few key components:

1. *End-to-End Encryption.* Data must be encrypted by the Sender and decrypted by the receiver. Service providers should never be able to access a plaintext version or have the means to decrypt a message.
2. *Full Transparency of Code.* Email clients must publish their source code and libraries, so that any sort of backdoors can be identified. Without open-source and full transparency, Service Providers and Software Vendors can be compelled by courts to offer insecure products.
3. *Required Encryption.* Plaintext emails should not be deliverable at all. This would force widespread adoption of secure email , and keep ‘snoopable’ email off of email servers.

In order to achieve this , I suggest the start of novel improvements to current email as follows:

1. All emails should be encrypted via public-key cryptography within the email client.
– Outbound email is encrypted for the Recipient
– Drafts are encrypted for the Sender
– An email header contains a digest of the key used for encryption
2. SMTP should be extended with new commands and responses ( SSMTP ). Generally speaking :
– SMTP servers should reject unencrypted email
– SMTP servers should reject email with an unknown or expired Public Key

It would be trivial to extend a SMTP server like Exim with those requests. Many mail servers easily connect to SQL or LDAP backends to find delivery information; extending this information with a list of VALID and INVALID public keys for each user is trivial.

A current SMTP example looks like this ( from http://en.wikipedia.org/wiki/SMTP )

S: 220 smtp.example.com ESMTP Postfix
C: HELO relay.example.org
S: 250 Hello relay.example.org, I am glad to meet you
C: MAIL FROM:
S: 250 Ok
C: RCPT TO:
S: 250 Ok
C: RCPT TO: S: 250 Ok
C: DATA
S: 354 End data with .
C: From: “Bob Example”
C: To: “Alice Example”
C: Cc: theboss@example.com
C: Date: Tue, 15 January 2008 16:02:43 -0500
C: Subject: Test message
C:
C: Hello Alice.
C: This is a test message with 5 header fields and 4 lines in the message body.
C: Your friend,
C: Bob
C: .
S: 250 Ok: queued as 12345
C: QUIT
S: 221 Bye
{The server closes the connection}

A more secure version would be:

S: 220 smtp.example.com SSMTP Postfix

note that Secure SMTP is required

C: HELO relay.example.org
S: 250 Hello relay.example.org, I am glad to meet you
C: MAIL FROM:
S: 250 Ok
C: RCPT TO:
S: 250 Ok
C: PUBLIC KEY: ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb
S: 250 Ok

A hash of the public key is provided. It is allowed.

Perhaps the key has been expired ?

C: PUBLIC KEY: ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb
S: 541 EXPIRED KEY

Assuming we have the key, send the message…

C: RCPT TO: S: 250 Ok
C: DATA
S: 354 End data with .
C: From: “Bob Example”
C: To: “Alice Example”
C: Date: Tue, 15 January 2008 16:02:43 -0500
C: Subject: Test message
C: <<>>
C: .
S: 250 Ok: queued as 12345
C: QUIT
S: 221 Bye
{The server closes the connection}

The message is encrypted in a payload

There are some caveats with this approach.

1- This greatly complicates multiple recipients — ie , CC and BCC

2- A new Mail format might be needed. Currently, email headers contain a lot of sender/recipient information — as well as the subject lines. In order to be fully secure, the Subject line needs to move into the encrypted payload. Other headers would need to either move or be stripped. at the least, something like this should exist.

MESSAGE
+ unencrypted headers
– to
– from
+ encrypted payload
– headers
. subject
. etc
– body

3- The distribution of Public Keys is troublesome. At first I thought servers could advertise public keys as a directory service, and delivery failures could contain a “New Key” attachment. However, this creates a potential for a man-in-the-middle attack. For example, the NSA could compel an ISP to work with them by advertising an NSA key , decrypting and storing the message, then re-encoding the message with the proper key. It would be possible to identify this with Message Signing , but one or more messages would have been stolen.

Nevertheless, a secure method of distributing and verifying public keys would be needed.

There are complications and weaknesses in this approach, but I feel they are significantly smaller than other approaches I’ve seen lately.

In any event, I don’t think it’s possible to make the current SMTP system reasonably secure — I believe we need to change/extend it.

trouble installing psycopg2 on OSX ?

I had some trouble with a new virtualenv — psycopg2 wouldn’t install.

I remembered going though this before, but couldn’t find my notes. I ended up fixing it before finding my notes ( which point to this StackOverflow question http://stackoverflow.com/questions/2088569/how-do-i-force-python-to-be-32-bit-on-snow-leopard-and-other-32-bit-64-bit-quest ) , but I want to share this with others.

psycopg2 was showing compilation errors in relation to some PostgreSQL libraries

> ld: warning: in /Library/PostgreSQL/8.4.5/lib/libpq.dylib, missing required architecture x86_64 in file

so then I checked how the file was built:

lets see how that was built…

$ file /Library/PostgreSQL/8.4.5/lib/

> /Library/PostgreSQL/8.4.5/lib/libpq.dylib: Mach-O universal binary with 2 architectures
> /Library/PostgreSQL/8.4.5/lib/libpq.dylib (for architecture ppc): Mach-O dynamically linked shared library
> /Library/PostgreSQL/8.4.5/lib/libpq.dylib (for architecture i386): Mach-O dynamically linked shared library i386

Crap. it’s built i386 only. The fix is easy right? We just need to export archflags and build.

$ export ARCHFLAGS=”-arch i386″
$ pip install –upgrade psycopg2

That works perfect, right?

Wrong.

> File “/environments/example-2.7.5/lib/python2.7/site-packages/psycopg2/__init__.py”, line 50, in
> from psycopg2._psycopg import BINARY, NUMBER, STRING, DATETIME, ROWID
>ImportError: dlopen(/environments/example-2.7.5/lib/python2.7/site-packages/psycopg2/_psycopg.so, 2): no suitable image found. Did find:
> /environments/example-2.7.5/lib/python2.7/site-packages/psycopg2/_psycopg.so: mach-o, but wrong architecture

I was dumbfounded for a few seconds, then I realized — Python was trying to run 64 bit (x86_64) , but I only have the 32 bit library.

the right fix? Rebuild PostgreSQL to support 64bit.

My PostgrSQL was a prebuilt package, and I don’t have time to fix that, so I need to do a few hacky/janky things

basically , we’re going to force python to run in i386 ( and not 64bit )

go to our virtualenv…

cd /environments/example-2.7.5/bin

back it up

cp python python-original

strip it…

# note that our last arg is the input, and the 2nd to last it output
lipo -thin i386 -output python-i386 python

replace it

rm python
mv python-i386 python

now install psycopg2

export ARCHFLAGS=”-arch i386″
pip install –upgrade psycopg2

yay this works !

now get some work done and save some time so you can build a 64bit PostgreSQL