Optimized Archiving of Historical Time-Series Data for Analytics

The (forthcoming) Aptise platform features a web-indexer that tabulates a lot of statistical data each day.

While we don’t currently need this data, keeping it in the “live” database is not really a good idea – it’s never used and just becomes a growing stack of general crap that automatically gets backed-up and archived every day as part of standard operations. We will eventually need/want this data, but it’s best to keep things lean and move it out-of-sight and out-of-mind.

An example of this type of data is the daily calculation on the relevance of each given topic to each domain that is tracked. We’re primarily concerned with conveying the current relevance, but in the future we will want to address historical trends.

Let’s look at the basic storage requirements:

Column Format Size
date Date 8 bytes
domain_id Integer 4 bytes
topic_id Integer 4 bytes
count Integer 4 bytes

Each row is taking 20 bytes, 8 of which are due to date alone.

Storing this in PostgreSQL across thousands of domains and tags takes a significant chunk of data. This is only one of many similar tables that primarily have archival data.

An easy way to save some space for archiving purposes is to segment the data by date, and move the storage of date information from a row and into the organizational structure.

If we were to keep everything in Postgres, we would create an explicit table for the date. For example:

CREATE TABLE report_table_a__20160101 (domain_id INT NOT NULL, topic_id INT NOT NULL, count INT NOT NULL DEFAULT 0);

This stores the same data, but instead of repeating the date in every row, we have it once on the table name. This will lead to a nearly 40% reduction in size already.

In our use-case, we don’t want this in PostgreSQL — and if we did, having hundreds of tables is a bit much overkill. So we’re going to export the data into a new file.

SELECT domain_id, topic_id, count FROM table_a WHERE date = '2016-01-01';

And we’re going to save that into a comma delimited file:

table_a-20160101.csv

I skipped a lot of steps because I do this in Python — for reasons I’m about to explain.

As a raw csv, my date-specific table is still pretty large at 7556804 bytes.

Let’s consider ways to compress it:

Using standard zip compression, we can drop that down to 2985257 bytes. That’s not very good.
If we use xz compression, it drops to 2362719, slightly better.

We’ve already compressed the data to 40% the original size, but these numbers are just not very good. We’ve got to do better.

And we can.

We can do much better and actually it’s pretty easy. All we need to do is understand compression algorithms a bit.

Generally speaking, compression algorithms look for repeating patterns. When we pulled the data out of PostgreSQL, it was just random.

Let’s try to help the compression algorithm do its job, and try to give it better input through sorting:

SELECT domain_id, topic_id, count FROM table_a WHERE date = '2016-01-01' ORDER BY domain_id ASC, topic_id ASC, count ASC;

As a raw csv, this sorted date-specific table is still the same exact 7556804 bytes.

Look what happens when we try to compress it:

Using standard zip compression, we can drop that down to 1867502 bytes. That’s pretty good – we’re at 25.7% the size of the raw file AND it’s 60% the size of the non-sorted zip. That is a huge difference!
If we use xz compression, we drop down to 1280996 bytes. That’s even better at 17%.

17% compression is honestly great, and remember — this is compressing the data that is already 40% smaller because we shifted the date column out. If we consider what the filesize with the date column would be, we’re actually at 10% compression. Wonderful.

I’m pretty happy with these numbers, but we can still do better — without much work at all.

As I said above, compression software looks for patterns. Although the sorting helped, we’re still a bit hindered because our data is in a “row” storage format:

1000,1000,1
1000,1000,2
1000,1000,3
1001,1000,1
1001,1000,2
1001,1000,3

There are lots of repeating patterns there, but not as many as if we used a “column” storage format:

1000,1000,1000,1001,1001,1001
1000,1000,1000,1000,1000,1000
1,2,3,1,2,3

This is the same data, but as you can see it’s much more “machine” friendly.

Transposing an array of data from row to column (and back) is incredibly easy with Python and standard functions.

Let’s see what happens when I use transpose_to_csv below on the data from my csv file

def transpose_to_csv(listed_data):
    """given an array of `listed_data` will turn a row-store into a col-store (or vice versa)
    reverse with `transpose_from_csv`"""
    zipped = zip(*listed_data)
    list2 = [','.join([str(i) for i in zippedline]) for zippedline in zipped]
    list2 = '\n'.join(list2)
    return list2

def transpose_from_csv(string_data):
    """given a string of csvdata, will revert the output of `transpose_to_csv`"""
    destringed = string_data.split('\n')
    destringed = [line.split(',') for line in destringed]
    unzipped = zip(*destringed)
    return unzipped

As a raw csv, my transposed file is still the exact same size at 7556804 bytes.

However, if I zip the file – it drops down to 1425585 bytes.

And if I use xz compression… I’m now down to 804960 bytes.

This is a HUGE savings without much work.

The raw data in postgres was probably about 12594673 bytes (based on the savings).
Stripping out the date information and storing it in the filename generated a 7556804 bytes csv file – a 60% savings.
Without thinking about compression, just lazily “zipping” the file created a file 2985257 bytes.
But when we thought about compression: we sorted the input, transposed that data into a column store; and applied xz compression; we resulted in a filesize of 804960 bytes – 10.7% of the csv size and 6.4% of the estimated size in PostgreSQL.

This considerably smaller amount of data can not be archived onto something like Amazon’s Glacier and worried about at a later date.

This may seem like a trivial amount of data to worry about, but keep in mind that this is a DAILY snapshot, and one of several tables. At 12MB a day in PostgreSQL, one year of data takes over 4GB of space on a system that is treated for high-priority data backups. This strategy turns that year of snapshots into under 300MB of information that can be warehoused on 3rd party systems. This saves us a lot of time and even more money! In our situation, this strategy is applied to multiple tables. Most importantly, the benefits cascade across our entire system as we free up space and resources.

Note: I purposefully wrote out the ASC on the sql sorting above, because the sort order (and column order) does actually factor into the compression ratio. On my dataset, this particular column and sort order worked the best — but that could change based on the underlying data.

A primer on Web-based Business P&L Operations & Optimization

A Primer on Web-based Business P&L Operations & Optimization

# Overview

A lot of what I have been doing professionally over the past few years has involved the management of P&L sheets ( Profit & Loss ). For those unfamiliar with the term, P&L sheets are basically a management guidebook for charting how your organization, department, or product spends and generates revenue. Not only do they provide you with the figures that you need for annual reports, for budgeting, and for investment opportunities, but they also facilitate your understanding of your financial strengths and weaknesses, and suggest where you may be able to improve.

The first time that I had to deal with P&L sheets was during the infancy of FindMeOn. I was attempting to forecast corporate overhead and monthly / yearly expenses during our investment pitching, and this stuff was *completely* foreign to me. I resorted to recruiting a specialist onto the team to help me handle this P&L stuff. Unfortunately, this person was only a self-described ‘specialist’, and didn’t have the expertise that they had claimed — so the workflow fell on me to compute all of the Cost of Sales for overhead and fixed costs. The ‘specialist’ would then review and standardize the data, which was nothing more than a sign-off as ‘correct’. Potential investors were not impressed. Our spreadsheets were ultimately chicken scratches made by a madman (me!) with math that barely make sense to the untrained … and I didn’t know any better until a VC friend was kind enough to sit me down and give me a stern “talking to”.

For some insane reason, there isn’t much information out there for budding startup folks to reference on this topic.

Before continuing , I’d like to thank some friends for giving feedback on an early draft of this document:

– Avi Deitcher, founder of Atomic Inc, an operations / management consulting service. Aside from standard consulting , Avi routinely functions as an interim CIO, COO, CTO for startups. He’s one of the few people that I know who recognizes that both Microsoft and Open Source products have their places, and who maintains a brilliant blog on operational strategy at http://blog.atomicinc.com/.
– Rick Webb, Partner and COO of The Barbarian Group, recently named Digital Agency of the Year … again. Brands, agencies, startups, and even Internet Week and the Webby’s turn to Rick for sage advice as the guru of all things online. Rick often blogs about smart stuff over at http://www.barbariangroup.com/employees/rick_webb.

A full P&L sheet for a web oriented company is a huge document, taking into account labor, taxes, overhead, infrastructure, vendors, multiple departments etc etc etc… to give a full summation of the company’s activities. To handle this data, I use an Excel document that has a half-dozen sheets with hundreds of columns and rows to account for everything a company spends / earns / generates. The document is never quite complete, nor is it easy to describe. Explaining a full P&L sheet in one article would be a daunting task, and likely overwhelming.

Instead, this article is designed as a primer — very limited and with a focus on the Cost of Goods Sold and Sales with relation to scalability issues on fixed cost items like bandwidth.

If you really are interested in knowing about P&Ls for tech companies in-depth, in the near future you’ll be treated to some articles on this subject by Avi. You might get a series of articles co-written by us as well — we’re currently working out the details.

Before I move forward, I want to admit to an ‘error’, as my friend Avi has pointed out to me: I conflate several concepts in this overview in my attempts to make this more understandable to the non-Business School crowd. I also gloss over certain important items to make this article more digestable. Specifically, those items are:

– Cost of Sales
– I focus on the cost of serving up Web content of any type in regards to bandwidth.
– I do not cover hosting related costs: power, software / cpu licenses, hardware, colocation fees, management overhead, engineering staff, etc.
– I do not cover corporate infrastructure costs: technologists who build / maintain your product, content creators for your product, a sales force, other employees, office overhead, etc.
– Optimization of Cost of Sales
– I limit myself to the sections that I cover above
– Revenues from Advertising business
– I introduce this topic due to how it + COS can affect product strategy, however, this may not be applicable to your particular business

Avi also mentioned a very important differentiation that people should keep in mind, which I’ll quote below:

> You should be discussing COS (cost of sales), which is what you use in service, and not COGS (cost of goods sold), which is what you use in products. Yes, I know, many people mix them up, but not the real experts. Allow it to be part of your education of the readers.

So … Avi nailed it, and I admit it — I constantly conflate COS and COGS. In my defense, it’s because the web projects that I work on tend to straddle the two – and we often might build verticals off the same infrastructure as a Service to institutional clients and a Product to individuals (or vice versa).

# Core Concepts

The approach that I will take with this article is to provide a quick and simple primer for people with little or no business background so that they are better able to understand the costs of operating a web based business. Understanding this stuff is *extremely* important to both startups and to established businesses alike:

– Costs are a function of scalability in userbase and in usage. It obviously costs more money to deliver your product to more consumers more times.
– Costs are a function of product design. Your product, at its heart, is really a User Experience. Depending on the way you have designed the product, the architecture, caching, etc, this User Experience that you want to deliver may be either cost effective or cost prohibitive.

Here’s an important rule for you to remember:
> _Your online property – whether it is a service, publication, network, or marketing exercise – is a product._

If you are able to better understand the costs related to your product, you may be able to alter the User Experience or the Product Design in order to make it more profitable. If your web-property is a service, you need to calculate how much you must charge clients in order to break even. In one of my proposed B2B projects at FindMeOn, we realized that a campaign that we wanted to launch with a non-profit was prohibitively expensive — the client couldn’t even cover the projected bandwidth costs, much less any break even or profit for our labor. In another B2C project I worked on, we realized that we would have to charge users a high monthly fee to be able to provide the service level that we wanted, which led us back to the drawing board to fine tune the product — in terms of web engineering, application design, and user experience / capabilities.

I want everyone reading this article to walk away with at least one more concept ingrained in their minds:

> _ Understanding scaling costs is important, because you may have to rethink your Product Design and User Experience to meet Business Goals. _

It does not matter if you don’t understand the technical intricacies of scaling costs for your product — you can always hire a specialist like me to assist you with that. You simply need to understand that beyond labor and servers and overhead required to produce your product, it costs money to deliver the final digital product to consumers — and those costs vary.

# Understanding Costs Through Real World Examples

I’ve always found that the most efficient methods to both learn and to teach are by example, so to illustrate these concepts I’ll utilize a few real world examples that I have encountered over recent months.

In the following examples, I *mostly* focus on making products more cost-effective on a per-page basis based on ad support. Let me repeat this – two of the examples focus on ad supported business models, and in both I refrain from discussing types of ad optimization / targeting strategies. One could easily offset high costs with very efficient advertising or charge-models that allow for a product with a more resource-intensive user experience. Similarly, these examples are based on ad supported revenue as a concern — and note the word concern. This does not mean that the ads are the only revenue, nor that they are profitable or even cover the associated costs — this simply means that it costs money to deliver content, and this cost is at least partially offset through the display of ads.

## Example 1 – A Publishing Oriented Website

A publishing oriented website wanted to deliver a rich user experience through an innovative and rewarding site design that featured lots of high quality images. I was brought in after the prototype was built to their original specs, and I took the prototype for a test drive to derive potential usage patterns and formulate their COGS. What I discovered was not good for their bottom line.

The original version of the site followed a fairly standard paradigm in web publishing: there were 10 articles on the front page, with each one consisting of a block of text and one or more images. What wasn’t standard was the pageload — the articles’ aggregate content and images ended up at around 1.3MB of bandwidth, and the site’s HTML, CSS, JavaScript and associated images were an additional 600k.

The site’s current revenue stream was advertising based, and there were 2 IAB standard ad blocks — that meant that every unique visit to their website cost them an initial 1.9MB in bandwidth, but only generated revenue from displaying two ad units.

Similarly, the performance for repeat users was equally disappointing. Repeat visits usually receive a discount for cached files – html pages, support files (js / css), images or flash media – however, in this instance, the site integrated new content in such a way that caching was not optimized. I calculated that someone who visits their website after an update would incur 700kb of bandwidth and only be exposed to 2 ads on their front page.

By understanding the associated costs and business interests, I was able to work with their product design team to reshape the User Experience (UX) and the site-design into something that was more cost-effective – all while maintaining a positive user experience.

– The initial pageload was dropped down to a total of 400k
– Repeat pageloads were dropped down to approximately 75k (after content update)
– ‘Full Story’ content was replaced by a mix of ‘Full Story’ with Teaser / Excerpt content — designed to entice people to ‘click for more’ and create new ad impressions through focus pages.
– Images were tiered to include previews on the front page, medium resolution photos within the focus pages, and a new class of image-focus pages that were ad supported and included the hi-res photo.
– The site was slightly redesigned and optimized to lower the size of support files to 250k

## Example 2 – A Niche Social Network with Rich Media

A pre-beta niche Social Network included the usual mix of social networking fare (profiles, friends, messages) along with some premium content (high resolution photos & videos, etc.) for their target demographic.

Social Networking projects can be difficult to balance because of the nature of experience and the advertising that they provide for — different classifications of pages generate different usage patterns, necessary bandwidth and advertising rates. As a very basic overview: Home pages, Profile pages, and published content garner significantly higher rates as they demand more of a user’s attention; “messaging” functionality, modal pages, and gaming/application pages simply don’t monetize well, as they don’t foster user attention – people tend to tune out the ads.

To address this company’s needs, I worked with the product team to achieve the follwoing goals:

– Allowed for higher quality images and a richer user experience on the ‘Tier 1′ advertising pages (profiles and published content)
– Dropped the modal pages to the simplest html possible, while maintaining good design / UX. Migrated as much of the css / js / etc. into external files; aggressively cached and recycled this content onto the function / modal pages.
– Limited the amount of AJAX used. While it affords an amazing UX, AJAX also (typically) means “interactions without ads”. By reviewing and selectively limiting AJAX usage, we were able to drive more Ad impressions without hurting the User Experience.
– Determined ideal video sizes (in terms of bandwidth) to balance revenue goals against. Used interstitial ads and segments to handle videos in excess of this number.
– Decided which products and functionalities could be used as loss-leaders to maintain brand loyalty, and which could not be; set parameters for performance review.

Through simple techniques like those mentioned above, we were able to make the networks’ operations incredibly more cost-effective — making their P&L sheets more profitable.

## Example 3 – A Social Media Campaign for a Brand

Not all web properties are business entities — many are produced as Marketing / ROI campaigns for brands. Because (micro)sites like these rarely have any sort of revenue stream and exist mostly as an advertising expense for brand loyalty, they’re a bit of an anomaly in terms of streamlining costs.

In instances such as this, I work with the Creative Team ( Digital / Production Agency & Advertising Agency) and the Brand’s Internal Marketing Team to clearly define a User Story that best meet the brand’s goals (ie: what is the user’s takeaway / brand impact), to list out any ROI indicators, and to define a User Story that best meets the Creative team’s user experience.

Many people assume that the brand’s Creative Team and Internal Marketing Team want the same thing — that which is best for the brand. This is really a half truth — while they are both indeed focused on what is best for the brand, the Creative Team almost always thinks about this within the scope of the current Campaign / Project, while the brand’s internal marketing team is also thinking about long term growth and how the current campaign or medium ties into other campaigns or mediums that are simultaneously being marketed to users.

When streamlining a branded social media campaign, you need to first know what the brand’s own goals are (how many people they want to reach, what is the messaging) and then what the Creative Team’s user experience goals are. You are then able to create a technical user story, map it to a clickpath, calculate the cost of each impression, and suggest optimizations that won’t derail the Creative Team’s UX and Campaign while simultaneously meeting the brand’s goals.

# Calculating Costs

To calculate costs for a web-based product, I break things down into 3 phases:

– Define Typical User Paths
– Profile User Path Elements
– Input everything into generic spreadsheet software that can handle multiple worksheets which reference one another.

## Define Typical User Paths

As a start, let’s clearly define what I mean by the term user path:

> A user path (in this context) refers to how a typical user will interact with your web property during an engagement; it is the finite list of pages they encounter in a single visit.

The word / phrase userpath can refer to many things in different contexts – this is what it refers to within the scope of this article.

Note how I said typical user. If you’re a startup founder, I may be about to break your heart…

The internet isn’t a Field of Dreams. While it’s nice to be hopeful that “If you build it , they will come” – you need to realistically know that people won’t be flocking to you in droves like you’re Kevin Costner with a mystical ballpark with the children of the corn. User’s are going to trickle in, and more importantly they’re not going to have the usage patterns that you either want or expect.

If you’re reading this article, here’s a surprise — you’re not a typical internet user. (What a great self-selecting audience I have here!). You may be a “Power User”, a “Proficient” user, or some other marketing-speak styled distinction about someone who works in digital media or uses the internet as their profession. Regardless, it doesn’t matter what you *are* … it matters what you are *not*, and you are not typical nor emblematic of internet users as a whole. You think of, use and interact with websites very differently than most – and may have problems with grasping typical user patterns. Let’s try to fix that.

A realistic goal for user experience is this:
– A user will visit your website 3 times a month
– Each one of these visits includes viewing 3 pages

(it is realistic, with a good product, to achieve 5 visits of 9 pages each, but that depends entirely upon the product – these numbers are a safer starting point)

Unless you’re Facebook, Twitter, or some other “Crack App” of the moment, these numbers will apply to more than 90% of your users. I’ve rarely found more than 6% of site visitors to be ‘active’ users for the measured period — and keep in mind that even the most successful web services tout their active memberships to be only about 25% of their registration base.

Fanatical users are a limited bunch — and while your application may be designed for their experience, most of your web traffic will come from casual users.

As a rule of thumb, I aim for numbers that represent the 80th percentile of users — ie, the common experience shared by 80% of your users.

It’s really hard believing numbers like this. I still don’t entirely… but I’m “in the industry” and think of technology differently than most others. I cannot imagine not checking into any of my major apps a few times a day — and I’m a “member” of dozens more websites than non-industry friends. But I also admit that I too have that old Friendster account that I use for 10 minutes a few times a year, a Perlmonks account that gets me for a few 3-page visits when troubleshooting code each month, and I even jump onto MySpace for a few minutes every now and then. At times I’ve been a fanatical user of each of those services — now I barely think of them in passing.

Truth be told, that’s how most people are… and for every person that reloads your website 100 times in a single day, there are dozens that don’t get past the first page.

If you want hard data on this , I suggest the following

1- Look at your own server logs
2- Look at any quantcast / compete / alexa score

You’ll see that the numbers come out the same. You’ll likely also see another characterization — I’ve found that there is a sharp transition between the casual user and the diehard user. Its fairly amazing … people either love your product or they are ambivalent.

Getting back to the userpath…

With realistic numbers in mind, you need to create some sample clickpaths that entice those 80% of your users to interact with your site. If you’re feeling fancy, make another userpath for the powerusers that you hope to attract. Generating the userpaths is dead-simple — you just click around your site, assuming that you visited the index page, or got to it from a search engine / ad — and try figure out what users are likely to click on.

Even better — if you’ve already launched (or just don’t want to estimate / forecast these numbers), you can install some tracking & analytics packages then generate a report based on real data. This is my preferred method, however, it cannot be done before there is a beta — and this sort of work is really beneficial when you’re developing an initial product. For pre-launch products, I use information from sites like Alexa, Compete and Quantcast to determine usage patterns of similar-in-functionality or competitive landscape websites.

Userpath’s aren’t very complex — here’s one from my publishing example above:

Publishing site userpath

1 User visits homepage
2 Clicks on Article Focus
3 Clicks on Video Focus
4 Clicks on Related Content button

## Profile User Path Elements

You can profile User Path elements relatively simply with a wordprocessing / spreadsheet program and a browser like Safari or Firefox. Note: Firebug in Firefox and “developer mode” in Safari make this much easier.

For every page in your userpath, you should create a new page / worksheet to include the page elements. I’ll break them down into 5 categories:

– The actual Page HTML
– Included Images
– External JS Files
– External CSS Files
– Rich Media (flash, video, etc.)

I’ll also break down each category into usage classifications:

– is only used on page
– is used on some pages (ie: section pages)
– is used throughout the site

The reason why we classify the elements is that it allows us to understand and to forecast caching for re-usability.

## The Spreadsheet

I use the same basic setup for every spreadsheet that I build: one sheet for the userpath, which references a sheet for each page in the userpath.

The userpath sheet really just contains a listing of all the different pages in a userpath, along with some functions / calculations. We include references to data contained in sheets that are dedicated to breakdowns of each actual page in the userpath. Each page in the userpath gets its own sheet to handle a breakdown of all the content contained within.

### Sample Sheet: Userpath

This is what a Userpath worksheet in my spreadsheet software usually looks like:

Page Type Delivery Bandwidth(per page) Impressions (per Gigabyte) CPM @ 50¢ # of pages in visit delivery bandwidth per visit
Index Page (ref) (func) (func) ? (func)
Article Page (ref) (func) (func) ? (func)
Focus Image Page (ref) (func) (func) ? (func)
Focus Video Page (ref) (func) (func) ? (func)

This userpath sheet lists the different ‘classes’ of pages that the users will encounter, along with their calculated bandwidth and the number of times each page is encountered — per-visit. This turns an almost intangible user experience into a graphed representation.

Here is what you need to know about the specific columns:

– we reference the ‘Delivery Bandwidth’ per page type to date in its own sheet (see below)
– we calculate the ‘Impressions per Gigabyte’ – or how many impressions of this page fit in 1 gigabyte of data
– we calculate the CPM for delivery with bandwidth priced at 50¢ / Gigabyte – how many impressions can i have if bandwidth costs 50¢ / gigabyte
– we have # of pages per visit as a number. We can quickly change this when trying to optimize new paths. This is traffic forecasting. Some pages might only get 1 view, while others receive 20 views.
– we calculate the amount of bandwidth used for this pagetype per delivery

I typically compute a few more values on this page too:

– Subtotal for all bandwidth per visit
– how many site visits you can get per gigabyte
– how many site visits you can get at different CPM levels

The userpath page acts as both an index and as a conclusion — mapping out many referenced sheets and summarizing them as well. The only data entered / manipulated on this page is the number of pages per visit — everything else is calculated based on data referenced from other sheets.

The usage of many references, lets makes this live document. As the site progresses or as I start optimizing (by altering planned site traffic and directing users onto different types of pages to consume the premium content; or adjusting the bandwidth on each page) the underlying costs and associated income are quickly measured. This approach makes it easy not only just to gauge performance, but also to set goals and to devise new ways to meet these goals.

### Sample Sheet: Page Type

For every classification of page, I create a worksheet that reads a little something like this:

Page Type kilobytes re-use discount total cost (in K)
HTML Size ? 1 (func)
Images Used Across Site ? .35 (func)
Images Used On Parts Of Site ? .7 (func)
Images Specific To Page ? .95 (func)
JS Used Across Site ? .35 (func)
JS Used On Parts Of Site ? .7 (func)
JS Specific To Page ? .95 (func)
CSS Used Across Site ? .35 (func)
CSS Used On Parts Of Site ? .7 (func)
CSS Specific To Page ? .95 (func)
Flash Used Across Site ? .35 (func)
Flash Used On Parts Of Site ? .7 (func)
Flash Specific To Page ? .95 (func)
Images Used Across Site ? .35 (func)
Images Used On Parts Of Site ? .7 (func)
Size of Video File / Stream ? 1 (func)

Huh?!? You ask. This looks confusing! Well, that’s because it is! Unless, of course, it isn’t! Or maybe it is? N’est pas?

The above chart is really just a template that I use for every classification of a webpage in a userpath.

Classifications are pages that are ‘similar’ – either built off of the same general template, or that have similar size & function on a sitemap. In publishing sites, you would have classifications such as: Index, Archives, Entries, Entry-Photos, Entry-Videos, Ancillary; in social networking sites, classifications would be: Dashboards, Profiles, Modal / Messaging, Ancillary, etc.

For each page classification, I break down the client-downloaded-assets into specific components: the actual HTML, external images, js, css, and rich media. I then classify each of those components by their usage type: items that are loaded only for that page, ones that are used throughout the page-class, or ones used site-wide.

By tallying up each type of site content, and classifying its usage type, we are able to get an idea of what is actually sent to the browser each time. Not to mention we get a good reference on how to better optimize the site further down the road.

Let’s think of a real-world example: imagine a publishing site where you read an article about a celebrity and view a couple of photos of said celebrity at a press function. On this ‘Article’ classification, we can think of page assets as such:

– Used Across Site

The main css / js files; the site design template images, etc.

– Used on Parts of Site

‘Section’ images on the site design – like ‘active’ status on the nav, in-site promotions, category specific design files, any special js / css used for this template, etc.

– Page Specific

The HTML, the photos of the celebrity appearing only on this page

For those unfamiliar with caching, here’s a quick overview: every time a user visits this website, they download the css / js files unless there is an active cache. Once that file has been cached, it remains until some sort of expiry (closing the browser, waiting 10 months, waiting until the file has changed on the server, etc.) and will not download again. If a user clicks on the index page, then a category page, and then two discrete articles, they will only download the uniquely different files one time each.

In order to get these numbers you can do one of two things:

1. For an active site, you can use server logs and analysis software (or Regular Expressions and some clever scripting) to get accurate numbers for your 80th percentile
2. Safari and Firefox both have features that report the media on a page, along with their size and URI (uniform resource locator). These reports can be coerced into a spreadsheet page with some light scripting or a bit of “copy and find / replace fun”. Using these tools, it becomes fairly simple to figure out what the file loads are, and where the current and potential re-usabilities lay.

To account for caching, I use a discount rate — which I have shown in the above example as 35, 70 and 95. This means that *on average* these items are only served from our server on 35%, 70% and 95% of the requests. Note how I use the concept of served – the numbers that we care about are how many times the content is fetched from our servers, not needed. Users have their own caches running locally on their browser or machines, networks / ISPs and offices often have shared caches too – so there are many times when the browser may request a file, but that request is served by a server / service that we don’t pay for.

# A Fun Note On: Bandwidth

A lot of people wonder “How much should bandwidth cost?”

While the answer is always a joking “as little as possible”, the actual question is a bit more difficult to answer.

Every service provider charges for bandwidth using a different price, and a different method.

It’s not uncommon to see one hosting company to charge .10¢ for a gigabyte of bandwidth, while another charges $3 or more per gigabyte.

Bandwidth is also charged by different units.

Managed hosting companies (they provide hardware and you pay monthly for exclusive or shared usage) tend to charge in terms of (giga)bytes — or for each actual bit of data transferred. Co-location facilities (you provide hardware and they provide a place to store it) tend to charge in bits (mega) — in which you pay for the size of the ‘pipe’ needed to deliver your content 95% of the time. Content Delivery Networks (CDNs) use a mix of the two; it’s a crap shoot to guess. Some firms also charge connection or per-request fees for clients accessing the data.

I personally like to price things out at the exact vendor costs that I am using at the time, but when projecting ‘expected’ expenses, I like to use 50¢/GB as a good ballpark. It’s the midpoint of most companies that I work with – across data-centers, hosting firms, and CDNS – so I am able to easily obtain a figure in the range in which my expenses are likely to run.

I also like to convert bandwidth to Gigabytes for simplicity. That’s how I serve content, think about storage and handle CPMs. Call me lazy — but it’s easier to standardize all of my bandwidth oriented concerns using a singe unit of measurement.

So how do you standardize units ?

## Quick Reminder: Bits and Bytes

– there are 8 bits in a bite
– when abbreviated, a lowercase b means bit and an uppercase B means byte.

## Quick analogies for pricing models :

GigaByte based pricing – this is like the water company charging you each month based on the total number of gallons you consume.

Connection-Fee charges – In addition to the per-gallon charges, you’re also charged a fee every time you turn the faucet on.

Throughput charges – Instead of billing you for the water, you’re billed for the size of the pipe your house needs to receive all its water.

## About : Throughput based pricing

Many companies charge bandwidth based on throughput — the number of bits delivered in a month. Your throughput is measured at the datacenter every 5 minutes or so, and logged to database. At the end of the billing cycle, they use something called 95th percentile billing — they drop off the top 5% of usage spikes and charge you for the size of the pipe needed to deliver the rest.

If you do a bit of math, you’ll learn that a 1 Megabit connection has the theoretical potential to deliver about 320 gigabytes a month. That’s at full-speed nonstop, which is unrealistic. In practice, you’re likely to see a 1 Megabit connection drive somewhere around between 150 and 200 GigaBytes a month between standard usage and the periodic bursts. So when you need to convert the monthly cost of a Megabits/second pipe to GigaBytes, use 175.

# A Note On: Optimizing

Once you are able to estimate your initial page bandwidth, and use a spreadsheet full of related cells, you’ll have a good tool to utiliaze when setting goals and working to achieve them:

– Your page class sheets will give you a good idea of what needs to be optimized, and what does not.
– Your userpath sheet can have page-visits for each item altered to show different ways of herding users
– You can make multiple workbooks, userpath sheets or page-class sheets to test out different concepts.

My personal tips:

– I often clone a page and experiment with how I can better leverage cached images on the site to try and lower the bandwidth cost
– On my userpath sheet, I’ll often have multiple user paths that each look like that first table. They’ll reference different pages, possibly / often alternatives to existing pages, and also have different numbers for pages-per-visit, as I alter different potential user experiences.
– On ad supported websites, on the main P&L sheet, I group these bandwidth costs with the costs for the server infrastructure that supports them AND the ad revenue generated since they all affect one another. I generally leave the labor out of this equation and keep it with the corporate labor costs.

# A Note On: Advertising

I use the phrase “Ad supported” quite a bit, so I feel the need to remind you about the types of advertising revenue that you can realistically expect to generate. Realistically is the key word.

If you’re a niche site and you have a good team driving your company, you could be aiming for a $25 effective page-CPM (the aggregate of all ads on the page) when you are functioning at your peak. While you may command $75-100 as an ideal page-CPM rate, after you deal with sales and inventory concerns (you’re not selling 100% of your adspace at the numbers you command) your average will be around $25 — if you’re lucky.

There are plenty of other costs associated with running your company too – you have to pay for content production, licensing fees, your office space, lawyers, accountants, marketing, etc etc etc. Making your ad supported operations as efficient as possible is really important, because you not only have to cover the costs of delivery — but of production, overhead, and then hopefully profit as well.

Here’s another Startup reality check-

For all other sites — and especially for niche sites on non-publisher content pages — you should aim for a $2 CPM and feel happy if you get that. You should be ecstatic if you make anything more. Even with the most advanced targeting and tracking, you might bump up to $4-8 if you’re very lucky. The industry standard right now is well under $2 for ads that very few people notice on websites that very few people care about. Unless you have the largest PR initiative behind you, that is going to be your web project.

Many large social networks are selling overflow / non-targeted CPMs in the .10¢ to .75¢ range, and only commanding more money from their less-used advanced targeting buys or site sponsorships. The more targeting that you either have or push for, the more infrastructure that you need on your side — either specialists to handle sales, or technologists to implement 3rd party tools. So while it is possible to make more, you will absolutely need to spend more in order to do so.

In essence, you will not get a $25 CPM out of the gate for premium content pages, and you will likely be making under $2CPMs for everything else.

# A Note On: Application Priorities

I noted before that 90% or more of your users will likely be casual — account for about 3 visits of 3 pages each per month. What I should also add is that it is not uncommon to see 70% or more of a site’s traffic driven by one or two URL’s on your site — such as the homepage, a profile/dashboard page, or a latest-news page.

It’s important to use analytics tools to figure out exactly what these pages are, and be extremely aggressive in your optimization and monetization strategies — performance on these pages can make or break your business.

It’s also important to identify these pages for prioritization in your Product Management and Design. These pages can drive most of your revenue, and set the tone for your user experience – leading to stronger connections with your userbase. You will often want to prioritize the integration of new user and management features into these pages.

You’ll also want to keep constant metrics on your application’s usage — your userbase demographics will change as you scale, and users often shift their own patterns as your competitive landscape changes.

A company I recently worked with realized that 72% of their traffic was coming from 3 pages that shared the same admin tool — despite earlier patterns of heavy usage on another part of the site. This was because of both from a 20x scale in users from their startup to production phases, and the introductions and failures of other players in their space. By keeping tabs on their metrics, they were able to fine-tune their product. Instead of spending their finite cash and resources to enrich the overall user experience throughout the site, they decided to focus on streamlining the management operations for the new high-traffic pages. The result? Those pages became even more popular with users and cheaper to maintain — providing increased revenue to overhaul the rest of the site, and a larger/more loyal userbase to advertise the new features to.

# A Note On: The Numbers

This is not an exact science.

Nothing in this article is exact. Many of the figures are based on forecasting usage patterns. When actual server logs are used to show actual user / download patterns, their numbers vary — and are accurate for their site and their site alone.

The point of this exercise and skill is to determine the likely *average* usage patterns, and to optimize a web service to meet them.

These figures also do not take into account a lot of VERY important ideas:

– You may need to scale servers — add additional servers to offset CPU and memory stress. Each additional server can cost between $20 to $1000 a month to operate
– This does not include any sort of labor costs — no engineers, content producers, management, or anything of the like. You will often need to scale your labor to accommodate more consumers.
– There are tens, hundreds, thousands of obvious things that I’ve omitted from this overview

# A Note On: Startups – Being Scared (or Not) and Entrepreneurial Spirit ; or How much should you budget ?

I wrote this article to talk about a concerns for companies that are focused on generating revenue through web-applications. I did not write this article to talk about making cool and kick ass webapps.

If you’re trying to build a business – one with employees and healthcare and offices and a water cooler – you need to be thinking operationally and about income streams. You can have the world’s greatest webapp, but if it can’t support itself or its staff — you just don’t have a business, you have a hobby.

There’s nothing wrong with that. Not everyone is trying to build a company on the outset. If you’re a startup person, there are some good – and proven – ideas in this article that can help you cut costs and be more efficient. Are you trying to build the next kick ass webapp for fun or in your spare time? If so, you can forget about 90% of the things that I’ve written about in this article — your application can probably scale to ~100,000 users without you having to worry about bills.

My friend Rick Webb recently blogged about Dodgeball.com’s demise — their product was great, the users were loyal, and the costs were low — even he could cover them himself. But the company wasn’t able to be financially self-sufficient, so the company that acquired them shut it down. This is probably a bad example, and has more to do with the parent company being inept in its handling of their acquisition, its technologies / capabilities – and most imporantly its team.

If you’re building something like Dodgeball just to build it, then do it. Don’t look back. Have fun, learn some things, and maybe turn it into a company down the road when you meet the right Sales and Operations team. But if you are a company, or you have a technology that you’re trying to turn into one — think for a moment on how you can be profitable, and if you can even achieve that.

I lost a lot of money with my startup FindMeOn, and I learned some tough lessons: We focused on product development when we should have focused on business development; and we focused on business development when we should have focused on product development. It sounds like a Catch-22, but its a simple idea — much like a seesaw on a playground, you need to find balance between what you want your product to become and what you need for it to be a source of actual income. So instead of capitalizing on the technology we invented to integrate social networks and port data, I found myself pulled in directions every-which-way — trying to make the product better and to create revenue streams to keep things moving forward at once. Small teams can’t handle sucessfully, you need focus. The result was us stopping work on the product, and just amassing an army of lawyers to handle a portfolio of patents.

If I knew in 2005 what I know now, things would have worked out very differently.

When Rick first gave me feedback on this, he recommended that I address the ballpark costs involved in covering a startup – so I wanted to conclude on that note.

I thought about the issue for some time, and came with my final thought: Getting a startup off the ground will cost you everything you’re willing to put into it; nothing less, nothing more. If you care about your business, you’ll find a way to make things work — or fail trying.

For example: If you are a small firm, or a beta product, you’ll easily find a host that gives you ‘unmetered’ bandwidth until you serve 500GB for well under $50 a month, and not much more after that. If you’re a large firm with an inept IT department, you’ll get the same package for more than $500/month from a ‘premier’ host — one that also charges you $5/GB for bandwidth over your allocation.

Technology and service options have grown so much in the past 4 years — that you can find ways to make anything work. In 2005, we spent nine weeks coding a scalable image and document serving / storage system from scratch; in 2006, that was rendered completely-useless by Amazon’s S3 – which not only handled serving and storage , but backup as well. Their solution wasn’t just better than ours — it was far cheaper even before adding in all the engineering, maintenance, and hardware that we no longer needed.

Of course, many people think that Amazon charged “too much” to serve images and was unreliable in general. Panther Express came out of nowhere — founded by the original DoubleClick team — and offered CDN services for the masses, targeting disgruntled Amazon customers as one of their demographics. Panther quickly rose to prominence – bringing an enterprise level CDN like Akamai to the masses at a much better price point and integration cost. Today there are also companies like Slicehost that offer virtual servers for $20/month — versus $200/month for similar specced dedicated hardware not too-long-ago.

So if you’re a startup — where there’s a will, there’s a way. Covering your fixed costs will almost never be an issue. If you do ever have this problem, you’re probably wildly successful and in prime position to quickly receive funding or a selloff based on your userbase alone — and you can worry about becoming profitable then. ( On the flip side, you might also have an app that wastes bandwidth or is not realistically profitable, and should rethink your product and approach ).

# A Final Note

Just to illustrate how crazy numbers can be: In a single month, friends who had a design firm spent $7,000 on their web hosting charges; other friends who run a global advertising network spent under $2,000. I’ve seen amazing amounts of traffic served on $20 virtual servers and on $150 dedicated servers.

If you are a startup and you want to build an application — just find a way to do it. It exists. I assure you. The costs of scaling your product are negligible — the only substantial costs will be for your own labor and time. If you get to the point that you need to worry about costs optimization, you already have a VC check in your hands.

If you are a startup trying to be profitable / sufficient from day 1 — and there are a lot of you out there now — your tech costs should be well under $1,000 a month if you have under 100,000 active users. If you have more than that, optimize your application and get an advertising specialist or new VP of business development in your office — because you should be making money. You need to think about these numbers not for daily operations, but for making yourself look profitable for investment or acquisition.

And finally – if you are a brand thinking about running a digital project, think of your scaling costs in terms of CPMs — how much are you willing to pay per user impression for the visit? You’ll probably have to use some sort of ‘premier’ web host and CDN, which will cost between $1,000 and $3,000 on setup / monthly fees alone. Beyond that, you’ll be paying significantly less per CPM to maintain a good relation with your customers than you would on advertising CPMs to attract new ones.

New Category – Scalability & Performance

I’ve decided to create a new category on Scalability & Performance

I recently learned that not enough people talk about this – and there are a lot of independent efforts that don’t cross communicate.

Since I’m plugged into a lot of misc communities, I decided to start talking about the different efforts in each — and where there is overlap.