Thursday, November 3, 2011

The US Cell Phone Market

or how I learned to play wacka-mole

Looking to buy a new phone?

What carrier should I choose?
What phone should I choose?
How future proof are these choices?
Can I switch carriers?
Will my phone work overseas?

If you're a techie like me, it can be maddening looking at the history and current state-of-affairs of world-wide mobile phones.

A brief overview

Cell phones have gone through 4 generations numbered 1G through 4G loosely representing the past 4 decades (since 1980).

1G was the original analog, which used circuit switched networks. This had low-latency (quick response) and connection reliability but very little spectrum efficiency (e.g. downloads were less efficient per MHz). AMPS (Advanced Mobile Phone System)
2G added digital downloads and uploads which leveraged packet switched networks. Europe made a concerted effort to standardize radio frequencies and protocols via GSM (Global System for Mobile Communications), while most of the US carriers went their own ways. Common GSM protocol names were GRPS (General Packet Radio Service) and EDGE (Enhanced Data Rates for GSM Evolution).
3G started in 2000 in Europe as a general specification UMTS (Universal Mobile Telecommunications System) which stated 3G will be anything that is 200kbps or faster and must be all digital. This included HSPA (High Speed Packet Access), later faster downloads with HSDPA (High Speed Downlink Packet Access). Still later HSUPA (High Speed Uplink Packet Access). And finally HSPA+ (with between 5Mbps to 48Mbps). A major competing standard was CDMA2000 which included 1xRTT and EV-DO (Evolution-Data only later rebranded as Evolution-Data Optimized). And finally WiMAX (Worldwide Interoperability for Microwave Access).
4G was guided by IMT Advanced (International Mobile Telecommunications-Advanced) in 2008 to define the most recent generation. This included a minimum bandwidth of 1Gbps, and a requirement to be pure IP based. It basically leveraged OFDM (Orthogonal frequency-division multiplexing)- a technique leveraged in short-range WiFi since 802.11a. It also leveraged the latest WiFi techniques (802.11n) of multi-antenna / multi-path, often called MIMO (multiple-input and multiple-output). However, there were major challenges, given existing radio frequency spectrum allocations, costs, and handset requirements. So a practical evolutionary approach was decided on called LTE (Long Term Evolution) and subsequently LTE Advanced. Separately higher speed WiMAX solutions have been adopted by some vendors. LTE peek speeds are currently on the order of 50Mbps. Far short of the 4G requirement. More importantly, there is tremendous overlap of LTE and HSPA+ deployments.

The above is certainly not an authoritative reference - it represents the culmination of my frustrated attempts at learning bit-by-bit from various articles including wikipedia, anandtech, and carrier marketing site.

The basic problem

Radio frequencies have the following characteristics:

A Frequency is measured in HZ (after Heinrich Hertz), and it represents one full rotation (similar to a rotating tire - but more specifically a complete sinusoidal oscillation between two orthogonal states).
Radio frequencies are split up in a continuous spectrum much like color shades from red through violet (as can be seen through a prism).

This means there are a very large number of frequencies between even 1HZ and 2HZ (though quantum physics quickly gets in the way).

Radio Frequencies are categorized as those between 3Hz and 300GHz.
The frequencies commonly attributed to Cell phones are UHF (Ultra High Frequency) 300MHz - 3GHz.
What is interesting about Radio Frequencies is that they can emit off copper wires and be captured again by a different copper wire miles away.
Some radio frequencies can pass through walls, while others have a hard-time. Still others' are subject to natural interference.
For a given power level, some frequencies can travel great distances with little signal loss; others can only travel a few inches. For those shorter range frequencies, you can usually increase the power of the transmitter to increase the range, but sometimes there are consequences - an Arc-welder, for example, is a high powered radio transmitter.
Radio transmissions are composed of oscillating multi-dimensional bundles of energy called photons that continuously transition back and forth between the electric and magnetic fields, while simultaneously traveling through space at the speed of light.

The process of transmitting radio frequencies wirelessly through space is very similar to how they are transmitted through a waveguide such as a copper wire.

As a consequence, wired ethernet and fiber-optics have very similar limitations as wireless - but have the advantage that multiple parallel cables have very little interference with one another - So you can achieve massive bandwidth with very little power consumption.

Antenna's are basically echo-chambers that can absorb 50% of the power in a given radio photon. The remaining 50% is echo'd/reflected back out.

This means radio transmission is never more than 50% power efficient
Given a long-lasting coherent frequency, the echo's resonate within the antenna's echo-chamber and can coherently be amplified (in the case of analog music radios like FM/AM) or measured / sampled (in digital systems).

Minor obstructions in the wires of an antenna have massive effects on which frequencies the echo chamber can resonate effectively. This is more-or-less how we can tune a wire to accept ONLY a small frequency-range.
We call a pre-defined frequency-range a channel (VHF channel 13 is 156.65 MHz +/- 25 KHz)

A GSM cell tower may define 124 channels centered about 850MHz in 200 KHz increments.

There is bleeding of one frequency into the next due to noise, photon-scattering (e.g. through air), and the basic physics involved in antennas. So while a given photon is precisely 1 frequency (for example 30,153.112212 Hz), the resonation-amplified signal will not be determinable beyond an accuracy of say 5kHz at the 30Mhz level. Thus an entire contiguous band needs to be allocated/reserved.
Using mathematical techniques on digitally sampled measurements of the antenna, we can effectively encode between 0.1 and 20 bits per Hz of a given contiguous frequency-band.

Thus if a cell phone was allocated a 1 MHz band somewhere in the UHF range of 300MHz to 3GHz, then they may be able to encode between 100Kbps and 20Mbps. Note, there would only be 2,700 such bands in the UHF range. So for a given region, you could not support more than 2,700 cell phones with such a partitioning scheme.

The measurement of bits per preciously scarce Hz is called spectral efficiency.
Given the ever increasing cost of frequency-ranges in given geographic regions (such as high population areas), there is an ever growing need to increase spectral efficiency - all else being equal.
When you sell a device that uses a frequency, you have to wait until they have all been retired before you can effectively re-purpose that frequency for a different device or protocol.

Thus, historically the frequency spectrum is full of ranges that you are not allowed to use because of spectrum squatters.

In the US, the FCC (Federal Communications Commission) defines what devices are allowed to use which frequency-ranges; what protocols to use in those ranges, and at what max-power-levels (so as to define a max-range/distance of interference)
You can generally separate two-way send/recieves in frequency or time (or some combination there-of).

TDMA - Time Division Multiple Access
FDMA - Frequency Division Multiple Access
CDMA - Code Division Multiple Access (combines a range of frequencies over time to produce a code-point that

So to sum it up, we've got a limited supply of useful over-the-air spectrum - though it can all be reused in each city. Much of it is full of legacy devices (such as Analog radio (AM / FM), Satalite, Analog TV). The dollar value of a each MHz of spectrum skyrockets yearly, and thus legacy systems are quickly being retired so as to allow their spectrum to be re-distributed to the highest dollar-value market.

Today that market is the cellphone industry.

BUT, much of that spectrum is already allocated for cell-phone use. And there are over 1 billion cell phone devices world-wide. There is a very high cost in upgrading older cell phone devices that have less spectral efficiency or to consolidate which frequencies are used for which protocols, so a given phone can be useful in different cities around the world. So instead, most carriers will opt to continue to fragment the world-wide-market for short-term cost-savings.

So lets say we have 2 cities A and B that each implemented their own 3G phone Phone 1 and Phone 2. Lets say they used frequencies 1 GHz and 2 GHz respectively. Lets say there was some legacy satalite blocking 1 GHz in city B.
Now lets say they both wanted to create a 4G network. Let's say in City A, the only available frequency is 3 GHz. So they create a Phone 3 (that is backwardly compatible with Phone 1's frequencies) . But in City B, it was cheaper to decommision that satalite and reuse the 1 GHz space. Lets also say 3 GHz is NOT currently cost effective to re-purpose. So City B KNOWs that if it chooses 1 GHz it will be in conflict with City A. But it would take many years and a lot of money to do something else - so instead they create a Phone 4. So now Phone 1 and 3 work in City A ONLY, and Phones 2 and 4 ONLY work in City B.

Now take this situation and take 20 frequencies and 15 protocols. Many of which requires special dedicated hardware to work properly. Depending on the details, there could be upwards of 100 specialized pieces of hardware required to actually work on all possible frequencies and protocols. While technically possible, the cost-effectiveness of making a LOW-POWER portable cell phone is perhaps challenging. Now throw in patents / royalties for a given protocol, and a handset manufacturer has to think seriously about whether it's worth while making a true world phone (one that can operate in any major city around the world with at least voice connectivity).

The breakdown

Europe:
While Europe is full of conflicting standards, it did champion the GSM standard. This allocated the following frequencies:

900MHz (890-915 uplink, 935-960 downlink) GSM / EDGE / GRPS/ 2G with 124 channel-pairs.
1.8GHz (1.710-1.785 uplink, 1.805-1.880 downlink) GSM / EDGE / GRPS/ 2G with 374 channel-pairs.
1.9GHz / 2.1GHz IMT (1920–1980 uplink, 2110–2170 downlink) UMTS / 3G / W-CDMA (2004 - )
1.8GHz DCS (1710–1785 uplink, 1805–1880 downlink) UMTS / 3G / W-CDMA (alternative / migration)
900GHz GSM (880-915 uplink, 925-960 downlink) UMTS / 3G / W-CDMA (alternative / migration)

US:

Due to conflicts, the US allocated these (and other) frequencies:

800MHz (825 - 894) for 1G AMPS (FDMA) then incrementally upgraded to 2G D-AMPS (TDMA) in ATT / Verizon / On-Star (started in 1982 and discontinued in 2008).
850MHz (824-849 uplink, 869-894 downlink) GSM / EDGE / GRPS/ 2G with 124 channel-pairs.
850MHz T-Mobile for GSM (1996-) ROAMING ONLY
850MHz (824-849 uplink 869-894 downlink) ATT UMTS / HSPA / 3G (HSDPA 2005 ) (HSUPA in 2009 ) - slowly replacing GSM
850MHz Verizon CDMA / 3G
1.9GHz Verizon CDMA / 3G
1.9GHz T-Mobile GSM / 2G (1994-) (GPRS 2002) (EDGE 2004)
1.9GHz ATT GSM / 2G (GPRS 2002 ) (EDGE 2004 )
1.9GHz (1.85-1.99) Sprint 2G CDMA / GSM custom.(1995 - 2000)
1.9GHz PCS Sprint CDMA / EV-DO
1.7GHz / 2.1GHz AWS (1.71-1.755 uplink, 2.11-2.155 downlink) T-Mobile UMTS / 3G (W-CDMA 3.6Mbps in 2006) (HSPA 7.2Mbps in 2010) (HSPA+ 42Mbps in 2011)
1.9GHz PCS (1.85-1.91 uplink 1.93-1.99 downlink) ATT UMTS / HSPA / 3G (HSDPA 2005) (HSUPA in 2009) - slowly replacing GSM
700MHz ATT LTE / 4G (2011 - )
700MHz (777-787 uplink, 746-756 downlink) Verizon UMTS / 3G
2.5-2.7GHz Sprint XOMH WiMAX / 4G

Spectral efficiency and bandwidth

AMPS (0.03 bits/Hz)
D-AMPS (1.62 bits/Hz) Each channel-pair is 30KHz wide in 3 time slots (TDMA). Supports 94 channel-pairs.
GSM Each channel is 200KHz wide.
GSM / GPRS (?? ) ( 56Kbps to 154Kbps)
GSM / EDGE (1.92 bits/Hz) (400Kbps to 1Mbps)
CDMA2000 / EV-DO (2.5 bits/Hz) (2.4Mbps to 3.1Mbps) 1.25MHz
UMTS / WCDMA / HSDPA (8.4 bits/Hz) (1.4Mbps to 14Mbps) 5MHz channel-pairs
UMTS / HSPA+ (42Mbps)
LTE Advanced (16 bits/Hz) (6Mbps normal peek 300Mbps) 1.25MHz .. 20 MHz channel-bundles
WiMAX (1.75 to 20 bits/Hz)
V.92 modem (18.1 bits/Hz)
802.11g (20 bits/Hz)
802.11n (20 bits/Hz)

Phones:

Apple's iPhone 4 contains a quadband chipset operating on 850/900/1900/2100 MHz, allowing usage in the majority of countries where UMTS-FDD is deployed. Note, this doesn't support the 1.7GHz uplink for T-Mobile.
Samsung Galaxy S Vibrant (SGH-T959) T-Mobile - GSM / 2G 850, 900, 1800, and 1900. UMTS 1700/2100 (US, Tmobile only) and UMTS 1900/2100 (Europe). It does NOT support the 850 band as used by AT&T 3G.
Samsung Galaxy S Captivate (SGH-i897) ATT -GSM / 2G 850, 900, 1800 and 1900. UMTS / 3G 850/1900 (US, ATT) and UMTS / 3G 1900/2100 (Europe).
Samsung Galaxy S II (SGH-I777) ATT - GSM/3G 850/1900 (US) 900/1800 (Europe)
UMTS / 3G 850/1900 (US, ATT) and UMTS / 3G 1900/2100 (Europe) [1.2 GHz, Dual Core Exnyos C210 + Mali-400 MP GPU]
Samsung Galaxy S II (SGH-T989) T-Mobile - HSPA+ [1.5 GHz, Dual Core Qualcomm Snapdragon S3]
Samsung Galaxy S II Skyrocket (SGH-I727) ATT - HSDPA / 3G 850 / 1900 / LTE 700MHz [1.5 GHz dual-core Snapdragon S3]

Why am I missing Verizon/Sprint?

Frankly, the information for Verizon/Sprint was less abundant - and this being a personal handset research project, I just gave up looking.

The VAST majority of the data was garnered from wikipedia and general google searches. If I come across more detailed information in my spare time, I'd love to update the data.

References:

http://en.wikipedia.org/wiki/4G

http://en.wikipedia.org/wiki/Evolved_HSPA

http://en.wikipedia.org/wiki/Comparison_of_wireless_data_standards

http://en.wikipedia.org/wiki/GSM_frequency_bands

http://en.wikipedia.org/wiki/History_of_mobile_phones

http://en.wikipedia.org/wiki/Digital_AMPS

http://en.wikipedia.org/wiki/Personal_Communications_Service

http://en.wikipedia.org/wiki/Spectral_efficiency

http://en.wikipedia.org/wiki/Verizon_Wireless

http://en.wikipedia.org/wiki/T-Mobile_USA

http://en.wikipedia.org/wiki/AT%26T_Mobility

http://en.wikipedia.org/wiki/UMTS

http://en.wikipedia.org/wiki/EV-DO

http://en.wikipedia.org/wiki/Sprint_Nextel

http://en.wikipedia.org/wiki/General_Packet_Radio_Service

http://en.wikipedia.org/wiki/EDGE

http://en.wikipedia.org/wiki/LTE_Advanced

http://navcen.uscg.gov/?pageName=mtVhf

http://en.wikipedia.org/wiki/Radio_frequency

Saturday, June 11, 2011

Cloud Computing

Intro

I recently had the chance to go to a cloud computing expo in NYC. I didn't think there would be much to learn, and indeed, most of the presentations were very high level. But if you paid attention, there were many little gems.

Where we've been

Virtualization has been around for a while. A LONG while. IBM has been doing this since the 70's. The idea was that a business needed a really really really reliable system, so it needed 3 way redundancy for every aspect of the computing framework. The devices were self-healing; on failure, the hardware would reroute to working chips, drives, etc.

Now that you've got this multi-million dollar system, it would be kind of nice to not let it sit idle all the time. So we invent time-sharing. This basically allows multiple users to perform their tasks (possibly in isolated operating system slices). This has the effect of making the whole system slower, but the goal was data reliability and price - not performance.

Then came the cheap commodity DOS and windows hardware. This meant smaller businesses could solve the same basic problems without 'big iron'. And with the manual backup process, you could have as much fault tolerance as you could eat. Big corporations now represented dis-economies of scale. By being so large that they couldn't survive down-time, or they were so large, they couldn't fit on floppies, it meant their ONLY option was expensive hardware.

Over time various UNIX flavors started replacing big-iron.. Now 'cheap' $20,000 servers could handle thousands of users in time-sharing environments, and could utilize 'cheap' SCSI disks in a new fangled MIT described RAID configuration. 5 200 MEG SCSI disks could get you a wopping 800 Meg of reliable disk storage! A new era of mid-size companies was taking over. And the 90's tech bubble was fueled by incremental yet massively scaling compute capabilities. Herein the IT staff was king. The business models were predicated around bleeding edge data or compute capabilities, and thus it was up to the IT division to make it happen, and the company was made or broken on the ability to meet the business challenges... Of course, as we all know, not every business model made any sense at all, but that's another story.

The next phase was the introduction of free operating systems, namely Linux. The key was that now we could make cost effective corporate appliances that were single-functioned.. And in the world of network security, there was a constant need to isolate and regularly upgrade security patches.

Enter VMware..

While a mostly academic endeavor, it had long been possible to 'simulate' a computer inside a computer. I personally worked on one such project. Many companies, like Apple and DEC had strong needs to 'migrate' or expand over applications written for different operating systems, and they sometimes found that it was easier to just emulate the hardware. IBM once tried software service emulation with their OS/2 to support Windows 3.0 style software, and likewise Linux had a windows 3.0 software-stack emulator called WINE. But both of these endeavors were HIGHLY unreliable, in that if they missed something, it wouldn't be visibile until somebody crashed. And further, it couldn't be as efficient as running on the native OS - begging the question of whether full hardware emulation might still be better.

So now VMware had found some techniques to not emulate the CPU itself, but instead only emulate the privledged instructions that represent OS calls. This is presumably a much smaller percentage of emulation, and thus faster than CPU-emulation. You can then act as a proxy, routing those OS calls to a virtualized OS. So the application runs natively, the transition to the OS is proxied, and the OS runs in some combination of native and expensive CPU emulation. Namely if it's just some resource allocation management algorithm that requires no special CPU instructions, then that'll run natively in the OS, but if the OS call is actually manipulating IO-ports, virtual memory mappings, etc, then each of those instructions will be heavily trapped (proxied). Later, VMware type solutions found they could outright replace specific sections of OS code with proxied code, minimizing the expensive back-and-forth between the VM and the target OS.

Thus VMware was notoriously slow for IO operations, but decently fast for most other operations, and, like vitual CPU emulation, it was 100% compatible with a target OS. You could run Linux, Windows, OS/2, and theoretically Mac OS all side by side on the same machine.. This meant if you had software that only ran on one OS, you could have access to it.

Eventually people realized that this could solve more than just software access... What if I knew that FTP sites around the world were being hacked.. Sure I'll patch it, but what if they discover a new flaw before I can patch? They'll get to my sensitive data... So, lets put an FTP server inside it's own OS.. But that means I'd have to buy a new machine, a new UPS, a new network switch (if my current one has limited ports). Why not just create a VMware instance and run a very small Linux OS (with 64 Meg of RAM) which just runs the FTP service. Likewise with sendmail, or ssh servers, or telnet servers, or NFS shares, etc, etc etc. Note these services are NOT fast, and you have to have redundant copies of the OS and memory caches of common OS resources. But that's still cheaper than allocating dedicated hardware.

So technically VMware didn't save you any money.. You're running more RAM, more CPU and slower apps than if you ran a single machine. But the 'fear' of viruses, exploitation, infiltration, etc made you WANT to spend more money. So assuming that your CTO gave you that extra money, you've saved it by instead going to VMware. See free MONEY!!

If you were concerned about exploitation, then there was one glaring hole here.. The VM ran on some OS.. that might have it's own flaws... Further, the path is from guest app trapping to Vmware (running as a guest app on a host OS), who then delegates the CPU instruction to the guest OS. And the host OS may or may not allow the most efficient way of VMware to make this transition. Obviously VMware needed special proprietary hooks into the host OS anyway.. Sooo.

VMware eventually wrote their own OS... Called ESXi. This was called a hyper-visor. A 'bare metal' OS that did nothing but support the launching of real OS's in a virtualized environment.

In theory this gave the OS near native speeds if it was the only thing running on the machine, since there was only a single extra instruction proxy call needed in many cases.

So now we can start innovating and finding more problems to solve (and thus more services to charge the customer). So we come up with things like:

Shared storage allows shut-down on machine 1 and boot up on machine 2. This allows hardware maintenance with only the wasted time to shutdown/boot.
vMotion allows pretending that you're out of RAM, forcing the OS to swap to disk, which, in reality is forcing those disk writes to an alternate machine's RAM.. When the last page is swapped, the new OS takes over and the original OS is killed. This is near instant fail-over (dependent on the size of RAM).

toggling RAM/num-CPUs per node.
Using storage solutions which give 'snapshot' based backups.
Launching new OS images based on snapshots.
Rollback to previous snapshots within an OS.
This all allows you to try new versions of the software before you decide to migrate over to it. And allows undoing installations (though with periodic down-time).

The reason Linux was valuable during this period was that making these OS snapshots, mirrored images, etc, had no licensing restrictions. You were free to make as many instances / images as you saw fit.. With a Windows environment, you had to be VERY careful not to violate your licensing agreement.. In some cases the OS would detect duplicate use of the license key and deactivate/cripple that instance. Something that is overcomeable but not to the casual user/developer.

Note that VMware was by far, not the only player in this market. Citrix's XEN, Red Hat's KVM, VirtualBox, and others had their own directions.

Hosted Solutions:

In parallel with all of this came the hosted web-site solutions. Build-a-web-site with WYSIWYG such that grandma can build/publish online with ease.

Next to that were leased hardware solutions.. Server Beach, Rack Space, etc. Also were the good old time-sharing solutions, where you'd just be granted a login; literally next to hundreds of concurrent users.

Whatever the need, the idea was that, 'why manage your own hardware and datacenter and internet connection'? That's an economy of scale sort of thing.. Someone will buy 1Gbps connection and 500 machines and associated large scale backup / switches / storage. They then lease it out to you for more than cost. You avoid up-front costs, and they make out a decent business plan.. It's a win-win... Sometimes.

The problem is that, installation on these solutions leaves with you very little to build a business over. You can download and install free software; but commercial software is largely difficult to deploy (especially if hardware has no guarantees and from month to month you may be forced to switch physical hardware (which would auto-invalidate your licenses)).

Free software was still fledgling.. Databases were somewhat slow/ unreliable.. Authentication systems were primitive. Load balancing techniques were unreliable/unscalable. And, of course, the network bandwidth that most providers gave were irregular at best.

Enter Amazon AWS

So some companies decided that it would make sense to try and solve ALL the building block needs.. Networking, load balancing, storage, data-store, relational data-store, messaging, emailing, monitoring, fast fault tolerance, fast scaling...

But more importantly, these building blocks happen without a phone-call.. Without a scale-negotiated pricing agreement.. Without an SLA. It is a-la-cart. Credit-card end-of-month 'charge-back'. You pay for what you use on an hourly or click basis. This means if you are small, you're cheap.. If you have bursts, you pay only for that burst (which is presumably profitable and thus worth it). And if you need to grow, you can. And you can always switch providers at the drop of a hat.

It's this instant provisioning and a la cart solution pricing that's innovative here. But provisioning that gives you commercial grade reliability (e.g. alternatives to oracle RAC, CISCO F5s, netapp).

Along with this came the proliferation of online add-on service stacks. The now classic example is salesforce.com. Something where there is an end solution that naturally can be extended with custom needs. This extension allows the opportunities for secondary markets and business partnership opportunities.

So today, apparently we use the following buzz words to categories all the above.

SaaS - Software as a Service

This is an ebay, salesforce.com. Some end software solution (typically website / web-service) but that can be built upon (the service aspect). The key is charge per-usage volume, and pluggability. A cnn.com is not pluggable and thus not categorized SaaS.

PaaS - Platform as a Service

This is the Amazon AWS (language neutral), the google app engine (Java/Python/Go), the Microsoft Azure (.NET). These are a-la-cart charge-back micro-services. The end company makes money (or as in google's case recoup costs) and provides highly-scalable solutions, so long as you stay within their sandbox. Currently there is vendor-lock in, in so far as you can't swap the techniques of vendor A with vendor B. And this isn't likely to change... If, for no other reason than the languages themselves are differentiated between these various platforms. Thus even a common SDK is unlikely to provide abstraction layers.

IaaS - Infrastructure as a Service

This is the classic hosted hardware with only the ability to provision/decommission hardware. Amazon EC2, Rackspace, terramark, Server Beach, etc are all in this model.

You are charged per use.. You have no visibility into the hardware itself.. Only RAM-size, number of CPUs, some benchmark representing minimum quality-of-CPU-compute-capability, and disk-space-size.

In some instances you can mix and match.. In others, you are given natural divisions of a base super-unit of resources.. Namely you can half/quarter/split 8 ways/16 ways/32 ways the basic compute unit (as with rack space). The needs of your business may dictate one vendor's solution v.s. another. Including whether they properly amortize valid windows licenses.

The reaction:

So, of course, the corporate buzz being - leased hardware, leased software stacks, make less efficient but more scalable solutions. Every CTO is being asked by their board, "What about cloud computing?". And thus there was a push-back.. In the analysis, there were some major concerns.

Vendor-lock-in
Sensitive data
Security
Uncontrolled Outages
Lack of SLAs
Lack of realized cost savings
Latency

So vendors started stepping in, and charging fees on top of fees, pro-porting to solve the various issues. New phrases were coined:

Virtual Private Cloud - The idea that on-premises or in a shared data-center, you could guarantee hardware and network isolation from peer clients. This guards against government search warrents that have nothing to do with you. This also guards against potential exploits at the VM layer (client A is exploited, and the hacker finds a way to hack the VM and then gets in-memory access to all other OSes running on that hardware; including yours). Companies like terramark advertise this.

Hierarchical Resource allocation - The IT staff is responsible for serving all divisions of a corporation. But they don't necessarily know the true resource needs - only the apparent requested needs (via official request tickets).

Thus with cloud-in-a-box on-premises appliances, the IT staff can purchase a compute-farm (say 60 CPUs and 20TB of redundant disk space). It then divides this 3 ways into 3 divisions based on preliminary projections of resource consumption. It then grants each division head "ownership" of a subset of the resources.. This allocation is not physical, and in fact can allow over-allocation (meaning you could have allocated 200 CPUs even though you only had 60). Those divisions then start allocating OS instances with desired RAM/CPU/disk based on their perceived needs.. They could have a custom FTP site, a custom compiler-server, a custom share-point, or some shared windows machine with visio / utilizing r-desktop for one-at-a-time access. The key is that the latency from the end user and the division head is faster than to joe-sys-admin who's over-tasked and under-budgeted. There is 'no phone call necessary' for provisioning of a service.

Now the central IT staff monitors the load of the over-committed box. There might be 50 terabytes of block-devices allocated on only 20TB of physical disk-arrays, BUT 90% of all those OS images are untouched. Note that defrag would be a VERY bad thing to run on such machines because they would commit all those logical blocks into physical ones. Likewise, each OS might report that they have 2GB of RAM, but in reality a 'balloon' app is leaching 1,200Meg back to the host hyper-visor. (so running graphically rich screen-savers is a bad bad bad thing).

As the load of the cloud appliance reaches 70%, they can purchase a second node and vmotion or cold restart individual heavy VMs over to the new hardware. They then re-portion any remaining unused reserve for given sub-divisions to come instead form the new cloud farm. Repeat and rinse..

At the end of each quarter, divisions pay back their actual resource consumption to the central IT budget.

The stated goal is that, previously, you'd need 1 IT staff member per 50 machines.. Now you can have 1 central IT staff member per 5,000 "virtual" machines. Note, they're still dealing with desktop/laptop/ipad integration issues left and right. They're still pulling bad hard disks left and right. They're still negotiating purchase order agreements and dealing with network outages. But the turn-around for provisioning is practically eliminated, and the cost of provisioning amortized.

Solutions like abiquo provide corporate multi-level hierarchical resource sub-divisioning. So from division, to department, to office-floor, to 3-man development team. The only request upstream is for more generic resources (any combination of disk, ram, HD). You manage the specific OS/environment needs. And by manage, a majority of OS deployments are stock services.. For example, a fully licensed windows 7 OS with visio install. A fully configured hudson continuous integration build machine. A fully configured shared file-system.

Availability Zones

Of course, since network, DDOS issues, power-issues, geographic network/power failure issues exist. The remaining issue for client-facing services is whether to host your own data-centers or go to a virtual private cloud or even a public cloud for a subset of your business needs/offerings. Here, it becomes too expensive to host Tokyo, Ireland, California, NY data center presences, so it may be more cost effective to just leverage existing hosted solutions. More-over, some vendors (akamai through rackspace, custom cloud-front through amazon AWS, etc) offer a thousand data-center 'edge' points which host static content.. These massively decrease world-wide average load latencies.. This is nearly impossible to satisfy on your own.

Many hosted solutions offer explicit Availability Zones (AZs), even within a given data-center. Obviously it is pointless to have a mysql slave node on the same physical machine as the master node. If one drive goes down, you lose both data nodes. Of course, with private cloud products, a given vendor will give you assurances that they only report 0.001% return rates.. Meaning, 'just trust their hardware appliance'.. You don't need to buy 2 $50,000 netapp appliances.. It's like it's own big-iron. But let me assure you.. Ethernet ports go bad. motherboard connectors go bad. Drives have gone bad such that they push high-voltage onto a common bus, obliterating all other shared connectors. ANY electrical coupling is a common-failure point. I personally see little value in scaling vertically at cost, and instead find greater comfort in scaling horizontally. Some solutions do focus on this electrically isolated fault tolerance, but most focus (for performance reasons) on vertical integration; and most stupidly happen to have shared electrical connections with no surge isolation (which makes them sub-mainframe quality).

Business Needs

The conference did help me shift my perception in this one important way. My excitement about solving 'web scale' problems is just that, excitement.. It does not directly translate into a business need. Over and over, speakers expressed that at the end of the day, all of this cloud 'stuff' tries to solve just one problem 'lower cost'. I was initially offended by this assertion.. But by putting on my economics 101 cap. All short-term fixed costs eventually become long term variable costs. Any short term problem can be designed for economies of scale in the long run. If I can't 'scale' today, I can over purchase then rent-out the excess tomorrow in 1,000 data centers. I can partner with akamai directly with a sufficiently high volume tomorrow for cheaper than I can purchase today or even lease today. If my software doesn't scale today, I can engineer an assembly language low-latency, custom FPGA design that fixes my bottlenecks tomorrow. ALL problems are solvable in the long-run. So the idea that clouds uniquely solve ANY problem is fundamentally flawed. I can do it better, cheaper, faster... tomorrow. The cloud only solves a small subset of problems today. And so the question any board of directors needs to consider when investing in private v.s. public infrastructure is time-to-market, and the risk of capital costs. If I know a product will have 3 years of return, and will need a particular ramp-up in deferred capital costs. Then I'm pretty sure I can engineer a purchase plan that is cheaper than an associated Amazon AWS solution. It'll run faster, cheaper, and more reliably. BUT, what if those projections are wrong? What if the project is a failure? What if we have spikes earlier than projected? Cloud computing provides SOME degree of risk mitigation, presumably in reducing the costs. Really, it just changes the equation radically enough so that old problems are replaced by new ones.

The single biggest beneficiary of cloud computing are in-experienced divisions or startups. Those who don't have talented seasoned IT staff members (DBAs, certified CISCO engineers, teleco contractors on call, etc). Those that can't afford the up-front costs. Those who's business risks are massive. Those who's challenge is raising money until they start showing revenue. Yet they can't earn revenue until some hardware/software stack is in place.

To this category, there simply is no alternative. You have zero up-take until you go viral.. Then no amount of hardware is sufficient to meet you needs.. That exponential growth period is highly interderministic, and also the one-shot make-or-break moment.

To this category, cloud computing is a no-brainer. At least in the beginning.. But even here, growth stories have shown 'cloud' doesn't solve their problems once they achieve a sustainable business model. 'github' for example, uses rack-space.. BUT only in a traditional datacenter model. They have massive storage needs which is in contrast to rackspace's business model.. So github doesn't leverage any public managed cloud services.. They own their own hardware/software stack. The only things rackspace provides is a subset of 'IaaS', which includes akamai's edge CDN network. And ultimately this 'hybrid' approach is probably where most medium+ businesses will be forced to live.

Security

It was somewhat good to hear some best practices at the conference include single-sign-on services. We're already familiar with facebook connect, openID and google's single-sign on. But these are obviously highly proprietary. The more interesting solutions for me were SAML based standards compliant trusted authority solutions.. Those where your corporate environment can leverage an Active Directory / LDAP metadata store of user+password+roles, then transmit trusted tokens to a google-apps, salesforce, etc to access their services with little fear of password hacking; and, of course, the value of single-sign-on. Here each tier is layered, so the 'password' part can be replaced with a finger-print scanner, or RFID card+finger+pin combo, etc. I personally like these hybrids, as the Sony exploit has shown that people are dumber than dumb.. Associating simple dictionary words 6 characters or less with credit card info. People simply can't be trusted to remember complex pass phrases that aren't biographically linked to metadata easily discoverable about them.

Separately there are all the best practices that SHOULD be honored. Don't ever pass sensitive data that you don't directly need through your network. Don't allow relay of public information through you (e.g. don't attach generic blogs unless necessary and unless monitored). Don't use encryption where assertion can solve the same problem (stolen assertion data is useless to a hacker, whereas stolen encrypted data can be hacked for a single master key).

Conclusions

I think that businesses HAVE to follow the joneses. But they should do so pragmatically, and in a venture-capital sort of way.. Fund initiatives to see if they have practical ROI. Do they solve more problems then they cause.. Keep your company and team agile (fast turn around time and with an ability to shift directions). Keep them appraised of possible solutions that solve problems more quickly, efficiently, cheaply. But remember that in 5 years, we will lament the whole 'cloud' era and laugh at people still that use centralized data. Much like we did in the early 90s. iPad/android peer-to-peer apps hiding data from 1984 oppresive government eyes is more important than consistent up-to-date data. Who knows what tomorrows critical challenges will be. So I wouldn't put too much stock (literally) in moving existing solutions over to public clouds.. But high-risk projects with large hardware needs and potentially short-lifetimes does make a lot of sense to get your corporate feet wet.

Tech articles