Intro
I recently had the chance to go to a cloud computing expo in NYC. I didn't think there would be much to learn, and indeed, most of the presentations were very high level. But if you paid attention, there were many little gems.
Where we've been
Virtualization has been around for a while. A LONG while. IBM has been doing this since the 70's. The idea was that a business needed a really really really reliable system, so it needed 3 way redundancy for every aspect of the computing framework. The devices were self-healing; on failure, the hardware would reroute to working chips, drives, etc.Now that you've got this multi-million dollar system, it would be kind of nice to not let it sit idle all the time. So we invent time-sharing. This basically allows multiple users to perform their tasks (possibly in isolated operating system slices). This has the effect of making the whole system slower, but the goal was data reliability and price - not performance.
Then came the cheap commodity DOS and windows hardware. This meant smaller businesses could solve the same basic problems without 'big iron'. And with the manual backup process, you could have as much fault tolerance as you could eat. Big corporations now represented dis-economies of scale. By being so large that they couldn't survive down-time, or they were so large, they couldn't fit on floppies, it meant their ONLY option was expensive hardware.
Over time various UNIX flavors started replacing big-iron.. Now 'cheap' $20,000 servers could handle thousands of users in time-sharing environments, and could utilize 'cheap' SCSI disks in a new fangled MIT described RAID configuration. 5 200 MEG SCSI disks could get you a wopping 800 Meg of reliable disk storage! A new era of mid-size companies was taking over. And the 90's tech bubble was fueled by incremental yet massively scaling compute capabilities. Herein the IT staff was king. The business models were predicated around bleeding edge data or compute capabilities, and thus it was up to the IT division to make it happen, and the company was made or broken on the ability to meet the business challenges... Of course, as we all know, not every business model made any sense at all, but that's another story.
The next phase was the introduction of free operating systems, namely Linux. The key was that now we could make cost effective corporate appliances that were single-functioned.. And in the world of network security, there was a constant need to isolate and regularly upgrade security patches.
Enter VMware..
While a mostly academic endeavor, it had long been possible to 'simulate' a computer inside a computer. I personally worked on one such project. Many companies, like Apple and DEC had strong needs to 'migrate' or expand over applications written for different operating systems, and they sometimes found that it was easier to just emulate the hardware. IBM once tried software service emulation with their OS/2 to support Windows 3.0 style software, and likewise Linux had a windows 3.0 software-stack emulator called WINE. But both of these endeavors were HIGHLY unreliable, in that if they missed something, it wouldn't be visibile until somebody crashed. And further, it couldn't be as efficient as running on the native OS - begging the question of whether full hardware emulation might still be better.
So now VMware had found some techniques to not emulate the CPU itself, but instead only emulate the privledged instructions that represent OS calls. This is presumably a much smaller percentage of emulation, and thus faster than CPU-emulation. You can then act as a proxy, routing those OS calls to a virtualized OS. So the application runs natively, the transition to the OS is proxied, and the OS runs in some combination of native and expensive CPU emulation. Namely if it's just some resource allocation management algorithm that requires no special CPU instructions, then that'll run natively in the OS, but if the OS call is actually manipulating IO-ports, virtual memory mappings, etc, then each of those instructions will be heavily trapped (proxied). Later, VMware type solutions found they could outright replace specific sections of OS code with proxied code, minimizing the expensive back-and-forth between the VM and the target OS.
Thus VMware was notoriously slow for IO operations, but decently fast for most other operations, and, like vitual CPU emulation, it was 100% compatible with a target OS. You could run Linux, Windows, OS/2, and theoretically Mac OS all side by side on the same machine.. This meant if you had software that only ran on one OS, you could have access to it.
Eventually people realized that this could solve more than just software access... What if I knew that FTP sites around the world were being hacked.. Sure I'll patch it, but what if they discover a new flaw before I can patch? They'll get to my sensitive data... So, lets put an FTP server inside it's own OS.. But that means I'd have to buy a new machine, a new UPS, a new network switch (if my current one has limited ports). Why not just create a VMware instance and run a very small Linux OS (with 64 Meg of RAM) which just runs the FTP service. Likewise with sendmail, or ssh servers, or telnet servers, or NFS shares, etc, etc etc. Note these services are NOT fast, and you have to have redundant copies of the OS and memory caches of common OS resources. But that's still cheaper than allocating dedicated hardware.
So technically VMware didn't save you any money.. You're running more RAM, more CPU and slower apps than if you ran a single machine. But the 'fear' of viruses, exploitation, infiltration, etc made you WANT to spend more money. So assuming that your CTO gave you that extra money, you've saved it by instead going to VMware. See free MONEY!!
If you were concerned about exploitation, then there was one glaring hole here.. The VM ran on some OS.. that might have it's own flaws... Further, the path is from guest app trapping to Vmware (running as a guest app on a host OS), who then delegates the CPU instruction to the guest OS. And the host OS may or may not allow the most efficient way of VMware to make this transition. Obviously VMware needed special proprietary hooks into the host OS anyway.. Sooo.
VMware eventually wrote their own OS... Called ESXi. This was called a hyper-visor. A 'bare metal' OS that did nothing but support the launching of real OS's in a virtualized environment.
In theory this gave the OS near native speeds if it was the only thing running on the machine, since there was only a single extra instruction proxy call needed in many cases.
So now we can start innovating and finding more problems to solve (and thus more services to charge the customer). So we come up with things like:
- Shared storage allows shut-down on machine 1 and boot up on machine 2. This allows hardware maintenance with only the wasted time to shutdown/boot.
- vMotion allows pretending that you're out of RAM, forcing the OS to swap to disk, which, in reality is forcing those disk writes to an alternate machine's RAM.. When the last page is swapped, the new OS takes over and the original OS is killed. This is near instant fail-over (dependent on the size of RAM).
- toggling RAM/num-CPUs per node.
- Using storage solutions which give 'snapshot' based backups.
- Launching new OS images based on snapshots.
- Rollback to previous snapshots within an OS.
- This all allows you to try new versions of the software before you decide to migrate over to it. And allows undoing installations (though with periodic down-time).
The reason Linux was valuable during this period was that making these OS snapshots, mirrored images, etc, had no licensing restrictions. You were free to make as many instances / images as you saw fit.. With a Windows environment, you had to be VERY careful not to violate your licensing agreement.. In some cases the OS would detect duplicate use of the license key and deactivate/cripple that instance. Something that is overcomeable but not to the casual user/developer.
Note that VMware was by far, not the only player in this market. Citrix's XEN, Red Hat's KVM, VirtualBox, and others had their own directions.
Hosted Solutions:
In parallel with all of this came the hosted web-site solutions. Build-a-web-site with WYSIWYG such that grandma can build/publish online with ease.
Next to that were leased hardware solutions.. Server Beach, Rack Space, etc. Also were the good old time-sharing solutions, where you'd just be granted a login; literally next to hundreds of concurrent users.
Whatever the need, the idea was that, 'why manage your own hardware and datacenter and internet connection'? That's an economy of scale sort of thing.. Someone will buy 1Gbps connection and 500 machines and associated large scale backup / switches / storage. They then lease it out to you for more than cost. You avoid up-front costs, and they make out a decent business plan.. It's a win-win... Sometimes.
The problem is that, installation on these solutions leaves with you very little to build a business over. You can download and install free software; but commercial software is largely difficult to deploy (especially if hardware has no guarantees and from month to month you may be forced to switch physical hardware (which would auto-invalidate your licenses)).
Free software was still fledgling.. Databases were somewhat slow/ unreliable.. Authentication systems were primitive. Load balancing techniques were unreliable/unscalable. And, of course, the network bandwidth that most providers gave were irregular at best.
Enter Amazon AWS
So some companies decided that it would make sense to try and solve ALL the building block needs.. Networking, load balancing, storage, data-store, relational data-store, messaging, emailing, monitoring, fast fault tolerance, fast scaling...
But more importantly, these building blocks happen without a phone-call.. Without a scale-negotiated pricing agreement.. Without an SLA. It is a-la-cart. Credit-card end-of-month 'charge-back'. You pay for what you use on an hourly or click basis. This means if you are small, you're cheap.. If you have bursts, you pay only for that burst (which is presumably profitable and thus worth it). And if you need to grow, you can. And you can always switch providers at the drop of a hat.
It's this instant provisioning and a la cart solution pricing that's innovative here. But provisioning that gives you commercial grade reliability (e.g. alternatives to oracle RAC, CISCO F5s, netapp).
Along with this came the proliferation of online add-on service stacks. The now classic example is salesforce.com. Something where there is an end solution that naturally can be extended with custom needs. This extension allows the opportunities for secondary markets and business partnership opportunities.
So today, apparently we use the following buzz words to categories all the above.
SaaS - Software as a Service
This is an ebay, salesforce.com. Some end software solution (typically website / web-service) but that can be built upon (the service aspect). The key is charge per-usage volume, and pluggability. A cnn.com is not pluggable and thus not categorized SaaS.
PaaS - Platform as a Service
This is the Amazon AWS (language neutral), the google app engine (Java/Python/Go), the Microsoft Azure (.NET). These are a-la-cart charge-back micro-services. The end company makes money (or as in google's case recoup costs) and provides highly-scalable solutions, so long as you stay within their sandbox. Currently there is vendor-lock in, in so far as you can't swap the techniques of vendor A with vendor B. And this isn't likely to change... If, for no other reason than the languages themselves are differentiated between these various platforms. Thus even a common SDK is unlikely to provide abstraction layers.
IaaS - Infrastructure as a Service
This is the classic hosted hardware with only the ability to provision/decommission hardware. Amazon EC2, Rackspace, terramark, Server Beach, etc are all in this model.
You are charged per use.. You have no visibility into the hardware itself.. Only RAM-size, number of CPUs, some benchmark representing minimum quality-of-CPU-compute-capability, and disk-space-size.
In some instances you can mix and match.. In others, you are given natural divisions of a base super-unit of resources.. Namely you can half/quarter/split 8 ways/16 ways/32 ways the basic compute unit (as with rack space). The needs of your business may dictate one vendor's solution v.s. another. Including whether they properly amortize valid windows licenses.
The reaction:
So, of course, the corporate buzz being - leased hardware, leased software stacks, make less efficient but more scalable solutions. Every CTO is being asked by their board, "What about cloud computing?". And thus there was a push-back.. In the analysis, there were some major concerns.
- Vendor-lock-in
- Sensitive data
- Security
- Uncontrolled Outages
- Lack of SLAs
- Lack of realized cost savings
- Latency
Virtual Private Cloud - The idea that on-premises or in a shared data-center, you could guarantee hardware and network isolation from peer clients. This guards against government search warrents that have nothing to do with you. This also guards against potential exploits at the VM layer (client A is exploited, and the hacker finds a way to hack the VM and then gets in-memory access to all other OSes running on that hardware; including yours). Companies like terramark advertise this.
Hierarchical Resource allocation - The IT staff is responsible for serving all divisions of a corporation. But they don't necessarily know the true resource needs - only the apparent requested needs (via official request tickets).
Thus with cloud-in-a-box on-premises appliances, the IT staff can purchase a compute-farm (say 60 CPUs and 20TB of redundant disk space). It then divides this 3 ways into 3 divisions based on preliminary projections of resource consumption. It then grants each division head "ownership" of a subset of the resources.. This allocation is not physical, and in fact can allow over-allocation (meaning you could have allocated 200 CPUs even though you only had 60). Those divisions then start allocating OS instances with desired RAM/CPU/disk based on their perceived needs.. They could have a custom FTP site, a custom compiler-server, a custom share-point, or some shared windows machine with visio / utilizing r-desktop for one-at-a-time access. The key is that the latency from the end user and the division head is faster than to joe-sys-admin who's over-tasked and under-budgeted. There is 'no phone call necessary' for provisioning of a service.
Now the central IT staff monitors the load of the over-committed box. There might be 50 terabytes of block-devices allocated on only 20TB of physical disk-arrays, BUT 90% of all those OS images are untouched. Note that defrag would be a VERY bad thing to run on such machines because they would commit all those logical blocks into physical ones. Likewise, each OS might report that they have 2GB of RAM, but in reality a 'balloon' app is leaching 1,200Meg back to the host hyper-visor. (so running graphically rich screen-savers is a bad bad bad thing).
As the load of the cloud appliance reaches 70%, they can purchase a second node and vmotion or cold restart individual heavy VMs over to the new hardware. They then re-portion any remaining unused reserve for given sub-divisions to come instead form the new cloud farm. Repeat and rinse..
At the end of each quarter, divisions pay back their actual resource consumption to the central IT budget.
The stated goal is that, previously, you'd need 1 IT staff member per 50 machines.. Now you can have 1 central IT staff member per 5,000 "virtual" machines. Note, they're still dealing with desktop/laptop/ipad integration issues left and right. They're still pulling bad hard disks left and right. They're still negotiating purchase order agreements and dealing with network outages. But the turn-around for provisioning is practically eliminated, and the cost of provisioning amortized.
Solutions like abiquo provide corporate multi-level hierarchical resource sub-divisioning. So from division, to department, to office-floor, to 3-man development team. The only request upstream is for more generic resources (any combination of disk, ram, HD). You manage the specific OS/environment needs. And by manage, a majority of OS deployments are stock services.. For example, a fully licensed windows 7 OS with visio install. A fully configured hudson continuous integration build machine. A fully configured shared file-system.
Availability Zones
Of course, since network, DDOS issues, power-issues, geographic network/power failure issues exist. The remaining issue for client-facing services is whether to host your own data-centers or go to a virtual private cloud or even a public cloud for a subset of your business needs/offerings. Here, it becomes too expensive to host Tokyo, Ireland, California, NY data center presences, so it may be more cost effective to just leverage existing hosted solutions. More-over, some vendors (akamai through rackspace, custom cloud-front through amazon AWS, etc) offer a thousand data-center 'edge' points which host static content.. These massively decrease world-wide average load latencies.. This is nearly impossible to satisfy on your own.
Many hosted solutions offer explicit Availability Zones (AZs), even within a given data-center. Obviously it is pointless to have a mysql slave node on the same physical machine as the master node. If one drive goes down, you lose both data nodes. Of course, with private cloud products, a given vendor will give you assurances that they only report 0.001% return rates.. Meaning, 'just trust their hardware appliance'.. You don't need to buy 2 $50,000 netapp appliances.. It's like it's own big-iron. But let me assure you.. Ethernet ports go bad. motherboard connectors go bad. Drives have gone bad such that they push high-voltage onto a common bus, obliterating all other shared connectors. ANY electrical coupling is a common-failure point. I personally see little value in scaling vertically at cost, and instead find greater comfort in scaling horizontally. Some solutions do focus on this electrically isolated fault tolerance, but most focus (for performance reasons) on vertical integration; and most stupidly happen to have shared electrical connections with no surge isolation (which makes them sub-mainframe quality).
Business Needs
The conference did help me shift my perception in this one important way. My excitement about solving 'web scale' problems is just that, excitement.. It does not directly translate into a business need. Over and over, speakers expressed that at the end of the day, all of this cloud 'stuff' tries to solve just one problem 'lower cost'. I was initially offended by this assertion.. But by putting on my economics 101 cap. All short-term fixed costs eventually become long term variable costs. Any short term problem can be designed for economies of scale in the long run. If I can't 'scale' today, I can over purchase then rent-out the excess tomorrow in 1,000 data centers. I can partner with akamai directly with a sufficiently high volume tomorrow for cheaper than I can purchase today or even lease today. If my software doesn't scale today, I can engineer an assembly language low-latency, custom FPGA design that fixes my bottlenecks tomorrow. ALL problems are solvable in the long-run. So the idea that clouds uniquely solve ANY problem is fundamentally flawed. I can do it better, cheaper, faster... tomorrow. The cloud only solves a small subset of problems today. And so the question any board of directors needs to consider when investing in private v.s. public infrastructure is time-to-market, and the risk of capital costs. If I know a product will have 3 years of return, and will need a particular ramp-up in deferred capital costs. Then I'm pretty sure I can engineer a purchase plan that is cheaper than an associated Amazon AWS solution. It'll run faster, cheaper, and more reliably. BUT, what if those projections are wrong? What if the project is a failure? What if we have spikes earlier than projected? Cloud computing provides SOME degree of risk mitigation, presumably in reducing the costs. Really, it just changes the equation radically enough so that old problems are replaced by new ones.
The single biggest beneficiary of cloud computing are in-experienced divisions or startups. Those who don't have talented seasoned IT staff members (DBAs, certified CISCO engineers, teleco contractors on call, etc). Those that can't afford the up-front costs. Those who's business risks are massive. Those who's challenge is raising money until they start showing revenue. Yet they can't earn revenue until some hardware/software stack is in place.
To this category, there simply is no alternative. You have zero up-take until you go viral.. Then no amount of hardware is sufficient to meet you needs.. That exponential growth period is highly interderministic, and also the one-shot make-or-break moment.
To this category, cloud computing is a no-brainer. At least in the beginning.. But even here, growth stories have shown 'cloud' doesn't solve their problems once they achieve a sustainable business model. 'github' for example, uses rack-space.. BUT only in a traditional datacenter model. They have massive storage needs which is in contrast to rackspace's business model.. So github doesn't leverage any public managed cloud services.. They own their own hardware/software stack. The only things rackspace provides is a subset of 'IaaS', which includes akamai's edge CDN network. And ultimately this 'hybrid' approach is probably where most medium+ businesses will be forced to live.
Security
It was somewhat good to hear some best practices at the conference include single-sign-on services. We're already familiar with facebook connect, openID and google's single-sign on. But these are obviously highly proprietary. The more interesting solutions for me were SAML based standards compliant trusted authority solutions.. Those where your corporate environment can leverage an Active Directory / LDAP metadata store of user+password+roles, then transmit trusted tokens to a google-apps, salesforce, etc to access their services with little fear of password hacking; and, of course, the value of single-sign-on. Here each tier is layered, so the 'password' part can be replaced with a finger-print scanner, or RFID card+finger+pin combo, etc. I personally like these hybrids, as the Sony exploit has shown that people are dumber than dumb.. Associating simple dictionary words 6 characters or less with credit card info. People simply can't be trusted to remember complex pass phrases that aren't biographically linked to metadata easily discoverable about them.
Separately there are all the best practices that SHOULD be honored. Don't ever pass sensitive data that you don't directly need through your network. Don't allow relay of public information through you (e.g. don't attach generic blogs unless necessary and unless monitored). Don't use encryption where assertion can solve the same problem (stolen assertion data is useless to a hacker, whereas stolen encrypted data can be hacked for a single master key).
Conclusions
I think that businesses HAVE to follow the joneses. But they should do so pragmatically, and in a venture-capital sort of way.. Fund initiatives to see if they have practical ROI. Do they solve more problems then they cause.. Keep your company and team agile (fast turn around time and with an ability to shift directions). Keep them appraised of possible solutions that solve problems more quickly, efficiently, cheaply. But remember that in 5 years, we will lament the whole 'cloud' era and laugh at people still that use centralized data. Much like we did in the early 90s. iPad/android peer-to-peer apps hiding data from 1984 oppresive government eyes is more important than consistent up-to-date data. Who knows what tomorrows critical challenges will be. So I wouldn't put too much stock (literally) in moving existing solutions over to public clouds.. But high-risk projects with large hardware needs and potentially short-lifetimes does make a lot of sense to get your corporate feet wet.