Saturday, February 23, 2013

Git and Mac OS 'brew' case study.

Summary

The Git Distributed Version Control System (DVCS) is one my current favorite computer science models right now.  And I am ecstatic to see it starting to find it's way into every day tools I use.

This morning I was reminded that the Mac OS 'brew' 3rd party open-source library (linux wanna-be) package manager needed only a 'git pull' command to update it's local database.. From this, I was reminded of all the complexities of configuration ownership, and in an instant my mind was free of frustration (albeit only briefly)

What is Git?

or better:
or honorable mention

My Summary of Git is that it solves solves 3 problems.
  1. How do I keep represent an arbitrary file in an arbitrary state? (e.g. maintain changes)
  2. How do I share these files and associated changes with someone else?
  3. How do I integrate files and changes from someone else into one or more files?
Now, you might think.. Oh but X,Y,Z already does this... Perhaps.. And perhaps very well (Adobe photoshop suite has an excellent integrated collaboration system).  MicroSoft share-point integrated office-documents pretty decently.  Subversion + apache is an excellent centralized poor-man's publication system. Then of course there are formal publication tools.

But all the above are special case, proprietary, require central control (and thus present beaurocratic delays in mental work-flows).  In order to 'save' a file, I have to publish the file... In order to save the file, I have to get a sys-admin to create a repository.  In order to get a file, I have to 'ask' my peer to publish their file.  In order to reconcile, there can be only 1 canonical representation.. If a casual user were to go looking for my file, they'd probably pick the canonial one (even though multiple tags, branches, versions exist), and thus everybody has to agree on what that is (which means you are FORCED to use a bureaucracy)... 

Now don't get me wrong, bureaucracy is a good thing.. But it's not applicable to every situation.  And, as I mentioned, it does interrupt the creative process.  Imposing possibly weeks of delay.. (Company A and Company B spend 2 weeks agreeing on a file exchange protocol like ftp, Kerberos + subversion, etc).  Or what is the absolute worst - to avoid the bureaucracy - they just email each other files.  I hope I don't have to explain the horrors of this process and it's likely outcome over time.

How is Git Different?

  • Git uses cryptographic SHA-1 signatures of 'Objects' (really just files, but some file formats are special, like directories, tags, commits).
    • This means all objects can be independently verified by comparing to their signatures
    • All differing objects are universally unique (very very very very very low probability that objects will EVER have the same signature - but with enough objects in a working-set, it is possible; billions of objects)
  • Git uses the SHA-1 as the object-ID
    • All IDs are the same size 
      • allows efficient database indexing
    • ID's are relatively small (20/40 bytes bin/hex)
      • allows efficient storage / network transmission
    • ID's of different locations can safely be merged together (thanks to global uniqueness)
  • Objects can stand alone
    • You can inject, delete any object into any Git system
    • You can create, split, merge, archive, delete objects independently from git repositories
    • They are unique eternal representations of an object (file)
    • They are independent of their storage format (raw, prefixed, gzip'd, xdelta'd, future)
  • Objects themselves are NOT versioned (versioning is externalized)
    • versioning is external to a file (unlike RCS, subversion, office-documents, etc)
    • Allows SHA-1 to never need changing
    • Allows alternate version-control systems to be applied to the same logical change-sets
    • Allows multiple parallel histories of the same file (below)
  • Non-Trivial files have canonical representation
    • Directory, tags, change-sets all are sorted with well formed structure
    • This allows two independent authors to produce an identical object-SHA's for coincidentally identical content bundles.
  • Efficient work-space representation
    • Entire history is stored locally for rapid analysis and version-mutation
    • Entire history is generally gzip + xdelta'd and naturally de-dupped (due to SHA-IDs)
    • Fully fault-tolerant (due to SHA-IDs, and infinite number of work-space copies)
    • Single check'd out work-space representation
      • Rapid local delta-ing when switching between points in the version history (switching to tag, branch, arbitrary change-set point)
    • 'hard link' based cloned work-spaces, minimizes (though not eliminating) overhead
      • Works on most modern file-systems, ntfs, ufs, ext, hfs, xfs, zfs, btrfs, etc
    • network-copies are rsync deltas of object-bundles.
  • Minimum "central" server load (not even required)
    • Due to laptop proliferation (transient up-time), central 'push' servers are valuable for collaboration.
    • The central server is nothing but a dumb file-system with whatever convenient copy-in, copy-out protocols are available (DAV, 'git', rsync, SMB, NFS, ssh+scp, etc).
      • Since you're just transferring 'compacted' object bundles and an index of ID to bundle-offset, virtually anything can work; you just want to avoid concurrent over-writing of some index files
  • Change-set Parallel history universes (below)
  • Promotes Social collaboration
    • Does not enforce a methodology on other's
    • Is "pull" centric instead of "push" centric.
      • You look at everyone else's changes and choose what you want from them
        • Fetching is getting remote changes without integration (merging)
      • You only ever push to YOUR repository; separate from everyone else
    • Encourages 'asking' people to publish their incomplete changes
      • Empowers content authors
      • Emboldens communication by content browsers / contributors
    • Disagreements can be resolved via publication namespaces (URLs)
      • Owner of project owns canonical namespace for publication
      • Others can use different [temporary] name-spaces
      • Until resolved, participants can choose to use whichever namespace is most appropriate for continued productivity
      • Volatile / politically dangerous changes can be hidden in private namespaces until/unless a time is appropriate.

What is a file?

What is a file?  If I rename it, is it the same file?  What if I wanted to swap two file names?  What if I deleted a file, then later created a new file that happened to share the same name?  What if a file's name and contents stayed similar, but a subtle change in the contents means that a human obviously would not consider it to be the same file anymore (ex president-accomodations.doc).

Systems like subversion, sharepoint, etc, know what a file is... The file-name... If you wanted to rename the file, you'd use 'mv' type commands, and it's subversion's job to keep track of that.  If you wanted to replace a file, you'd delete it, then create a new one.  Easy right?.....

But how do I communicate those 'verbs' across a federated universe of human actors?  If the buracracy breaks down and lines of communication are lost.. If we had to restore our VCS from backup and annotate it back to life..   Was the history of those VERBs maintained?  If Joe 'moves' shopping-list.txt to 'joe-shopping-list.txt' and Bob overwrites it with his own needs, what happens?

The issue is that while you CAN get clever, you're going to go wrong eventually.. And with federation in particular, that wrong is going to be the norm.   Consider the following:

Husband:  Honey, I need eggs.  Wife: Ok, I'll make a note  *edits android app to include eggs in shopping list; publishes in cloud*Husband: {thinks}hmm, she's going to forget, let me edit my shopping list to add.. *publishes to cloud*Wife: Honey, while I'm shopping can you publish your list of things you needed, I already have eggsHusband: *adds unrelated shopping list fragments to file and publishes"Wife: *updates local shopping list*


Now what's the expectation when the wife reconciles?  There are LOTS of changes, but more importantly, the same change was made twice: 'eggs'.  What is the logical human expectation?

Well, one could argue that they should be alerted to the fact that there is a conflict.  One could argue that any modification to the list that needs reconciliation should alert the user.  But in this case, there is no conflict... Someone didn't say "I need eggs" and someone else said "Eggs are bought"... They're both saying "I need eggs".

The point is not that git happens to solve this particular problem better than any of the other SCMs (except perhaps real-time systems like google-docs; which are not always practical, and solve a different problem entirely), but that there are potentially DIFFERENT ways that this reconciliation can be applied.

Git's strength is that it doesn't enforce versioning patterns on the user.  It provides a default which can sometimes be confusing.  But not only can you reconcile in different ways, for different situations, but, YOU can CHANGE the history of a file. An administrator is not needed.  And if someone doesn't like that history, they can apply an alternate history in a different namespace (e.g. repository).

The KEY is that the ONLY thing that mattered was the contents of the file.  If, at the end of the day, the wife is able to see what she needs to buy, then how it got there (it's history and merge process) is irrelevant.

This isn't always a true statement.  Government audit records are not histories to be trifled with.  A single canonical representation is critical.  Even customer presented feature-change-lists should be immutable - they want to see version 1.2.3 followed by 1.3.5 followed by 2.0.9.  If that history got rearranged in the change-list of the next release, they'd be un-happy... And that's what canonical lists are for.  Lists that are intended to never have their histories changed, and where bureaucracy is mandatory for submission of changes.

BUT, this is only the final publication.. There are intermediate publications that go on all the time in life.. We chat verbally with our peers.. We exchange emails.. We email files.. We ftp/share files.  All outside of the canonical publication.. The only real question is whether a publication system can track those histories as well - since tracking things brings order to the chaos.

This, I'd argue that Git is a file representation that logically exists in a multi-verse of potential states-of-being, with different pasts, and different futures.  And, I'd apply  Occam's Razor.  If two descriptions of a system produce identical measurable outcomes, then the simpler of the two is more likely correct.  At the end of the day, you get the same published PDF, word-doc, HTML-file, source-code.  The challenge is in the process that gets from thoughts to final publication.

Basic Git 'things'

  • Directory tree of objects
    • Every object goes in a file named by it's SHA-1 (with first 4 chars as 2 directory levels)
    • Object Files are either gzip'd or raw (small header thrown in to say which)
  • Compacted Object bundle
    • 'Compacting' takes 99% of the free-form file-objects (above) and throws them into a single file
      • The file is SHA-1'd and it's name is that of the SHA
    • A second 'index' file is a simple fixed DB which maps SHA-ID's to offsets within the compacted bundle
    • This is the basis for all network transfers.
      • Any push operation first compacts the objects and transfers an xdelta from an existing known remote bundle
  • ASCII text starting points
    • .git/refs/heads/master - file with 40 bytes; hex representation of SHA-1 containing head (root) of main-line tree
    • .git/refs/heads/feature1_branch - file with 40 hex bytes; representing SHA-1 of the head of a branch (which might temporarily be same as master)
    • .git/refs/tags/release_1_0_0 - file with 40 hex bytes; representing SHA-1 of an arbitrary local tag
    • .git/refs/remotes/bobs_computer/master - file with 40 bytes; hex .... of starting point of an object pulled in from bob's computer represending HIS current head.  Note, YOU might already have had this object and thus rsync didn't copy it.. YOU might have created the same set of files and it happened to match his representation (independently authored)
    • .git/HEAD - ascii file with the name of the current 'head', e.g. 'master' or 'feature1_branch' or 'release_1_0_0' or 'remotes/bobs_computer/master'
    • .git/config - simple 10+ line win.ini type config-file (arguably should have been XML).  Denotes remote URLs and any special settings.
While not as elegant as it could be.. It's idiot simple and easy for 3rd party tools (like ruby-based brew) to extend.

What's the point of github?

So given how every work-space is a full backup.
Given that anybody can pull from anyone else's repository directly (drop-box, google-drive, sky-drive, SMB, http).
Given that you can email patches directly from git commands "git format-patch -M origin/master -o outgoing/; git send-email outgoing/*".
Given that there are thousands of collaboration tools.

What's the point of github?

Github represents a popular canonical name-space for a multitude of projects..  Like an always on closet server (to front your transient-on laptop), it represents a universally accessible collaboration point.

Most any git project will be on github - EVEN IF it's already on atlansian's 'stash'... Why? Because people that collaborate on open-source already have a github account, and have local tools which expediate working on multiple platforms (windows, linux, mac).. They have passwords memorized.  They understand the work-flow because they've contributed to a project or two.. It isn't necessarily the be-all-end-all... But they're familiar with it.

The same COULD have been said about code.google.com or sourceforge.net.  Had they been 'git' based.  github was just first and sufficient (and pretty darn pretty).

github gave a useful online editing tool, so you can edit from any 'chrome-book' or possibly even android device. (obviously only micro-edits on such devices).

github exemplifies the social-hierarchy and culture of:
  • Give credit to original authors
  • While you can't wrest ownership, you can 'fork'
    • Someone else forking doesn't mean they hate you; it's the ONLY way for them to contribute because they don't have write-access (so you're less likely to take offense)
  • You 'respectfully submit' a 'pull request' to the owning author from your fork's change-list.
  • Authors can accept with a single button-click (visually seeing the diff), reject or request ammendments.
    • This fosters collaborate, enforces consistent formatting rules, reduces the bureaucratic review process while FORCING some minimal review
  • Contributors are in no way required for the original author to comply or accept.. If they reject or delay.. You have a completely valid ALTERNATE canonical representation of the original project.
  • Due to alternate version-histories, pull requests are required to be trivial-merges...
    • This promotes a linear canonical version-tree and simple (like subversion)
    • This avoids conflict resolution (allows reliable 1-button UI merges)
    • This forces the submitter to continuously vet their changes against the master/trunk (least surprise)
    • This parallelizes the work.. The project owner is not constantly pestered with maintenance tasks.. He can delegate work to peers, and each is naturally responsible (no merge possible) for maintaining latest merge-histories with the central fork.

What is Brew?


  • To me, brew represents a replacement for 'apt' and 'yum' and 'rpm'.  It is a package manager which gives a linuxy-feel to mac..
  • It owns the '/usr/local' directory space (though technically it could go anywhere.
  • It gives non-root ownership of said repository (something rpm doesn't)
  • It makes '/usr/local' a git-repository.
    • Updating apt-cache or yum-listings is simply 'git pull'
  • All libraries are downloaded and locally compiled
    • Minimizes dependence central compiler-farms (though I do miss apt/rpm pre-compilation)
  • All libraries are stored in localized directory structures (most libraries support this, but package managers hate it for some reason)
    • /usr/local/Cellar/ffmpeg/1.1/  bin , lib , share , etc 
    • Can install multiple simultaneous versions of same library
  • All libraries are sym-linked to central location
    • /usr/local/bin/ffmpeg -> ../Cellar/ffmpeg/1.1/bin/ffmpeg
    • /usr/local/lib/libavcodec.54.86.100.dylib -> ../Cellar/ffmpeg/1.1/lib/libavcodec.54.86.100.dylib
    • This allows you to swap out a library AND trace back every artifact to it's owning project and version (slightly more reliably than rpm/apt)
  • A single /usr/local/bin/brew ruby-script
  • The brew hosting is on github.
So the exciting thing for me is to see a community-driven trivial-to-extend database of source-code-projects.

Conclusion

There is nothing special about git, github or brew... All are someone hackish toys... Initiatives whos' authors were at the right place at the right time.  But they, for me, represent a trend that I hope continues... Open-Collaboration, Open-Contribution, Open-Attribution.  And from a science perspective, ever more elegant models of federation, data-distribution, de-centralization (or at least limited dependence on central choke points).

In our 'cloud' era, this is something we are starting to forget... We cite google, apple, android, iOS as enabling tool providers... But they're really monolithic appliances (like a car)..  They're monumental achievements, to be sure.  But ultimately they will last only so long as some venture capitalist (or centrally planned government agency) is willing to subsidize the cost (as most such services are not intrinsically/directly funded).  The cloud-model (including app-stores, rpm-repositories) is very main-frame era in nature.  Even with distributed cloud, you are still reliant on a single vendor, a single chief administrator, a single attack-point of a virus.

Distribution, on the other hand exemplifies a cell-network.. Something which is resilient to faults.. Resilient to adverse-interests.  It is something that can scale because you have both big-iron vendors and their fully engineered (and funded) projects, sitting next to a transient laptop (serving providing content 'seeds' to trusted peers).


No comments:

Post a Comment

Followers

About Me

My photo
Knowledge of any kind is my addiction