Tech articles: Alternate mem-store for ehcache

Architecture

Ehcache is a 4 tier architecture:

1) A read-friendly concurrent hash-map (which has semi-single-thread write operations; semi-16-way)

2) A configureable expiration architecture which allows semi-random selection for eviction (using policies to reduce randomness)

3) An optional disk-store, used to serialize values to disk, along with a periodic expiration detector and buffered/backgrounded serialization

4) An event notification strategy which supports optional cluster replication (delete-on-update or replicate-change-data)

Makeup

The data-structures involved for the first 2 tiers are:

The key-value pair are stored in:
Element:
K key
V value
long date_created
long date_accessed
int num_accesses

Which is then stored in a java.util.concurrent.ConcurrentHashMap which contains:

Map<K,element>
Map.Entry[] table

Map.Entry<k,element>
K key
Element value
int hash
Map.Entry next

The expiration-policy then maintains:

Object[] keyTestArray

AtomicInteger nextKeyTestIndex

ConcurrentHashmap

The CHM (ConcurrentHashMap) is the standard java-distribution. It's strength is in fast near const-time searching (almost as fast as raw Hashmap's - lagging mostly due to one extra indirection and hash-computation). We're talking millions of lookups / second on half-meg-sized heaps.

The concurrency is produced via two techniques..

First, in having to address-spaces.. For a 32bit hash (the only type java support via obj.hashCode())

0xSX XX XX XH

Here S is the 4bit segment-address and H is the 4bit hash-address. As maps grow, the H expands, one bit at a time towards the left, while the S stays static - the number of bits is set at constructor time (defaulting to 4 or having 16 possible addresses). Each segment is 100% isolated from each other. They grow independently, and most importantly, they lock indepdently.

Meaning each segment have an inifite number of readers and exactly 1 writer..

The second feature is that readers are independent of writers. All write operations are required to maintain a 'snapshot' of the read-data. This is accomplished by using 'root' changes.. Meaning all changes happen in temporary memory, and the last operation is an 'atomic' changing of the root pointer. This is the same for insert/delete and more interestingly for resize/growths. The growths, in particular have several optimizations to avoid duplicating the entire pointer mapping probably by as much as 90% (given the high probability of 0/1 sized buckets).

Memory/GC overhead

The allocative efficiency of CHM is about as good as one is use to in java-best-practices.

An object consumes about 4 .. 16 Bytes of overhead. At a minimum, the class-pointer (4B) which theoretically can contain all other relavant info. But often there is other meta-data stored with the object for efficiency (like the redundant hash-value). object-size (3..8B) (since rarely does an object reserve an exact amount of memory, this sometimes includes the slack-space which isn't present in the class-data). flags (1..4B). At a min, every object has to have a garbage-collector status (copied, in-use, etc). Every java-object also optionally can be synchronized, but the JVM can either do this by reserving 4B in the object header or use an external hash-map, it's implementation dependent.

Finally, you have the field-data (the whole reason for having an object-intstance). Every field takes a fixed amount of mem.. byte=1B, char/short=2B, int/ptr/float=4B, long,double=8B. booleans can take either 1B for each bools or 1B per 1..8 bool-group, depending on the implementation.

Note, that 64bit JVMs MAY use 8B per pointer. Technically since java is an abstraction, it could implement it as an offset + the ref-value, and thus 32bits are still feasible. If -Xmx4G or less is used, then this may be in place.

Thus for the simple class

Foo { int x, y, z; }

Foo foo = new Foo();

foo will likely consume between 20B (8B overhead) and 28B (16B overhead)

whereas:

Bar { Foo foo; }

Bar bar = new Bar();

bar will likely consume between 12B (8B overhead) and 24B (16B overhead and 64bit ptrs)

We'll assume 12B of object overhead for the rest of this document.

Integer objects are similar to our bar, and we'll assume 14B

String objects contain the following:

java.lang.String:

int hash

int offset

int size

char[] data;

For the string "tester", a typical 6B string, we have:

12B (obj-overhead), 3x4=12B (ints), 4B (ptr), and 2x6=12 (6 chars) = 40B

Note, you can call

String s = "test".intern();

which produces a global hash of strings, so having 500 duplicate strings only really consumes one 36B chunk.. BUT, this is obviously a massive memory leak. So while your public static void main() could do this, you don't want your servlet-engine, etc doing it. It is not defined whether the interned strings are ever garbage collected, so it's probably not a good idea for 3'rd party libraries to ever use it.

So given the above assumptions (which certainly change from JVM to JVM, or even from different JVM startup-options), we'll assume 40B for an average string.

Hash Example

So for 1000 entry hash-map of Strings to Strings, we have:

4x2048 = 8,192B (table pointers, with 0.75 load factor, rounded up to nearest power of 2 - here 11bits)

12x1000 = 12,000B (object overheads, not included containing and helper objects)

4x1000 = 4,000B( 4 4-byte fields in the Entry)

2x40x1000 = 80,000B (key and value of 40B strings)

total ~ 104K memory

But while this represents the minimum allocation, when considering the full garbage collection, we have to take into account the work-load, we will have to physically copy all of the 104K at least once, then for compaction/defragmentation, we may have to copy them again possibly several times. But EVERY time we do a full garbage collection (which here, for train collectors, I mean once it's traversed all cars), we'll have to recursively traverse every pointer. So

Map: 1

Segments[]: 1 + 16 * ptrCnt(Segment)

Segment:

Map: 1 (non-static parent ptr)

Map.Entry[]: 1 + 2048/16

Map.Entry: 1000 * ptrCnt(Map.Entry)

K,V,Next: 3 + ptrCnt(K) + ptrCnt(V)

String:

char[] data: 1

Sor for our Map<string,string>, we have:

1 + 1 + 16 * (1 + 1 + 512 ) + 1000 * ( 3 + 1 + 1)) ~ 13k references

So we have a 13:1 overhead of garbage-collection traversals to the amount of data we actually want to store. and that's just for simple strings/integers. Caches tend to store highly complex data-models with potentially thouands of objects.

13:1 object-references and 100:1 memory consumption for what seemed like a 6:1 x 2 pair of strings.

This is most evident if we serialize the data. Doing so merely stores key/value pairs os co-serialized objects. This is VERY compact - Not pointers, no multi-byte chars. We can produce something close to 16:1 disk-overhead (obj-headers are still stored).

Cache expiration Models

The thing that differentiates a Cache from a Map is that there is an implicit 'loss of data'. Namely, we set a max-capacity on the map, and once that capacity is reached, we identify a target for expiration. Expiring may or may not have external notifications (which allow sub-systems to compensate or reconsile cluster-coherency).

The two main reasons you'd want to expire data are:

A) data gets stale

B) memory would get too full

The first reason is application-specific, the second is system-limitation-specific. If you 'cached' the domain names of the IP-s that hit or might hit your web-site, you'd need gigabytes of RAM. Not necessarily the best use of this expensive and limited resource.

CPU Direct Mapped

The simplest is CPU-cache style direct-mapped caches. Partially duplicate hash-codes automatically expire pre-existing records on puts.

0xAA AA AA HH

For a cache of size 256, we have 24 'address' bits and 8 hash-bits.

Map

Entry[HH]

Entry:

int hash

Object data

Here, the hash gives us the 'slot' and if the address matches an existing entry, we have a hit, otherwise we have a miss. Unlike traditional hash-maps which guarantee that conflicts don't hurt us (not even duplicate hash-codes), direct-maps have an implicit and FAST expiration policy. It relies on probabilities exclusively.

CPU Associatively Mapped

Here we do open hash mapping, where we have a finite number of slots, say 8 (in 8-way associative caches).

Map

Entry[HH][AA] - AA is the associativity, such as 8

Entry:

int hash

Object data

Here we must check all 8 sub-array items for a match. And there may or may not be an LRU applied to determine whom to expire when putting a new element and 8 already exist. Otherwise we might just use a globally rotating integer (say the next expiration will be on slot 2, then slot 3, no matter which bucket we are expiring from).

In the direct implementation, this is very memory wasteful, because not all buckets will have 8 collisions.. Some will have thousands of collisions and some none. You could alternatively use a linked-list

Map

Entry[HH]

Entry:

int hash

Object data

Entry next;

So now he have re-invented the HashMap, but we 'delete' when lists are of size 8. Note, this has an implicit FIFO, since new items are added to the head of the list. If we were true to the above configuration, this means we would almost never store the full max-contents, because we couldn't FIND 8 items for every hash. An alternative might be to allow a 'soft' 16-entry deep when the gobal max hasn't been reached.. Then once it has, begin truncation.

The main negative is that we like our hash-tables large enough that 99% of items have only 1 level deep. We're trading off our efficiency. Essentially we're running with a load factor of 8 (instead of the normal 0.75). Such that for 1,000 element max Map, we have a hashtable of 128 entries. And thus we're 10x slower statistically speaking.

Another issue, is that expiration is always a function of the input item. If the most desired items happen to have very similar hash codes, then we've limited the max of those we can hold, and further, we reward undesired items that have unique hash codes. Basically we have are dependent on probabilities.

LRU access linked-list

java.util.LinkedHashMap (LHM) is meant to support an ordering of the list. This ordering can be used to identify oldest or least-recently-used items.

Map.Entry<k,v>

K key

V val

int hash

Map.Entry<k,v> next;

Map.Entry<k,v> first, last;

The first,last pair can either be set only at insertion time (producing a FIFO ordering) or it can have every map.get(key) prune and re-insert the first/last pair, making it the head of the list. This facilitates an LRU ordering, which is often ideal for selecting a 'best candidate' for expiration.

While CHM can support FIFO it doesn't currently. But more importantly it CANT support LRU because that would require mutating operations inside the get.

Random sampling

Another approach is to randomly pick say 32 entries, and compare some metric to find a 'best' candidate.

This requires either meta-data on the Map.Entry or for the value to be a wrapper with meta-data.

Element<k,v>

K key

V value

long date_last_put

long date_last_get

int num_gets

Map<k,element>

Object[] randomData

Here, we can randomly pick items from randomData, identify a target, then delete it from the map. The size of random data may be of concern.. Perhaps copy-on-growth or just pre-allocate the max-size.

The random data can either contain the Key's or the elements themselves. If they are Elements, then we have to be careful to null them out when deleted or expired so we don't produce memory potential leaks (when elements contain very large data). For this it would require

Element<k,v>

K key

V data

int key_idx

long date_last_put

long date_last_get

int num_gets

Here, adding an Element to the randomData array has you assign it's index.. Thus on expiration, we null out that column as well.

The problem that remains is how to allocate a slot. On an expiration, we can maintain a linked-list.

Element[] randomData

int[] nextFreeSlot

int nextFreeIdx

int topFreeSlot;

Now, on startup, we increment over topFreeSlot until the max-size is reached.. By then, we can't allocate a new slot until an old one has been freed. The size of the two arrays must correspond to the number of elements. On each free, the nextFreeSlot points to the most recently feed' index, and a nextFreeSlot of that index points to the old nextFreeIdx. Thus a linked list is produced.

To save memory, we could instead use:

Object[] randomElOrNextFree

int nextFreeIdx

int topFreeSlot

Then the Object either represents an Element or an Integer object which contains the next free idx. This only really saves mem when we're mostly full all the time and have very large arrays.

For very large maxes, a grow-on-full model should be used.

By using random sampling, we have a low probability of finding the BEST candidate (meaning the least-frequently used or hte least-recently-used or even the oldest item), but we definitely remove any biases or group-thrashing.

Embeded Random Sampling

If, like LHM, our CHM embedded a few more parameters, then we could avoid the parallel randomData array. We could also do away with the extra Element record (which affects overall mem-size and pointers for VERY large caches).

Obviously we could put dates, and counts together, but there is a limitation with random access.. Namely CHM or LHM don't have any sort of random access as they're navigating linked-lists.

But if we DID have a suite of all the Map.Entry records, then we could accomplish this feat. More to come...

C-style Records

Many programming languages like C, Basic and even C# support the concept of ordered records. Essentially a rectangular memory-layout of fixed-sized records.

struct _R {

int x,y,z

};

typedef _R R;

R[1000] data;

here sizeof(R) is exactly 12B, and thus sizeof(R[]) is EXACTLY 12000B. If you want data[idx].y, that's compiled into the following CPU-assembly-pseudo-code:

rIdx // pre-determined

rRdata = data // load the 32bit address of the base pointer

rTmp = rIdx * 12 // calculate the record-offset

rAddr = rData + rTmp // calculate the record physical address start

ry = [rAddr + 4] // load the location of 'y'

You can't actually do this in java.. what you have is an Object-array that IS statically mapped, but this contains a pointer to dynamically located object-data. Consequently, it's not possible to 'null out' a record-array entry.. You can try and assign zeros to all it's values, but the 'null' concept doesn't apply.

However, there's nothing that says that you can't EMULATE this in java. For our trivial R record, we can just allocate a large array of symmetric data-types and perform the address computation manually.

int[] recordData

final int RECORD_SIZE = 3; // x,y,z ints

int y = recordData[ idx * RECORD_SIZE + 1 ];

If you have mismatched data-types, then you'll need to break them out

struct Entry {

int hash, next_ptr, first_ptr, last_ptr, num_hits;

long date_last_get, date_last_put

Objecct key, value

}

int[] entry_ints

final int ENTRY_INTS_SIZE = 5;

long[] entry_longs;

final int ENTRY_LONGS_SIZE = 2;

Object[] entry_objs;

final int ENTRY_OBJS_SIZE = 2;

Now, if we want entry[idx].date_last_put

we'd run

long date_last_put = entry_longs[ENTRY_LONGS_SIZE * idx + 1];

Lets consider the memory and ref-count overhead:

Map:

int[] entry_ints

final int I_S

long[] entry_longs

final int L_S

Object[] entry_objs

final O_S

int next_free_entry_id

int top_entry_id

int[] table;

Here we have 4x12=48B (obj-overhead) + 7x4=28B (map-fields) + 1000x4x5=20KB (entry_ints) + 1000x8x2=16KB (entry_longs) + 1000x4x2 (key/val obj-refs) + 2048x4=8KB (table) =~ 48KB memory overhead

When we add in the 40B string-pairs, (80K), we're down to ~ 128KB total memory

This is somewhat comparable to our existing CHM, and probably slightly better than CHM<string,element<string,string>> (which we didn't model out). The point being, it's about the same for mem-size.

BUT, the ONLY GC-able references are the keys and values themselves. Here 2K references A near 1:1 sizing

Incidentally:

For FIFO, we can use first,last embeded fields to produce true FIFO.

For LRU,LFU we can use 32-random-element selection on the direct element-map.

Data-type compaction

Once you've gone down the line of squeezing your own data-types into compacted C-structures.. Why stop there. Most keys and values are java primitives. A pointer to a string of size 2 takes 4 bytes for the data and 4 bytes for the pointer. Why not just store the data itself INSIDE the pointer.

0xFD DD DD DD

Where F-4 bits is the data-type and the remaining bits are either the data-itself or an index into an appropriate array (of objects, large-ints/longs/floats/doubles/strings).

So for strings of size 3 chars that have Unicode mappings at or below \u00FF we can use 1-byte mappings.

For strings of size 4 chars that have Unicode mappings at or below \u007F, we can use 7-bit mappings

Note, for strings of size 0,1,2,3 we'd need either different flags, or use the second 4bit field

0xFS DD DD DD

Where S is the size.

Likewise for many other field-types we can shrink us down to 24bits for the core data, leaving the 28bit data-blocks for those that truely need it.

In this situation a 6char string still would need it's 40B overhead with 2 pointers, but most 0..4 char pointers would take effectively ZERO extra bytes and more interestly zero pointers.

Useful types are:

NULL

BYTE

SHORT

SMALL_INT

NEG_SMALL_INT

LARGE_INT

SMALL_LONG

NEG_SMALL_LONG

LARGE_LONG

DOUBLE

OBJECT

STR7_4

STR8_0_3

String optimizations

But we can expand this for strings even further.. The JVM defines Strings as read-only black-boxes, there is source-code which suggests how the strings are stored, but JIT operations MAY make sub-optimizations, such as only storing 1byte per char.

Consider our compacted Object above. If we knew that a string had only 7bit chars, we COULD copy the char-array into a byte-array. If those byte-arrays were a contiguous chunk of memory with mechanisms of a copying-collector (instead of free-lists), then we could save 50% of memory and again have zero pointers.

So we could have

STR7_4

STR8_0_3

STR8_n

STR_n

where STR8_n is an index into another compacted data-structure:

int[] str_byte_fields // compacted structure of str-size, str-starting-address, str-hash (which is different than the map-hash)

byte[] str_byte_data

For key's this is ideal, because we avoid any pointers, and we can save a small amount of memory

For values, there is an extra cost.. Namely that we regularly indend to retrieve the item as a String Object.

We can use

str = new String(str_byte_data, str_byte_fields[idx + 1], str_fields[idx + 2]);

str = new String(str_char_data, str_char_fields[idx + 1], str_fields[idx + 2]);

The first method is actually deprecated because it doesn't handle unicode-chars, BUT we've already asserted that we can only have such an arrangement if all chars are below \u00FF

Either way, there is overhead in copying the values into the tranditional String. For excessively large strings, this is probably not acceptible, especially if they have any high-order data.

When we know that the data will likely repeat with a high-order, we can produce a Set of JUST strings, and have our generic-map have key/value indexes relative to this set.

Namely if values are often one of a small subset (say stringified enumerations or stringified numbers), then defining our own version of String.intern() may be of use. Namely all strings are cached in a unique-string set, made up of the above data-structure with the addition of one more heap and next-index.

Concurrency issues with compacted elements

In ConcurrentHashMap, the read-concurrency occurs by allocating new entries and allowing java garbage collection to reclaim the old entries when NO other thread still holds a reference - this includes Iterators which, while pointing to node 2 out of 5 nodes will keep 2,3,4,5 alive, but allow node 1 to be GCd safely.

When we perform data-compaction, the GC has no ability to reclaim sub-sections of our element-data, and, more importantly, compaction REQUIRES that a next-free-xxx be updated at some point in the future.. The most recently freed item becomes the next allocated item.. If a reader thread takes a VERY long time to complete (because of some stall), then it might be in the middle of reading when we overwrite it's contents. String rebuilding is the most likely case since there are 6 non-atomic parts.

1) read hash of element - compare

2) read key-pointer of element - compare (might be a multi-step process)

3) read value-index to string

4) read size of string

5) read offset of string

6) get pointer to byte-data

Once we have the byte[] pointer, this won't change, a copy-on-write would garbage-collect the old array.

Other data-types have 3 non-atomic operations that aren't present in a traditional physical Entry object.

One possible approach to resolve this issue is a reader lock. Namely all gets open a reader-lock which prevents the intended writer from mutating known contentious data. Note, we don't need to ALWAYS not have readers, but we need to determine when all readers at a single point in time have 'exited'. So, we set a flag that says the last-reader-out reset this flag, then we'll know that no mater what else happens, nobody can be pointing to the target freeable data.

We can utilize an AtomicInteger numReaders; where all get(key) operations increment on entry. Then on exit, if you produced the value 0, you reset any 'pendingTransaction' flag (whether it's set or not).

So the read-overhead involves two spin-loop operations. Meaning, pretty fast.

All puts, have the following constraints:

1) if after mutating the read-count is 0, we can just free the data directly

1b) else, enqueue the data onto a LinkedList and set the pendingTransaction flag

2) on ENTRY of a mutating operation, first check if the LinkedList shows pending data. If so AND the pendingTransaction is false, we first free any prior data.

Freeing data is:

A) identifying the category of data (ints, longs, doubles, objects, strings)

B) taking the index of that data and assigning to a co-record (or the record itself if possible) the current value of next_free_xxx

C) set next_free_xxx = freeingIndex

This means that indexes will remaing 'floating' for potentially several puts. If a reader-thread blocks indefinitely inside the get() methods (shouldn't be possible directly by this algorithm, but an out-of-memory issue would for String rebuilding, or a ThreadDeath exception), then nobody could free transient items again. Unlikely situations that would likely have affected other parts of code more.

For iterators, we would store table-idx, and list-depth. Then restart from the root list-item and traverse an ever increasing depth (holding a read-lock for each hasNext/next operation). Since an element-idx is in no way stable, nor would we want to retain a read-lock for the life of the iterator.

JSON Data and Nested-Maps: common allocators

An increasingly useful data-structure is JSON nested maps. You can consider that XML data is essentially this except that JSON requires exactly one 'element' of each type. Successively repeated elements use arrays, so the mapping isn't straight forward.

There are 5 data-types:

null
map
array
string
number

There are functions as well, but in terms of data-exchange this is not cool.

Many Java based JSON translators use what is effectively a map of map/lists or set of map/lists

For large web-datasets, we may see thousands of 'records' stored in this nested-map structure. Ultimtately a far cry from our optimized C-style record-array.

As JSON transport becomes more prevalant, non-transient JSON storage will grow in desireability. If, for no other reason, than to reduce the impedence mismatch of converting java-objects to/from JSON.

actionType = jsonmap.get("HEADER").get("action_type");

firstName = jsonmap.get("DATA").get("records").get(idx).get("first_name");

Currently java doesn't make this mixing and matching of Map/List/String/Integer. With specialized JSON data-trees, you could at least say

jsonmap.getMap("DATA").getList("records").getMap(idx).getString("first_name");

Ideally,nullified data would return empy Map/List JSON wrappers to we'd never have to check for nulity (except for primitive int/string values)

Ignoring this niceness for a moment. The relevant aspect here is the overhead of having potentially 10s of thousands of nested Maps.

While it's certainly possible to utilize the compacted map structure, and the overhead of a few 10k object refs is nothing in comparison to what normal maps utilize, there is a lot of 'slack space'.. Namely all our xx[] refs will tend to double-in-size when needed, so each item will have between 5% and 50% slack-space. So we're multiplying this by 10s of thousands.

If, instead, we had a common int[]-slab from which to pull, we could minimize the slack-space.

There are two limitations with this:

A) allocation/freeing needs to be synchronized

B) the address space was limited for our compacted-data (24 or 28bits), thus havings 100k maps would make a typical 1k item have indexes near 23 bits, especially with consolidate strings that rarely compact.

Use Cases

So with the above available techniques. What are some end-use-cases we can work towards?

Single-threaded
Multi-threaded
read-only
read-more-than-update (factor of 10+)
update-more-than-read (factor of 10+)
mixed reads/write
Mostly duplicate values
Mostly unique values
Small sizes (10K)
Medium sizes (100K)
Large sizes (10 million)
Massive sizes (1 billion+)
Time-local data (gets are near-in-time to the put)
hot-data (some data is never used, others are heavily used by say 100:1)

In bold are tough but useful case-situations. Such as caching relational data (with millions of transient rows).

Ideally we have read-mostly data of massive tables in a time-local fashion.

LFU's are good for hot-data, LRU's are good for time-local data.

1M+ data-sets are good for compacted data-structures (otherwise GC-load becomes noticeable on what is mostly unused data).

For a Cache, we have to assume multi-threaded.

For write-mostly or mixed read-write, compacted data-structures are practical. But for read-mostly, compacted generates significant contention/overhead.

Read-Only data greatly desires compacted data because it's just 'wasted memory' otherwise. Especially with respect to duplicate strings. Truely read-only-data could even use a transient StringSet, then throw the hashing data away after building.

Recommendation

It seems plausible that many of these modes of operation are inter-compatible.

Single-threaded (for writers) can utilize a single-segment this is what ConcurrentHashMap suggests)
Segments and indeed Maps themselves can utilize common StringSet and OpaquePool, which faciliates string-reuse and efficient Nested-Maps.
reader-optimized structures can be told to use Objects instead of StringSets.
LRU,FIFO,LFU strategies can be swapped out on a per-map basis.
Unlimited v.s. limited maps are trivialy differntiated

All of this can be configured in the constructor.

Tech articles

Saturday, August 8, 2009

Alternate mem-store for ehcache

No comments:

Post a Comment

Followers

Blog Archive

About Me