Date
1 - 20 of 33
[Q] Default SLAB allocator
Ezequiel Garcia <elezegarcia@...>
Hello,
While I've always thought SLUB was the default and recommended allocator, I'm surprise to find that it's not always the case: $ find arch/*/configs -name "*defconfig" | wc -l 452 $ grep -r "SLOB=y" arch/*/configs/ | wc -l 11 $ grep -r "SLAB=y" arch/*/configs/ | wc -l 245 This shows that, SLUB being the default, there are actually more defconfigs that choose SLAB. I wonder... * Is SLAB a proper choice? or is it just historical an never been re-evaluated? * Does the average embedded guy knows which allocator to choose and what's the impact on his platform? Thanks, Ezequiel |
|
Andi Kleen <andi@...>
Ezequiel Garcia <elezegarcia@...> writes:
Hello,iirc the main performance reasons for slab over slub have mostly disappeared, so in theory slab could be finally deprecated now. -Andi -- ak@... -- Speaking for myself only |
|
David Rientjes <rientjes@...>
On Thu, 11 Oct 2012, Andi Kleen wrote:
SLUB is a non-starter for us and incurs a >10% performance degradation inWhile I've always thought SLUB was the default and recommended allocator,iirc the main performance reasons for slab over slub have mostly netperf TCP_RR. |
|
Andi Kleen <andi@...>
David Rientjes <rientjes@...> writes:
On Thu, 11 Oct 2012, Andi Kleen wrote:When did you last test? Our regressions had disappeared a few kernelsSLUB is a non-starter for us and incurs a >10% performance degradation inWhile I've always thought SLUB was the default and recommended allocator,iirc the main performance reasons for slab over slub have mostly ago. -Andi -- ak@... -- Speaking for myself only |
|
Ezequiel Garcia <elezegarcia@...>
Hi,
On Thu, Oct 11, 2012 at 8:10 PM, Andi Kleen <andi@...> wrote: David Rientjes <rientjes@...> writes:Where are you seeing that?On Thu, 11 Oct 2012, Andi Kleen wrote:SLUB is a non-starter for us and incurs a >10% performance degradation inWhile I've always thought SLUB was the default and recommended allocator,iirc the main performance reasons for slab over slub have mostly Notice that many defconfigs are for embedded devices, and many of them say "use SLAB"; I wonder if that's right. Is there any intention to replace SLAB by SLUB? In that case it could make sense to change defconfigs, although it wouldn't be based on any actual tests. Ezequiel |
|
David Rientjes <rientjes@...>
On Thu, 11 Oct 2012, Andi Kleen wrote:
When did you last test? Our regressions had disappeared a few kernelsThis was in August when preparing for LinuxCon, I tested netperf TCP_RR on two 64GB machines (one client, one server), four nodes each, with thread counts in multiples of the number of cores. SLUB does a comparable job, but once we have the the number of threads equal to three times the number of cores, it degrades almost linearly. I'll run it again next week and get some numbers on 3.6. |
|
David Rientjes <rientjes@...>
On Fri, 12 Oct 2012, Ezequiel Garcia wrote:
In my benchmarking results.Where are you seeing that?SLUB is a non-starter for us and incurs a >10% performance degradation in Notice that many defconfigs are for embedded devices,If a device doesn't require the smallest memory footprint possible (SLOB) then SLAB is the right choice when there's a limited amount of memory; SLUB requires higher order pages for the best performance (on my desktop system running with CONFIG_SLUB, over 50% of the slab caches default to be high order). Is there any intention to replace SLAB by SLUB?There may be an intent, but it'll be nacked as long as there's a performance degradation. In that case it could make sense to change defconfigs, althoughUm, you can't just go changing defconfigs without doing some due diligence in ensuring it won't be deterimental for those users. |
|
Ezequiel Garcia <elezegarcia@...>
Hi David,
On Sat, Oct 13, 2012 at 6:54 AM, David Rientjes <rientjes@...> wrote: On Fri, 12 Oct 2012, Ezequiel Garcia wrote:But SLAB suffers from a lot more internal fragmentation than SLUB,In my benchmarking results.Where are you seeing that?SLUB is a non-starter for us and incurs a >10% performance degradation in which I guess is a known fact. So memory-constrained devices would waste more memory by using SLAB. I must admit a didn't look at page order (but I will now). Yeah, it would be very interesting to compare SLABs on at leastIs there any intention to replace SLAB by SLUB?There may be an intent, but it'll be nacked as long as there's a some of those platforms. Ezequiel |
|
Eric Dumazet <eric.dumazet@...>
On Sat, 2012-10-13 at 02:51 -0700, David Rientjes wrote:
On Thu, 11 Oct 2012, Andi Kleen wrote:In latest kernels, skb->head no longer use kmalloc()/kfree(), so SLAB vsWhen did you last test? Our regressions had disappeared a few kernelsThis was in August when preparing for LinuxCon, I tested netperf TCP_RR on SLUB is less a concern for network loads. In 3.7, (commit 69b08f62e17) we use fragments of order-3 pages to populate skb->head. SLUB was really bad in the common workload you describe (allocations done by one cpu, freeing done by other cpus), because all kfree() hit the slow path and cpus contend in __slab_free() in the loop guarded by cmpxchg_double_slab(). SLAB has a cache for this, while SLUB directly hit the main "struct page" to add the freed object to freelist. I played some months ago adding a percpu associative cache to SLUB, then just moved on other strategy. (Idea for this per cpu cache was to build a temporary free list of objects to batch accesses to struct page) |
|
David Rientjes <rientjes@...>
On Sat, 13 Oct 2012, David Rientjes wrote:
This was in August when preparing for LinuxCon, I tested netperf TCP_RR onOn 3.6, I tested CONFIG_SLAB (no CONFIG_DEBUG_SLAB) vs. CONFIG_SLUB and CONFIG_SLUB_DEBUG (no CONFIG_SLUB_DEBUG_ON or CONFIG_SLUB_STATS), which are the defconfigs for both allocators. Using netperf-2.4.5 and two machines both with 16 cores (4 cores/node) and 32GB of memory each (one client, one netserver), here are the results: threads SLAB SLUB 16 115408 114477 (-0.8%) 32 214664 209582 (-2.4%) 48 297414 290552 (-2.3%) 64 372207 360177 (-3.2%) 80 435872 421674 (-3.3%) 96 490927 472547 (-3.7%) 112 543685 522593 (-3.9%) 128 586026 564078 (-3.7%) 144 630320 604681 (-4.1%) 160 671953 639643 (-4.8%) It seems that slub has improved because of the per-cpu partial lists, which truly makes the "unqueued" allocator queued, by significantly increasing the amount of memory that the allocator uses. However, the netperf benchmark still regresses significantly and is still a non- starter for us. This type of workload that really exhibits the problem with remote freeing would suggest that the design of slub itself is the problem here. |
|
David Rientjes <rientjes@...>
On Sat, 13 Oct 2012, Ezequiel Garcia wrote:
But SLAB suffers from a lot more internal fragmentation than SLUB,Even with slub's per-cpu partial lists? |
|
JoonSoo Kim <js1304@...>
Hello, Eric.
2012/10/14 Eric Dumazet <eric.dumazet@...>: SLUB was really bad in the common workload you describe (allocationsCould you elaborate more on how 'netperf RR' makes kernel "allocations done by one cpu, freeling done by other cpus", please? I don't have enough background network subsystem, so I'm just curious. I played some months ago adding a percpu associative cache to SLUB, thenIs this implemented and submitted? If it is, could you tell me the link for the patches? Thanks! |
|
Eric Dumazet <eric.dumazet@...>
On Tue, 2012-10-16 at 10:28 +0900, JoonSoo Kim wrote:
Hello, Eric.Common network load is to have one cpu A handling device interrupts doing the memory allocations to hold incoming frames, and queueing skbs to various sockets. These sockets are read by other cpus (if the cpu A is fully used to service softirqs under high load), so the kfree() are done by other cpus. Each incoming frame uses one sk_buff, allocated from skbuff_head_cache kmemcache (256 bytes on x86_64) # ls -l /sys/kernel/slab/skbuff_head_cache lrwxrwxrwx 1 root root 0 oct. 16 08:50 /sys/kernel/slab/skbuff_head_cache -> :t-0000256 # cat /sys/kernel/slab/skbuff_head_cache/objs_per_slab 32 On a configuration with 24 cpus and one cpu servicing network, we may have 23 cpus doing the frees roughly at the same time, all competing in __slab_free() on the same page. This increases if we increase slub page order (as recommended by SLUB hackers) To reproduce this kind of workload without a real NIC, we probably need some test module, using one thread doing allocations, and other threads doing the free. It was implemented in february and not submitted at that time.I played some months ago adding a percpu associative cache to SLUB, thenIs this implemented and submitted? The following rebase has probably some issues with slab debug, but seems to work. include/linux/slub_def.h | 22 ++++++ mm/slub.c | 127 +++++++++++++++++++++++++++++++------ 2 files changed, 131 insertions(+), 18 deletions(-) diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index df448ad..9e5b91c 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -41,8 +41,30 @@ enum stat_item { CPU_PARTIAL_FREE, /* Refill cpu partial on free */ CPU_PARTIAL_NODE, /* Refill cpu partial from node partial */ CPU_PARTIAL_DRAIN, /* Drain cpu partial to node partial */ + FREE_CACHED, /* free delayed in secondary freelist, cumulative counter */ + FREE_CACHED_ITEMS, /* items in victim cache */ NR_SLUB_STAT_ITEMS }; +/** + * struct slub_cache_desc - victim cache descriptor + * @page: slab page + * @objects_head: head of freed objects list + * @objects_tail: tail of freed objects list + * @count: number of objects in list + * + * freed objects in slow path are managed into an associative cache, + * to reduce contention on @page->freelist + */ +struct slub_cache_desc { + struct page *page; + void **objects_head; + void **objects_tail; + int count; +}; + +#define NR_SLUB_PCPU_CACHE_SHIFT 6 +#define NR_SLUB_PCPU_CACHE (1 << NR_SLUB_PCPU_CACHE_SHIFT) + struct kmem_cache_cpu { void **freelist; /* Pointer to next available object */ unsigned long tid; /* Globally unique transaction id */ diff --git a/mm/slub.c b/mm/slub.c index a0d6984..30a6d72 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -31,6 +31,7 @@ #include <linux/fault-inject.h> #include <linux/stacktrace.h> #include <linux/prefetch.h> +#include <linux/hash.h> #include <trace/events/kmem.h> @@ -221,6 +222,14 @@ static inline void stat(const struct kmem_cache *s, enum stat_item si) #endif } +static inline void stat_add(const struct kmem_cache *s, enum stat_item si, + int cnt) +{ +#ifdef CONFIG_SLUB_STATS + __this_cpu_add(s->cpu_slab->stat[si], cnt); +#endif +} + /******************************************************************** * Core slab cache functions *******************************************************************/ @@ -1993,6 +2002,8 @@ static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) c->freelist = NULL; } +static void victim_cache_flush(struct kmem_cache *s, int cpu); + /* * Flush cpu slab. * @@ -2006,6 +2017,7 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) if (c->page) flush_slab(s, c); + victim_cache_flush(s, cpu); unfreeze_partials(s); } } @@ -2446,38 +2458,34 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace); #endif /* - * Slow patch handling. This may still be called frequently since objects + * Slow path handling. This may still be called frequently since objects * have a longer lifetime than the cpu slabs in most processing loads. * * So we still attempt to reduce cache line usage. Just take the slab - * lock and free the item. If there is no additional partial page + * lock and free the items. If there is no additional partial page * handling required then we can return immediately. */ -static void __slab_free(struct kmem_cache *s, struct page *page, - void *x, unsigned long addr) +static void slub_cache_flush(const struct slub_cache_desc *cache) { void *prior; - void **object = (void *)x; int was_frozen; int inuse; struct page new; unsigned long counters; struct kmem_cache_node *n = NULL; - unsigned long uninitialized_var(flags); - - stat(s, FREE_SLOWPATH); + struct page *page = cache->page; + struct kmem_cache *s = page->slab; - if (kmem_cache_debug(s) && - !(n = free_debug_processing(s, page, x, addr, &flags))) - return; + stat_add(s, FREE_CACHED, cache->count - 1); + stat_add(s, FREE_CACHED_ITEMS, -cache->count); do { prior = page->freelist; counters = page->counters; - set_freepointer(s, object, prior); + set_freepointer(s, cache->objects_tail, prior); new.counters = counters; was_frozen = new.frozen; - new.inuse--; + new.inuse -= cache->count; if ((!new.inuse || !prior) && !was_frozen && !n) { if (!kmem_cache_debug(s) && !prior) @@ -2499,7 +2507,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, * Otherwise the list_lock will synchronize with * other processors updating the list of slabs. */ - spin_lock_irqsave(&n->list_lock, flags); + spin_lock(&n->list_lock); } } @@ -2507,8 +2515,8 @@ static void __slab_free(struct kmem_cache *s, struct page *page, } while (!cmpxchg_double_slab(s, page, prior, counters, - object, new.counters, - "__slab_free")); + cache->objects_head, new.counters, + "slab_free_objects")); if (likely(!n)) { @@ -2549,7 +2557,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, stat(s, FREE_ADD_PARTIAL); } } - spin_unlock_irqrestore(&n->list_lock, flags); + spin_unlock(&n->list_lock); return; slab_empty: @@ -2563,11 +2571,90 @@ slab_empty: /* Slab must be on the full list */ remove_full(s, page); - spin_unlock_irqrestore(&n->list_lock, flags); + spin_unlock(&n->list_lock); stat(s, FREE_SLAB); discard_slab(s, page); } +DEFINE_PER_CPU_ALIGNED(struct slub_cache_desc, victim_cache[NR_SLUB_PCPU_CACHE]); + +static void victim_cache_flush(struct kmem_cache *s, int cpu) +{ + int i; + struct slub_cache_desc *cache = per_cpu(victim_cache, cpu); + + for (i = 0; i < NR_SLUB_PCPU_CACHE; i++,cache++) { + if (cache->page && cache->page->slab == s) { + slub_cache_flush(cache); + cache->page = NULL; + } + + } +} + +static unsigned int slub_page_hash(const struct page *page) +{ + u32 val = hash32_ptr(page); + + /* ID : add coloring, so that cpus dont flush a slab at same time ? + * val += raw_smp_processor_id(); + */ + return hash_32(val, NR_SLUB_PCPU_CACHE_SHIFT); +} + +/* + * Instead of pushing individual objects into page freelist, + * dirtying page->freelist/counters for each object, we build percpu private + * lists of objects belonging to same slab. + */ +static void __slab_free(struct kmem_cache *s, struct page *page, + void *x, unsigned long addr) +{ + void **object = (void *)x; + struct slub_cache_desc *cache; + unsigned int hash; + struct kmem_cache_node *n = NULL; + unsigned long flags; + + stat(s, FREE_SLOWPATH); + + if (kmem_cache_debug(s)) { + n = free_debug_processing(s, page, x, addr, &flags); + if (!n) + return; + spin_unlock_irqrestore(&n->list_lock, flags); + } + + hash = slub_page_hash(page); + + local_irq_save(flags); + + cache = __this_cpu_ptr(&victim_cache[hash]); + if (cache->page == page) { + /* + * Nice, we have a private freelist for this page, + * add this object in it. Since we are in slow path, + * we add this 'hot' object at tail, to let a chance old + * objects being evicted from our cache before another cpu + * need them later. This also helps the final + * slab_free_objects() call to access objects_tail + * without a cache miss (object_tail being hot) + */ + set_freepointer(s, cache->objects_tail, object); + cache->count++; + } else { + if (likely(cache->page)) + slub_cache_flush(cache); + + cache->page = page; + cache->objects_head = object; + cache->count = 1; + } + cache->objects_tail = object; + stat(s, FREE_CACHED_ITEMS); + local_irq_restore(flags); +} + /* * Fastpath with forced inlining to produce a kfree and kmem_cache_free that * can perform fastpath freeing without additional function calls. @@ -5084,6 +5171,8 @@ STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc); STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free); STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node); STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain); +STAT_ATTR(FREE_CACHED, free_cached); +STAT_ATTR(FREE_CACHED_ITEMS, free_cached_items); #endif static struct attribute *slab_attrs[] = { @@ -5151,6 +5240,8 @@ static struct attribute *slab_attrs[] = { &cpu_partial_free_attr.attr, &cpu_partial_node_attr.attr, &cpu_partial_drain_attr.attr, + &free_cached_attr.attr, + &free_cached_items_attr.attr, #endif #ifdef CONFIG_FAILSLAB &failslab_attr.attr, |
|
Ezequiel Garcia <elezegarcia@...>
David,
On Mon, Oct 15, 2012 at 9:46 PM, David Rientjes <rientjes@...> wrote: On Sat, 13 Oct 2012, Ezequiel Garcia wrote:I'm not considering that, but rather plain fragmentation: the differenceBut SLAB suffers from a lot more internal fragmentation than SLUB,Even with slub's per-cpu partial lists? between requested and allocated, per object. Admittedly, perhaps this is a naive analysis. However, devices where this matters would have only one cpu, right? So the overhead imposed by per-cpu data shouldn't impact so much. Study the difference in overhead imposed by allocators is something that's still on my TODO. Now, returning to the fragmentation. The problem with SLAB is that its smaller cache available for kmalloced objects is 32 bytes; while SLUB allows 8, 16, 24 ... Perhaps adding smaller caches to SLAB might make sense? Is there any strong reason for NOT doing this? Thanks, Ezequiel |
|
Eric Dumazet <eric.dumazet@...>
On Tue, 2012-10-16 at 09:35 -0300, Ezequiel Garcia wrote:
Now, returning to the fragmentation. The problem with SLAB is thatI would remove small kmalloc-XX caches, as sharing a cache line is sometime dangerous for performance, because of false sharing. They make sense only for very small hosts. |
|
Tim Bird <tim.bird@...>
On 10/16/2012 05:56 AM, Eric Dumazet wrote:
On Tue, 2012-10-16 at 09:35 -0300, Ezequiel Garcia wrote:That's interesting...Now, returning to the fragmentation. The problem with SLAB is thatI would remove small kmalloc-XX caches, as sharing a cache line It would be good to measure the performance/size tradeoff here. I'm interested in very small systems, and it might be worth the tradeoff, depending on how bad the performance is. Maybe a new config option would be useful (I can hear the groans now... :-) Ezequiel - do you have any measurements of how much memory is wasted by 32-byte kmalloc allocations for smaller objects, in the tests you've been doing? -- Tim ============================= Tim Bird Architecture Group Chair, CE Workgroup of the Linux Foundation Senior Staff Engineer, Sony Network Entertainment ============================= |
|
Ezequiel Garcia <elezegarcia@...>
On Tue, Oct 16, 2012 at 3:07 PM, Tim Bird <tim.bird@...> wrote:
On 10/16/2012 05:56 AM, Eric Dumazet wrote:Yes, we have some numbers:On Tue, 2012-10-16 at 09:35 -0300, Ezequiel Garcia wrote:That's interesting...Now, returning to the fragmentation. The problem with SLAB is thatI would remove small kmalloc-XX caches, as sharing a cache line http://elinux.org/Kernel_dynamic_memory_analysis#Kmalloc_objects Are they too informal? I can add some details... They've been measured on a **very** minimal setup, almost every option is stripped out, except from initramfs, sysfs, and trace. On this scenario, strings allocated for file names and directories created by sysfs are quite noticeable, being 4-16 bytes, and produce a lot of fragmentation from that 32 byte cache at SLAB. Is an option to enable small caches on SLUB and SLAB worth it? Ezequiel |
|
Ezequiel Garcia <elezegarcia@...>
On Tue, Oct 16, 2012 at 3:07 PM, Tim Bird <tim.bird@...> wrote:
On 10/16/2012 05:56 AM, Eric Dumazet wrote:It might be worth reminding that very small systems can use SLOBOn Tue, 2012-10-16 at 09:35 -0300, Ezequiel Garcia wrote:That's interesting...Now, returning to the fragmentation. The problem with SLAB is thatI would remove small kmalloc-XX caches, as sharing a cache line allocator, which does not suffer from this kind of fragmentation. Ezequiel |
|
Tim Bird <tim.bird@...>
On 10/16/2012 11:27 AM, Ezequiel Garcia wrote:
On Tue, Oct 16, 2012 at 3:07 PM, Tim Bird <tim.bird@...> wrote:On 10/16/2012 05:56 AM, Eric Dumazet wrote:Yes, we have some numbers:On Tue, 2012-10-16 at 09:35 -0300, Ezequiel Garcia wrote:That's interesting...Now, returning to the fragmentation. The problem with SLAB is thatI would remove small kmalloc-XX caches, as sharing a cache line They've been measured on a **very** minimal setup, almost every optionThe detail I'm interested in is the amount of wastage for a "common" workload, for each of the SLxB systems. Are we talking a few K, or 10's or 100's of K? It sounds like it's all from short strings. Are there other things using the 32-byte kmalloc cache, that waste a lot of memory (in aggregate) as well? Does your tool indicate a specific callsite (or small set of callsites) where these small allocations are made? It sounds like it's in the filesystem and would be content-driven (by the length of filenames)? This might be an issue particularly for cameras, where all the generated filenames are 8.3 (and will be for the foreseeable future) Is an option to enable small caches on SLUB and SLAB worth it?I'll have to do some measurements to see. I'm guessing the option itself would be pretty trivial to implement? -- Tim ============================= Tim Bird Architecture Group Chair, CE Workgroup of the Linux Foundation Senior Staff Engineer, Sony Network Entertainment ============================= |
|
Ezequiel Garcia <elezegarcia@...>
On Tue, Oct 16, 2012 at 3:44 PM, Tim Bird <tim.bird@...> wrote:
On 10/16/2012 11:27 AM, Ezequiel Garcia wrote:A more "Common" workload is one of the next items on my queue.On Tue, Oct 16, 2012 at 3:07 PM, Tim Bird <tim.bird@...> wrote:On 10/16/2012 05:56 AM, Eric Dumazet wrote:Yes, we have some numbers:On Tue, 2012-10-16 at 09:35 -0300, Ezequiel Garcia wrote:That's interesting...Now, returning to the fragmentation. The problem with SLAB is thatI would remove small kmalloc-XX caches, as sharing a cache line Does your tool indicate a specific callsite (or small set of callsites)That's right. And, IMHO, the problem is precisely that the allocation size is content-driven. Ezequiel |
|