Fast sorting with limit in Elixir
If you have a large list of things in Elixir and then want to sort them in order
to get the top n
for something, chances are that your existing sort using
Enum.sort
may be relatively slow, as it rebuilds the entire list. We can do it
more efficiently by simply accumulating the top n
in an optimized data
structure.
Erlang provides this optimized data structure for us, namely
:gb_trees
. We roughly need to
do the following:
- Reduce over our enumerable.
- Figure out the value of the current item, and check whether it’s bigger than
the smallest value.1
- If it is, insert it into the tree.
- If not, insert it only if we’re not at the limit yet.
Let’s look at the function in detail:
defp sort_by_limit(stream, key_fn, limit) do
stream
|> Enum.reduce({:gb_trees.empty(), 0, 0}, fn item, {tree, smallest, size} ->
case key_fn.(item) do
key when key > smallest and size >= limit ->
{_smallest, _value, new_tree} = :gb_trees.take_smallest(tree)
{:gb_trees.insert(key, item, new_tree), key, size}
key when size < limit ->
{:gb_trees.insert(key, item, tree), key, size + 1}
_key ->
{tree, smallest, size}
end
end)
|> elem(0)
|> :gb_trees.to_list()
|> Enum.reverse()
|> Enum.map(&elem(&1, 1))
end
Carrying the smallest
and size
parameters myself is a bit of an
optimization: we could call :gb_trees.smallest
and :gb_trees.size
ourselves
but we can track this relatively simply. The following pipes are mainly used to
extract the tree, reverse the order from ascending to descending sort, and then
map to the actual items we cared about.
If this proved useful to you, or you found an issue with it, please let me know :-)
-
Since I needed it this way, the function is hardcoded to do this. I would be happy for any suggestions to make it more generic, although I assume in that case we would need to change the function a bit due to the guards in
case
having to be known at compile time. ↩︎