Efficient Counter that uses a limited (bounded) amount of memory regardless of data size

Bounter is a Python library, written in C, for extremely fast probabilistic counting of item frequencies in massive datasets, using only a small fixed memory footprint.

Why Bounter?

Bounter lets you count how many times an item appears, similar to Python’s built-in dict or Counter:

from bounter import bounter

counts = bounter(size_mb=1024)  # use at most 1 GB of RAM
counts.update([u'a', 'few', u'words', u'a', u'few', u'times'])  # count item frequencies

print(counts[u'few'])  # query the counts
2

However, unlike dict or Counter, Bounter can process huge collections where the items would not even fit in RAM. This commonly happens in Machine Learning and NLP, with tasks like dictionary building or collocation detection that need to estimate counts of billions of items (token ngrams) for their statistical scoring and subsequent filtering.

Bounter implements approximative algorithms using optimized low-level C structures, to avoid the overhead of Python objects. It lets you specify the maximum amount of RAM you want to use. In the Wikipedia example below, Bounter uses 31x less memory compared to Counter.

Bounter is also marginally faster than the built-in dict and Counter, so wherever you can represent your items as strings(both byte-strings and unicode are fine, and Bounter works in both Python2 and Python3), there’s no reason not to use Bounter instead.

https://github.com/RaRe-Technologies/bounter

Advertisements