Bounter is a Python library, written in C, for extremely fast probabilistic counting of item frequencies in massive datasets, using only a small fixed memory footprint.
Why Bounter?
Bounter lets you count how many times an item appears, similar to Python’s built-in dict
or Counter
:
from bounter import bounter
counts = bounter(size_mb=1024) # use at most 1 GB of RAM
counts.update([u'a', 'few', u'words', u'a', u'few', u'times']) # count item frequencies
print(counts[u'few']) # query the counts
2
However, unlike dict
or Counter
, Bounter can process huge collections where the items would not even fit in RAM. This commonly happens in Machine Learning and NLP, with tasks like dictionary building or collocation detection that need to estimate counts of billions of items (token ngrams) for their statistical scoring and subsequent filtering.
Bounter implements approximative algorithms using optimized low-level C structures, to avoid the overhead of Python objects. It lets you specify the maximum amount of RAM you want to use. In the Wikipedia example below, Bounter uses 31x less memory compared to Counter
.
Bounter is also marginally faster than the built-in dict
and Counter
, so wherever you can represent your items as strings(both byte-strings and unicode are fine, and Bounter works in both Python2 and Python3), there’s no reason not to use Bounter instead.