Bounter is a Python library, written in C, for extremely fast probabilistic counting of item frequencies in massive datasets, using only a small fixed memory footprint.
Bounter lets you count how many times an item appears, similar to Python’s built-in
from bounter import bounter counts = bounter(size_mb=1024) # use at most 1 GB of RAM counts.update([u'a', 'few', u'words', u'a', u'few', u'times']) # count item frequencies print(counts[u'few']) # query the counts 2
Counter, Bounter can process huge collections where the items would not even fit in RAM. This commonly happens in Machine Learning and NLP, with tasks like dictionary building or collocation detection that need to estimate counts of billions of items (token ngrams) for their statistical scoring and subsequent filtering.
Bounter implements approximative algorithms using optimized low-level C structures, to avoid the overhead of Python objects. It lets you specify the maximum amount of RAM you want to use. In the Wikipedia example below, Bounter uses 31x less memory compared to
Bounter is also marginally faster than the built-in
Counter, so wherever you can represent your items as strings(both byte-strings and unicode are fine, and Bounter works in both Python2 and Python3), there’s no reason not to use Bounter instead.