Exploiting New Properties of String Net Frequency for Efficient Computation
arxiv(2024)
摘要
Knowing which strings in a massive text are significant – that is, which
strings are common and distinct from other strings – is valuable for several
applications, including text compression and tokenization. Frequency in itself
is not helpful for significance, because the commonest strings are the shortest
strings. A compelling alternative is net frequency, which has the property that
strings with positive net frequency are of maximal length. However, net
frequency remains relatively unexplored, and there is no prior art showing how
to compute it efficiently. We first introduce a characteristic of net frequency
that simplifies the original definition. With this, we study strings with
positive net frequency in Fibonacci words. We then use our characteristic and
solve two key problems related to net frequency. First, single-nf, how
to compute the net frequency of a given string of length m, in an input text
of length n over an alphabet size σ. Second, all-nf, given
length-n input text, how to report every string of positive net frequency.
Our methods leverage suffix arrays, components of the Burrows-Wheeler
transform, and solution to the coloured range listing problem. We show that,
for both problems, our data structure has O(n) construction cost: with this
structure, we solve single-nf in O(m + σ) time and
all-nf in O(n) time. Experimentally, we find our method to be around
100 times faster than reasonable baselines for single-nf. For
all-nf, our results show that, even with prior knowledge of the set of
strings with positive net frequency, simply confirming that their net frequency
is positive takes longer than with our purpose-designed method.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要