New Idea for Probabilistic Upper Bounds on PHFN

One famous result of Stinson, van Trung, and Wei is that PHFN(k,v,t) <= c*(log(2k choose t)-log(k)) where c is a constant independent of k, but dependent on v, t. The proof essentially shows that if we start with 2k columns, and have the expected number of unseparated t-sets be strictly smaller than k, then we can delete one column from each of these unseparated t-sets, and are guaranteed then to have at least k columns. The math continues almost identically to the basic method in Perfect Hash Families (a brief overview).

But note that this method does not rely on having exactly 2k columns – note that if we started with any number of columns m > k, and delete m-k columns, we are still ok. So therefore PHFN(k,v,t) <= c*(log(m choose t) – log(k-m)) where c is the same constant as before.

The question then is to determine what is the “best” choice of m here, since it is a discrete value. One could use Stirling’s approximation and find the minimum of such a function by taking the derivative of it with respect to m.

The function is: m log(m) – t log(t) – (m-t)log(m-t) – log(m-k), and the derivative is log(m) – log(m-t) – 1/(m-k). WolframAlpha cannot find a value of m that has this derivative equal to 0. We could note that t log(t) is much smaller than mlog(m), and approximate t log (t) by 0, but this does not help.

If someone can identify how to find the minimum of this function, that would be of great use!


Perfect Hash Families (a brief overview)

My research deals with Perfect Hash Families, and this article will deal with some of the background in (possible) future posts that I make. No background is strictly necessary for this post other than general mathematical knowledge.

A perfect hash family (PHF) is a 4-tuple {(N; k, v, t)} which is a set of {N} functions {f : X \rightarrow Y} with {|X| = k} and {|Y| = v}, such that for any {t}-subset {C} of this set, there is at least one function {g} such that {g} restricted to {C} is one-to-one. In matrix form, this represents an {N \times k} matrix over {v} symbols such that for any {t} columns, there is at least 1 row among these columns that has all distinct entries.

If for a given set of {t} columns we have such a condition (i.e., one row is all distinct), then we say that these columns are covering. Also, we say the PHF is covering if all {t}-subsets of columns are covering.

We want to find the smallest possible PHF for the given parameters, in terms of the number of rows. We denote this question (the smallest number of rows) by {PHFN(k, v, t)}.

These objects were introduced by Mehlhorn in 1984 for fast retrieval of frequently used information. Currently, they are used in the construction of Covering Arrays as well as in combinatorial cryptography.

Theorem 1 {PHFN(k, v, t) = 1} if {k = t}, and {\ge 2} otherwise.

Proof: If {k=t}, the array consists of 1 row of all distinct symbols. If {k > t} and there is only 1 row, then there must be one set of {t} columns with a duplicate entry. \Box

Theorem 2 {PHFN(k, v, 2) = \lceil \frac{\log(k)}{\log(v)} \rceil}.

Proof: If {t=2}, then the PHF covers if and only if every pair of columns is different. Therefore, we can list columns containing {v} symbols. If each is of length {N}, then we have {v^N = k} different columns, and solving for {N} gives the desired result. \Box

Theorem 3 (Symbol Increase) {PHFN(k, v+d, t) \le PHFN(k, v, t)} for any {d \ge 1}.

Proof: Increasing the number of symbols does not change whether or not the PHF covers. \Box

Theorem 4 (Column Decrease) {PHFN(k-d, v, t) \le PHFN(k, v, t)} for any {d \ge 1}.

Proof: Removing columns does not change whether or not the PHF covers. \Box

Theorem 5 {PHFN(k+d, v+d, t) \le PHFN(k, v, t)} for any {d \ge 1}.

Proof: Add {d} new columns to the PHF, each one consisting entirely of a new symbol. The PHF certainly remains covering on any of the previous {k} columns, and covers when involving any of the {d} new columns because they introduce new symbols. \Box

1. Recursive Constructions

The power in generating large combinatorial objects lies in being able to create them using smaller “ingredients” – we describe a few of them here. The first two lead to the best bounds, in general, for very large PHFs.

Theorem 6 (Blackburn Composition) {PHFN(k, v, t) \le PHFN(k, x, t) \times PHFN(x, v, t)}.

Proof: Call the first PHF {A}, and the other {B}. Replace all occurrences of the {i}th symbol of {A} with the {i}th column of {B} to form {C}. {C} has the required size, but why is it a PHF? Select any {t} columns of {C}. In {A} with the same columns, at least 1 row contains distinct values. In {C}, that row includes distinct columns of {B}. Since {B} was a PHF, at least 1 row in {C} must contain distinct columns. Therefore, {C} is a PHF. \Box

Theorem 7 (Atici-Magliveras-Stinson-Wei Product)

\displaystyle PHFN(k_0 k_2, v, t) \le PHFN(k_0 k_1, v, t) \times PHFN(k_2, k_1, t-1) + PHFN(k_2, v, t).

Theorem 8 (Martirosyan Row Decrease)

\displaystyle PHFN(\lceil \frac{k(t-1)}{v} \rceil, v, t) \le PHFN(k, v, t)-1.

Proof: Let {A} be the PHF where we want to remove a row. Pick any row {j}. Denote {x_i} to be the number of times symbol {i} appears in row {j}. Without loss of generality, suppose {x_0} is the largest over all symbols, and {x_{v-1}} is the least (ordered in the usual way).

Delete all columns of {A} that have symbols {x_{t-1}, x_t, \cdots, x_{v-1}} in row {j}. Now remove row {j} completely. We get an {(N-1) \times k'} array {C} (with potentially fewer symbols, but this is not guaranteed). However, we can guarantee {C} is a PHF of strength {t}.

What is the value of {k'}? Since {x_0 \ge \cdots x_{v-1}} we have that:

\displaystyle \sum_{i=0}^{v-1} x_i \le v \frac{\sum_{i=0}^{t-2} x_i}{t-1}

. Therefore, {k' \ge \frac{k(t-1)}{v}}, and the result is proved. \Box

2. Probabilistic Constructions

The previous section discusses actually making a PHF, but here we take a different approach: we see that a PHF of given parameters must exist, but the proofs are nonconstructive (unfortunately).

Let {A} be a random {N \times k} array over {v} symbols. Choose a set of {t} columns {C}. What is the probability that {C} is not covered? It is the probability that no rows in {C} have all distinct entries. What is the probability that a given row is all distinct? This probability is

\displaystyle \frac{v(v-1)(v-2)\cdots(v-t+1)}{v^t}.

Let this probability be {p}. Therefore, the probability that a given row is not all distinct is {1-p}, and the probability that all rows are not distinct is {(1-p)^N}.

So what is the expected number of {t}-sets of columns that are not covering? By union bound, we know that since there are {{k \choose t}} columns, that this expected number is

\displaystyle {k \choose t}(1-p)^N.

All we need to do is to solve for {N} when this number is strictly less than 1. If we do, then we guarantee that there must be some PHF with those parameters. Taking logarithms, and upper bounds, we have that:

Theorem 9 {PHFN(k, v, t) \le \frac{t( \log(k) - \log(t) + 1))}{t\log(v) - \log(v^t-t!{v \choose t})}}.

Deng, Stinson, and Wei showed an improved bound using the Lovász Local Lemma:

Theorem 10 {PHFN(k, v, t) \le \frac{\log({k \choose t} - {{k-t} \choose t})+1}{t\log(v) - \log(v^t - t! {v \choose t})}}.

Stinson, van Trung, and Wei improved on this (called “expurgation”):

Theorem 11 {PHFN(k, v, t) \le \frac{\log{2k \choose t} - \log(k)}{t\log(v) - \log(v^t-t!{v \choose t}))}}.

Note that improvements are only in the numerator in all cases.