# Perfect Hash Families (a brief overview)

My research deals with Perfect Hash Families, and this article will deal with some of the background in (possible) future posts that I make. No background is strictly necessary for this post other than general mathematical knowledge.

A perfect hash family (PHF) is a 4-tuple ${(N; k, v, t)}$ which is a set of ${N}$ functions ${f : X \rightarrow Y}$ with ${|X| = k}$ and ${|Y| = v}$, such that for any ${t}$-subset ${C}$ of this set, there is at least one function ${g}$ such that ${g}$ restricted to ${C}$ is one-to-one. In matrix form, this represents an ${N \times k}$ matrix over ${v}$ symbols such that for any ${t}$ columns, there is at least 1 row among these columns that has all distinct entries.

If for a given set of ${t}$ columns we have such a condition (i.e., one row is all distinct), then we say that these columns are covering. Also, we say the PHF is covering if all ${t}$-subsets of columns are covering.

We want to find the smallest possible PHF for the given parameters, in terms of the number of rows. We denote this question (the smallest number of rows) by ${PHFN(k, v, t)}$.

These objects were introduced by Mehlhorn in 1984 for fast retrieval of frequently used information. Currently, they are used in the construction of Covering Arrays as well as in combinatorial cryptography.

Theorem 1 ${PHFN(k, v, t) = 1}$ if ${k = t}$, and ${\ge 2}$ otherwise.

Proof: If ${k=t}$, the array consists of 1 row of all distinct symbols. If ${k > t}$ and there is only 1 row, then there must be one set of ${t}$ columns with a duplicate entry. $\Box$

Theorem 2 ${PHFN(k, v, 2) = \lceil \frac{\log(k)}{\log(v)} \rceil}$.

Proof: If ${t=2}$, then the PHF covers if and only if every pair of columns is different. Therefore, we can list columns containing ${v}$ symbols. If each is of length ${N}$, then we have ${v^N = k}$ different columns, and solving for ${N}$ gives the desired result. $\Box$

Theorem 3 (Symbol Increase) ${PHFN(k, v+d, t) \le PHFN(k, v, t)}$ for any ${d \ge 1}$.

Proof: Increasing the number of symbols does not change whether or not the PHF covers. $\Box$

Theorem 4 (Column Decrease) ${PHFN(k-d, v, t) \le PHFN(k, v, t)}$ for any ${d \ge 1}$.

Proof: Removing columns does not change whether or not the PHF covers. $\Box$

Theorem 5 ${PHFN(k+d, v+d, t) \le PHFN(k, v, t)}$ for any ${d \ge 1}$.

Proof: Add ${d}$ new columns to the PHF, each one consisting entirely of a new symbol. The PHF certainly remains covering on any of the previous ${k}$ columns, and covers when involving any of the ${d}$ new columns because they introduce new symbols. $\Box$

1. Recursive Constructions

The power in generating large combinatorial objects lies in being able to create them using smaller “ingredients” – we describe a few of them here. The first two lead to the best bounds, in general, for very large PHFs.

Theorem 6 (Blackburn Composition) ${PHFN(k, v, t) \le PHFN(k, x, t) \times PHFN(x, v, t)}$.

Proof: Call the first PHF ${A}$, and the other ${B}$. Replace all occurrences of the ${i}$th symbol of ${A}$ with the ${i}$th column of ${B}$ to form ${C}$. ${C}$ has the required size, but why is it a PHF? Select any ${t}$ columns of ${C}$. In ${A}$ with the same columns, at least 1 row contains distinct values. In ${C}$, that row includes distinct columns of ${B}$. Since ${B}$ was a PHF, at least 1 row in ${C}$ must contain distinct columns. Therefore, ${C}$ is a PHF. $\Box$

Theorem 7 (Atici-Magliveras-Stinson-Wei Product)

$\displaystyle PHFN(k_0 k_2, v, t) \le PHFN(k_0 k_1, v, t) \times PHFN(k_2, k_1, t-1) + PHFN(k_2, v, t).$

Theorem 8 (Martirosyan Row Decrease)

$\displaystyle PHFN(\lceil \frac{k(t-1)}{v} \rceil, v, t) \le PHFN(k, v, t)-1.$

Proof: Let ${A}$ be the PHF where we want to remove a row. Pick any row ${j}$. Denote ${x_i}$ to be the number of times symbol ${i}$ appears in row ${j}$. Without loss of generality, suppose ${x_0}$ is the largest over all symbols, and ${x_{v-1}}$ is the least (ordered in the usual way).

Delete all columns of ${A}$ that have symbols ${x_{t-1}, x_t, \cdots, x_{v-1}}$ in row ${j}$. Now remove row ${j}$ completely. We get an ${(N-1) \times k'}$ array ${C}$ (with potentially fewer symbols, but this is not guaranteed). However, we can guarantee ${C}$ is a PHF of strength ${t}$.

What is the value of ${k'}$? Since ${x_0 \ge \cdots x_{v-1}}$ we have that:

$\displaystyle \sum_{i=0}^{v-1} x_i \le v \frac{\sum_{i=0}^{t-2} x_i}{t-1}$

. Therefore, ${k' \ge \frac{k(t-1)}{v}}$, and the result is proved. $\Box$

2. Probabilistic Constructions

The previous section discusses actually making a PHF, but here we take a different approach: we see that a PHF of given parameters must exist, but the proofs are nonconstructive (unfortunately).

Let ${A}$ be a random ${N \times k}$ array over ${v}$ symbols. Choose a set of ${t}$ columns ${C}$. What is the probability that ${C}$ is not covered? It is the probability that no rows in ${C}$ have all distinct entries. What is the probability that a given row is all distinct? This probability is

$\displaystyle \frac{v(v-1)(v-2)\cdots(v-t+1)}{v^t}.$

Let this probability be ${p}$. Therefore, the probability that a given row is not all distinct is ${1-p}$, and the probability that all rows are not distinct is ${(1-p)^N}$.

So what is the expected number of ${t}$-sets of columns that are not covering? By union bound, we know that since there are ${{k \choose t}}$ columns, that this expected number is

$\displaystyle {k \choose t}(1-p)^N.$

All we need to do is to solve for ${N}$ when this number is strictly less than 1. If we do, then we guarantee that there must be some PHF with those parameters. Taking logarithms, and upper bounds, we have that:

Theorem 9 ${PHFN(k, v, t) \le \frac{t( \log(k) - \log(t) + 1))}{t\log(v) - \log(v^t-t!{v \choose t})}}$.

Deng, Stinson, and Wei showed an improved bound using the Lovász Local Lemma:

Theorem 10 ${PHFN(k, v, t) \le \frac{\log({k \choose t} - {{k-t} \choose t})+1}{t\log(v) - \log(v^t - t! {v \choose t})}}$.

Stinson, van Trung, and Wei improved on this (called “expurgation”):

Theorem 11 ${PHFN(k, v, t) \le \frac{\log{2k \choose t} - \log(k)}{t\log(v) - \log(v^t-t!{v \choose t}))}}$.

Note that improvements are only in the numerator in all cases.

# New Bounds for Covering Arrays of Strength 7

Happy Halloween!

I want to share some cool results of a project I’m working on in one of my classes. A covering array is a 4-tuple ${CA(N; t, k, v)}$ which is an ${N \times k}$ array, each entry is from an alphabet of size ${v}$, and for every ${t}$ of the ${k}$ columns, all ${t}$-tuples over ${v}$ exists in at least one row when restricted to these columns. The covering array number, ${CAN(t, k, v)}$, is the smallest ${N}$ for which a ${CA(N; t, k, v)}$ exists. Kleitman and Spencer, and Katona independently, found ${CAN(2, k, 2)}$ for all ${k}$; no other cases are known for all ${k}$, and only heuristics are known. My advisor keeps the best-known covering array numbers here.

I was able to show the following:

1. ${CAN(7, 10, 3) \le 4371 (-2184)}$,
2. ${CAN(7, 11, 3) \le 6555 (-2184)}$,
3. ${CAN(7, 13, 3) \le 9225 (-1698)}$,
4. ${CAN(7, 14, 3) \le 10923 (-2184)}$,
5. ${CAN(7, 15, 3) \le 13107 (-2184)}$,
6. ${CAN(7, 17, 5) \le 312485 (-78120)}$.

The numbers in parentheses are the row reductions from previous known bounds. I won’t share how I did this yet, but it is a cool computational technique that works very well for high ${t}$!