Friday, February 17, 2017

Reasoning in the presence of NULLs

One of the defining characteristics of SQL is that it is a declarative query language. That is, the user does not specify how the result should be computed, but instead specifies what conditions the result should satisfy. Within these constraints the database is free to choose between execution alternatives. This has some interesting consequences:

Consider for example the query

select x=a and x=b

clearly, it is equivalent to the following query

select x=a and a=b

after all, x=a, and thus we can substitute a with x in the second term.

And of course the same is true is we replace a and b with constants:

select x=1 and x=2

is equivalent to

select x=1 and 1=2

Now the really interesting question is: Is a database system allowed to infer that this is equivalent to

select false

? After all, x cannot be equal 1 and equal to 2 simultaneously, and the second formulation is x=1 and false, which is clearly false. But what looks intuitive is not necessarily true, if we consider the result of the following two queries (with PostgreSQL results):
postgres=> select x,x=1 and x=2 from (values(1),(null)) s(x);
  x   | ?column? 
    1 | f
(2 rows)

postgres=> select x,x=1 and 1=2 from (values(1),(null)) s(x);
  x   | ?column? 
    1 | f
 NULL | f
(2 rows)

The reason for that difference is the tricky behavior of NULL: 1) A comparison of value with NULL is NULL, and 2) NULL AND NULL is NULL, but 3) NULL AND FALSE is FALSE.

Both 1) and 3) make sense, 1) because we cannot say anything about the result of the comparison, and 3) because it doesn't matter for which value NULL really stands (either true or false), the and with FALSE will always produce a FALSE value. But in combination, they lead to very surprising behavior, namely that some databases return NULL and other databases return FALSE for the same query, depending on whether they noticed that the query contained a contradiction or not.

And this kind of reasoning occurs in many other circumstances, too, for example in this query

select cast(x as integer)=1.5

obviously, this condition can never be true, regardless of the value for x, as 1.5 is not an integer number. But what about NULL? Are we allowed to statically infer that the result is false, even if x might be a NULL value?

I tried to find an answer to that question in the SQL:2011 standard, but unfortunately it is not specified clearly. But after looking at the definition of AND, I have convinced myself that returning false here is ok. After all, the argument for NULL AND FALSE being FALSE is that NULL is an unknown value from within the domain, and any value AND FALSE is FALSE. If we use the same argument here, we can statically decide that the result is false, regardless of the value of x.

But even though the argument makes sense (and it is good for the query optimizer, which exploits that fact), it is still unfortunate that the query result changes depending on how smart the database system is. A smart system can infer that the result is false, not matter what, while a more simple system might return NULL. But perhaps that is just the consequence of having a declarative system, we cannot (and should not!) control in which order expressions are evaluated.

No comments:

Post a Comment