Thursday, January 8, 2015

Fun with CHAR

The CHAR(n) data type is one of the more bizarre features of SQL. It is supposed to represent a fixed-length string, but as we will see in a moment, it behaves odd in all kinds of ways. IMHO it should never be used for anything. There might be use cases for CHAR(1), i.e., for storing a single character, but as a string data type its semantics are so odd and broken that it should never be used.

At a first glance, CHAR is simple. CHAR(n) means that the corresponding value is a string with a fixed length of n, padding with space as needed. Thus

select cast('A' as CHAR(5))
'A    '

An easy concept, except that people apparently noticed that such a data types would not be very useful in practice when comparing values like this:

create table foo(x char(5));
create table bar(y char(10));
select * from foo,bar where x=y;

If we would take the "fixed-length string" semantic literally, we would never find join partners, as x and y will always differ due to their different length.

To fix that, the SQL standard mandated that for comparison purposes CHAR values are implicitly padded with <space> until they reach the length of the other string. Which means that

   'A  '   = 'A    ' (char(3) and char(5))
=> 'A    ' = 'A    ' 
=> true

Which "fixed" the comparison problem, but caused all kinds of other problems. Note that the padding behavior is always active, even when comparing with non-CHAR strings like VARCHAR. This results in absurd semantic issues:

create table foo(x varchar(10),y char(5),z varchar(10));
insert into foo values ('A ','A','A  ');
select x=y, y=z, x=z from foo;

Equality is no longer transitive! The CHAR/VARCHAR comparisons do add the padding to the shorter CHAR string, and thus compare as equal. The VARCHAR comparisons use the literal values, and thus compare as not equal. This kills transitivity, ruins all kinds of query optimizer reasoning, and is a pain in general.

This implicit padding behavior also ruins hash-based approaches, as we cannot know beforehand how many spaces we would have to add, which makes computing hash-values impossible. A common remedy for that is to do the exact opposite of what CHAR is supposed to mean, namely always strip all trailing spaces from CHAR values, and then for comparisons also strip trailing spaces from the comparison partner. We always have a unique representation for CHAR values then, and can use hashing. The only problem is that this affects observable behavior, as there are a few characters that should be sorted before space (depending on the collation, of course):

create table foo(x char(3));
insert into foo values (''),(chr(3));
select * from foo order by x;
'   '
'\x03  '

This is the output order that PostgreSQL and HyPer are producing, but strictly speaking the tuples should be in the other order, as chr(3)<chr(32). Fortuntately most users do not care about characters below space, so we can get away with the trimming trick.

After all these problems, what might be reasons that speak in favor of CHAR(n) (with n>1)? To be honest, I have no idea. In the past one argument in favor of CHAR might have been that fixed-length strings are easier to handle than variable length strings. But that argument is largely moot today. Note that, at least by default, the "n" specifies the number of characters, not the number of bytes. When using Unicode this is not very helpful for determining the length of the string, unless one wants to waste a lot of space. Thus CHAR strings are not really fixed length, anyway. They cause a lot of problems, but have virtually no benefit over VARCHAR.

Fun data point that I noticed when comparing HyPer with other database systems: Even well established systems like PostgreSQL get the CHAR semantics wrong. PostgreSQL 9.4 does it correct when comparing with VARCHAR, but forgets to pad when comparing to TEXT (but trims instead, which completely ruins the comparison):

select 'A'::char(2) ='A '::varchar(10) => T
select 'A'::char(2) ='A '::text        => F
select 'A '::char(2)='A '::text        => F

Don't use CHAR(n). Use VARCHAR(n) or TEXT.