Thursday, January 8, 2015

Fun with CHAR

The CHAR(n) data type is one of the more bizarre features of SQL. It is supposed to represent a fixed-length string, but as we will see in a moment, it behaves odd in all kinds of ways. IMHO it should never be used for anything. There might be use cases for CHAR(1), i.e., for storing a single character, but as a string data type its semantics are so odd and broken that it should never be used.

At a first glance, CHAR is simple. CHAR(n) means that the corresponding value is a string with a fixed length of n, padding with space as needed. Thus

select cast('A' as CHAR(5))
----
'A    '

An easy concept, except that people apparently noticed that such a data types would not be very useful in practice when comparing values like this:

create table foo(x char(5));
create table bar(y char(10));
select * from foo,bar where x=y;

If we would take the "fixed-length string" semantic literally, we would never find join partners, as x and y will always differ due to their different length.

To fix that, the SQL standard mandated that for comparison purposes CHAR values are implicitly padded with <space> until they reach the length of the other string. Which means that

   'A  '   = 'A    ' (char(3) and char(5))
=> 'A    ' = 'A    ' 
=> true

Which "fixed" the comparison problem, but caused all kinds of other problems. Note that the padding behavior is always active, even when comparing with non-CHAR strings like VARCHAR. This results in absurd semantic issues:

create table foo(x varchar(10),y char(5),z varchar(10));
insert into foo values ('A ','A','A  ');
select x=y, y=z, x=z from foo;
----
T T F

Equality is no longer transitive! The CHAR/VARCHAR comparisons do add the padding to the shorter CHAR string, and thus compare as equal. The VARCHAR comparisons use the literal values, and thus compare as not equal. This kills transitivity, ruins all kinds of query optimizer reasoning, and is a pain in general.

This implicit padding behavior also ruins hash-based approaches, as we cannot know beforehand how many spaces we would have to add, which makes computing hash-values impossible. A common remedy for that is to do the exact opposite of what CHAR is supposed to mean, namely always strip all trailing spaces from CHAR values, and then for comparisons also strip trailing spaces from the comparison partner. We always have a unique representation for CHAR values then, and can use hashing. The only problem is that this affects observable behavior, as there are a few characters that should be sorted before space (depending on the collation, of course):

create table foo(x char(3));
insert into foo values (''),(chr(3));
select * from foo order by x;
----
'   '
'\x03  '

This is the output order that PostgreSQL and HyPer are producing, but strictly speaking the tuples should be in the other order, as chr(3)<chr(32). Fortuntately most users do not care about characters below space, so we can get away with the trimming trick.


After all these problems, what might be reasons that speak in favor of CHAR(n) (with n>1)? To be honest, I have no idea. In the past one argument in favor of CHAR might have been that fixed-length strings are easier to handle than variable length strings. But that argument is largely moot today. Note that, at least by default, the "n" specifies the number of characters, not the number of bytes. When using Unicode this is not very helpful for determining the length of the string, unless one wants to waste a lot of space. Thus CHAR strings are not really fixed length, anyway. They cause a lot of problems, but have virtually no benefit over VARCHAR.

Fun data point that I noticed when comparing HyPer with other database systems: Even well established systems like PostgreSQL get the CHAR semantics wrong. PostgreSQL 9.4 does it correct when comparing with VARCHAR, but forgets to pad when comparing to TEXT (but trims instead, which completely ruins the comparison):

select 'A'::char(2) ='A '::varchar(10) => T
select 'A'::char(2) ='A '::text        => F
select 'A '::char(2)='A '::text        => F


Don't use CHAR(n). Use VARCHAR(n) or TEXT.

7 comments:

  1. Posting from the type providers article (mean to be here):

    >> The VARCHAR comparisons use the literal values, and thus compare as not equal.
    >
    > As far as I understand it, this is wrong. The ANSI standard is that trailing spaces are *not* significant in comparison of VARCHAR values.


    Note that the query example is a bit more subtle: The "x=y" comparison is a "VARCHAR=CHAR" comparison, and thus ignores trailing spaces. The "y=z" comparison is a "CHAR=VARCHAR" comparison, and thus also ignores trailing spaces.
    But the "x=z" comparison is a "VARCHAR=VARCHAR" comparison, ans thus does _not_ ignore trailing spaces, which breaks transitivity. (As x=y, and y=z, but x is not = to z).

    Which is completely weird and counter-intuitive, but that is what the standard says. And if you add the fact the CHAR is basically broken in popular database systems like PostgreSQL (they have a lot of bugs there, for example ''::char(1) like ''::char(1) returns false), you see that CHAR(n) is not a very useful data type in practice.

    ReplyDelete
    Replies
    1. > comparison is a "VARCHAR=VARCHAR" comparison, ans thus does _not_ ignore trailing spaces

      No, that is the bit I was objecting to - I didn;t think that was true. For example this is culled from documentation about SQL/Serve:

      >> SQL Server follows the ANSI/ISO SQL-92 specification (Section 8.2, , General rules #3) on how to compare strings with spaces. The ANSI standard requires padding for the character strings used in comparisons so that their lengths match before comparing them. The padding directly affects the semantics of WHERE and HAVING clause predicates and other Transact-SQL string comparisons. For example, Transact-SQL considers the strings 'abc' and 'abc ' to be equivalent for most comparison operations.

      The only exception to this rule is the LIKE predicate. When the right side of a LIKE predicate expression features a value with a trailing space, SQL Server does not pad the two values to the same length before the comparison occurs. Because the purpose of the LIKE predicate, by definition, is to facilitate pattern searches rather than simple string equality tests, this does not violate the section of the ANSI SQL-92 specification mentioned earlier. <<

      Delete
    2. More info FWIW:

      If I recall, DB2, SQL/Server and MySQL follow the standard in treating trailing spaces as not significant, Oracle is different (legitimately as they call their type VARCHAR2 not VARCHAR) and Postgres is broken in the limited sense that it does not follow the standard in this respect and of course can't be changed now.

      Delete
    3. I have just re-read the SQL standard (Subclause 8.2). And it says:

      ""
      If the length in characters of X is not equal to the length in characters of Y, then the shorter string is effectively replaced, for the purposes of comparison, with a copy of itself that has been extended to the length of the longer string by concatenation on the right of one or more pad characters, where the
      pad character is chosen based on CS. If CS has the NO PAD characteristic, then the pad character is an implementation-dependent character different from any character in the character set of X and Y that collates less than any string under CS. Otherwise, the pad character is a .
      ""

      The only problem is that it is implementation defined if we have to use NO PAD here or not.
      Clearly, CHAR has to be padded with spaces, at least up to the declared length. If VARCHAR has to be padded, too, is implementation defined.

      I ran the following SQL query on different systems to see how they behave

      select case when cast('a' as varchar(10))=cast('a ' as varchar(10)) then 'T' else 'F' end from (table with one tuple)

      Result:

      Oracle: not equal
      SQL Server: equal
      PostgreSQL: not equal
      DB2: equal
      HyPer: not equal
      SQLIte: not equal
      MonetDB: not equal

      So there is no consensus between systems, some do ignore the trailing spaces, some do not.

      edit: blogger at my less and greater symbols

      Delete
    4. >Otherwise, the pad character is a .

      Blogger ate the the word "space" at the end of that sentence I think for anyone else reading :)


      Fair enough,

      So, in effect, the question becomes whether or not you are implictly using a collation sequence (CS) that has the NO_PAD characteristic or the PAD_SPACE characteristic. And this differs between databases.

      I don't know how an implicit CS is determined, I find one comment elsewhere referring to Subclause 9.13 but I don't have a copy of the standard.

      So, it would seem, if you want transitive equality across the character types you do need a collation sequence with the PAD_SPACE character..

      But I simply don't know the extent to which the various databases allow you to control the collation sequence applicable and/or create a collation sequence with the PAD_SPACE characteristic so you can resolve this problem when you want to. Maybe it's just a messy area.

      Delete
    5. I think it is messy. Way too many things in the SQL standard are implementation defined, which in practice means that different systems will implement it differently.

      By the way, if you want to look at the SQL standard (without paying CHF 198) you can look at a current draft version for free. Not identical to the published standard, of course, but for all practical purposes close enough. The comparison rules are on page 505.

      Delete
  2. Thanks for the link, much appreciated.

    ReplyDelete