EAV and DataTypes-Collection of common programming errors

We are going to use the EAV Pattern for a szenario, where we will have various different entities with very different attributes.

The “basic” EAV-Pattern is made out of 3 tables. As different attributes would have different DataTypes (date, long, boolean,….) im currently thinking about how to resolve this.

First way would be to store everything as a string. This requires “parsing” and having a number like ‘1’ would not show directly, if this is a double, boolean or anything. Attribute Values Table would look like

id|attribute_id|entity_id|value
1  2            3         17.0
2  4            2         Foobar

Second way would be to split up the different types into different columns, having the value columns nullable, like:

id|attribute_id|entity_id|value_string|value_long|value_float|value_date
1  2            3         NULL         NULL       17.0        NULL
2  4            2         Foobar       NULL       NULL        NULL

However this would produce a lot of NULL-Values which is basically the reason WHY the decision lead to the EAV Pattern (reducing NULL values in unused columns)

So, this leads to the 3rd possible solution, to create typed attribute tables:

attribute_values_string
id|attribute_id|entity_id|value
2  4            2         Foobar

attribute_values_float
id|attribute_id|entity_id|value
1  2            3         17.0

But, this would make queries more complex, as always n tables have to bee checked for the presence of an attribute_value. Also using it like this would lead to equal attribute_ids for different value types, if all are using their own auto_increment. So, having a value to be converted to another type could be somewhat tricky, as it’s id could not be maintained. Of course this could be avoided by adding another, non typed attribute_values table that serves the auto_increment value and maybe contains some type information and/or metadata. (Since we need to use versioning/revisioning we cant use autogenerated attribute_values_ids anyway, as two revisions still need to share the same id)

So, the first decision would be the layout. Does anybody have experiences with EAV in use? What are the bottlenecks of each attempt?

  1. I understand your point of view about trying to reduce Null values, however I believe the The 5 Normal forms can always sort out this issue. If you have customers that demand real-time or on-demand addition of attributes than I agree with the EAV model.

    Among many other issues I see:

    1. it is difficult to control the attribute names, and to make them consistent;
    2. it is difficult to enforce data type integrity and referential integrity;
    3. it is difficult (and slow) to pivot and/or self-join the values to make a single row;
    4. it is difficult to make certain attributes mandatory;

    I’ve been working with Magento and it requests a huge amount of cache and auxiliar tables to work ‘fine’, besides of course the server to keep it running has to be a robust server. Maybe your application is much smaller.

    Anyway, It’s just my opnion, you, dognose, can also check this other point of view and make your own conclusions

  2. Use the NULL approach. Else queries in general require more resources.

    SQL Server 2008 provides some sparse columns which can be of help.

    Also I can see that you missed the Instance_Id column for the Entity Entity has attributes, and then there are many instances of that entity. You need to link it with a GUID

  3. Fwiw, I’ve had decent experience with your first schema and a variation of your second schema. (A type column, a single value column, and partial indexes on the appropriate expressions, i.e. the casted data. The closest you’ll get using MySQL is your second schema: a column per type you care about.)

    If you plan to add indexes on second schema, don’t forget to distinguish between value_string (for short strings that you care to index, such as enums and the like) and value_text (for long pieces of text that should not get indexed).

    Long term, though, I’d actually advise the first option, with the usual warnings and caveats: the only reasonable long term use-case for an EAV table is if the structure is undefined and the actual data stored within it is cosmetic stuff. The operating words here are undefined and cosmetic. Any time you actually want or need to query against a piece of data within your EAV table, you should be asking yourself how you should modify your schema to accommodate a new column. Not doing so is a recipe for slow queries.

Originally posted 2013-11-09 19:41:18.