Author Topic: Knowing the Limitations of the Shapefile format  (Read 6895 times)

Offline gregory

  • Insight Map Creator
  • *
  • Posts: 854
Knowing the Limitations of the Shapefile format
« on: 07 November 2014, 17:04:55 »
Much is misunderstood about the limitations of the shapefile format.  These misunderstandings are usually based on a partial truth but the limitation is always stated for the wrong reason.

Data Size Limitations

It's common knowledge that the maximum shapefile and DBF file sizes are 2 GB.  This is true for a lot of readers but it has nothing to do with the format and everything to do with a seek/tell limitation on the data stream.  The IMC uses a 64 bit reader so this 2 GB restriction is not imposed.

If the limit isn't 2 GB on both, what is it?

The shapefile format uses a fixed size reference file (.shx) that has position and length information stored in signed words.  Signed words have a value range of 0 to 4,294,967,294 (2^32 - 2) bytes in steps of 2.  The shp and shx files also have a length variable in their header which is also stored as a signed word.  While you can continue to write out to a shapefile ad infinitum, in order to maintain a reference to the location of the data, the data needs to be restricted, at the very least, to the reference variables which means the last record needs to start at the latest at the max signed word position.  Shapefiles do not require sequential records or specify any restrictions on gaps in the data so parsing and constructing a 64 bit record table isn't a possibility unless you know those 2 conditions have been met.

While, technically, the variable is a signed word, there is nothing ever encoded as a negative value, which means that the variable can be interpretted as an unsigned word while still maintaining reference positions.  This allows for an effective data size of 0 to 8,589,934,590 (2^33 - 2) bytes.  This is a deviation from the format that the IMC uses to get the most from the format.

While concessions could've been made to improve the data size at the time of creation of the format, this format was created at a time where the use of 8 GB's of data was, generally, unfathomable.

The DBF file doesn't use positions but, instead, it uses fixed sized records which means that the file size is capped by the maximum record length.  There's a negligible data size devoted to the header.  The maximum length of a field is 255 bytes.  There can be 255 fields in a record (plus 1 extra byte for the delete).  There can be 4,294,967,295 (2^32 - 1) records.  All in all, the maximum data size is around 281 TB (2^48).

The DBF field tags are limited to 11 ASCII characters but, for historic reasons, the 11th byte is usually only used to null terminate the string effectively capping the size at 10 bytes in most applications.  The IMC allows the 11th character to be specified while most GIS applications will only use 10 characters.

In any case, since most readers understand these limitations as 2 GB, it's best to keep under these data sizes when working with GIS applications.

Encodings

The DBF format has limited support for codepages.  There are 2 variables that affect this. 

The LDID is located within the DBF file and in most cases takes precedence.  LDID value 0x57 (OEM ANSI) is often interpretted as Windows-1252.  The IMC assumes the LDID value is 1252 because of inconsistencies when copying shapefiles between devices using different locales.  Note that most LDID values are not standardized by FoxPro (an application by Microsoft) and are added by 3rd parties.

When a valid LDID is not specified, the IMC looks to the CPG file and attempts to use that as the codepage for the DBF.

When neither is specified, the IMC assumes the file is in Windows 1252.

There is no official standard regarding the order of the LDID vs the CPG.  Trying to base the data on the LDID first is the more common approach.

When using a multibyte format, the amount of possible characters usually does not meet the data size in bytes.  For example, if you're storing Chinese characters in a UTF-8 representation, you will not be able to store 255 characters in a field.  You'll be limited to 63 characters.

Null values

The use of null values in the DBF is not standardized but the most common approach is filling the field with astericks or spaces where the field expects a valid value.  Numeric null fields are sometimes written as E-316 which is an out of bounds value for doubles.  Character strings can not be null...only empty.

DBF fields have expected formats.

Character fields are generally unregulated.
Date fields are usually exactly 8 bytes and in yyyyMMdd format.
Number fields have a separate length and precision (though, the precision is usually ignored).  The maximum precision of a double is around 17 characters so there's usually no reason to make these fields any larger than about 20 characters.
Logical fields are basically unregulated.  True values are "T"/"t"/"Y"/"y"/"1" and any other value is usually interpretted as false.

Record Aggregation

The shapefile format specifies the geometric format of the data in 2 places...once in the header and once before every record.

There is a restriction that the individual records must either match the header's type or be a null record (no mixing and matching).

This means that a point and a line geometry can not be stored in the same shapefile.

Projection

Projections are specified in an accompanying file but, for simplicity, the projection files are ignored by the IMC which assumes that the data is coming in as a fixed projection.

Conclusions

Working with shapefiles is full of caveats but it is the de facto standard so it's important to support it as best as possible.

When working with data, I opt to bypass it completely unless I need to do visualization or manual editing because of the limitations outlined here.

Offline DannyGeysen

  • Barsch
  • ***
  • Posts: 115
Re: Knowing the Limitations of the Shapefile format
« Reply #1 on: 09 November 2014, 21:12:08 »
Interesting information Gregory, thank you.