Data Formats
This page is designed to explain the types of data Honeycomb supports, so that you can successfully load your own data into the tool.
Honeycomb supports 3 main data formats:
- Comma-Separated Values files (
.csv
) - Parquet files (
.parquet
) - GeoJSON files (
.geojson
)
Considerations when choosing a data format
Most times, the data format you load into Honeycomb will be determined by the data you already have. However, if you are transforming data or building a pipeline and get to choose which file format to use, it may be helpful to understand the differences between each format.
Format | Benefits | Drawbacks |
---|---|---|
CSV | Very common file format which can be exported by common tools like Excel, and opened in a text editor. Better for small amounts of data. | No compression - file sizes may become large. Takes longer to load. |
Parquet | Optimized data storage format which is self-describing and can be compressed and read in segments. Best for large amounts of data. | More difficult to generate and view - likely need to use a tool like DuckDB to process. |
GeoJSON | Geospatial-specific version of the common GeoJSON format. Useful for ensuring geographic data is formatted correctly. | No compression - file sizes may become large. Hard to generate with traditional tabular data tools like Excel. |
Comma-Separated Values (.csv)
CSV files must contain a column with spatial information. This column needs to be one of the following types and needs to be named correctly so it can be recognized by Honeycomb. Additional columns with numeric data will each be added to the map as a layer. Columns which contain text data will not be read.
Data Type | Description | Allowed Column Names | Column Data Types | Example values |
---|---|---|---|---|
H3 Index | Data which has already been aggregated to H3, for example in a data warehouse or Python script. Each row should contain a unique h3 cell. | hexbin , h3 , h3_index | Either a string column with the hexadecimal H3 id, or an integer column with the equivalent integer. | 871f1d48affffff OR 608533319839121407 |
Geo Object | Geospatial data which has a column which contains a Polygon or Point. For polygons, data will be interpolated to h3 using areal weighted interpolation. Points will be converted to their corresponding h3 res 10 index. | geo , geom | A string column which contains either a single GeoJSON object or a WKT-encoded spatial object. | { "type": "Feature", "properties": {}, "geometry": { "coordinates": [ 13.350108460430476, 52.51451793635954 ], "type": "Point" }, "id": 1 } OR POINT (13.350108460430476 52.51451793635954) |
Latitude/ Longitude Columns | Points will be converted to their corresponding H3 res 10 index. | lat , lon , lng , latitude , longitude | Two numeric columns containing latitude and longitude using the WGS 84 CRS. | 13.350108460430476 AND 52.51451793635954 |
TIP
The coordinates of all spatial data must use the WGS 84 (EPSG:4326) coordinate system.
Parquet (.parquet)
Similar to CSV files, Parquet files must contain a column with spatial information. This column needs to be one of the following types and needs to be named correctly so it can be recognized by Honeycomb. Additional columns with numeric data will each be added to the map as a layer. Columns which contain text data will not be read.
Data Type | Description | Allowed Column Names | Column Data Types | Example values |
---|---|---|---|---|
H3 Index | Data which has already been aggregated to H3, for example in a data warehouse or Python script. Each row should contain a unique h3 cell. | hexbin , h3 , h3_index | Either a string column with the hexadecimal H3 id, or an integer column with the equivalent integer. | 871f1d48affffff OR 608533319839121407 |
Geo Object | Geospatial data which has a column which contains a Polygon or Point. For polygons, data will be interpolated to h3 using areal weighted interpolation. Points will be converted to their corresponding h3 res 10 index. | geo , geom | A string column which contains either a single GeoJSON object or a WKT-encoded spatial object. | { "type": "Feature", "properties": {}, "geometry": { "coordinates": [ 13.350108460430476, 52.51451793635954 ], "type": "Point" }, "id": 1 } OR POINT (13.350108460430476 52.51451793635954) |
Latitude/ Longitude Columns | Points will be converted to their corresponding H3 res 10 index. | lat , lon , lng , latitude , longitude | Two numeric columns containing latitude and longitude using the WGS 84 CRS. | 13.350108460430476 AND 52.51451793635954 |
Metric columns
Additional columns which contain numbers will each be added to the map as a layer. These columns should contain a number representing some type of property or observation which relates to the spatial column.
Optimizing parquet files for remote range queries
Parquet files are a bit different than CSV files - within the file, the dataset is broken into chunks (called row groups). Along with these row groups, a Parquet file contains metadata which describes how data is distributed into the row groups.
With the metadata about how data is distributed in a Parquet file, Honeycomb can read only the parts of the file which are needed for the current map view. This feature is especially important when querying remote files - it means that Honeycomb only needs to download a small part of the remote file to make a map - greatly increasing performance! This groundbreaking feature allows Honeycomb to efficiently query planet-level data contained in a single large file in cloud object storage.
To leverage this functionality, it's critical that you format your data in the correct way. Otherwise, Honeycomb will fall back to reading the entire data file, which will drastically reduce performance. Here is how to format your Parquet files for remote range queries:
- Your parquet file must contain a column named
hexbin
,h3
, orh3_index
which contains the integer representation of an h3 index - The parquet file must be sorted in ascending order by this h3 column
- The server hosting your file must support HTTP Range Requests. AWS S3, GCP Cloud Storage, Azure Blob Storage, as well as most CDNs, do support this.
- The recommended row group size is
122880
- The recommended compression scheme in
ZSTD
TIP
DuckDB is a great option for sorting and exporting data in Parquet format. You can use the ROW_GROUP_SIZE
and COMPRESSION
parameters with the COPY
function to optimize the exported file.
GeoJSON (.geojson)
GeoJSON, as a geospatial file format, does not require any special formatting to correctly include spatial information. Any numeric data included in the 'properties' tag will be imported as data layers.
Importing bulk places with GeoJSON
In addition to importing GeoJSON data as data layers, you can drag a GeoJSON file to the 'Places' sidebar on the right to add each GeoJSON feature as a place on the map. This is helpful for importing multiple, pre-defined analysis areas, such as neighborhood outlines or pre-defined areas of interest.