Data Formats

This page is designed to explain the types of data Honeycomb supports, so that you can successfully load your own data into the tool.

Honeycomb supports 3 main data formats:

Comma-Separated Values files (.csv)
Parquet files (.parquet)
GeoJSON files (.geojson)

Considerations when choosing a data format

Most times, the data format you load into Honeycomb will be determined by the data you already have. However, if you are transforming data or building a pipeline and get to choose which file format to use, it may be helpful to understand the differences between each format.

Format	Benefits	Drawbacks
CSV	Very common file format which can be exported by common tools like Excel, and opened in a text editor. Better for small amounts of data.	No compression - file sizes may become large. Takes longer to load.
Parquet	Optimized data storage format which is self-describing and can be compressed and read in segments. Best for large amounts of data.	More difficult to generate and view - likely need to use a tool like DuckDB to process.
GeoJSON	Geospatial-specific version of the common GeoJSON format. Useful for ensuring geographic data is formatted correctly.	No compression - file sizes may become large. Hard to generate with traditional tabular data tools like Excel.

Comma-Separated Values (.csv)

CSV files must contain a column with spatial information. This column needs to be one of the following types and needs to be named correctly so it can be recognized by Honeycomb. Additional columns with numeric data will each be added to the map as a layer. Columns which contain text data will not be read.

Data Type	Description	Allowed Column Names	Column Data Types	Example values
H3 Index	Data which has already been aggregated to H3, for example in a data warehouse or Python script. Each row should contain a unique h3 cell.	`hexbin`, `h3`, `h3_index`	Either a string column with the hexadecimal H3 id, or an integer column with the equivalent integer.	`871f1d48affffff` OR `608533319839121407`
Geo Object	Geospatial data which has a column which contains a Polygon or Point. For polygons, data will be interpolated to h3 using areal weighted interpolation. Points will be converted to their corresponding h3 res 10 index.	`geo` , `geom`	A string column which contains either a single GeoJSON object or a WKT-encoded spatial object.	`{ "type": "Feature", "properties": {}, "geometry": { "coordinates": [ 13.350108460430476, 52.51451793635954 ], "type": "Point" }, "id": 1 }` OR `POINT (13.350108460430476 52.51451793635954)`
Latitude/ Longitude Columns	Points will be converted to their corresponding H3 res 10 index.	`lat`, `lon`, `lng`, `latitude`, `longitude`	Two numeric columns containing latitude and longitude using the WGS 84 CRS.	`13.350108460430476` AND `52.51451793635954`

TIP

The coordinates of all spatial data must use the WGS 84 (EPSG:4326) coordinate system.

Parquet (.parquet)

Similar to CSV files, Parquet files must contain a column with spatial information. This column needs to be one of the following types and needs to be named correctly so it can be recognized by Honeycomb. Additional columns with numeric data will each be added to the map as a layer. Columns which contain text data will not be read.

Data Type	Description	Allowed Column Names	Column Data Types	Example values
H3 Index	Data which has already been aggregated to H3, for example in a data warehouse or Python script. Each row should contain a unique h3 cell.	`hexbin`, `h3`, `h3_index`	Either a string column with the hexadecimal H3 id, or an integer column with the equivalent integer.	`871f1d48affffff` OR `608533319839121407`
Geo Object	Geospatial data which has a column which contains a Polygon or Point. For polygons, data will be interpolated to h3 using areal weighted interpolation. Points will be converted to their corresponding h3 res 10 index.	`geo` , `geom`	A string column which contains either a single GeoJSON object or a WKT-encoded spatial object.	`{ "type": "Feature", "properties": {}, "geometry": { "coordinates": [ 13.350108460430476, 52.51451793635954 ], "type": "Point" }, "id": 1 }` OR `POINT (13.350108460430476 52.51451793635954)`
Latitude/ Longitude Columns	Points will be converted to their corresponding H3 res 10 index.	`lat`, `lon`, `lng`, `latitude`, `longitude`	Two numeric columns containing latitude and longitude using the WGS 84 CRS.	`13.350108460430476` AND `52.51451793635954`

Metric columns

Additional columns which contain numbers will each be added to the map as a layer. These columns should contain a number representing some type of property or observation which relates to the spatial column.

Optimizing parquet files for remote range queries

Parquet files are a bit different than CSV files - within the file, the dataset is broken into chunks (called row groups). Along with these row groups, a Parquet file contains metadata which describes how data is distributed into the row groups.

With the metadata about how data is distributed in a Parquet file, Honeycomb can read only the parts of the file which are needed for the current map view. This feature is especially important when querying remote files - it means that Honeycomb only needs to download a small part of the remote file to make a map - greatly increasing performance! This groundbreaking feature allows Honeycomb to efficiently query planet-level data contained in a single large file in cloud object storage.

To leverage this functionality, it's critical that you format your data in the correct way. Otherwise, Honeycomb will fall back to reading the entire data file, which will drastically reduce performance. Here is how to format your Parquet files for remote range queries:

Your parquet file must contain a column named hexbin, h3, or h3_index which contains the integer representation of an h3 index
The parquet file must be sorted in ascending order by this h3 column
The server hosting your file must support HTTP Range Requests. AWS S3, GCP Cloud Storage, Azure Blob Storage, as well as most CDNs, do support this.
The recommended row group size is 122880
The recommended compression scheme in ZSTD

TIP

DuckDB is a great option for sorting and exporting data in Parquet format. You can use the ROW_GROUP_SIZE and COMPRESSION parameters with the COPY function to optimize the exported file.

GeoJSON (.geojson)

GeoJSON, as a geospatial file format, does not require any special formatting to correctly include spatial information. Any numeric data included in the 'properties' tag will be imported as data layers.

Importing bulk places with GeoJSON

In addition to importing GeoJSON data as data layers, you can drag a GeoJSON file to the 'Places' sidebar on the right to add each GeoJSON feature as a place on the map. This is helpful for importing multiple, pre-defined analysis areas, such as neighborhood outlines or pre-defined areas of interest.

Data Formats ​

Considerations when choosing a data format ​

Comma-Separated Values (.csv) ​

Parquet (.parquet) ​

Metric columns ​

Optimizing parquet files for remote range queries ​

GeoJSON (.geojson) ​

Importing bulk places with GeoJSON ​