Introduction
At Waterplan, our goal is to help manage water resources efficiently and sustainably. To do this effectively, we rely on detailed geospatial data such as aquifers, river basins, and hydrological boundaries. However, handling these datasets, particularly in the form of shapefiles, brings significant technical challenges due to their size and complexity.
In this blog post, we'll explore how we tackled these challenges by integrating shapefiles with DynamoDB, enabling efficient indexing, storage, and rapid retrieval of geospatial data.
Technical Challenges
Working with shapefiles poses several challenges.
Point-in-polygon queries, a common operation in geospatial applications, requires specialized spatial indexing techniques and optimized search algorithms for real-time or near-real-time applications to quickly identify the containing polygon.
Storing large polygons directly in databases with item size restrictions, like DynamoDB, can be problematic. Strategies like polygon simplification, compression, or partitioning may be necessary to fit the data within the storage constraints.
Furthermore, shapefiles store spatial and attribute data separately, which can lead to data redundancy and potential inconsistencies. Maintaining data integrity and synchronization between these components can be challenging.
In addition, shapefiles may use different coordinate systems and projections, which can cause misalignment and incorrect spatial relationships when data from multiple sources is combined.
Another challenge is that shapefiles do not inherently enforce topological relationships between features. Ensuring data quality and topological consistency often requires additional validation and cleaning steps.
Finally, the shapefile format is not always well-suited for modern data processing frameworks and cloud-based environments. Conversion to other formats or the use of specialized geospatial databases may be necessary for optimal performance and scalability.Working with shapefiles poses several challenges. For example, point-in-polygon queries, a common operation in geospatial applications, require specialized spatial indexing techniques and optimized search algorithms to quickly identify the containing polygon; this is especially critical for real-time or near-real-time applications.
Our Approach
To address these challenges, we adopted a strategy involving geohashes and DynamoDB.
Step 1: Calculate the geohashes
The first step was to calculate the geohashes that intersected the polygon we wanted to index. If you’re new to geohashes, here’s a web that can help understand how they work. This map shows the result of obtaining the level 3 geohashes of a polygon:

Step 2: Indexing in DynamoDB
Once the intersecting geohashes were obtained, we grouped all polygons by the geohash. This process resulted in a list of polygons for each geohash.
In order to reduce the size of the item to be stored in DynamoDB, we took a few extra steps:
For each polygon that falls into a geohash we only keep the intersection between the polygon and the geohash. For example, in the image below there is no need to store the entire polygon that spans across 2 states in the US. We only need to keep the intersection:

Once each polygon has been reduced only to the intersection with the geohash we compressed the entire list of polygons so as to minimize the size and keep it under the 400Kb DynamoDB allows per item.
Step 3: Building a scalable pipeline
We used Amazon S3 to dynamically store shapefiles, with AWS Lambda functions processing new uploads. Lambdas generated geohashes and grouped them using GeoPandas. Geohashes were used as partition keys in DynamoDB, facilitating fast lookups.
In our system architecture, we leveraged the scalability and flexibility of Amazon S3 to dynamically store incoming shapefiles. To manage the processing of new shapefile uploads, we implemented AWS Lambda functions, ensuring serverless and event-driven operation. These Lambda functions were designed to generate geohashes from the spatial data within the shapefiles, employing the GeoPandas library for efficient geospatial operations. Geohashes, serving as a spatial indexing system, were then grouped and organized based on their geographic proximity.
The generated geohashes played a crucial role in optimizing data storage and retrieval within DynamoDB. By utilizing geohashes as partition keys in DynamoDB, we achieved significant improvements in query performance, particularly for spatial lookups. This strategy ensured that data with similar spatial characteristics was co-located within the database, minimizing the amount of data that needed to be scanned for each query. The combination of S3 for storage, Lambda for processing, Geohashes for spatial indexing, and DynamoDB for fast lookups resulted in a robust and scalable solution for managing and querying geospatial data.
Lessons Learned
An important insight from this approach was realizing DynamoDB's strengths, even though it's not typically considered a geospatial database. For relatively static datasets like ours, DynamoDB's scalability and efficient key-value lookup capabilities were ideal, significantly reducing operational complexity and cost compared to alternatives like Postgres or Elasticsearch.
By compressing data and indexing with geohashes, we achieved rapid geospatial queries at a fraction of the cost and complexity.
Related content
Get insights, expert analysis and tips on measuring, reporting, and responding to water risk