S3 and GCS Event Triggers for Shapefiles

Triggering a cloud function on a shapefile upload requires a completion gate that aggregates at least three object events before dispatching — fire on the first .shp event and you will hit OGRERR_NOT_ENOUGH_DATA before the .dbf has even arrived. The minimum reliable pattern uses a key-value store (DynamoDB, Firestore, or Azure Table Storage) to track observed file extensions per base filename, transitions to a validated state only when the required set is complete, and emits a single idempotency-keyed dispatch message downstream.

This page covers the exact trigger configuration, state aggregation logic, validation probe, and failure modes across AWS, GCP, and Azure — building on the broader Event-Driven Geospatial Processing Patterns that govern how cloud-native GIS pipelines compose.

Why This Constraint Matters for Geospatial Workloads

The ESRI Shapefile specification mandates a minimum of three companion files — .shp (geometry), .shx (shape index), and .dbf (attribute table) — with common additions including .prj (coordinate reference system), .cpg (encoding declaration), and .sbn/.sbx (spatial index). Cloud object storage emits events at the object level, not the dataset level. HTTP PUT requests for each component file arrive sequentially across separate TCP connections; the resulting event stream is unordered relative to dataset completeness.

If a serverless function triggers on the .shp event, pyogrio.read_info() raises RuntimeError: no driver found or GDAL reports OGRERR_UNSUPPORTED_OPERATION when the .shx has not yet landed. Attempting fiona.open() without the .dbf returns a schema with zero fields and silently omits all attribute data. These are silent data-quality failures, not hard crashes — making them the most expensive class of bug to detect in a production GIS pipeline.

The atomicity mismatch also interacts with ephemeral storage limits in AWS Lambda: a function that fires prematurely, writes a partial dataset to /tmp (capped at 10 GB), and then retries on the next event can exhaust ephemeral storage before any complete dataset is processed.

Platform-by-Platform Limits Table

Dimension	AWS	GCP	Azure
Event source	S3 → EventBridge or direct Lambda invocation	Cloud Storage → Pub/Sub → Cloud Functions (2nd gen)	Blob Storage → Event Grid → Azure Functions
Trigger latency (p50)	~200 ms	~300 ms	~400 ms
Function timeout	15 min (Lambda)	60 min (Cloud Functions 2nd gen)	10 min (Consumption plan)
Max function memory	10 GB	32 GB	1.5 GB (Consumption); 14 GB (Premium)
Ephemeral storage	10 GB (`/tmp`)	8 GB (in-memory + tmpfs)	500 MB (`D:\local\Temp`)
State backend options	DynamoDB (conditional writes, TTL)	Firestore (merge writes, TTL via scheduled delete)	Azure Table Storage (optimistic concurrency via ETag)
Delivery guarantee	At-least-once	At-least-once	At-least-once
Event filtering	S3 suffix filter (e.g. `.shp`, `.dbf`)	Object name prefix/suffix filter	Event Grid subject filter
DLQ support	SQS DLQ on Lambda event source mapping	Pub/Sub dead-letter topic	Azure Service Bus dead-letter sub-queue
Cost per 1 M trigger events	~$0.20 (Lambda) + $0.25 (S3 notify)	~$0.40 (Cloud Functions invocations)	~$0.60 (Event Grid) + ~$0.20 (Functions)

AWS offers the lowest per-invocation cost and the most mature event-filtering syntax for S3 suffixes. GCP’s 60-minute function timeout is the differentiator for very large shapefile sets requiring in-place reprojection. Azure’s 500 MB ephemeral storage on the Consumption plan is a hard constraint: write intermediate datasets directly to Blob rather than /tmp — see ephemeral storage limits in AWS Lambda for the analogous constraint analysis on AWS.

Step-by-Step Implementation

Step 1 — Configure Event Notifications on the Staging Bucket

AWS (Terraform):

python

# Equivalent boto3 setup for notification configuration
import boto3

s3 = boto3.client("s3", region_name="us-east-1")

notification_config = {
    "LambdaFunctionConfigurations": [
        {
            "LambdaFunctionArn": "arn:aws:lambda:us-east-1:123456789:function:shp-aggregator",
            "Events": ["s3:ObjectCreated:*"],
            "Filter": {
                "Key": {
                    "FilterRules": [
                        {"Name": "prefix", "Value": "staging/"},
                        # No suffix filter here — catch .shp, .shx, .dbf, .prj, .cpg
                    ]
                }
            },
        }
    ]
}

s3.put_bucket_notification_configuration(
    Bucket="raw-shp-ingest",
    NotificationConfiguration=notification_config,
)

GCP (Python SDK):

python

from google.cloud import storage

client = storage.Client()
bucket = client.bucket("gcs-shp-landing")

notification = bucket.notification(
    topic_name="projects/my-project/topics/shp-events",
    event_types=["OBJECT_FINALIZE"],
    payload_format="JSON_API_V1",
)
notification.create()
print(f"Notification created: {notification.notification_id}")

Step 2 — Aggregate Events by Base Filename (AWS Lambda + DynamoDB)

Parse the S3 event, extract the shapefile base name, and perform a conditional upsert. The ADD files_seen :ext expression is idempotent — adding an already-present value to a string set is a no-op.

python

import os
import json
import boto3
from pathlib import PurePosixPath

# Environment variables — set explicitly in Lambda configuration
DYNAMO_TABLE = os.environ["DYNAMO_TABLE"]          # e.g. shp-completion-state
STATE_TTL_SECONDS = int(os.environ.get("STATE_TTL_SECONDS", "86400"))  # 24 h

dynamodb = boto3.resource("dynamodb", region_name=os.environ["AWS_REGION"])
table = dynamodb.Table(DYNAMO_TABLE)

REQUIRED_EXTENSIONS = {".shp", ".shx", ".dbf"}


def lambda_handler(event, context):
    for record in event.get("Records", []):
        key = record["s3"]["object"]["key"]          # e.g. staging/parcels_2024.shp
        bucket = record["s3"]["bucket"]["name"]
        suffix = PurePosixPath(key).suffix.lower()
        base = PurePosixPath(key).stem              # parcels_2024
        prefix = str(PurePosixPath(key).parent)    # staging

        import time
        expiry = int(time.time()) + STATE_TTL_SECONDS

        response = table.update_item(
            Key={"base_name": f"{prefix}/{base}"},
            UpdateExpression=(
                "ADD files_seen :ext "
                "SET bucket = :bucket, ttl = :ttl, "
                "    upload_prefix = :prefix"
            ),
            ExpressionAttributeValues={
                ":ext": {suffix},
                ":bucket": bucket,
                ":ttl": expiry,
                ":prefix": prefix,
            },
            ReturnValues="ALL_NEW",
        )

        item = response["Attributes"]
        observed = item.get("files_seen", set())

        if REQUIRED_EXTENSIONS.issubset(observed):
            _trigger_validation(bucket, prefix, base)

    return {"statusCode": 200}


def _trigger_validation(bucket: str, prefix: str, base: str) -> None:
    """Publish a validation task — downstream function checks idempotency."""
    sqs = boto3.client("sqs", region_name=os.environ["AWS_REGION"])
    sqs.send_message(
        QueueUrl=os.environ["VALIDATION_QUEUE_URL"],
        MessageBody=json.dumps({
            "bucket": bucket,
            "prefix": prefix,
            "base": base,
        }),
        MessageDeduplicationId=f"{bucket}/{prefix}/{base}",   # FIFO queue dedup
        MessageGroupId="shapefile-validation",
    )

Step 3 — Validate the Complete Dataset

The validation function uses pyogrio for a lightweight schema probe that avoids loading geometry into memory. It confirms geometry type, CRS readability, and feature count before marking the record READY.

python

import os
import json
import boto3
import pyogrio
import hashlib
import time

# Explicitly set GDAL environment — never assume these are in the execution environment
os.environ["GDAL_DATA"] = "/opt/python/share/gdal"
os.environ["PROJ_LIB"] = "/opt/python/share/proj"
os.environ["LD_LIBRARY_PATH"] = "/opt/python/lib:" + os.environ.get("LD_LIBRARY_PATH", "")

S3_CLIENT = boto3.client("s3")
DYNAMO = boto3.resource("dynamodb").Table(os.environ["DYNAMO_TABLE"])
DISPATCH_QUEUE = os.environ["DISPATCH_QUEUE_URL"]


def lambda_handler(event, context):
    for record in event.get("Records", []):
        body = json.loads(record["body"])
        bucket = body["bucket"]
        prefix = body["prefix"]
        base = body["base"]

        shp_key = f"{prefix}/{base}.shp"
        local_dir = f"/tmp/{base}"
        os.makedirs(local_dir, exist_ok=True)

        # Download the required trio to ephemeral storage
        for ext in [".shp", ".shx", ".dbf", ".prj", ".cpg"]:
            remote_key = f"{prefix}/{base}{ext}"
            local_path = f"{local_dir}/{base}{ext}"
            try:
                S3_CLIENT.download_file(bucket, remote_key, local_path)
            except S3_CLIENT.exceptions.NoSuchKey:
                if ext in {".shp", ".shx", ".dbf"}:
                    raise RuntimeError(f"Required file missing: {remote_key}")

        # Lightweight schema probe — no geometry is loaded into memory
        local_shp = f"{local_dir}/{base}.shp"
        info = pyogrio.read_info(local_shp)

        if info["features"] == 0:
            raise ValueError(f"Shapefile {base} contains zero features after validation")

        # Idempotency key: SHA-256 of base name + current wall-clock epoch bucket
        idempotency_key = hashlib.sha256(
            f"{bucket}/{prefix}/{base}:{int(time.time() // 3600)}".encode()
        ).hexdigest()

        # Update state to READY
        DYNAMO.update_item(
            Key={"base_name": f"{prefix}/{base}"},
            UpdateExpression="SET #s = :ready, geometry_type = :gt, feature_count = :fc, crs = :crs",
            ExpressionAttributeNames={"#s": "status"},
            ExpressionAttributeValues={
                ":ready": "READY",
                ":gt": info["geometry_type"],
                ":fc": info["features"],
                ":crs": str(info.get("crs", "UNKNOWN")),
            },
        )

        # Dispatch to downstream processors
        sqs = boto3.client("sqs", region_name=os.environ["AWS_REGION"])
        sqs.send_message(
            QueueUrl=DISPATCH_QUEUE,
            MessageBody=json.dumps({
                "bucket": bucket,
                "prefix": prefix,
                "base": base,
                "geometry_type": info["geometry_type"],
                "feature_count": info["features"],
                "crs": str(info.get("crs", "UNKNOWN")),
                "idempotency_key": idempotency_key,
            }),
            MessageDeduplicationId=idempotency_key,
            MessageGroupId="shapefile-dispatch",
        )

    return {"statusCode": 200}

For the GCP equivalent using Cloud Functions and Firestore, see Triggering GCP Cloud Functions on New Shapefile Uploads.

Step 4 — Route to Processing via a Message Queue

The dispatch message feeds directly into the SQS and Pub/Sub Queue Routing Strategies layer. Apply content-based routing at this stage: large feature counts (> 500,000) route to a dedicated high-memory queue; small datasets fan out to standard workers. The idempotency_key field guards against double processing when the queue delivers the same message twice.

Measurement and Verification

Use this benchmark function to measure end-to-end latency from the first object upload to the dispatch message appearing in SQS:

python

import boto3
import time
import hashlib

S3 = boto3.client("s3")
SQS = boto3.client("sqs")


def measure_pipeline_latency(
    bucket: str,
    prefix: str,
    local_shp_dir: str,
    dispatch_queue_url: str,
    base: str,
    timeout_seconds: int = 60,
) -> dict:
    """
    Upload a shapefile set and measure how long until the dispatch message arrives.
    Returns latency_ms and message metadata.
    """
    start = time.monotonic()

    # Upload in typical desktop-client order: .shp first
    for ext in [".shp", ".shx", ".dbf", ".prj"]:
        local = f"{local_shp_dir}/{base}{ext}"
        if not __import__("os").path.exists(local):
            continue
        S3.upload_file(local, bucket, f"{prefix}/{base}{ext}")

    # Poll the dispatch queue until the validated message appears
    deadline = start + timeout_seconds
    while time.monotonic() < deadline:
        msgs = SQS.receive_message(
            QueueUrl=dispatch_queue_url,
            MaxNumberOfMessages=1,
            WaitTimeSeconds=5,
        ).get("Messages", [])
        for msg in msgs:
            body = __import__("json").loads(msg["Body"])
            if body.get("base") == base:
                latency_ms = (time.monotonic() - start) * 1000
                SQS.delete_message(
                    QueueUrl=dispatch_queue_url,
                    ReceiptHandle=msg["ReceiptHandle"],
                )
                return {
                    "latency_ms": round(latency_ms, 1),
                    "geometry_type": body["geometry_type"],
                    "feature_count": body["feature_count"],
                }
    raise TimeoutError(f"Dispatch message for {base} not received within {timeout_seconds}s")

Expected ranges on a warm Lambda pair:

upload_to_dispatch_latency: 800 ms – 2,500 ms for a 5 MB shapefile set
validation_failure_rate (CloudWatch metric): < 0.5% in production with schema-clean inputs
DynamoDB SuccessfulRequestLatency for the upsert: < 5 ms p99

CloudWatch metric names to monitor:

Metric	Namespace	Expected value
`upload_to_dispatch_latency`	`ShapefileIngestion/Custom`	< 3 000 ms p95
`validation_failure_rate`	`ShapefileIngestion/Custom`	< 1%
`DLQ/ApproximateNumberOfMessagesVisible`	`AWS/SQS`	0 in steady state
`DynamoDB/SuccessfulRequestLatency`	`AWS/DynamoDB`	< 5 ms p99

Failure Modes and Debugging

1. OGRERR_NOT_ENOUGH_DATA on pyogrio.read_info()

Root cause: the validation function fired before all three required files arrived in /tmp. This happens when the DynamoDB completion check races with an S3 eventual-consistency window on newly written objects. Fix: add a 500 ms backoff before the download_file calls in the validation function, and retry on ClientError with error code NoSuchKey up to three times with exponential delay.

2. DynamoDB ConditionalCheckFailedException on the upsert

Root cause: two Lambda invocations for the same file raced on the ADD expression. This should not happen in practice because ADD on a string set is commutative and unconditional — if you see this error you have accidentally switched to a conditional PUT expression. Revert to update_item with ADD.

3. Completion gate never fires despite all files being present

Root cause: the .shp and .shx arrived under a slightly different prefix (e.g. a trailing slash difference or URL-encoded space in the filename). Add logging to print the raw key from every incoming S3 record and compare the derived base_name partition key values. Normalize keys with urllib.parse.unquote_plus before constructing the partition key.

4. Validation function exceeds the 10 GB /tmp limit on AWS

Root cause: multiple concurrent validations running in the same warm Lambda container share /tmp. Scope each dataset to a UUID subdirectory (/tmp/<uuid>/<base>.shp) and delete it after validation. For datasets exceeding 2 GB total size, stream the pyogrio.read_info() probe directly from S3 using GDAL’s /vsis3/ virtual filesystem driver:

python

import os

os.environ["GDAL_DATA"] = "/opt/python/share/gdal"
os.environ["PROJ_LIB"] = "/opt/python/share/proj"
os.environ["AWS_DEFAULT_REGION"] = os.environ["AWS_REGION"]

import pyogrio

# Read directly from S3 without downloading to /tmp
info = pyogrio.read_info(f"/vsis3/{bucket}/{prefix}/{base}.shp")

This bypasses /tmp entirely and makes the ephemeral storage limit irrelevant for the validation step.

5. State records accumulate indefinitely for abandoned uploads

Root cause: a GIS analyst uploaded only the .shp and abandoned the transfer. The DynamoDB record sits in PENDING forever and accrues read costs on any polling query. Fix: set a TTL of 86 400 seconds (24 hours) on all records at write time — include :ttl = :ttl in the initial upsert expression as shown in Step 2.

Cost and Scaling Considerations

Cost per 1,000 shapefile ingestions (AWS, on-demand):

Component	Unit cost	Per 1,000 uploads
S3 `PUT` notifications	$0.005 per 1,000 requests	$0.015 (3 files × 1,000)
Lambda aggregator (128 MB, 200 ms)	$0.0000002083 per GB-s	$0.006
DynamoDB upserts (1 WCU each)	$0.00000125 per WCU	$0.011
Lambda validator (1 GB, 2 s)	$0.0000002083 per GB-s	$0.42
SQS FIFO dispatch	$0.0000005 per message	$0.0005
Total		~$0.45 per 1,000 uploads

Scale-out behaviour: the aggregation Lambda is stateless and scales to the default concurrency limit (1,000 on AWS). The bottleneck at scale is DynamoDB write throughput — at 10,000 concurrent uploads, the upsert rate can reach 30,000 WCUs per second (3 files × 10,000). Use on-demand billing mode to avoid provisioned-capacity planning; the auto-scaling response time is under 30 seconds for sudden traffic spikes.

When to prefer an alternative: if your upstream clients can be modified to zip the shapefile components into a single archive before upload, you eliminate the completion gate entirely. A ZIP-triggered pipeline is simpler, cheaper, and immune to partial-upload races. The completion-gate pattern is only necessary when you do not control the upload client — a common constraint when ingesting data from partner agencies or legacy desktop GIS workflows.

Connecting chunked I/O patterns for large satellite imagery with this trigger design: once the dispatch message lands in the queue, downstream raster-processing workers can consume it and apply windowed reads without re-triggering the ingestion layer, keeping ingestion and computation independently scalable.

Frequently Asked Questions

Why does a shapefile upload trigger multiple events?

Object storage emits one event per object write. A minimal shapefile requires three companion files (.shp, .shx, .dbf), so a single dataset upload produces at least three separate events. Without a completion gate, any function triggered by the first event operates on an incomplete dataset.

Can duplicate S3 events cause double processing?

Yes. S3 event notifications have at-least-once delivery semantics. Use conditional writes (DynamoDB attribute_not_exists, Firestore set with merge) so that repeated events for the same object do not increment the completion counter, and include an idempotency key in every dispatch message.

What is the cheapest way to track upload completion state?

DynamoDB with on-demand billing costs under $0.00025 per write for typical shapefile metadata records. A 24-hour TTL on pending records auto-expires abandoned uploads with no manual cleanup. Firestore is a close equivalent on GCP; Azure Table Storage on Azure.

Should I filter S3 events to only .shp files?

No. Filter by the staging prefix, not by suffix. You need an event for every file in the set to maintain the completion counter. Filtering to .shp only means the gate never sees the .shx or .dbf arrivals and can never transition to READY.

How do I handle .prj-less shapefiles?

Require the .prj file in your validation gate if your downstream pipeline requires coordinate reference system information. If CRS is optional for your use case, add .prj to the observed set passively: the gate fires on {.shp, .shx, .dbf} and the validator logs a warning if .prj is absent but does not fail. Document this policy in your pipeline’s data contract.

Back to Event-Driven Geospatial Processing Patterns