Files
kamco-cd-cron/shp-exporter/CLAUDE.md

17 KiB
Executable File

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Spring Boot 3.5.7 CLI application that converts PostgreSQL PostGIS spatial data to ESRI shapefiles and GeoJSON formats. The application uses Spring Batch for memory-efficient processing of large datasets (1M+ records) and supports automatic GeoServer layer registration via REST API.

Key Features:

  • Memory-optimized batch processing (90-95% reduction: 2-13GB → 150-200MB)
  • Chunk-based streaming with cursor pagination (fetch-size: 1000)
  • Automatic geometry validation and type conversion (MultiPolygon → Polygon)
  • Coordinate system validation (EPSG:5186 Korean 2000 / Central Belt)
  • Dual execution modes: Spring Batch (recommended) and Legacy mode

Build and Run Commands

Build

./gradlew build                  # Full build with tests
./gradlew clean build -x test   # Skip tests
./gradlew spotlessApply         # Apply Google Java Format (2-space indentation)
./gradlew spotlessCheck         # Verify formatting without applying

Output: build/libs/shp-exporter.jar (fixed name, no version suffix)

Run Application

# Generate shapefile + GeoJSON
./gradlew bootRun --args="--batch --converter.batch-ids[0]=252"

# With GeoServer registration
export GEOSERVER_USERNAME=admin
export GEOSERVER_PASSWORD=geoserver
./gradlew bootRun --args="--batch --geoserver.enabled=true --converter.batch-ids[0]=252"

# Using JAR (production)
java -jar build/libs/shp-exporter.jar \
  --batch \
  --converter.inference-id=D5E46F60FC40B1A8BE0CD1F3547AA6 \
  --converter.batch-ids[0]=252 \
  --converter.batch-ids[1]=253

Legacy Mode (Small Datasets Only)

./gradlew bootRun  # No --batch flag
# Warning: May OOM on large datasets

Upload Shapefile to GeoServer

Set environment variables first:

export GEOSERVER_USERNAME=admin
export GEOSERVER_PASSWORD=geoserver

Then upload:

./gradlew bootRun --args="--upload-shp /path/to/file.shp --layer layer_name"

Or using JAR:

java -jar build/libs/shp-exporter.jar --upload-shp /path/to/file.shp --layer layer_name

Override Configuration via Command Line

Using Gradle (recommended - no quoting issues):

./gradlew bootRun --args="--converter.inference-id=ABC123 --converter.map-ids[0]=35813030 --converter.batch-ids[0]=252 --converter.mode=MERGED"

Using JAR with zsh (quote arguments with brackets):

java -jar build/libs/shp-exporter.jar '--converter.inference-id=ABC123' '--converter.map-ids[0]=35813030'

Code Formatting

Apply Google Java Format (2-space indentation) before committing:

./gradlew spotlessApply

Check formatting without applying:

./gradlew spotlessCheck

Active Profile

By default, the application runs with spring.profiles.active=prod (set in application.yml). Profile-specific configurations are in application-{profile}.yml files.

Architecture

Dual Execution Modes

The application supports two execution modes with distinct processing pipelines:

Trigger: --batch flag Use Case: Large datasets (100K+ records), production workloads Memory: 150-200MB constant (chunk-based streaming)

Pipeline Flow:

ConverterCommandLineRunner
  → JobLauncher.run(mergedModeJob)
    → Step 1: GeometryTypeValidationTasklet (validates geometry homogeneity)
    → Step 2: generateShapefileStep (chunk-oriented)
        → JdbcCursorItemReader (fetch-size: 1000)
        → FeatureConversionProcessor (InferenceResult → SimpleFeature)
        → StreamingShapefileWriter (chunk-based append)
    → Step 3: generateGeoJsonStep (chunk-oriented, same pattern)
    → Step 4: CreateZipTasklet (creates .zip for GeoServer)
    → Step 5: GeoServerRegistrationTasklet (conditional, if --geoserver.enabled=true)
    → Step 6: generateMapIdFilesStep (partitioned, sequential map_id processing)

Key Components:

  • JdbcCursorItemReader: Cursor-based streaming (no full result set loading)
  • StreamingShapefileWriter: Opens GeoTools transaction, writes chunks incrementally, commits at end
  • GeometryTypeValidationTasklet: Pre-validates with SQL DISTINCT ST_GeometryType(), auto-converts MultiPolygon
  • CompositeItemWriter: Simultaneously writes shapefile and GeoJSON in map_id worker step

Legacy Mode

Trigger: No --batch flag (deprecated) Use Case: Small datasets (<10K records) Memory: 1.4-9GB (loads entire result set)

Pipeline Flow:

ConverterCommandLineRunner
  → ShapefileConverterService.convertAll()
    → InferenceResultRepository.findByBatchIds() (full List<InferenceResult>)
    → validateGeometries() (in-memory validation)
    → ShapefileWriter.write() (DefaultFeatureCollection accumulation)
    → GeoJsonWriter.write()

Key Design Patterns

Geometry Type Validation & Auto-Conversion:

  • Pre-validation step runs SQL SELECT DISTINCT ST_GeometryType(geometry) to detect mixed types
  • Supports automatic conversion: ST_MultiPolygonST_Polygon (extracts first polygon only)
  • Fails fast on unsupported mixed types (e.g., Polygon + LineString)
  • Validates EPSG:5186 coordinate bounds (X: 125-530km, Y: -600-988km) and ST_IsValid()
  • See GeometryTypeValidationTasklet (batch/tasklet/GeometryTypeValidationTasklet.java:1-290)

WKT to JTS Conversion Pipeline:

  1. PostGIS query returns ST_AsText(geometry) as WKT string
  2. GeometryConvertingRowMapper converts ResultSet row to InferenceResult with WKT string (batch/reader/GeometryConvertingRowMapper.java:1-74)
  3. FeatureConversionProcessor uses GeometryConverter.parseGeometry() to convert WKT → JTS Geometry (service/GeometryConverter.java:1-92)
  4. StreamingShapefileWriter wraps JTS geometry in GeoTools SimpleFeature and writes to shapefile

Chunk-Based Transaction Management (Spring Batch only):

// StreamingShapefileWriter
@BeforeStep
public void open() {
    transaction = new DefaultTransaction("create");
    featureStore.setTransaction(transaction);  // Long-running transaction
}

@Override
public void write(Chunk<SimpleFeature> chunk) {
    ListFeatureCollection collection = new ListFeatureCollection(featureType, chunk.getItems());
    featureStore.addFeatures(collection);  // Append chunk to shapefile
    // chunk goes out of scope → GC eligible
}

@AfterStep
public void afterStep() {
    transaction.commit();  // Commit all chunks at once
    transaction.close();
}

PostgreSQL Array Parameter Handling:

// InferenceResultItemReaderConfig uses PreparedStatementSetter
ps -> {
    Array batchIdsArray = ps.getConnection().createArrayOf("bigint", batchIds.toArray());
    ps.setArray(1, batchIdsArray);  // WHERE batch_id = ANY(?)
    ps.setString(2, mapId);
}

Output Directory Strategy:

  • Batch mode (MERGED): {output-base-dir}/{inference-id}/merge/ → Single merged shapefile + GeoJSON
  • Batch mode (map_id partitioning): {output-base-dir}/{inference-id}/{map-id}/ → Per-map_id files
  • Legacy mode: {output-base-dir}/{inference-id}/{map-id}/ (no merge folder)

GeoServer Registration:

  • Only shapefile ZIP is uploaded (GeoJSON not registered)
  • Requires pre-created workspace 'cd' and environment variables for auth
  • Conditional execution via JobParameter geoserver.enabled
  • Non-blocking: failures logged but don't stop batch job

Configuration

Profile System

  • Default profile: prod (set in application.yml)
  • Configuration hierarchy: application.ymlapplication-{profile}.yml
  • Override via: --spring.profiles.active=dev

Key Configuration Properties

Converter Settings (ConverterProperties.java):

converter:
  inference-id: 'D5E46F60FC40B1A8BE0CD1F3547AA6'  # Output folder name
  batch-ids: [252, 253, 257]  # PostgreSQL batch_id filter (required)
  map-ids: []                 # Legacy mode only (ignored in batch mode)
  mode: 'MERGED'              # Legacy mode only: MERGED, MAP_IDS, or RESOLVE
  output-base-dir: '/data/model_output/export/'
  crs: 'EPSG:5186'            # Korean 2000 / Central Belt

  batch:
    chunk-size: 1000          # Records per chunk (affects memory usage)
    fetch-size: 1000          # JDBC cursor fetch size
    skip-limit: 100           # Max skippable records per chunk
    enable-partitioning: false  # Future: parallel map_id processing

GeoServer Settings (GeoServerProperties.java):

geoserver:
  base-url: 'https://kamco.geo-dev.gs.dabeeo.com/geoserver'
  workspace: 'cd'              # Must be pre-created in GeoServer
  overwrite-existing: true     # Delete existing layer before registration
  connection-timeout: 30000    # 30 seconds
  read-timeout: 60000          # 60 seconds
  # Credentials from environment variables (preferred):
  # GEOSERVER_USERNAME, GEOSERVER_PASSWORD

Spring Batch Metadata:

spring:
  batch:
    job:
      enabled: false           # Prevent auto-run on startup
    jdbc:
      initialize-schema: always  # Auto-create BATCH_* tables

Database Integration

Query Strategies

Spring Batch Mode (streaming):

-- InferenceResultItemReaderConfig.java
SELECT uid, map_id, probability, before_year, after_year,
       before_c, before_p, after_c, after_p,
       ST_AsText(geometry) as geometry_wkt
FROM inference_results_testing
WHERE batch_id = ANY(?)
  AND ST_GeometryType(geometry) IN ('ST_Polygon', 'ST_MultiPolygon')
  AND ST_SRID(geometry) = 5186
  AND ST_X(ST_Centroid(geometry)) BETWEEN 125000 AND 530000
  AND ST_Y(ST_Centroid(geometry)) BETWEEN -600000 AND 988000
  AND ST_IsValid(geometry) = true
ORDER BY map_id, uid
-- Uses server-side cursor with fetch-size=1000

Legacy Mode (full load):

-- InferenceResultRepository.java
SELECT uid, map_id, probability, before_year, after_year,
       before_c, before_p, after_c, after_p,
       ST_AsText(geometry) as geometry_wkt
FROM inference_results_testing
WHERE batch_id = ANY(?) AND map_id = ?
-- Returns full List<InferenceResult> in memory

Geometry Type Validation:

-- GeometryTypeValidationTasklet.java
SELECT DISTINCT ST_GeometryType(geometry)
FROM inference_results_testing
WHERE batch_id = ANY(?) AND geometry IS NOT NULL
-- Pre-validates homogeneous geometry requirement

Field Mapping

Database columns map to shapefile fields (10-character limit):

Database Column DB Type Shapefile Field Shapefile Type Notes
uid uuid chnDtctId String Change detection ID
map_id text mpqd_no String Map quadrant number
probability float8 chn_dtct_p Double Change detection probability
before_year bigint cprs_yr Long Comparison year
after_year bigint crtr_yr Long Criteria year
before_c text bf_cls_cd String Before classification code
before_p float8 bf_cls_pro Double Before classification probability
after_c text af_cls_cd String After classification code
after_p float8 af_cls_pro Double After classification probability
geometry geom the_geom Polygon Geometry in EPSG:5186

Field name source: See FeatureTypeFactory.java (batch/util/FeatureTypeFactory.java:1-104)

Coordinate Reference System

  • CRS: EPSG:5186 (Korean 2000 / Central Belt)
  • Valid Coordinate Bounds: X ∈ [125km, 530km], Y ∈ [-600km, 988km]
  • Encoding: WKT in SQL → JTS Geometry → GeoTools SimpleFeature → .prj file
  • Validation: Automatic in batch mode via ST_X(ST_Centroid()) range check

Dependencies

Core Framework:

  • Spring Boot 3.5.7
    • spring-boot-starter: DI container, logging
    • spring-boot-starter-jdbc: JDBC template, HikariCP
    • spring-boot-starter-batch: Spring Batch framework, job repository
    • spring-boot-starter-web: RestTemplate for GeoServer API calls
    • spring-boot-starter-validation: @NotBlank annotations

Spatial Libraries:

  • GeoTools 30.0 (via OSGeo repository)
    • gt-shapefile: Shapefile I/O (DataStore, FeatureStore, Transaction)
    • gt-geojson: GeoJSON encoding/decoding
    • gt-referencing: CRS transformations
    • gt-epsg-hsql: EPSG database for CRS lookups
  • JTS 1.19.0: Geometry primitives (Polygon, MultiPolygon, GeometryFactory)
  • PostGIS JDBC 2.5.1: PostGIS geometry type support

Database:

  • PostgreSQL JDBC Driver (latest)
  • HikariCP (bundled with Spring Boot)

Build Configuration:

// build.gradle
configurations.all {
  exclude group: 'javax.media', module: 'jai_core'  // Conflicts with GeoTools
}

bootJar {
  archiveFileName = "shp-exporter.jar"  // Fixed JAR name
}

spotless {
  java {
    googleJavaFormat('1.19.2')  // 2-space indentation
  }
}

Development Patterns

Adding a New Step to Spring Batch Job

When adding steps to mergedModeJob, follow this pattern:

  1. Create Tasklet or ItemWriter in batch/tasklet/ or batch/writer/
  2. Define Step Bean in MergedModeJobConfig.java:
@Bean
public Step myNewStep(JobRepository jobRepository,
                      PlatformTransactionManager transactionManager,
                      MyTasklet tasklet,
                      BatchExecutionHistoryListener historyListener) {
  return new StepBuilder("myNewStep", jobRepository)
      .tasklet(tasklet, transactionManager)
      .listener(historyListener)  // REQUIRED for history tracking
      .build();
}
  1. Add to Job Flow in mergedModeJob():
.next(myNewStep)
  1. Always include BatchExecutionHistoryListener to track execution metrics

Modifying ItemReader Configuration

ItemReaders are not thread-safe. Each step requires its own instance:

// WRONG: Sharing reader between steps
@Bean
public JdbcCursorItemReader<InferenceResult> reader() { ... }

// RIGHT: Separate readers with @StepScope
@Bean
@StepScope  // Creates new instance per step
public JdbcCursorItemReader<InferenceResult> shapefileReader() { ... }

@Bean
@StepScope
public JdbcCursorItemReader<InferenceResult> geoJsonReader() { ... }

See InferenceResultItemReaderConfig.java for working examples.

Streaming Writers Pattern

When writing custom streaming writers, follow StreamingShapefileWriter pattern:

@Component
@StepScope
public class MyStreamingWriter implements ItemStreamWriter<MyType> {
    private Transaction transaction;

    @BeforeStep
    public void open(ExecutionContext context) {
        // Open resources, start transaction
        transaction = new DefaultTransaction("create");
    }

    @Override
    public void write(Chunk<? extends MyType> chunk) {
        // Write chunk incrementally
        // Do NOT accumulate in memory
    }

    @AfterStep
    public ExitStatus afterStep(StepExecution stepExecution) {
        transaction.commit();  // Commit all chunks
        transaction.close();
        return ExitStatus.COMPLETED;
    }
}

JobParameters and StepExecutionContext

Pass data between steps using StepExecutionContext:

// Step 1: Store data
stepExecution.getExecutionContext().putString("geometryType", "ST_Polygon");

// Step 2: Retrieve data
@BeforeStep
public void beforeStep(StepExecution stepExecution) {
    String geomType = stepExecution.getJobExecution()
        .getExecutionContext()
        .getString("geometryType");
}

Job-level parameters from command line:

// ConverterCommandLineRunner.buildJobParameters()
JobParametersBuilder builder = new JobParametersBuilder();
builder.addString("inferenceId", converterProperties.getInferenceId());
builder.addLong("timestamp", System.currentTimeMillis());  // Ensures uniqueness

Partitioning Pattern (Map ID Processing)

The generateMapIdFilesStep uses partitioning but runs sequentially to avoid DB connection pool exhaustion:

@Bean
public Step generateMapIdFilesStep(...) {
    return new StepBuilder("generateMapIdFilesStep", jobRepository)
        .partitioner("mapIdWorker", partitioner)
        .step(mapIdWorkerStep)
        .taskExecutor(new SyncTaskExecutor())  // SEQUENTIAL execution
        .build();
}

For parallel execution in future (requires connection pool tuning):

.taskExecutor(new SimpleAsyncTaskExecutor())
.gridSize(4)  // 4 concurrent workers

GeoServer REST API Integration

GeoServer operations use RestTemplate with custom error handling:

// GeoServerRegistrationService.java
try {
    restTemplate.exchange(url, HttpMethod.PUT, entity, String.class);
} catch (HttpClientErrorException e) {
    if (e.getStatusCode() == HttpStatus.NOT_FOUND) {
        // Handle workspace not found
    }
}

Always check workspace existence before layer registration.

Testing Considerations

  • Unit tests: Mock JdbcTemplate, DataSource for repository tests
  • Integration tests: Use @SpringBatchTest with embedded H2 database
  • GeoTools: Use MemoryDataStore for shapefile writer tests
  • Current state: Limited test coverage (focus on critical path validation)

Refer to claudedocs/SPRING_BATCH_MIGRATION.md for detailed batch architecture documentation.