Files

505 lines
17 KiB
Markdown
Executable File

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
Spring Boot 3.5.7 CLI application that converts PostgreSQL PostGIS spatial data to ESRI shapefiles and GeoJSON formats. The application uses **Spring Batch** for memory-efficient processing of large datasets (1M+ records) and supports automatic GeoServer layer registration via REST API.
**Key Features**:
- Memory-optimized batch processing (90-95% reduction: 2-13GB → 150-200MB)
- Chunk-based streaming with cursor pagination (fetch-size: 1000)
- Automatic geometry validation and type conversion (MultiPolygon → Polygon)
- Coordinate system validation (EPSG:5186 Korean 2000 / Central Belt)
- Dual execution modes: Spring Batch (recommended) and Legacy mode
## Build and Run Commands
### Build
```bash
./gradlew build # Full build with tests
./gradlew clean build -x test # Skip tests
./gradlew spotlessApply # Apply Google Java Format (2-space indentation)
./gradlew spotlessCheck # Verify formatting without applying
```
Output: `build/libs/shp-exporter.jar` (fixed name, no version suffix)
### Run Application
#### Spring Batch Mode (Recommended)
```bash
# Generate shapefile + GeoJSON
./gradlew bootRun --args="--batch --converter.batch-ids[0]=252"
# With GeoServer registration
export GEOSERVER_USERNAME=admin
export GEOSERVER_PASSWORD=geoserver
./gradlew bootRun --args="--batch --geoserver.enabled=true --converter.batch-ids[0]=252"
# Using JAR (production)
java -jar build/libs/shp-exporter.jar \
--batch \
--converter.inference-id=D5E46F60FC40B1A8BE0CD1F3547AA6 \
--converter.batch-ids[0]=252 \
--converter.batch-ids[1]=253
```
#### Legacy Mode (Small Datasets Only)
```bash
./gradlew bootRun # No --batch flag
# Warning: May OOM on large datasets
```
#### Upload Shapefile to GeoServer
Set environment variables first:
```bash
export GEOSERVER_USERNAME=admin
export GEOSERVER_PASSWORD=geoserver
```
Then upload:
```bash
./gradlew bootRun --args="--upload-shp /path/to/file.shp --layer layer_name"
```
Or using JAR:
```bash
java -jar build/libs/shp-exporter.jar --upload-shp /path/to/file.shp --layer layer_name
```
#### Override Configuration via Command Line
Using Gradle (recommended - no quoting issues):
```bash
./gradlew bootRun --args="--converter.inference-id=ABC123 --converter.map-ids[0]=35813030 --converter.batch-ids[0]=252 --converter.mode=MERGED"
```
Using JAR with zsh (quote arguments with brackets):
```bash
java -jar build/libs/shp-exporter.jar '--converter.inference-id=ABC123' '--converter.map-ids[0]=35813030'
```
### Code Formatting
Apply Google Java Format (2-space indentation) before committing:
```bash
./gradlew spotlessApply
```
Check formatting without applying:
```bash
./gradlew spotlessCheck
```
### Active Profile
By default, the application runs with `spring.profiles.active=prod` (set in `application.yml`). Profile-specific configurations are in `application-{profile}.yml` files.
## Architecture
### Dual Execution Modes
The application supports two execution modes with distinct processing pipelines:
#### Spring Batch Mode (Recommended)
**Trigger**: `--batch` flag
**Use Case**: Large datasets (100K+ records), production workloads
**Memory**: 150-200MB constant (chunk-based streaming)
**Pipeline Flow**:
```
ConverterCommandLineRunner
→ JobLauncher.run(mergedModeJob)
→ Step 1: GeometryTypeValidationTasklet (validates geometry homogeneity)
→ Step 2: generateShapefileStep (chunk-oriented)
→ JdbcCursorItemReader (fetch-size: 1000)
→ FeatureConversionProcessor (InferenceResult → SimpleFeature)
→ StreamingShapefileWriter (chunk-based append)
→ Step 3: generateGeoJsonStep (chunk-oriented, same pattern)
→ Step 4: CreateZipTasklet (creates .zip for GeoServer)
→ Step 5: GeoServerRegistrationTasklet (conditional, if --geoserver.enabled=true)
→ Step 6: generateMapIdFilesStep (partitioned, sequential map_id processing)
```
**Key Components**:
- `JdbcCursorItemReader`: Cursor-based streaming (no full result set loading)
- `StreamingShapefileWriter`: Opens GeoTools transaction, writes chunks incrementally, commits at end
- `GeometryTypeValidationTasklet`: Pre-validates with SQL `DISTINCT ST_GeometryType()`, auto-converts MultiPolygon
- `CompositeItemWriter`: Simultaneously writes shapefile and GeoJSON in map_id worker step
#### Legacy Mode
**Trigger**: No `--batch` flag (deprecated)
**Use Case**: Small datasets (<10K records)
**Memory**: 1.4-9GB (loads entire result set)
**Pipeline Flow**:
```
ConverterCommandLineRunner
→ ShapefileConverterService.convertAll()
→ InferenceResultRepository.findByBatchIds() (full List<InferenceResult>)
→ validateGeometries() (in-memory validation)
→ ShapefileWriter.write() (DefaultFeatureCollection accumulation)
→ GeoJsonWriter.write()
```
### Key Design Patterns
**Geometry Type Validation & Auto-Conversion**:
- Pre-validation step runs SQL `SELECT DISTINCT ST_GeometryType(geometry)` to detect mixed types
- Supports automatic conversion: `ST_MultiPolygon``ST_Polygon` (extracts first polygon only)
- Fails fast on unsupported mixed types (e.g., Polygon + LineString)
- Validates EPSG:5186 coordinate bounds (X: 125-530km, Y: -600-988km) and ST_IsValid()
- See `GeometryTypeValidationTasklet` (batch/tasklet/GeometryTypeValidationTasklet.java:1-290)
**WKT to JTS Conversion Pipeline**:
1. PostGIS query returns `ST_AsText(geometry)` as WKT string
2. `GeometryConvertingRowMapper` converts ResultSet row to `InferenceResult` with WKT string (batch/reader/GeometryConvertingRowMapper.java:1-74)
3. `FeatureConversionProcessor` uses `GeometryConverter.parseGeometry()` to convert WKT → JTS Geometry (service/GeometryConverter.java:1-92)
4. `StreamingShapefileWriter` wraps JTS geometry in GeoTools `SimpleFeature` and writes to shapefile
**Chunk-Based Transaction Management** (Spring Batch only):
```java
// StreamingShapefileWriter
@BeforeStep
public void open() {
transaction = new DefaultTransaction("create");
featureStore.setTransaction(transaction); // Long-running transaction
}
@Override
public void write(Chunk<SimpleFeature> chunk) {
ListFeatureCollection collection = new ListFeatureCollection(featureType, chunk.getItems());
featureStore.addFeatures(collection); // Append chunk to shapefile
// chunk goes out of scope → GC eligible
}
@AfterStep
public void afterStep() {
transaction.commit(); // Commit all chunks at once
transaction.close();
}
```
**PostgreSQL Array Parameter Handling**:
```java
// InferenceResultItemReaderConfig uses PreparedStatementSetter
ps -> {
Array batchIdsArray = ps.getConnection().createArrayOf("bigint", batchIds.toArray());
ps.setArray(1, batchIdsArray); // WHERE batch_id = ANY(?)
ps.setString(2, mapId);
}
```
**Output Directory Strategy**:
- Batch mode (MERGED): `{output-base-dir}/{inference-id}/merge/` → Single merged shapefile + GeoJSON
- Batch mode (map_id partitioning): `{output-base-dir}/{inference-id}/{map-id}/` → Per-map_id files
- Legacy mode: `{output-base-dir}/{inference-id}/{map-id}/` (no merge folder)
**GeoServer Registration**:
- Only shapefile ZIP is uploaded (GeoJSON not registered)
- Requires pre-created workspace 'cd' and environment variables for auth
- Conditional execution via JobParameter `geoserver.enabled`
- Non-blocking: failures logged but don't stop batch job
## Configuration
### Profile System
- Default profile: `prod` (set in application.yml)
- Configuration hierarchy: `application.yml``application-{profile}.yml`
- Override via: `--spring.profiles.active=dev`
### Key Configuration Properties
**Converter Settings** (`ConverterProperties.java`):
```yaml
converter:
inference-id: 'D5E46F60FC40B1A8BE0CD1F3547AA6' # Output folder name
batch-ids: [252, 253, 257] # PostgreSQL batch_id filter (required)
map-ids: [] # Legacy mode only (ignored in batch mode)
mode: 'MERGED' # Legacy mode only: MERGED, MAP_IDS, or RESOLVE
output-base-dir: '/data/model_output/export/'
crs: 'EPSG:5186' # Korean 2000 / Central Belt
batch:
chunk-size: 1000 # Records per chunk (affects memory usage)
fetch-size: 1000 # JDBC cursor fetch size
skip-limit: 100 # Max skippable records per chunk
enable-partitioning: false # Future: parallel map_id processing
```
**GeoServer Settings** (`GeoServerProperties.java`):
```yaml
geoserver:
base-url: 'https://kamco.geo-dev.gs.dabeeo.com/geoserver'
workspace: 'cd' # Must be pre-created in GeoServer
overwrite-existing: true # Delete existing layer before registration
connection-timeout: 30000 # 30 seconds
read-timeout: 60000 # 60 seconds
# Credentials from environment variables (preferred):
# GEOSERVER_USERNAME, GEOSERVER_PASSWORD
```
**Spring Batch Metadata**:
```yaml
spring:
batch:
job:
enabled: false # Prevent auto-run on startup
jdbc:
initialize-schema: always # Auto-create BATCH_* tables
```
## Database Integration
### Query Strategies
**Spring Batch Mode** (streaming):
```sql
-- InferenceResultItemReaderConfig.java
SELECT uid, map_id, probability, before_year, after_year,
before_c, before_p, after_c, after_p,
ST_AsText(geometry) as geometry_wkt
FROM inference_results_testing
WHERE batch_id = ANY(?)
AND ST_GeometryType(geometry) IN ('ST_Polygon', 'ST_MultiPolygon')
AND ST_SRID(geometry) = 5186
AND ST_X(ST_Centroid(geometry)) BETWEEN 125000 AND 530000
AND ST_Y(ST_Centroid(geometry)) BETWEEN -600000 AND 988000
AND ST_IsValid(geometry) = true
ORDER BY map_id, uid
-- Uses server-side cursor with fetch-size=1000
```
**Legacy Mode** (full load):
```sql
-- InferenceResultRepository.java
SELECT uid, map_id, probability, before_year, after_year,
before_c, before_p, after_c, after_p,
ST_AsText(geometry) as geometry_wkt
FROM inference_results_testing
WHERE batch_id = ANY(?) AND map_id = ?
-- Returns full List<InferenceResult> in memory
```
**Geometry Type Validation**:
```sql
-- GeometryTypeValidationTasklet.java
SELECT DISTINCT ST_GeometryType(geometry)
FROM inference_results_testing
WHERE batch_id = ANY(?) AND geometry IS NOT NULL
-- Pre-validates homogeneous geometry requirement
```
### Field Mapping
Database columns map to shapefile fields (10-character limit):
| Database Column | DB Type | Shapefile Field | Shapefile Type | Notes |
|-----------------|---------|-----------------|----------------|-------|
| uid | uuid | chnDtctId | String | Change detection ID |
| map_id | text | mpqd_no | String | Map quadrant number |
| probability | float8 | chn_dtct_p | Double | Change detection probability |
| before_year | bigint | cprs_yr | Long | Comparison year |
| after_year | bigint | crtr_yr | Long | Criteria year |
| before_c | text | bf_cls_cd | String | Before classification code |
| before_p | float8 | bf_cls_pro | Double | Before classification probability |
| after_c | text | af_cls_cd | String | After classification code |
| after_p | float8 | af_cls_pro | Double | After classification probability |
| geometry | geom | the_geom | Polygon | Geometry in EPSG:5186 |
**Field name source**: See `FeatureTypeFactory.java` (batch/util/FeatureTypeFactory.java:1-104)
### Coordinate Reference System
- **CRS**: EPSG:5186 (Korean 2000 / Central Belt)
- **Valid Coordinate Bounds**: X ∈ [125km, 530km], Y ∈ [-600km, 988km]
- **Encoding**: WKT in SQL → JTS Geometry → GeoTools SimpleFeature → `.prj` file
- **Validation**: Automatic in batch mode via `ST_X(ST_Centroid())` range check
## Dependencies
**Core Framework**:
- Spring Boot 3.5.7
- `spring-boot-starter`: DI container, logging
- `spring-boot-starter-jdbc`: JDBC template, HikariCP
- `spring-boot-starter-batch`: Spring Batch framework, job repository
- `spring-boot-starter-web`: RestTemplate for GeoServer API calls
- `spring-boot-starter-validation`: @NotBlank annotations
**Spatial Libraries**:
- GeoTools 30.0 (via OSGeo repository)
- `gt-shapefile`: Shapefile I/O (DataStore, FeatureStore, Transaction)
- `gt-geojson`: GeoJSON encoding/decoding
- `gt-referencing`: CRS transformations
- `gt-epsg-hsql`: EPSG database for CRS lookups
- JTS 1.19.0: Geometry primitives (Polygon, MultiPolygon, GeometryFactory)
- PostGIS JDBC 2.5.1: PostGIS geometry type support
**Database**:
- PostgreSQL JDBC Driver (latest)
- HikariCP (bundled with Spring Boot)
**Build Configuration**:
```gradle
// build.gradle
configurations.all {
exclude group: 'javax.media', module: 'jai_core' // Conflicts with GeoTools
}
bootJar {
archiveFileName = "shp-exporter.jar" // Fixed JAR name
}
spotless {
java {
googleJavaFormat('1.19.2') // 2-space indentation
}
}
```
## Development Patterns
### Adding a New Step to Spring Batch Job
When adding steps to `mergedModeJob`, follow this pattern:
1. **Create Tasklet or ItemWriter** in `batch/tasklet/` or `batch/writer/`
2. **Define Step Bean** in `MergedModeJobConfig.java`:
```java
@Bean
public Step myNewStep(JobRepository jobRepository,
PlatformTransactionManager transactionManager,
MyTasklet tasklet,
BatchExecutionHistoryListener historyListener) {
return new StepBuilder("myNewStep", jobRepository)
.tasklet(tasklet, transactionManager)
.listener(historyListener) // REQUIRED for history tracking
.build();
}
```
3. **Add to Job Flow** in `mergedModeJob()`:
```java
.next(myNewStep)
```
4. **Always include `BatchExecutionHistoryListener`** to track execution metrics
### Modifying ItemReader Configuration
ItemReaders are **not thread-safe**. Each step requires its own instance:
```java
// WRONG: Sharing reader between steps
@Bean
public JdbcCursorItemReader<InferenceResult> reader() { ... }
// RIGHT: Separate readers with @StepScope
@Bean
@StepScope // Creates new instance per step
public JdbcCursorItemReader<InferenceResult> shapefileReader() { ... }
@Bean
@StepScope
public JdbcCursorItemReader<InferenceResult> geoJsonReader() { ... }
```
See `InferenceResultItemReaderConfig.java` for working examples.
### Streaming Writers Pattern
When writing custom streaming writers, follow `StreamingShapefileWriter` pattern:
```java
@Component
@StepScope
public class MyStreamingWriter implements ItemStreamWriter<MyType> {
private Transaction transaction;
@BeforeStep
public void open(ExecutionContext context) {
// Open resources, start transaction
transaction = new DefaultTransaction("create");
}
@Override
public void write(Chunk<? extends MyType> chunk) {
// Write chunk incrementally
// Do NOT accumulate in memory
}
@AfterStep
public ExitStatus afterStep(StepExecution stepExecution) {
transaction.commit(); // Commit all chunks
transaction.close();
return ExitStatus.COMPLETED;
}
}
```
### JobParameters and StepExecutionContext
**Pass data between steps** using `StepExecutionContext`:
```java
// Step 1: Store data
stepExecution.getExecutionContext().putString("geometryType", "ST_Polygon");
// Step 2: Retrieve data
@BeforeStep
public void beforeStep(StepExecution stepExecution) {
String geomType = stepExecution.getJobExecution()
.getExecutionContext()
.getString("geometryType");
}
```
**Job-level parameters** from command line:
```java
// ConverterCommandLineRunner.buildJobParameters()
JobParametersBuilder builder = new JobParametersBuilder();
builder.addString("inferenceId", converterProperties.getInferenceId());
builder.addLong("timestamp", System.currentTimeMillis()); // Ensures uniqueness
```
### Partitioning Pattern (Map ID Processing)
The `generateMapIdFilesStep` uses partitioning but runs **sequentially** to avoid DB connection pool exhaustion:
```java
@Bean
public Step generateMapIdFilesStep(...) {
return new StepBuilder("generateMapIdFilesStep", jobRepository)
.partitioner("mapIdWorker", partitioner)
.step(mapIdWorkerStep)
.taskExecutor(new SyncTaskExecutor()) // SEQUENTIAL execution
.build();
}
```
For parallel execution in future (requires connection pool tuning):
```java
.taskExecutor(new SimpleAsyncTaskExecutor())
.gridSize(4) // 4 concurrent workers
```
### GeoServer REST API Integration
GeoServer operations use `RestTemplate` with custom error handling:
```java
// GeoServerRegistrationService.java
try {
restTemplate.exchange(url, HttpMethod.PUT, entity, String.class);
} catch (HttpClientErrorException e) {
if (e.getStatusCode() == HttpStatus.NOT_FOUND) {
// Handle workspace not found
}
}
```
Always check workspace existence before layer registration.
### Testing Considerations
- **Unit tests**: Mock `JdbcTemplate`, `DataSource` for repository tests
- **Integration tests**: Use `@SpringBatchTest` with embedded H2 database
- **GeoTools**: Use `MemoryDataStore` for shapefile writer tests
- **Current state**: Limited test coverage (focus on critical path validation)
Refer to `claudedocs/SPRING_BATCH_MIGRATION.md` for detailed batch architecture documentation.