TechMentor - 글로벌 테크 기업 출신 시니어 아키텍트의 실전 노하우

1. 모니터링 개념과 필요성

애플리케이션 모니터링의 필요성

현대 시스템의 복잡성

마이크로서비스 아키텍처, 클라우드 환경, 분산 시스템이 일반화되면서 시스템의 복잡성이 급격히 증가했습니다. 이러한 환경에서는 전통적인 모니터링 방식으로는 시스템 상태를 파악하기 어렵습니다.

분산 시스템: 여러 서비스 간의 상호작용으로 인한 복잡한 의존성
동적 환경: 컨테이너, 오토스케일링으로 인한 인프라 변화
다양한 기술 스택: 각기 다른 특성을 가진 기술들의 조합
높은 가용성 요구: 24/7 서비스 운영에 대한 기대

모니터링의 핵심 목표

운영 관점

시스템 상태 실시간 파악
장애 조기 발견 및 예방
성능 병목 지점 식별
리소스 사용량 최적화
SLA/SLO 준수 확인

비즈니스 관점

사용자 경험 개선
비즈니스 메트릭 추적
의사결정 지원 데이터 제공
비용 최적화
규정 준수 및 감사

모니터링 없이 발생하는 문제들

장애 대응 지연

문제 발생을 사용자 신고로만 알게 되어 대응이 늦어짐

근본 원인 파악 어려움

로그만으로는 복잡한 시스템의 문제 원인을 찾기 어려움

성능 저하 미인지

점진적인 성능 저하를 인지하지 못해 사용자 경험 악화

리소스 낭비

불필요한 리소스 할당으로 인한 비용 증가

메트릭의 종류와 분류

Four Golden Signals (Google SRE)

Google SRE에서 제안한 시스템 모니터링의 핵심 지표들입니다.

1. Latency (지연시간)

요청을 처리하는데 걸리는 시간

응답 시간 (Response Time)
처리 시간 (Processing Time)
네트워크 지연 시간

2. Traffic (트래픽)

시스템에 가해지는 부하의 양

초당 요청 수 (RPS)
동시 사용자 수
네트워크 I/O

3. Errors (에러)

실패한 요청의 비율

HTTP 4xx, 5xx 에러
예외 발생률
타임아웃 발생률

4. Saturation (포화도)

리소스 사용률과 대기 상태

CPU, 메모리 사용률
디스크 I/O 사용률
큐 길이

메트릭 분류 체계

1. 인프라스트럭처 메트릭

시스템 리소스

CPU 사용률 (%)
메모리 사용량 (MB/GB)
디스크 사용률 (%)
네트워크 대역폭 (Mbps)

I/O 성능

디스크 IOPS
네트워크 패킷 수
파일 시스템 사용률
스왑 사용량

프로세스

프로세스 수
스레드 수
파일 디스크립터
로드 애버리지

2. 애플리케이션 메트릭

성능 지표

응답 시간 (ms)
처리량 (TPS/RPS)
동시 연결 수
큐 대기 시간

안정성 지표

에러율 (%)
가용성 (%)
타임아웃 발생률
재시도 횟수

JVM 지표

힙 메모리 사용량
GC 횟수/시간
스레드 풀 상태
클래스 로딩 수

3. 비즈니스 메트릭

사용자 행동

활성 사용자 수 (DAU/MAU)
세션 지속 시간
페이지 뷰
전환율 (Conversion Rate)

비즈니스 KPI

매출액
주문 수
고객 만족도
이탈률 (Churn Rate)

메트릭 수집 방식

Push 방식

애플리케이션이 직접 메트릭을 모니터링 시스템으로 전송

장점: 실시간성, 네트워크 효율성
단점: 애플리케이션 부하, 장애 시 메트릭 손실
예시: StatsD, CloudWatch, DataDog

Pull 방식

모니터링 시스템이 주기적으로 애플리케이션에서 메트릭을 수집

장점: 중앙 집중 제어, 안정성
단점: 네트워크 오버헤드, 지연 가능성
예시: Prometheus, Nagios

모니터링 전략과 베스트 프랙티스

관찰 가능성 (Observability)의 3요소

Metrics

수치화된 시계열 데이터

• 시스템 상태 요약
• 트렌드 분석
• 알림 기준

Logs

이벤트 기록

• 상세한 컨텍스트
• 디버깅 정보
• 감사 추적

Traces

요청 흐름 추적

• 분산 시스템 가시성
• 성능 병목 식별
• 의존성 분석

모니터링 설계 원칙

1

목적 중심 설계

무엇을 모니터링할지 명확한 목적을 정의하고 필요한 메트릭만 수집

2

계층별 모니터링

인프라, 플랫폼, 애플리케이션, 비즈니스 각 계층별로 적절한 메트릭 수집

3

적절한 세분화

너무 세밀하면 노이즈, 너무 거칠면 중요한 정보 놓칠 수 있음

4

자동화된 알림

임계값 기반 알림과 이상 탐지를 통한 proactive 모니터링

2. Spring Boot Actuator

Spring Boot Actuator 개요

Actuator란?

Spring Boot Actuator는 운영 환경에서 애플리케이션을 모니터링하고 관리할 수 있는 기능을 제공하는 라이브러리입니다. HTTP 엔드포인트나 JMX를 통해 애플리케이션의 상태, 메트릭, 환경 정보 등을 노출합니다.

주요 기능

애플리케이션 상태 확인 (Health Check)
메트릭 수집 및 노출
환경 정보 조회
로그 레벨 동적 변경
스레드 덤프 생성
힙 덤프 생성

장점

별도 설정 없이 즉시 사용 가능
표준화된 엔드포인트 제공
확장 가능한 구조
보안 설정 지원
다양한 모니터링 도구와 연동
운영 중 실시간 정보 확인

Actuator 설정

의존성 추가

Maven

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

<!-- Micrometer Prometheus 연동 (선택사항) -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

Gradle

implementation 'org.springframework.boot:spring-boot-starter-actuator'

// Micrometer Prometheus 연동 (선택사항)
implementation 'io.micrometer:micrometer-registry-prometheus'

기본 설정

application.yml

# 기본 Actuator 설정
management:
  # 엔드포인트 기본 경로 설정
  endpoints:
    web:
      base-path: /actuator
      exposure:
        # 웹으로 노출할 엔드포인트 지정
        include: health,info,metrics,prometheus
        # 제외할 엔드포인트
        exclude: shutdown
  
  # 개별 엔드포인트 설정
  endpoint:
    health:
      # 상세 정보 표시 설정
      show-details: when-authorized
      show-components: always
    info:
      enabled: true
    metrics:
      enabled: true
    prometheus:
      enabled: true
  
  # 보안 설정
  security:
    enabled: true
  
  # 서버 포트 분리 (선택사항)
  server:
    port: 8081

application.properties

# 엔드포인트 노출 설정
management.endpoints.web.exposure.include=health,info,metrics,prometheus
management.endpoints.web.base-path=/actuator

# Health 엔드포인트 상세 정보 표시
management.endpoint.health.show-details=when-authorized
management.endpoint.health.show-components=always

# Info 엔드포인트 활성화
management.endpoint.info.enabled=true

# 메트릭 엔드포인트 활성화
management.endpoint.metrics.enabled=true

# Prometheus 엔드포인트 활성화
management.endpoint.prometheus.enabled=true

보안 설정

Spring Security 설정

@Configuration
@EnableWebSecurity
public class ActuatorSecurityConfig {

    @Bean
    public SecurityFilterChain actuatorSecurityFilterChain(HttpSecurity http) throws Exception {
        return http
            .requestMatcher(EndpointRequest.toAnyEndpoint())
            .authorizeHttpRequests(auth -> auth
                // Health 엔드포인트는 인증 없이 접근 허용
                .requestMatchers(EndpointRequest.to(HealthEndpoint.class)).permitAll()
                // Info 엔드포인트는 인증 없이 접근 허용
                .requestMatchers(EndpointRequest.to(InfoEndpoint.class)).permitAll()
                // 나머지 엔드포인트는 ADMIN 권한 필요
                .anyRequest().hasRole("ADMIN")
            )
            .httpBasic(Customizer.withDefaults())
            .build();
    }
    
    @Bean
    public UserDetailsService userDetailsService() {
        UserDetails admin = User.builder()
            .username("admin")
            .password("{noop}admin123")
            .roles("ADMIN")
            .build();
        return new InMemoryUserDetailsManager(admin);
    }
}

포트 분리 설정

# 애플리케이션과 관리 포트 분리
server.port=8080
management.server.port=8081

# 관리 포트에 대한 별도 보안 설정
management.security.enabled=true

주요 엔드포인트

Health 엔드포인트

애플리케이션의 상태를 확인하는 엔드포인트입니다. 데이터베이스, 디스크 공간, 외부 서비스 등의 상태를 종합적으로 판단합니다.

기본 사용법

# 기본 상태 확인
GET /actuator/health

# 응답 예시
{
  "status": "UP",
  "components": {
    "db": {
      "status": "UP",
      "details": {
        "database": "MySQL",
        "validationQuery": "isValid()"
      }
    },
    "diskSpace": {
      "status": "UP",
      "details": {
        "total": 499963174912,
        "free": 91943821312,
        "threshold": 10485760,
        "exists": true
      }
    }
  }
}

커스텀 Health Indicator

@Component
public class CustomHealthIndicator implements HealthIndicator {
    
    @Override
    public Health health() {
        // 외부 서비스 상태 확인 로직
        boolean externalServiceUp = checkExternalService();
        
        if (externalServiceUp) {
            return Health.up()
                .withDetail("external-service", "Available")
                .withDetail("response-time", "150ms")
                .build();
        } else {
            return Health.down()
                .withDetail("external-service", "Unavailable")
                .withDetail("error", "Connection timeout")
                .build();
        }
    }
    
    private boolean checkExternalService() {
        // 실제 외부 서비스 체크 로직
        try {
            // HTTP 호출 또는 다른 체크 로직
            return true;
        } catch (Exception e) {
            return false;
        }
    }
}

Info 엔드포인트

애플리케이션의 정보를 제공하는 엔드포인트입니다. 빌드 정보, Git 정보, 환경 정보 등을 포함할 수 있습니다.

설정 방법

# application.yml
management:
  info:
    # 환경 변수 정보 포함
    env:
      enabled: true
    # Java 정보 포함
    java:
      enabled: true
    # OS 정보 포함
    os:
      enabled: true

# 커스텀 정보 추가
info:
  app:
    name: My Spring Boot Application
    description: 모니터링 데모 애플리케이션
    version: 1.0.0
  team:
    name: Development Team
    email: dev@company.com

프로그래밍 방식 정보 추가

@Component
public class CustomInfoContributor implements InfoContributor {
    
    @Override
    public void contribute(Info.Builder builder) {
        builder.withDetail("custom", Map.of(
            "feature-flags", getFeatureFlags(),
            "database-version", getDatabaseVersion(),
            "last-deployment", getLastDeploymentTime()
        ));
    }
    
    private Map<String, Boolean> getFeatureFlags() {
        return Map.of(
            "new-ui", true,
            "beta-feature", false
        );
    }
    
    private String getDatabaseVersion() {
        // 데이터베이스 버전 조회 로직
        return "MySQL 8.0.25";
    }
    
    private String getLastDeploymentTime() {
        return LocalDateTime.now().minusHours(2).toString();
    }
}

Metrics 엔드포인트

애플리케이션의 다양한 메트릭을 제공하는 엔드포인트입니다. JVM, HTTP 요청, 데이터베이스 연결 등의 메트릭을 확인할 수 있습니다.

기본 사용법

# 모든 메트릭 목록 조회
GET /actuator/metrics

# 응답 예시
{
  "names": [
    "jvm.memory.used",
    "jvm.memory.max",
    "jvm.gc.pause",
    "http.server.requests",
    "system.cpu.usage",
    "process.uptime"
  ]
}

# 특정 메트릭 상세 조회
GET /actuator/metrics/jvm.memory.used

# 응답 예시
{
  "name": "jvm.memory.used",
  "description": "The amount of used memory",
  "baseUnit": "bytes",
  "measurements": [
    {
      "statistic": "VALUE",
      "value": 123456789
    }
  ],
  "availableTags": [
    {
      "tag": "area",
      "values": ["heap", "nonheap"]
    },
    {
      "tag": "id",
      "values": ["PS Eden Space", "PS Old Gen"]
    }
  ]
}

태그를 이용한 필터링

# 힙 메모리만 조회
GET /actuator/metrics/jvm.memory.used?tag=area:heap

# HTTP 요청 중 특정 URI만 조회
GET /actuator/metrics/http.server.requests?tag=uri:/api/users

# 여러 태그 조합
GET /actuator/metrics/http.server.requests?tag=uri:/api/users&tag=status:200

기타 유용한 엔드포인트

환경 정보

# 환경 변수 및 프로퍼티 조회
GET /actuator/env

# 특정 프로퍼티 조회
GET /actuator/env/server.port

로그 설정

# 로거 목록 조회
GET /actuator/loggers

# 특정 로거 레벨 조회
GET /actuator/loggers/com.example

# 로그 레벨 변경 (POST)
POST /actuator/loggers/com.example
{
  "configuredLevel": "DEBUG"
}

스레드 덤프

# 스레드 덤프 생성
GET /actuator/threaddump

힙 덤프

# 힙 덤프 생성 (바이너리)
GET /actuator/heapdump

운영 시 주의사항

보안 고려사항

민감한 정보 노출 방지

환경 변수, 설정 정보에 포함된 비밀번호, API 키 등이 노출되지 않도록 주의

접근 제어

운영 환경에서는 반드시 인증/인가 설정을 통해 접근을 제한

네트워크 분리

관리 포트를 별도로 분리하여 내부 네트워크에서만 접근 가능하도록 설정

성능 고려사항

메트릭 수집 오버헤드

불필요한 메트릭 수집은 성능에 영향을 줄 수 있으므로 필요한 것만 활성화

캐싱 활용

자주 조회되는 정보는 캐싱을 통해 성능 최적화

3. 커스텀 메트릭

MeterRegistry와 Micrometer

Micrometer 개요

Micrometer는 JVM 기반 애플리케이션을 위한 메트릭 수집 라이브러리입니다. 다양한 모니터링 시스템(Prometheus, Grafana, DataDog 등)과 연동할 수 있는 통합 인터페이스를 제공합니다.

주요 특징

벤더 중립적인 메트릭 API
다양한 모니터링 시스템 지원
차원(Dimension) 기반 메트릭
Spring Boot 자동 설정 지원
낮은 오버헤드

지원 모니터링 시스템

Prometheus
Grafana
DataDog
New Relic
CloudWatch
InfluxDB

MeterRegistry 설정

기본 설정

@Configuration
public class MetricsConfig {
    
    @Bean
    public MeterRegistryCustomizer<MeterRegistry> metricsCommonTags() {
        return registry -> registry.config()
            .commonTags("application", "my-app")
            .commonTags("environment", "production")
            .commonTags("version", "1.0.0");
    }
    
    @Bean
    public TimedAspect timedAspect(MeterRegistry registry) {
        return new TimedAspect(registry);
    }
}

메트릭 타입별 구현

Counter - 카운터

단조 증가하는 값을 측정하는 메트릭입니다. 요청 수, 에러 수, 완료된 작업 수 등을 측정할 때 사용합니다.

기본 사용법

@Service
public class OrderService {
    
    private final Counter orderCounter;
    private final Counter errorCounter;
    
    public OrderService(MeterRegistry meterRegistry) {
        this.orderCounter = Counter.builder("orders.created")
            .description("Number of orders created")
            .tag("type", "online")
            .register(meterRegistry);
            
        this.errorCounter = Counter.builder("orders.errors")
            .description("Number of order processing errors")
            .register(meterRegistry);
    }
    
    public void createOrder(Order order) {
        try {
            // 주문 처리 로직
            processOrder(order);
            
            // 성공 시 카운터 증가
            orderCounter.increment();
            
        } catch (Exception e) {
            // 에러 시 에러 카운터 증가
            errorCounter.increment(Tags.of("error.type", e.getClass().getSimpleName()));
            throw e;
        }
    }
    
    // 태그를 동적으로 추가하는 방법
    public void createOrderWithTags(Order order) {
        Counter.builder("orders.created")
            .tag("payment.method", order.getPaymentMethod())
            .tag("customer.type", order.getCustomerType())
            .register(meterRegistry)
            .increment();
    }
}

Gauge - 게이지

현재 값을 측정하는 메트릭입니다. 메모리 사용량, 큐 크기, 활성 연결 수 등 증가/감소할 수 있는 값을 측정합니다.

기본 사용법

@Component
public class SystemMetrics {
    
    private final AtomicInteger activeConnections = new AtomicInteger(0);
    private final Queue<String> messageQueue = new ConcurrentLinkedQueue<>();
    
    public SystemMetrics(MeterRegistry meterRegistry) {
        // AtomicInteger를 직접 게이지로 등록
        Gauge.builder("connections.active")
            .description("Number of active connections")
            .register(meterRegistry, activeConnections, AtomicInteger::get);
            
        // 컬렉션 크기를 게이지로 등록
        Gauge.builder("queue.size")
            .description("Size of message queue")
            .register(meterRegistry, messageQueue, Queue::size);
            
        // 메서드 참조를 이용한 게이지
        Gauge.builder("memory.used")
            .description("Used memory in bytes")
            .register(meterRegistry, this, SystemMetrics::getUsedMemory);
    }
    
    public void addConnection() {
        activeConnections.incrementAndGet();
    }
    
    public void removeConnection() {
        activeConnections.decrementAndGet();
    }
    
    private double getUsedMemory() {
        Runtime runtime = Runtime.getRuntime();
        return runtime.totalMemory() - runtime.freeMemory();
    }
}

MultiGauge 사용법

@Component
public class DatabaseMetrics {
    
    private final MultiGauge connectionPoolGauge;
    
    public DatabaseMetrics(MeterRegistry meterRegistry) {
        this.connectionPoolGauge = MultiGauge.builder("db.connections")
            .description("Database connection pool metrics")
            .register(meterRegistry);
    }
    
    @Scheduled(fixedRate = 30000) // 30초마다 업데이트
    public void updateConnectionMetrics() {
        List<MultiGauge.Row<?>> rows = Arrays.asList(
            MultiGauge.Row.of(Tags.of("pool", "primary", "state", "active"), 
                             getActiveConnections("primary")),
            MultiGauge.Row.of(Tags.of("pool", "primary", "state", "idle"), 
                             getIdleConnections("primary")),
            MultiGauge.Row.of(Tags.of("pool", "secondary", "state", "active"), 
                             getActiveConnections("secondary")),
            MultiGauge.Row.of(Tags.of("pool", "secondary", "state", "idle"), 
                             getIdleConnections("secondary"))
        );
        
        connectionPoolGauge.register(rows, true);
    }
}

Timer - 타이머

시간 측정과 이벤트 발생 빈도를 함께 측정하는 메트릭입니다. 메서드 실행 시간, HTTP 요청 처리 시간 등을 측정합니다.

기본 사용법

@Service
public class PaymentService {
    
    private final Timer paymentTimer;
    private final MeterRegistry meterRegistry;
    
    public PaymentService(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
        this.paymentTimer = Timer.builder("payment.processing.time")
            .description("Payment processing time")
            .register(meterRegistry);
    }
    
    // Timer.Sample을 이용한 측정
    public PaymentResult processPayment(PaymentRequest request) {
        Timer.Sample sample = Timer.start(meterRegistry);
        
        try {
            PaymentResult result = doProcessPayment(request);
            
            // 성공 시 태그 추가
            sample.stop(Timer.builder("payment.processing.time")
                .tag("status", "success")
                .tag("method", request.getPaymentMethod())
                .register(meterRegistry));
                
            return result;
            
        } catch (Exception e) {
            // 실패 시 태그 추가
            sample.stop(Timer.builder("payment.processing.time")
                .tag("status", "error")
                .tag("error.type", e.getClass().getSimpleName())
                .register(meterRegistry));
            throw e;
        }
    }
    
    // recordCallable을 이용한 간편한 측정
    public PaymentResult processPaymentSimple(PaymentRequest request) {
        return paymentTimer.recordCallable(() -> {
            return doProcessPayment(request);
        });
    }
}

@Timed 어노테이션 사용

@Service
public class UserService {
    
    @Timed(name = "user.creation.time", 
           description = "Time taken to create user")
    public User createUser(UserRequest request) {
        // 사용자 생성 로직
        return new User(request);
    }
    
    @Timed(name = "user.search.time",
           description = "Time taken to search users",
           extraTags = {"operation", "search"})
    public List<User> searchUsers(String query) {
        // 사용자 검색 로직
        return userRepository.findByNameContaining(query);
    }
    
    // 조건부 측정
    @Timed(name = "user.update.time",
           description = "Time taken to update user",
           longTask = true) // 장시간 실행되는 작업 측정
    public User updateUser(Long id, UserRequest request) {
        // 사용자 업데이트 로직
        return userRepository.save(user);
    }
}

Distribution Summary

값의 분포를 측정하는 메트릭입니다. 요청 크기, 응답 크기, 배치 크기 등을 측정할 때 사용합니다.

기본 사용법

@Service
public class FileService {
    
    private final DistributionSummary fileSizeSummary;
    private final DistributionSummary batchSizeSummary;
    
    public FileService(MeterRegistry meterRegistry) {
        this.fileSizeSummary = DistributionSummary.builder("file.size")
            .description("Size of uploaded files")
            .baseUnit("bytes")
            .register(meterRegistry);
            
        this.batchSizeSummary = DistributionSummary.builder("batch.size")
            .description("Number of items in batch")
            .register(meterRegistry);
    }
    
    public void uploadFile(MultipartFile file) {
        // 파일 크기 기록
        fileSizeSummary.record(file.getSize());
        
        // 파일 업로드 처리
        processFile(file);
    }
    
    public void processBatch(List<Item> items) {
        // 배치 크기 기록
        batchSizeSummary.record(items.size());
        
        // 배치 처리
        items.forEach(this::processItem);
    }
}

비즈니스 메트릭 구현

실제 비즈니스 시나리오

E-commerce 메트릭

@Service
public class EcommerceMetrics {
    
    private final Counter orderCounter;
    private final Counter revenueCounter;
    private final Gauge cartSizeGauge;
    private final Timer checkoutTimer;
    private final DistributionSummary orderValueSummary;
    
    public EcommerceMetrics(MeterRegistry meterRegistry) {
        this.orderCounter = Counter.builder("ecommerce.orders.total")
            .description("Total number of orders")
            .register(meterRegistry);
            
        this.revenueCounter = Counter.builder("ecommerce.revenue.total")
            .description("Total revenue")
            .baseUnit("currency")
            .register(meterRegistry);
            
        this.checkoutTimer = Timer.builder("ecommerce.checkout.duration")
            .description("Time taken for checkout process")
            .register(meterRegistry);
            
        this.orderValueSummary = DistributionSummary.builder("ecommerce.order.value")
            .description("Distribution of order values")
            .baseUnit("currency")
            .register(meterRegistry);
    }
    
    public void recordOrder(Order order) {
        // 주문 수 증가
        orderCounter.increment(
            Tags.of(
                "category", order.getCategory(),
                "payment.method", order.getPaymentMethod(),
                "customer.segment", order.getCustomerSegment()
            )
        );
        
        // 매출 증가
        revenueCounter.increment(
            Tags.of("category", order.getCategory()),
            order.getTotalAmount()
        );
        
        // 주문 금액 분포 기록
        orderValueSummary.record(order.getTotalAmount());
    }
    
    @EventListener
    public void handleCheckoutEvent(CheckoutEvent event) {
        Timer.Sample sample = Timer.start();
        // 체크아웃 시간 측정 로직
        sample.stop(checkoutTimer.tag("result", event.isSuccess() ? "success" : "failure"));
    }
}

사용자 행동 메트릭

@Component
public class UserBehaviorMetrics {
    
    private final Counter pageViewCounter;
    private final Counter userActionCounter;
    private final Timer sessionDurationTimer;
    private final Gauge activeUsersGauge;
    
    private final Set<String> activeUsers = ConcurrentHashMap.newKeySet();
    
    public UserBehaviorMetrics(MeterRegistry meterRegistry) {
        this.pageViewCounter = Counter.builder("user.pageviews")
            .description("Number of page views")
            .register(meterRegistry);
            
        this.userActionCounter = Counter.builder("user.actions")
            .description("User actions performed")
            .register(meterRegistry);
            
        this.sessionDurationTimer = Timer.builder("user.session.duration")
            .description("User session duration")
            .register(meterRegistry);
            
        this.activeUsersGauge = Gauge.builder("user.active.count")
            .description("Number of active users")
            .register(meterRegistry, activeUsers, Set::size);
    }
    
    public void recordPageView(String userId, String page) {
        pageViewCounter.increment(
            Tags.of(
                "page", page,
                "user.type", getUserType(userId)
            )
        );
        
        activeUsers.add(userId);
    }
    
    public void recordUserAction(String userId, String action) {
        userActionCounter.increment(
            Tags.of(
                "action", action,
                "user.type", getUserType(userId)
            )
        );
    }
    
    public void recordSessionEnd(String userId, Duration sessionDuration) {
        sessionDurationTimer.record(sessionDuration);
        activeUsers.remove(userId);
    }
}

메트릭 베스트 프랙티스

명명 규칙

소문자와 점(.)으로 구분
의미있는 이름 사용
단위 명시 (bytes, seconds 등)
일관된 네이밍 컨벤션

태그 사용

카디널리티 주의 (너무 많은 조합 피하기)
의미있는 차원으로 분류
공통 태그 활용
동적 태그 값 제한

성능 고려사항

// 좋은 예: 메트릭 인스턴스 재사용
@Service
public class OptimizedMetrics {
    
    private final Counter successCounter;
    private final Counter errorCounter;
    
    public OptimizedMetrics(MeterRegistry registry) {
        this.successCounter = Counter.builder("api.requests")
            .tag("result", "success")
            .register(registry);
        this.errorCounter = Counter.builder("api.requests")
            .tag("result", "error")
            .register(registry);
    }
    
    public void recordSuccess() {
        successCounter.increment(); // 빠름
    }
}

// 나쁜 예: 매번 새로운 메트릭 생성
public void recordRequest(String result) {
    Counter.builder("api.requests")
        .tag("result", result)
        .register(registry)
        .increment(); // 느림, 메모리 누수 가능
}

4. Prometheus 연동

Prometheus 개요

Prometheus란?

Prometheus는 오픈소스 모니터링 및 알림 시스템입니다. 시계열 데이터베이스를 내장하고 있으며, Pull 방식으로 메트릭을 수집하고 강력한 쿼리 언어(PromQL)를 제공합니다.

주요 특징

다차원 데이터 모델 (레이블 기반)
강력한 쿼리 언어 (PromQL)
Pull 방식 메트릭 수집
서비스 디스커버리 지원
알림 규칙 설정
웹 UI 제공

아키텍처 구성요소

Prometheus Server (메트릭 수집/저장)
Client Libraries (메트릭 노출)
Push Gateway (배치 작업용)
Exporters (서드파티 메트릭)
Alertmanager (알림 처리)
Grafana (시각화)

Spring Boot와 Prometheus 연동

의존성 설정

Maven

<dependencies>
    <!-- Spring Boot Actuator -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-actuator</artifactId>
    </dependency>
    
    <!-- Micrometer Prometheus Registry -->
    <dependency>
        <groupId>io.micrometer</groupId>
        <artifactId>micrometer-registry-prometheus</artifactId>
    </dependency>
</dependencies>

Gradle

dependencies {
    implementation 'org.springframework.boot:spring-boot-starter-actuator'
    implementation 'io.micrometer:micrometer-registry-prometheus'
}

애플리케이션 설정

application.yml

management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
  endpoint:
    prometheus:
      enabled: true
    metrics:
      enabled: true
  metrics:
    export:
      prometheus:
        enabled: true
        # 메트릭 수집 간격 (기본값: 1분)
        step: 1m
        # 히스토그램 버킷 설정
        descriptions: true
    distribution:
      percentiles-histogram:
        http.server.requests: true
      percentiles:
        http.server.requests: 0.5, 0.95, 0.99
      slo:
        http.server.requests: 10ms,50ms,100ms,200ms,500ms
    tags:
      application: ${spring.application.name}
      environment: ${spring.profiles.active:default}

# 애플리케이션 정보
spring:
  application:
    name: monitoring-demo
info:
  app:
    name: ${spring.application.name}
    version: 1.0.0
    description: Spring Boot Monitoring Demo

Prometheus 설정 커스터마이징

MeterRegistry 커스터마이징

@Configuration
public class PrometheusConfig {
    
    @Bean
    public MeterRegistryCustomizer<PrometheusMeterRegistry> prometheusCustomizer() {
        return registry -> {
            registry.config()
                // 공통 태그 추가
                .commonTags(
                    "application", "monitoring-demo",
                    "version", "1.0.0",
                    "environment", getEnvironment()
                )
                // 메트릭 이름 변환
                .meterFilter(MeterFilter.renameTag("http.server.requests", "uri", "endpoint"))
                // 특정 메트릭 제외
                .meterFilter(MeterFilter.deny(id -> id.getName().startsWith("jvm.gc.overhead")))
                // 히스토그램 버킷 커스터마이징
                .meterFilter(MeterFilter.maximumExpectedValue("http.server.requests", 
                    Duration.ofSeconds(5)));
        };
    }
    
    @Bean
    public TimedAspect timedAspect(MeterRegistry registry) {
        return new TimedAspect(registry);
    }
    
    private String getEnvironment() {
        return System.getProperty("spring.profiles.active", "default");
    }
}

Prometheus 서버 설정

prometheus.yml 설정

기본 설정

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  # Spring Boot 애플리케이션
  - job_name: 'spring-boot-app'
    metrics_path: '/actuator/prometheus'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:8080']
        labels:
          application: 'monitoring-demo'
          environment: 'production'
  
  # Prometheus 자체 모니터링
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter (시스템 메트릭)
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

서비스 디스커버리 설정

scrape_configs:
  # Kubernetes 서비스 디스커버리
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::d+)?;(d+)
        replacement: ${1}:${2}
        target_label: __address__

  # Docker Compose 서비스 디스커버리
  - job_name: 'docker-compose'
    file_sd_configs:
      - files:
        - '/etc/prometheus/targets/*.json'
        refresh_interval: 30s

Docker Compose 설정

docker-compose.yml

version: '3.8'

services:
  # Spring Boot 애플리케이션
  app:
    build: .
    ports:
      - "8080:8080"
    environment:
      - SPRING_PROFILES_ACTIVE=docker
    labels:
      - "prometheus.io/scrape=true"
      - "prometheus.io/port=8080"
      - "prometheus.io/path=/actuator/prometheus"

  # Prometheus
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert_rules.yml:/etc/prometheus/alert_rules.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'

  # Grafana
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-storage:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning

  # Node Exporter
  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'

volumes:
  grafana-storage:

5. Grafana 대시보드

Grafana 개요

Grafana란?

Grafana는 메트릭 데이터를 시각화하고 분석할 수 있는 오픈소스 플랫폼입니다. 다양한 데이터 소스를 지원하며, 아름답고 인터랙티브한 대시보드를 제공합니다.

주요 기능

다양한 시각화 패널 (그래프, 테이블, 히트맵 등)
실시간 데이터 모니터링
알림 및 노티피케이션
대시보드 템플릿 및 변수
사용자 권한 관리
플러그인 생태계

지원 데이터 소스

Prometheus
InfluxDB
Elasticsearch
MySQL/PostgreSQL
CloudWatch
DataDog

Grafana 설정 및 연동

Prometheus 데이터 소스 설정

수동 설정

1. Grafana 접속: http://localhost:3000 (admin/admin)

2. Configuration → Data Sources → Add data source

3. Prometheus 선택

4. URL 설정: http://prometheus:9090 (Docker) 또는 http://localhost:9090

5. Save & Test 클릭

자동 프로비저닝 설정

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      httpMethod: POST
      queryTimeout: 60s
      timeInterval: 15s

대시보드 프로비저닝

대시보드 프로바이더 설정

# grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/provisioning/dashboards

대시보드 구성

Spring Boot 애플리케이션 대시보드

주요 패널 구성

1. 애플리케이션 상태

• 애플리케이션 업타임
• Health Check 상태
• 인스턴스 수

2. HTTP 요청 메트릭

• 요청 수 (RPS)
• 응답 시간 (평균, P95, P99)
• 에러율

3. JVM 메트릭

• 힙 메모리 사용량
• GC 시간 및 횟수
• 스레드 수

4. 시스템 리소스

• CPU 사용률
• 메모리 사용률
• 디스크 I/O

주요 PromQL 쿼리

# HTTP 요청 수 (RPS)
rate(http_server_requests_seconds_count[5m])

# 평균 응답 시간
rate(http_server_requests_seconds_sum[5m]) / rate(http_server_requests_seconds_count[5m])

# P95 응답 시간
histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m]))

# 에러율 (4xx, 5xx)
sum(rate(http_server_requests_seconds_count{status=~"4..|5.."}[5m])) / 
sum(rate(http_server_requests_seconds_count[5m])) * 100

# JVM 힙 메모리 사용률
jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} * 100

# GC 시간
rate(jvm_gc_pause_seconds_sum[5m])

# CPU 사용률
system_cpu_usage * 100

# 활성 스레드 수
jvm_threads_live_threads

비즈니스 메트릭 대시보드

E-commerce 대시보드 예시

# 시간당 주문 수
sum(rate(ecommerce_orders_total[1h])) * 3600

# 시간당 매출
sum(rate(ecommerce_revenue_total[1h])) * 3600

# 평균 주문 금액
sum(rate(ecommerce_revenue_total[5m])) / sum(rate(ecommerce_orders_total[5m]))

# 결제 방법별 주문 분포
sum by (payment_method) (rate(ecommerce_orders_total[5m]))

# 체크아웃 성공률
sum(rate(ecommerce_checkout_duration_seconds_count{result="success"}[5m])) /
sum(rate(ecommerce_checkout_duration_seconds_count[5m])) * 100

# 활성 사용자 수
user_active_count

# 페이지뷰 수
sum(rate(user_pageviews_total[5m]))

대시보드 변수 활용

변수 설정 예시

# 애플리케이션 변수
label_values(http_server_requests_seconds_count, application)

# 환경 변수
label_values(http_server_requests_seconds_count, environment)

# 인스턴스 변수
label_values(http_server_requests_seconds_count{application="$application"}, instance)

# URI 변수
label_values(http_server_requests_seconds_count{application="$application"}, uri)

# 시간 범위 변수 (Custom)
5m,15m,30m,1h,6h,12h,1d,7d

변수를 활용한 쿼리

# 선택된 애플리케이션의 요청 수
sum(rate(http_server_requests_seconds_count{application="$application"}[$interval]))

# 선택된 인스턴스의 메모리 사용량
jvm_memory_used_bytes{application="$application", instance="$instance", area="heap"}

# 선택된 URI의 응답 시간
histogram_quantile(0.95, 
  rate(http_server_requests_seconds_bucket{
    application="$application", 
    uri="$uri"
  }[$interval])
)

알림 설정

Grafana 알림 규칙

알림 규칙 예시

높은 에러율

5분간 에러율이 5%를 초과할 때

sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) / 
sum(rate(http_server_requests_seconds_count[5m])) * 100 > 5

높은 응답 시간

P95 응답 시간이 1초를 초과할 때

histogram_quantile(0.95, 
  rate(http_server_requests_seconds_bucket[5m])
) > 1

높은 메모리 사용률

힙 메모리 사용률이 85%를 초과할 때

jvm_memory_used_bytes{area="heap"} / 
jvm_memory_max_bytes{area="heap"} * 100 > 85

노티피케이션 채널

Slack 연동

{
  "name": "slack-alerts",
  "type": "slack",
  "settings": {
    "url": "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK",
    "channel": "#alerts",
    "username": "Grafana",
    "title": "{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}",
    "text": "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
  }
}

이메일 연동

{
  "name": "email-alerts",
  "type": "email",
  "settings": {
    "addresses": "admin@company.com;ops@company.com",
    "subject": "Grafana Alert: {{ .GroupLabels.alertname }}",
    "body": "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
  }
}

6. 헬스 체크 (Health Check)

헬스 체크 개요

헬스 체크란?

헬스 체크는 애플리케이션과 그 의존성들의 상태를 확인하는 메커니즘입니다. 로드 밸런서, 오케스트레이션 도구, 모니터링 시스템에서 애플리케이션의 가용성을 판단하는 데 사용됩니다.

헬스 체크 유형

Liveness: 애플리케이션이 살아있는지 확인
Readiness: 트래픽을 받을 준비가 되었는지 확인
Startup: 애플리케이션이 시작되었는지 확인

체크 대상

데이터베이스 연결
외부 API 서비스
메시지 큐
캐시 시스템
파일 시스템
디스크 공간

커스텀 HealthIndicator

기본 HealthIndicator 구현

데이터베이스 헬스 체크

@Component
public class DatabaseHealthIndicator implements HealthIndicator {
    
    private final DataSource dataSource;
    
    public DatabaseHealthIndicator(DataSource dataSource) {
        this.dataSource = dataSource;
    }
    
    @Override
    public Health health() {
        try (Connection connection = dataSource.getConnection()) {
            // 간단한 쿼리 실행으로 DB 연결 확인
            try (PreparedStatement statement = connection.prepareStatement("SELECT 1")) {
                ResultSet resultSet = statement.executeQuery();
                if (resultSet.next()) {
                    return Health.up()
                        .withDetail("database", "Available")
                        .withDetail("connection-pool", getConnectionPoolInfo())
                        .withDetail("query-time", measureQueryTime())
                        .build();
                }
            }
        } catch (SQLException e) {
            return Health.down()
                .withDetail("database", "Unavailable")
                .withDetail("error", e.getMessage())
                .withException(e)
                .build();
        }
        
        return Health.down()
            .withDetail("database", "Unknown state")
            .build();
    }
    
    private Map<String, Object> getConnectionPoolInfo() {
        // HikariCP 정보 조회 예시
        if (dataSource instanceof HikariDataSource) {
            HikariDataSource hikariDS = (HikariDataSource) dataSource;
            HikariPoolMXBean poolBean = hikariDS.getHikariPoolMXBean();
            
            return Map.of(
                "active", poolBean.getActiveConnections(),
                "idle", poolBean.getIdleConnections(),
                "total", poolBean.getTotalConnections(),
                "waiting", poolBean.getThreadsAwaitingConnection()
            );
        }
        return Map.of("type", "unknown");
    }
    
    private String measureQueryTime() {
        long startTime = System.currentTimeMillis();
        try (Connection connection = dataSource.getConnection();
             PreparedStatement statement = connection.prepareStatement("SELECT 1")) {
            statement.executeQuery();
            return (System.currentTimeMillis() - startTime) + "ms";
        } catch (SQLException e) {
            return "error";
        }
    }
}

외부 API 헬스 체크

@Component
public class ExternalApiHealthIndicator implements HealthIndicator {
    
    private final RestTemplate restTemplate;
    private final String apiUrl;
    
    public ExternalApiHealthIndicator(RestTemplate restTemplate, 
                                    @Value("${external.api.url}") String apiUrl) {
        this.restTemplate = restTemplate;
        this.apiUrl = apiUrl;
    }
    
    @Override
    public Health health() {
        try {
            long startTime = System.currentTimeMillis();
            
            ResponseEntity<String> response = restTemplate.exchange(
                apiUrl + "/health",
                HttpMethod.GET,
                null,
                String.class
            );
            
            long responseTime = System.currentTimeMillis() - startTime;
            
            if (response.getStatusCode().is2xxSuccessful()) {
                return Health.up()
                    .withDetail("external-api", "Available")
                    .withDetail("url", apiUrl)
                    .withDetail("status", response.getStatusCode().value())
                    .withDetail("response-time", responseTime + "ms")
                    .build();
            } else {
                return Health.down()
                    .withDetail("external-api", "Unhealthy response")
                    .withDetail("status", response.getStatusCode().value())
                    .withDetail("response-time", responseTime + "ms")
                    .build();
            }
            
        } catch (ResourceAccessException e) {
            return Health.down()
                .withDetail("external-api", "Connection timeout")
                .withDetail("error", e.getMessage())
                .build();
        } catch (Exception e) {
            return Health.down()
                .withDetail("external-api", "Unavailable")
                .withDetail("error", e.getMessage())
                .build();
        }
    }
}

복합 헬스 체크

Redis 클러스터 헬스 체크

@Component
public class RedisClusterHealthIndicator implements HealthIndicator {
    
    private final RedisTemplate<String, String> redisTemplate;
    private final List<String> redisNodes;
    
    public RedisClusterHealthIndicator(RedisTemplate<String, String> redisTemplate,
                                     @Value("${redis.nodes}") List<String> redisNodes) {
        this.redisTemplate = redisTemplate;
        this.redisNodes = redisNodes;
    }
    
    @Override
    public Health health() {
        Health.Builder builder = Health.up();
        Map<String, Object> details = new HashMap<>();
        
        try {
            // Redis 연결 테스트
            String testKey = "health-check-" + System.currentTimeMillis();
            String testValue = "test";
            
            // Write 테스트
            redisTemplate.opsForValue().set(testKey, testValue, Duration.ofSeconds(10));
            
            // Read 테스트
            String retrievedValue = redisTemplate.opsForValue().get(testKey);
            
            if (testValue.equals(retrievedValue)) {
                details.put("redis", "Available");
                details.put("operation", "Read/Write successful");
                
                // 클러스터 노드 상태 확인
                details.put("cluster-info", getClusterInfo());
                
            } else {
                builder = Health.down();
                details.put("redis", "Data inconsistency");
            }
            
            // Cleanup
            redisTemplate.delete(testKey);
            
        } catch (Exception e) {
            builder = Health.down();
            details.put("redis", "Unavailable");
            details.put("error", e.getMessage());
        }
        
        return builder.withDetails(details).build();
    }
    
    private Map<String, Object> getClusterInfo() {
        try {
            Properties info = redisTemplate.getConnectionFactory()
                .getConnection()
                .info("replication");
                
            return Map.of(
                "role", info.getProperty("role", "unknown"),
                "connected_slaves", info.getProperty("connected_slaves", "0"),
                "nodes", redisNodes.size()
            );
        } catch (Exception e) {
            return Map.of("error", "Unable to get cluster info");
        }
    }
}

Kubernetes 헬스 체크

Liveness vs Readiness

Liveness Probe 설정

애플리케이션이 살아있는지 확인합니다. 실패 시 컨테이너를 재시작합니다.

# application.yml
management:
  endpoint:
    health:
      probes:
        enabled: true
  health:
    livenessstate:
      enabled: true

---
# Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: spring-boot-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: spring-boot-app:latest
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /actuator/health/liveness
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3

Readiness Probe 설정

애플리케이션이 트래픽을 받을 준비가 되었는지 확인합니다. 실패 시 서비스에서 제외합니다.

# application.yml
management:
  health:
    readinessstate:
      enabled: true

---
# Kubernetes Deployment
        readinessProbe:
          httpGet:
            path: /actuator/health/readiness
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
          successThreshold: 1

커스텀 Liveness/Readiness 구현

애플리케이션 상태 관리

@Component
public class ApplicationStateManager {
    
    private volatile boolean ready = false;
    private volatile boolean live = true;
    
    @EventListener
    public void onApplicationReady(ApplicationReadyEvent event) {
        // 애플리케이션 초기화 완료 후 ready 상태로 변경
        this.ready = true;
        log.info("Application is ready to serve traffic");
    }
    
    @EventListener
    public void onContextClosed(ContextClosedEvent event) {
        // 애플리케이션 종료 시 ready 상태 해제
        this.ready = false;
        log.info("Application is shutting down");
    }
    
    public boolean isReady() {
        return ready && live && checkDependencies();
    }
    
    public boolean isLive() {
        return live && checkCriticalResources();
    }
    
    private boolean checkDependencies() {
        // 외부 의존성 체크 (DB, 외부 API 등)
        try {
            // 중요하지 않은 의존성들 체크
            return checkNonCriticalServices();
        } catch (Exception e) {
            log.warn("Non-critical service check failed", e);
            return false;
        }
    }
    
    private boolean checkCriticalResources() {
        // 핵심 리소스 체크 (메모리, 스레드 등)
        try {
            // 메모리 사용률 체크
            long maxMemory = Runtime.getRuntime().maxMemory();
            long usedMemory = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
            double memoryUsage = (double) usedMemory / maxMemory;
            
            if (memoryUsage > 0.95) {
                log.error("Memory usage too high: {}%", memoryUsage * 100);
                return false;
            }
            
            return true;
        } catch (Exception e) {
            log.error("Critical resource check failed", e);
            return false;
        }
    }
    
    public void setLive(boolean live) {
        this.live = live;
    }
}

커스텀 헬스 인디케이터

@Component
public class CustomLivenessHealthIndicator implements HealthIndicator {
    
    private final ApplicationStateManager stateManager;
    
    public CustomLivenessHealthIndicator(ApplicationStateManager stateManager) {
        this.stateManager = stateManager;
    }
    
    @Override
    public Health health() {
        if (stateManager.isLive()) {
            return Health.up()
                .withDetail("status", "Application is alive")
                .withDetail("timestamp", Instant.now())
                .build();
        } else {
            return Health.down()
                .withDetail("status", "Application is not responding")
                .withDetail("timestamp", Instant.now())
                .build();
        }
    }
}

@Component
public class CustomReadinessHealthIndicator implements HealthIndicator {
    
    private final ApplicationStateManager stateManager;
    
    public CustomReadinessHealthIndicator(ApplicationStateManager stateManager) {
        this.stateManager = stateManager;
    }
    
    @Override
    public Health health() {
        if (stateManager.isReady()) {
            return Health.up()
                .withDetail("status", "Application is ready")
                .withDetail("dependencies", "All systems operational")
                .build();
        } else {
            return Health.down()
                .withDetail("status", "Application is not ready")
                .withDetail("dependencies", "Some dependencies unavailable")
                .build();
        }
    }
}

헬스 체크 베스트 프랙티스

설계 원칙

Liveness 체크

빠른 응답 (1-2초 이내)
핵심 기능만 체크
외부 의존성 최소화
메모리 누수 감지
데드락 감지

Readiness 체크

모든 의존성 체크
데이터베이스 연결 확인
외부 서비스 가용성
초기화 완료 확인
캐시 워밍업 완료

헬스 체크 최적화

@Configuration
public class HealthCheckConfig {
    
    @Bean
    public HealthIndicatorRegistry healthIndicatorRegistry() {
        return new DefaultHealthIndicatorRegistry();
    }
    
    // 헬스 체크 캐싱
    @Bean
    public CachingHealthIndicator cachingHealthIndicator(
            HealthIndicator delegate) {
        return new CachingHealthIndicator(delegate, Duration.ofSeconds(30));
    }
    
    // 타임아웃 설정
    @Bean
    public HealthIndicator timeoutHealthIndicator(
            HealthIndicator delegate) {
        return new TimeoutHealthIndicator(delegate, Duration.ofSeconds(5));
    }
}

// 캐싱 헬스 인디케이터 구현
public class CachingHealthIndicator implements HealthIndicator {
    
    private final HealthIndicator delegate;
    private final Duration cacheDuration;
    private volatile Health cachedHealth;
    private volatile Instant lastCheck;
    
    public CachingHealthIndicator(HealthIndicator delegate, Duration cacheDuration) {
        this.delegate = delegate;
        this.cacheDuration = cacheDuration;
    }
    
    @Override
    public Health health() {
        Instant now = Instant.now();
        
        if (cachedHealth == null || 
            lastCheck == null || 
            Duration.between(lastCheck, now).compareTo(cacheDuration) > 0) {
            
            cachedHealth = delegate.health();
            lastCheck = now;
        }
        
        return cachedHealth;
    }
}

모니터링 및 알림

헬스 체크 메트릭

@Component
public class HealthMetrics {
    
    private final MeterRegistry meterRegistry;
    private final Counter healthCheckCounter;
    private final Timer healthCheckTimer;
    
    public HealthMetrics(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
        this.healthCheckCounter = Counter.builder("health.check.total")
            .description("Total health check executions")
            .register(meterRegistry);
        this.healthCheckTimer = Timer.builder("health.check.duration")
            .description("Health check execution time")
            .register(meterRegistry);
    }
    
    public Health executeHealthCheck(String component, Supplier<Health> healthCheck) {
        Timer.Sample sample = Timer.start(meterRegistry);
        
        try {
            Health health = healthCheck.get();
            
            healthCheckCounter.increment(
                Tags.of(
                    "component", component,
                    "status", health.getStatus().getCode()
                )
            );
            
            sample.stop(healthCheckTimer.tag("component", component));
            
            return health;
            
        } catch (Exception e) {
            healthCheckCounter.increment(
                Tags.of(
                    "component", component,
                    "status", "ERROR"
                )
            );
            
            sample.stop(healthCheckTimer.tag("component", component));
            
            return Health.down()
                .withDetail("error", e.getMessage())
                .build();
        }
    }
}

7. 정리 및 베스트 프랙티스

모니터링 설계 가이드

모니터링 전략 수립

1. 목표 설정

SLA/SLO 정의
핵심 비즈니스 메트릭 식별
알림 우선순위 설정
대시보드 사용자 정의

2. 메트릭 선택

Four Golden Signals 우선
비즈니스 KPI 연계
카디널리티 관리
성능 영향 최소화

모니터링 계층 구조

┌─────────────────────────────────────────────────────────┐
│                   비즈니스 메트릭                        │
│  • 매출, 주문수, 사용자 수                               │
│  • 전환율, 이탈률                                       │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│                 애플리케이션 메트릭                      │
│  • HTTP 요청/응답 시간                                  │
│  • 에러율, 처리량                                       │
│  • JVM 메트릭                                          │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│                인프라스트럭처 메트릭                     │
│  • CPU, 메모리, 디스크                                 │
│  • 네트워크 I/O                                        │
│  • 컨테이너/Pod 상태                                   │
└─────────────────────────────────────────────────────────┘

메트릭 네이밍 컨벤션

네이밍 규칙

// 좋은 예시
http.server.requests.duration     // 명확하고 계층적
database.connection.pool.active   // 구체적이고 의미있음
business.orders.created.total     // 비즈니스 도메인 명시
cache.hits.ratio                  // 단위가 명확함

// 나쁜 예시
requests                          // 너무 일반적
db_conn                          // 축약어 사용
OrdersCreated                    // 카멜케이스 사용
http_requests_per_second         // 단위를 이름에 포함

// 태그 사용 예시
http.server.requests{
  method="GET",
  status="200",
  uri="/api/users",
  environment="production"
}

business.orders{
  payment_method="credit_card",
  customer_segment="premium",
  region="us-east-1"
}

알림 설계 원칙

Critical (즉시 대응)

서비스 완전 중단
데이터 손실 위험
보안 침해
SLA 위반

Warning (업무시간 대응)

성능 저하
에러율 증가
리소스 부족
의존성 장애

Info (모니터링)

트렌드 변화
용량 계획
비즈니스 이벤트
정기 점검

운영 베스트 프랙티스

성능 최적화

메트릭 수집 최적화

@Configuration
public class MetricsOptimizationConfig {
    
    @Bean
    public MeterRegistryCustomizer<MeterRegistry> metricsOptimization() {
        return registry -> registry.config()
            // 불필요한 메트릭 필터링
            .meterFilter(MeterFilter.deny(id -> 
                id.getName().startsWith("jvm.gc.overhead")))
            
            // 높은 카디널리티 메트릭 제한
            .meterFilter(MeterFilter.maximumExpectedValue(
                "http.server.requests", Duration.ofSeconds(30)))
            
            // 히스토그램 버킷 최적화
            .meterFilter(MeterFilter.replaceTagValues(
                "uri", uri -> uri.length() > 100 ? "long-uri" : uri))
            
            // 샘플링 적용
            .meterFilter(MeterFilter.denyNameStartsWith("debug"));
    }
    
    // 비동기 메트릭 수집
    @Bean
    @ConditionalOnProperty(name = "metrics.async.enabled", havingValue = "true")
    public AsyncMeterRegistry asyncMeterRegistry(MeterRegistry registry) {
        return new AsyncMeterRegistry(registry, Executors.newFixedThreadPool(2));
    }
}

메모리 사용량 최적화

// 메트릭 인스턴스 재사용
@Service
public class OptimizedMetricsService {
    
    private final Map<String, Counter> counterCache = new ConcurrentHashMap<>();
    private final Map<String, Timer> timerCache = new ConcurrentHashMap<>();
    
    public void incrementCounter(String name, Tags tags) {
        String key = name + tags.toString();
        counterCache.computeIfAbsent(key, k -> 
            Counter.builder(name).tags(tags).register(meterRegistry))
            .increment();
    }
    
    public Timer.Sample startTimer(String name, Tags tags) {
        String key = name + tags.toString();
        Timer timer = timerCache.computeIfAbsent(key, k ->
            Timer.builder(name).tags(tags).register(meterRegistry));
        return Timer.start(timer);
    }
    
    // 주기적으로 사용하지 않는 메트릭 정리
    @Scheduled(fixedRate = 300000) // 5분마다
    public void cleanupUnusedMetrics() {
        // 사용하지 않는 메트릭 제거 로직
    }
}

보안 및 접근 제어

민감한 정보 보호

@Configuration
public class ActuatorSecurityConfig {
    
    @Bean
    public SecurityFilterChain actuatorSecurity(HttpSecurity http) throws Exception {
        return http
            .requestMatcher(EndpointRequest.toAnyEndpoint())
            .authorizeHttpRequests(auth -> auth
                // 공개 엔드포인트
                .requestMatchers(EndpointRequest.to(HealthEndpoint.class))
                    .permitAll()
                .requestMatchers(EndpointRequest.to(InfoEndpoint.class))
                    .permitAll()
                
                // 제한된 엔드포인트
                .requestMatchers(EndpointRequest.to(MetricsEndpoint.class))
                    .hasRole("METRICS_READER")
                .requestMatchers(EndpointRequest.to(PrometheusEndpoint.class))
                    .hasRole("METRICS_READER")
                
                // 관리자 전용 엔드포인트
                .requestMatchers(EndpointRequest.to(
                    EnvironmentEndpoint.class,
                    ConfigPropsEndpoint.class,
                    LoggersEndpoint.class))
                    .hasRole("ADMIN")
                
                .anyRequest().denyAll()
            )
            .httpBasic(Customizer.withDefaults())
            .build();
    }
    
    // 민감한 정보 마스킹
    @Bean
    public EnvironmentEndpointWebExtension environmentEndpoint(
            Environment environment) {
        return new EnvironmentEndpointWebExtension(
            new EnvironmentEndpoint(environment, 
                Arrays.asList("password", "secret", "key", "token")));
    }
}

장애 대응 프로세스

Runbook 예시

# 높은 응답 시간 알림 대응 가이드

## 1. 즉시 확인사항
- [ ] 현재 트래픽 패턴 확인
- [ ] 에러율 동시 증가 여부 확인
- [ ] 인프라 리소스 사용률 확인
- [ ] 외부 의존성 상태 확인

## 2. 조사 단계
### Grafana 대시보드 확인
- Application Overview Dashboard
- JVM Metrics Dashboard
- Infrastructure Dashboard

### 주요 메트릭 확인
- P95 응답 시간: > 1초
- 에러율: > 5%
- CPU 사용률: > 80%
- 메모리 사용률: > 85%

## 3. 대응 조치
### 즉시 조치
1. 트래픽 제한 (Rate Limiting)
2. 캐시 무효화
3. 스케일 아웃 (Auto Scaling)

### 근본 원인 분석
1. 로그 분석
2. 프로파일링
3. 데이터베이스 쿼리 분석

## 4. 사후 조치
- [ ] 인시던트 리포트 작성
- [ ] 모니터링 개선사항 도출
- [ ] 예방 조치 계획 수립

마무리

Spring Boot Actuator와 Prometheus, Grafana를 활용한 모니터링 시스템 구축은 현대 애플리케이션 운영에 필수적입니다. 이번 세션에서 다룬 내용들을 바탕으로 단계적으로 모니터링 시스템을 구축하고 지속적으로 개선해 나가시기 바랍니다.

핵심 포인트

모니터링은 목적 중심으로 설계하고 점진적으로 발전시켜야 합니다
Four Golden Signals를 기반으로 핵심 메트릭을 우선 구현하세요
비즈니스 메트릭과 기술 메트릭을 균형있게 수집하세요
알림은 실행 가능하고 의미있는 것만 설정하세요
보안과 성능을 항상 고려하여 구현하세요

Spring 17: 모니터링/Actuator

1. 모니터링 개념과 필요성

애플리케이션 모니터링의 필요성

현대 시스템의 복잡성

모니터링의 핵심 목표

운영 관점

비즈니스 관점

모니터링 없이 발생하는 문제들

장애 대응 지연

근본 원인 파악 어려움

성능 저하 미인지

리소스 낭비

메트릭의 종류와 분류

Four Golden Signals (Google SRE)

1. Latency (지연시간)

2. Traffic (트래픽)

3. Errors (에러)

4. Saturation (포화도)

메트릭 분류 체계

1. 인프라스트럭처 메트릭

시스템 리소스

I/O 성능

프로세스

2. 애플리케이션 메트릭

성능 지표

안정성 지표

JVM 지표

3. 비즈니스 메트릭

사용자 행동

비즈니스 KPI

메트릭 수집 방식

Push 방식

Pull 방식

모니터링 전략과 베스트 프랙티스

관찰 가능성 (Observability)의 3요소

Metrics

Logs

Traces

모니터링 설계 원칙

목적 중심 설계

계층별 모니터링

적절한 세분화

자동화된 알림

2. Spring Boot Actuator

Spring Boot Actuator 개요

Actuator란?

주요 기능

장점

Actuator 설정

의존성 추가

Maven

Gradle

기본 설정

application.yml

application.properties

보안 설정

Spring Security 설정

포트 분리 설정

주요 엔드포인트

Health 엔드포인트

기본 사용법

커스텀 Health Indicator

Info 엔드포인트

설정 방법

프로그래밍 방식 정보 추가

Metrics 엔드포인트

기본 사용법

태그를 이용한 필터링

기타 유용한 엔드포인트

환경 정보

로그 설정

스레드 덤프

힙 덤프

운영 시 주의사항

보안 고려사항

민감한 정보 노출 방지

접근 제어

네트워크 분리

성능 고려사항

메트릭 수집 오버헤드