Home
Python automation data processing
Back to Blog
PythonAutomationData Processing

Python for Automation and Data Processing

December 20256 min read

Python has become my go-to tool for automation and data processing tasks. At work, I use it for data validation, transformation, and small automation scripts that support our cloud-based workflows.

Data Validation Scripts

One common task is validating data before it gets processed or stored. Here's a pattern I use for structured data validation:

from typing import Dict, List, Any
from dataclasses import dataclass

@dataclass
class ValidationResult:
    is_valid: bool
    errors: List[str]

def validate_user_data(data: Dict[str, Any]) -> ValidationResult:
    """Validate user data before database insertion."""
    errors = []
    
    # Required field checks
    required_fields = ['email', 'name', 'user_id']
    for field in required_fields:
        if field not in data or not data[field]:
            errors.append(f"Missing required field: {field}")
    
    # Email format validation
    if 'email' in data and '@' not in str(data['email']):
        errors.append("Invalid email format")
    
    # Type validation
    if 'user_id' in data and not isinstance(data['user_id'], int):
        errors.append("user_id must be an integer")
    
    return ValidationResult(
        is_valid=len(errors) == 0,
        errors=errors
    )

Using dataclasses and type hints makes the code more maintainable and helps catch issues during development. This approach aligns with OOP principles I've learned and applied in my projects.

Data Transformation

Transforming data between different formats is essential when working with APIs and databases. Here's how I approach it:

import json
from datetime import datetime
from typing import List, Dict

def transform_api_response(raw_data: List[Dict]) -> List[Dict]:
    """Transform API response to database-ready format."""
    transformed = []
    
    for record in raw_data:
        transformed.append({
            'id': record.get('externalId'),
            'created_at': parse_timestamp(record.get('timestamp')),
            'status': normalize_status(record.get('state')),
            'metadata': json.dumps(record.get('extra', {}))
        })
    
    return transformed

def parse_timestamp(ts: str) -> datetime:
    """Parse various timestamp formats."""
    formats = [
        '%Y-%m-%dT%H:%M:%SZ',
        '%Y-%m-%d %H:%M:%S',
        '%Y/%m/%d'
    ]
    for fmt in formats:
        try:
            return datetime.strptime(ts, fmt)
        except ValueError:
            continue
    return datetime.now()

def normalize_status(status: str) -> str:
    """Normalize status values from different sources."""
    status_map = {
        'active': 'active',
        'ACTIVE': 'active',
        'enabled': 'active',
        'inactive': 'inactive',
        'INACTIVE': 'inactive',
        'disabled': 'inactive'
    }
    return status_map.get(status, 'unknown')

Automation with Clean Code Principles

From my E-Commerce API project and production work, I've learned the importance of writing modular, testable code. Here's an automation script that follows these principles:

class DataProcessor:
    """Process and validate data from multiple sources."""
    
    def __init__(self, config: dict):
        self.config = config
        self.processed_count = 0
        self.error_count = 0
    
    def process_batch(self, records: List[Dict]) -> Dict:
        """Process a batch of records with error handling."""
        results = {
            'successful': [],
            'failed': []
        }
        
        for record in records:
            try:
                validated = self._validate(record)
                transformed = self._transform(validated)
                results['successful'].append(transformed)
                self.processed_count += 1
            except Exception as e:
                results['failed'].append({
                    'record': record,
                    'error': str(e)
                })
                self.error_count += 1
        
        return results
    
    def _validate(self, record: Dict) -> Dict:
        """Validate a single record."""
        # Validation logic here
        return record
    
    def _transform(self, record: Dict) -> Dict:
        """Transform a single record."""
        # Transformation logic here
        return record
    
    def get_stats(self) -> Dict:
        """Return processing statistics."""
        return {
            'processed': self.processed_count,
            'errors': self.error_count,
            'success_rate': self.processed_count / 
                (self.processed_count + self.error_count) * 100
                if (self.processed_count + self.error_count) > 0 
                else 0
        }

Connecting to Cloud Services

At Gorkhali Agents, I work with Google Cloud services. Here's a pattern for securely accessing data using Secret Manager:

from google.cloud import secretmanager
import os

def get_secret(secret_id: str, project_id: str = None) -> str:
    """Retrieve a secret from Google Cloud Secret Manager."""
    project_id = project_id or os.environ.get('GCP_PROJECT')
    
    client = secretmanager.SecretManagerServiceClient()
    name = f"projects/{project_id}/secrets/{secret_id}/versions/latest"
    
    response = client.access_secret_version(request={"name": name})
    return response.payload.data.decode("UTF-8")

# Usage
db_password = get_secret("database-password")

Key Takeaways

  • Type hints matter: They catch bugs early and make code self-documenting
  • OOP for organization: Classes help organize related functionality
  • Error handling is crucial: In production, always expect and handle failures
  • Security first: Never hardcode credentials, use secret managers

These Python skills form the foundation for ETL development, where data validation, transformation, and reliable processing are essential.