Python for Automation and Data Processing
Python has become my go-to tool for automation and data processing tasks. At work, I use it for data validation, transformation, and small automation scripts that support our cloud-based workflows.
Data Validation Scripts
One common task is validating data before it gets processed or stored. Here's a pattern I use for structured data validation:
from typing import Dict, List, Any
from dataclasses import dataclass
@dataclass
class ValidationResult:
is_valid: bool
errors: List[str]
def validate_user_data(data: Dict[str, Any]) -> ValidationResult:
"""Validate user data before database insertion."""
errors = []
# Required field checks
required_fields = ['email', 'name', 'user_id']
for field in required_fields:
if field not in data or not data[field]:
errors.append(f"Missing required field: {field}")
# Email format validation
if 'email' in data and '@' not in str(data['email']):
errors.append("Invalid email format")
# Type validation
if 'user_id' in data and not isinstance(data['user_id'], int):
errors.append("user_id must be an integer")
return ValidationResult(
is_valid=len(errors) == 0,
errors=errors
)Using dataclasses and type hints makes the code more maintainable and helps catch issues during development. This approach aligns with OOP principles I've learned and applied in my projects.
Data Transformation
Transforming data between different formats is essential when working with APIs and databases. Here's how I approach it:
import json
from datetime import datetime
from typing import List, Dict
def transform_api_response(raw_data: List[Dict]) -> List[Dict]:
"""Transform API response to database-ready format."""
transformed = []
for record in raw_data:
transformed.append({
'id': record.get('externalId'),
'created_at': parse_timestamp(record.get('timestamp')),
'status': normalize_status(record.get('state')),
'metadata': json.dumps(record.get('extra', {}))
})
return transformed
def parse_timestamp(ts: str) -> datetime:
"""Parse various timestamp formats."""
formats = [
'%Y-%m-%dT%H:%M:%SZ',
'%Y-%m-%d %H:%M:%S',
'%Y/%m/%d'
]
for fmt in formats:
try:
return datetime.strptime(ts, fmt)
except ValueError:
continue
return datetime.now()
def normalize_status(status: str) -> str:
"""Normalize status values from different sources."""
status_map = {
'active': 'active',
'ACTIVE': 'active',
'enabled': 'active',
'inactive': 'inactive',
'INACTIVE': 'inactive',
'disabled': 'inactive'
}
return status_map.get(status, 'unknown')Automation with Clean Code Principles
From my E-Commerce API project and production work, I've learned the importance of writing modular, testable code. Here's an automation script that follows these principles:
class DataProcessor:
"""Process and validate data from multiple sources."""
def __init__(self, config: dict):
self.config = config
self.processed_count = 0
self.error_count = 0
def process_batch(self, records: List[Dict]) -> Dict:
"""Process a batch of records with error handling."""
results = {
'successful': [],
'failed': []
}
for record in records:
try:
validated = self._validate(record)
transformed = self._transform(validated)
results['successful'].append(transformed)
self.processed_count += 1
except Exception as e:
results['failed'].append({
'record': record,
'error': str(e)
})
self.error_count += 1
return results
def _validate(self, record: Dict) -> Dict:
"""Validate a single record."""
# Validation logic here
return record
def _transform(self, record: Dict) -> Dict:
"""Transform a single record."""
# Transformation logic here
return record
def get_stats(self) -> Dict:
"""Return processing statistics."""
return {
'processed': self.processed_count,
'errors': self.error_count,
'success_rate': self.processed_count /
(self.processed_count + self.error_count) * 100
if (self.processed_count + self.error_count) > 0
else 0
}Connecting to Cloud Services
At Gorkhali Agents, I work with Google Cloud services. Here's a pattern for securely accessing data using Secret Manager:
from google.cloud import secretmanager
import os
def get_secret(secret_id: str, project_id: str = None) -> str:
"""Retrieve a secret from Google Cloud Secret Manager."""
project_id = project_id or os.environ.get('GCP_PROJECT')
client = secretmanager.SecretManagerServiceClient()
name = f"projects/{project_id}/secrets/{secret_id}/versions/latest"
response = client.access_secret_version(request={"name": name})
return response.payload.data.decode("UTF-8")
# Usage
db_password = get_secret("database-password")Key Takeaways
- Type hints matter: They catch bugs early and make code self-documenting
- OOP for organization: Classes help organize related functionality
- Error handling is crucial: In production, always expect and handle failures
- Security first: Never hardcode credentials, use secret managers
These Python skills form the foundation for ETL development, where data validation, transformation, and reliable processing are essential.