Comprehensive documentation for TZZR system v5 including: - 00_VISION: Glossary and foundational philosophy - 01_ARQUITECTURA: System overview and server specs - 02_MODELO_DATOS: Entity definitions and data planes (T0, MST, BCK) - 03_COMPONENTES: Agent docs (CLARA, MARGARET, FELDMAN, GRACE) - 04_SEGURIDAD: Threat model and secrets management - 05_OPERACIONES: Infrastructure and backup/recovery - 06_INTEGRACIONES: GPU services (RunPod status: blocked) - 99_ANEXOS: Repository inventory (24 repos) Key findings documented: - CRITICAL: UFW inactive on CORP/HST - CRITICAL: PostgreSQL 5432 exposed - CRITICAL: .env files with 644 permissions - RunPod workers not starting (code ready in R2) - Infisical designated as single source of secrets (D-001) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
180 lines
3.2 KiB
Markdown
180 lines
3.2 KiB
Markdown
# GPU Services - RunPod
|
|
|
|
**Versión:** 5.0
|
|
**Fecha:** 2024-12-24
|
|
|
|
---
|
|
|
|
## Estado Crítico
|
|
|
|
**ALERTA: RunPod NO es confiable para producción.**
|
|
|
|
### Incidente 2024-12-24
|
|
|
|
RunPod falló en provisionar workers a pesar de:
|
|
- Configuración correcta
|
|
- Balance suficiente ($72+)
|
|
- Jobs en cola
|
|
|
|
**Resultado:** 0 workers activos. Servicio GPU inoperativo.
|
|
|
|
---
|
|
|
|
## Endpoints Configurados
|
|
|
|
| Servicio | Endpoint ID | Estado |
|
|
|----------|-------------|--------|
|
|
| GRACE | r00x4g3rrwkbyh | Inactivo |
|
|
| PENNY | 0mxhaokgsmgee3 | Inactivo |
|
|
| FACTORY | ddnuk6y35zi56a | Inactivo |
|
|
|
|
---
|
|
|
|
## GRACE (Extracción IA)
|
|
|
|
### Módulos Implementados (6 de 18)
|
|
|
|
| Módulo | Descripción |
|
|
|--------|-------------|
|
|
| ASR | Speech-to-text |
|
|
| OCR | Reconocimiento texto |
|
|
| TTS | Text-to-speech |
|
|
| Face | Reconocimiento facial |
|
|
| Embeddings | Vectores semánticos |
|
|
| Avatar | Generación avatares |
|
|
|
|
### Módulos Pendientes (12)
|
|
|
|
- Document parsing
|
|
- Image classification
|
|
- Object detection
|
|
- Sentiment analysis
|
|
- Named entity recognition
|
|
- Translation
|
|
- Summarization
|
|
- Question answering
|
|
- Code generation
|
|
- Audio classification
|
|
- Video analysis
|
|
- Multimodal fusion
|
|
|
|
---
|
|
|
|
## Código Disponible en R2
|
|
|
|
Los handlers están listos y son portables:
|
|
|
|
```
|
|
s3://architect/gpu-services/
|
|
├── base/
|
|
│ └── bootstrap.sh
|
|
├── grace/
|
|
│ └── code/handler.py
|
|
├── penny/
|
|
│ └── code/handler.py
|
|
└── factory/
|
|
└── code/handler.py
|
|
```
|
|
|
|
### Descargar Código
|
|
|
|
```bash
|
|
source /home/orchestrator/orchestrator/.env
|
|
export AWS_ACCESS_KEY_ID="$R2_ACCESS_KEY"
|
|
export AWS_SECRET_ACCESS_KEY="$R2_SECRET_KEY"
|
|
|
|
aws s3 sync s3://architect/gpu-services/ ./gpu-services/ \
|
|
--endpoint-url https://7dedae6030f5554d99d37e98a5232996.r2.cloudflarestorage.com
|
|
```
|
|
|
|
---
|
|
|
|
## Alternativas a Evaluar
|
|
|
|
### 1. Modal (Recomendado)
|
|
|
|
- Pricing: Pay-per-use
|
|
- Pros: Serverless real, buen DX, Python-native
|
|
- Contras: Menos GPUs disponibles que RunPod
|
|
|
|
```python
|
|
import modal
|
|
|
|
stub = modal.Stub("grace")
|
|
|
|
@stub.function(gpu="A10G")
|
|
def process_asr(audio_bytes):
|
|
# Whisper inference
|
|
pass
|
|
```
|
|
|
|
### 2. Replicate
|
|
|
|
- Pricing: Per-inference
|
|
- Pros: Modelos pre-entrenados, API simple
|
|
- Contras: Menos control, más caro a escala
|
|
|
|
### 3. Lambda Labs
|
|
|
|
- Pricing: Hourly
|
|
- Pros: Hardware dedicado
|
|
- Contras: Menos flexible, reserva manual
|
|
|
|
### 4. Self-Hosted
|
|
|
|
- Pricing: Upfront + hosting
|
|
- Pros: Control total, sin dependencias
|
|
- Contras: CapEx alto, mantenimiento
|
|
|
|
---
|
|
|
|
## Migración Propuesta
|
|
|
|
### Fase 1: Evaluación (1 semana)
|
|
- [ ] Probar Modal con ASR handler
|
|
- [ ] Comparar latencia y costo
|
|
|
|
### Fase 2: Migración (2 semanas)
|
|
- [ ] Portar 6 handlers a Modal
|
|
- [ ] Actualizar endpoints en MASON
|
|
|
|
### Fase 3: Producción
|
|
- [ ] Desplegar en Modal
|
|
- [ ] Deprecar RunPod endpoints
|
|
|
|
---
|
|
|
|
## Configuración Actual RunPod
|
|
|
|
### API Key
|
|
|
|
```
|
|
Almacenada en: creds_runpod (PostgreSQL ARCHITECT)
|
|
Campo: api_key_runpod
|
|
```
|
|
|
|
### SDK
|
|
|
|
```python
|
|
import runpod
|
|
|
|
runpod.api_key = "..."
|
|
|
|
# Crear job
|
|
job = runpod.run(
|
|
endpoint_id="r00x4g3rrwkbyh",
|
|
input={"audio": "base64..."}
|
|
)
|
|
|
|
# Check status
|
|
status = runpod.status(job["id"])
|
|
```
|
|
|
|
---
|
|
|
|
## Recomendación
|
|
|
|
**NO confiar en RunPod para producción.**
|
|
|
|
Migrar a Modal como prioridad alta una vez resueltos los issues de seguridad críticos.
|