safe-challenge-2025
/

example-submission

Model card Files Files and versions

xet

Community

sachin sharma commited on Sep 16

Commit

4f88f85

1 Parent(s): d136e15

added test case generation

Browse files

Files changed (3) hide show

README.md +177 -2
requirements.in +6 -0
requirements.txt +27 -8

README.md CHANGED Viewed

@@ -43,9 +43,12 @@ ml-inference-service/
 ├─ models/
 │  └─ resnet-18/              # Sample HF-style model folder
 ├─ scripts/
-│  └─ model_download.bash     # One-liner to snapshot HF weights locally
 ├─ requirements.in / requirements.txt
-└─ test_main.http             # Example request you can run from IDEs
 ```
 ---
@@ -262,3 +265,175 @@ Then set `MODEL_NAME=your-org/your-model` in your environment (Pydantic will map
 - **Prod**: Use a process manager (e.g., `gunicorn -k uvicorn.workers.UvicornWorker`) and add health checks.
 - **Containerize**: Copy only `requirements.txt` and source, install wheels, and bake the `models/` folder into the image or mount it as a volume.
 - **CPU vs GPU**: This example uses CPU by default. If you have CUDA, install a CUDA-enabled PyTorch build and set device placement in your service.

 ├─ models/
 │  └─ resnet-18/              # Sample HF-style model folder
 ├─ scripts/
+│  ├─ model_download.bash        # One-liner to snapshot HF weights locally
+│  ├─ generate_test_datasets.py  # Generate PyArrow datasets for testing
+│  ├─ test_datasets.py           # Test generated datasets against API
+│  └─ test_datasets/             # Generated PyArrow test datasets (100 files)
 ├─ requirements.in / requirements.txt
+└─ test_main.http                # Example request you can run from IDEs
 ```
 ---
 - **Prod**: Use a process manager (e.g., `gunicorn -k uvicorn.workers.UvicornWorker`) and add health checks.
 - **Containerize**: Copy only `requirements.txt` and source, install wheels, and bake the `models/` folder into the image or mount it as a volume.
 - **CPU vs GPU**: This example uses CPU by default. If you have CUDA, install a CUDA-enabled PyTorch build and set device placement in your service.
+---
+## 🧪 PyArrow Test Datasets
+This project includes a comprehensive **PyArrow-based dataset generation system** designed specifically for academic challenges and ML model validation. The system generates **100 standardized test datasets** that allow participants to validate their models against consistent, reproducible test cases.
+### 🏗️ Why Both? `.parquet` + `_metadata.json`
+```
+standard_test_001.parquet         # Actual test data (images, requests, responses)
+standard_test_001_metadata.json   # Human-readable description and stats
+```
+### 📊 Dataset Categories (25 each = 100 total)
+#### 1. **Standard Test Cases** (`standard_test_*.parquet`)
+**Purpose**: Baseline functionality validation
+**Content**: Normal images with expected successful predictions
+- **Image Types**: Random patterns, geometric shapes, gradients, text overlays, solid colors
+- **Formats**: JPEG, PNG with proper MIME types
+- **Sizes**: 224x224, 256x256, 299x299, 384x384 (common ML input sizes)
+- **Expected Behavior**: HTTP 200 responses with valid prediction structure
+#### 2. **Edge Case Tests** (`edge_case_*.parquet`)
+**Purpose**: Robustness and error handling validation
+**Content**: Challenging scenarios that test model resilience
+- **Tiny Images**: 32x32, 1x1 pixels (tests preprocessing robustness)
+- **Huge Images**: 2048x2048 (tests memory management and resizing)
+- **Extreme Aspect Ratios**: 1000x50 (tests preprocessing assumptions)
+- **Corrupted Data**: Invalid base64, malformed requests (tests error handling)
+- **Expected Behavior**: Graceful degradation, proper error responses
+#### 3. **Performance Benchmarks** (`performance_test_*.parquet`)
+**Purpose**: Latency and throughput measurement
+**Content**: Varying batch sizes for performance profiling
+- **Batch Sizes**: 1, 5, 10, 25, 50, 100 images per test
+- **Latency Tracking**: Expected max response times based on batch size
+- **Throughput Metrics**: Requests per second under different loads
+- **Expected Behavior**: Consistent performance within acceptable bounds
+#### 4. **Model Comparison** (`model_comparison_*.parquet`)
+**Purpose**: Cross-model validation and benchmarking
+**Content**: Identical inputs tested across different model architectures
+- **Model Types**: ResNet-18/50, ViT, ConvNext, Swin Transformer
+- **Consistent Inputs**: Same 10 base images per dataset
+- **Comparative Analysis**: Enables direct performance comparison between models
+- **Expected Behavior**: Architecture-specific but structurally consistent responses
+### 🛠️ Generation Process
+The dataset generation follows a **deterministic, reproducible approach**:
+#### Step 1: Synthetic Image Creation
+```python
+# Why synthetic images instead of real photos?
+# 1. Copyright-free for academic distribution
+# 3. Programmatically generated edge cases
+def create_synthetic_image(width, height, image_type):
+    if image_type == "random":
+        # RGB noise - tests model noise robustness
+        array = np.random.randint(0, 256, (height, width, 3))
+    elif image_type == "geometric":
+        # Shapes and patterns - tests feature detection
+        # ... geometric pattern generation
+    # ... other synthetic types
+```
+#### Step 2: API Request Structure Generation
+```python
+# Matches exact API format for drop-in testing
+{
+    "image": {
+        "mediaType": "image/jpeg",  # Proper MIME types
+        "data": "<base64-encoded-image>"  # Standard encoding
+    }
+}
+```
+#### Step 3: Expected Response Generation
+```python
+# Realistic prediction responses with proper structure
+{
+    "prediction": "tiger_cat",           # ImageNet-style labels
+    "confidence": 0.8742,                # Realistic confidence scores
+    "predicted_label": 282,              # Numeric label indices
+    "model": "microsoft/resnet-18",      # Model identification
+    "mediaType": "image/jpeg"            # Echo input format
+}
+```
+#### Step 4: PyArrow Table Creation
+```python
+# Columnar storage for efficient querying
+table = pa.table({
+    "dataset_id": [...],        # Unique dataset identifier
+    "image_id": [...],          # Individual image identifier
+    "api_request": [...],       # JSON-serialized requests
+    "expected_response": [...], # JSON-serialized expected responses
+    "test_category": [...],     # Category classification
+    "difficulty": [...],        # Complexity indicator
+    # ... additional metadata columns
+})
+```
+### 🚀 Usage Guide
+**1. Generate Test Datasets**
+```bash
+# Create all 100 datasets (~2-5 minutes depending on hardware)
+python scripts/generate_test_datasets.py
+# What this creates:
+# - scripts/test_datasets/*.parquet (actual test data)
+# - scripts/test_datasets/*_metadata.json (human-readable info)
+# - scripts/test_datasets/datasets_summary.json (overview)
+```
+**2. Validate API**
+```bash
+# Start your ML service
+uvicorn main:app --reload
+# Quick test (5 samples per dataset)
+python scripts/test_datasets.py --quick
+# Full validation (all samples)
+python scripts/test_datasets.py
+# Category-specific testing
+python scripts/test_datasets.py --category edge_case
+python scripts/test_datasets.py --category performance
+```
+### 📈 Testing Output and Metrics
+The test runner provides comprehensive validation metrics:
+```
+🏁 DATASET TESTING SUMMARY
+============================================================
+Datasets tested: 100
+Successful datasets: 95
+Failed datasets: 5
+Total samples: 1,247
+Overall success rate: 87.3%
+Test duration: 45.2s
+Performance:
+  Avg latency: 123.4ms
+  Median latency: 98.7ms
+  Min latency: 45.2ms
+  Max latency: 2,341.0ms
+  Requests/sec: 27.6
+Category breakdown:
+  standard: 25 datasets, 94.2% avg success
+  edge_case: 25 datasets, 76.8% avg success
+  performance: 25 datasets, 91.1% avg success
+  model_comparison: 25 datasets, 89.3% avg success
+Failed datasets: edge_case_023, edge_case_019, performance_012
+```

requirements.in CHANGED Viewed

@@ -14,3 +14,9 @@ python-multipart==0.0.6
 transformers>=4.35.0
 torch>=2.4.0  # Newer PyTorch with NumPy 2.x support
 pillow>=10.0.0

 transformers>=4.35.0
 torch>=2.4.0  # Newer PyTorch with NumPy 2.x support
 pillow>=10.0.0
+# Dataset generation and testing
+pyarrow>=14.0.0
+numpy>=1.24.0
+pandas>=2.0.0
+requests>=2.25.0

requirements.txt CHANGED Viewed

@@ -1,5 +1,9 @@
-# This file was autogenerated by uv via the following command:
-#    uv pip compile requirements.in -o requirements.txt
 annotated-types==0.7.0
     # via pydantic
 anyio==3.7.1
@@ -47,7 +51,10 @@ mpmath==1.3.0
 networkx==3.5
     # via torch
 numpy==2.3.2
-    # via transformers
 nvidia-cublas-cu12==12.8.4.1
     # via
     #   nvidia-cudnn-cu12
@@ -89,8 +96,12 @@ packaging==25.0
     # via
     #   huggingface-hub
     #   transformers
 pillow==10.1.0
     # via -r requirements.in
 pydantic==2.5.0
     # via
     #   -r requirements.in
@@ -100,6 +111,8 @@ pydantic-core==2.14.1
     # via pydantic
 pydantic-settings==2.0.3
     # via -r requirements.in
 python-dotenv==0.21.0
     # via
     #   -r requirements.in
@@ -107,6 +120,8 @@ python-dotenv==0.21.0
     #   uvicorn
 python-multipart==0.0.6
     # via -r requirements.in
 pyyaml==6.0.2
     # via
     #   huggingface-hub
@@ -116,14 +131,13 @@ regex==2025.7.34
     # via transformers
 requests==2.32.5
     # via
     #   huggingface-hub
     #   transformers
 safetensors==0.6.2
     # via transformers
-setuptools==80.9.0
-    # via
-    #   torch
-    #   triton
 sniffio==1.3.1
     # via anyio
 starlette==0.27.0
@@ -149,9 +163,11 @@ typing-extensions==4.15.0
     #   pydantic
     #   pydantic-core
     #   torch
 urllib3==2.5.0
     # via requests
-uvicorn==0.24.0
     # via -r requirements.in
 uvloop==0.21.0
     # via uvicorn
@@ -159,3 +175,6 @@ watchfiles==1.1.0
     # via uvicorn
 websockets==15.0.1
     # via uvicorn

+#
+# This file is autogenerated by pip-compile with Python 3.12
+# by the following command:
+#
+#    pip-compile requirements.in
+#
 annotated-types==0.7.0
     # via pydantic
 anyio==3.7.1
 networkx==3.5
     # via torch
 numpy==2.3.2
+    # via
+    #   -r requirements.in
+    #   pandas
+    #   transformers
 nvidia-cublas-cu12==12.8.4.1
     # via
     #   nvidia-cudnn-cu12
     # via
     #   huggingface-hub
     #   transformers
+pandas==2.3.2
+    # via -r requirements.in
 pillow==10.1.0
     # via -r requirements.in
+pyarrow==21.0.0
+    # via -r requirements.in
 pydantic==2.5.0
     # via
     #   -r requirements.in
     # via pydantic
 pydantic-settings==2.0.3
     # via -r requirements.in
+python-dateutil==2.9.0.post0
+    # via pandas
 python-dotenv==0.21.0
     # via
     #   -r requirements.in
     #   uvicorn
 python-multipart==0.0.6
     # via -r requirements.in
+pytz==2025.2
+    # via pandas
 pyyaml==6.0.2
     # via
     #   huggingface-hub
     # via transformers
 requests==2.32.5
     # via
+    #   -r requirements.in
     #   huggingface-hub
     #   transformers
 safetensors==0.6.2
     # via transformers
+six==1.17.0
+    # via python-dateutil
 sniffio==1.3.1
     # via anyio
 starlette==0.27.0
     #   pydantic
     #   pydantic-core
     #   torch
+tzdata==2025.2
+    # via pandas
 urllib3==2.5.0
     # via requests
+uvicorn[standard]==0.24.0
     # via -r requirements.in
 uvloop==0.21.0
     # via uvicorn
     # via uvicorn
 websockets==15.0.1
     # via uvicorn
+# The following packages are considered to be unsafe in a requirements file:
+# setuptools