{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
Clip
OpenAI의 모델 연결 비전 및 언어. Zero-shot 이미지 분류, 이미지-텍스트 매칭, 크로스-modal retrieval이 있습니다. 이미지 텍스트 쌍에 훈련. 이미지 검색, 콘텐츠 모드, 또는 선명한 작업을 위한 사용. 범용 이미지 이해에 가장 적합합니다.
기술 메타데이터
| 소스 | 선택 사항 - hermes skills install official/mlops/clip로 설치 |
| 경로 | optional-skills/mlops/clip |
| 버전 | 1.0.0 |
| 저자 | Orchestra Research |
| 라이선스 | MIT |
| 플랫폼 | linux, macos, windows |
| 태그 | Multimodal, CLIP, Vision-Language, Zero-Shot, Image Classification, OpenAI, Image Search, Cross-Modal Retrieval, Content Moderation |
참고: 전체 SKILL.md
정보
아래는 Hermes가 이 스킬을 활성화할 때 로드하는 원문 SKILL.md 정의입니다. 명령어, 코드, 식별자를 정확히 보존하기 위해 이 참조 블록은 원문을 유지합니다.
# CLIP - Contrastive Language-Image Pre-Training
OpenAI's model that understands images from natural language.
## When to use CLIP
**Use when:**
- Zero-shot image classification (no training data needed)
- Image-text similarity/matching
- Semantic image search
- Content moderation (detect NSFW, violence)
- Visual question answering
- Cross-modal retrieval (image→text, text→image)
**Metrics**:
- **25,300+ GitHub stars**
- Trained on 400M image-text pairs
- Matches ResNet-50 on ImageNet (zero-shot)
- MIT License
**Use alternatives instead**:
- **BLIP-2**: Better captioning
- **LLaVA**: Vision-language chat
- **Segment Anything**: Image segmentation
## Quick start
### Installation
```bash
pip install git+https://github.com/openai/CLIP.git
pip install torch torchvision ftfy regex tqdm
```
### Zero-shot classification
```python
import torch
import clip
from PIL import Image
# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# Load image
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
# Define possible labels
text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)
# Compute similarity
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Cosine similarity
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
# Print results
labels = ["a dog", "a cat", "a bird", "a car"]
for label, prob in zip(labels, probs[0]):
print(f"{label}: {prob:.2%}")
```
## Available models
```python
# Models (sorted by size)
models = [
"RN50", # ResNet-50
"RN101", # ResNet-101
"ViT-B/32", # Vision Transformer (recommended)
"ViT-B/16", # Better quality, slower
"ViT-L/14", # Best quality, slowest
]
model, preprocess = clip.load("ViT-B/32")
```
| Model | Parameters | Speed | Quality |
|-------|------------|-------|---------|
| RN50 | 102M | Fast | Good |
| ViT-B/32 | 151M | Medium | Better |
| ViT-L/14 | 428M | Slow | Best |
## Image-text similarity
```python
# Compute embeddings
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Normalize
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Cosine similarity
similarity = (image_features @ text_features.T).item()
print(f"Similarity: {similarity:.4f}")
```
## Semantic image search
```python
# Index images
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
image_embeddings = []
for img_path in image_paths:
image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
with torch.no_grad():
embedding = model.encode_image(image)
embedding /= embedding.norm(dim=-1, keepdim=True)
image_embeddings.append(embedding)
image_embeddings = torch.cat(image_embeddings)
# Search with text query
query = "a sunset over the ocean"
text_input = clip.tokenize([query]).to(device)
with torch.no_grad():
text_embedding = model.encode_text(text_input)
text_embedding /= text_embedding.norm(dim=-1, keepdim=True)
# Find most similar images
similarities = (text_embedding @ image_embeddings.T).squeeze(0)
top_k = similarities.topk(3)
for idx, score in zip(top_k.indices, top_k.values):
print(f"{image_paths[idx]}: {score:.3f}")
```
## Content moderation
```python
# Define categories
categories = [
"safe for work",
"not safe for work",
"violent content",
"graphic content"
]
text = clip.tokenize(categories).to(device)
# Check image
with torch.no_grad():
logits_per_image, _ = model(image, text)
probs = logits_per_image.softmax(dim=-1)
# Get classification
max_idx = probs.argmax().item()
max_prob = probs[0, max_idx].item()
print(f"Category: {categories[max_idx]} ({max_prob:.2%})")
```
## Batch processing
```python
# Process multiple images
images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)]
images = torch.stack(images).to(device)
with torch.no_grad():
image_features = model.encode_image(images)
image_features /= image_features.norm(dim=-1, keepdim=True)
# Batch text
texts = ["a dog", "a cat", "a bird"]
text_tokens = clip.tokenize(texts).to(device)
with torch.no_grad():
text_features = model.encode_text(text_tokens)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Similarity matrix (10 images × 3 texts)
similarities = image_features @ text_features.T
print(similarities.shape) # (10, 3)
```
## Integration with vector databases
```python
# Store CLIP embeddings in Chroma/FAISS
import chromadb
client = chromadb.Client()
collection = client.create_collection("image_embeddings")
# Add image embeddings
for img_path, embedding in zip(image_paths, image_embeddings):
collection.add(
embeddings=[embedding.cpu().numpy().tolist()],
metadatas=[{"path": img_path}],
ids=[img_path]
)
# Query with text
query = "a sunset"
text_embedding = model.encode_text(clip.tokenize([query]))
results = collection.query(
query_embeddings=[text_embedding.cpu().numpy().tolist()],
n_results=5
)
```
## Best practices
1. **Use ViT-B/32 for most cases** - Good balance
2. **Normalize embeddings** - Required for cosine similarity
3. **Batch processing** - More efficient
4. **Cache embeddings** - Expensive to recompute
5. **Use descriptive labels** - Better zero-shot performance
6. **GPU recommended** - 10-50× faster
7. **Preprocess images** - Use provided preprocess function
## Performance
| Operation | CPU | GPU (V100) |
|-----------|-----|------------|
| Image encoding | ~200ms | ~20ms |
| Text encoding | ~50ms | ~5ms |
| Similarity compute | <1ms | <1ms |
## Limitations
1. **Not for fine-grained tasks** - Best for broad categories
2. **Requires descriptive text** - Vague labels perform poorly
3. **Biased on web data** - May have dataset biases
4. **No bounding boxes** - Whole image only
5. **Limited spatial understanding** - Position/counting weak
## Resources
- **GitHub**: https://github.com/openai/CLIP ⭐ 25,300+
- **Paper**: https://arxiv.org/abs/2103.00020
- **Colab**: https://colab.research.google.com/github/openai/clip/
- **License**: MIT