Handling Long Documents in Sonamu
Youβre building a feature to upload blog posts or manuals to your Sonamu app:- Long documents (10,000+ words)
- Exceeds embedding API token limit
- Cannot embed the entire document at once
What is Chunking?
Key points:- Long document -> multiple chunks
- Each chunk -> individual embedding
- During search -> return the most relevant chunks
Why is it Necessary?
1. Token Limits- Voyage AI: 32,000 tokens
- OpenAI: 8,191 tokens
- Long documents exceed limits
- Shorter chunks yield more accurate results
- Searching βrefund methodβ -> returns only the refund section
- Keep related information together
- Split without breaking sentences
Sonamuβs Chunking Class
Using in Sonamu Model
Long Document Upload + Chunking
Table Structure
Understanding Configuration Options
chunkSize: Chunk Size
- Short search: 200-300 characters
- General: 400-600 characters
- Long context: 800-1000 characters
- Korean: ~1 character = ~1 token
- English: ~1 character = ~0.7 tokens
chunkOverlap: Overlap Size
skipThreshold: Skip Splitting
separators: Separator Priority
Practical Scenarios
Scenario: Technical Documentation Knowledge Base
Youβre building a development documentation search system with Sonamu. Step 1: Conditional ChunkingOptimizing for Markdown Documents
Chunking vs Full Document
When is Chunking Necessary?
| Document Type | Average Length | Chunking Needed | Reason |
|---|---|---|---|
| FAQ items | < 200 chars | No | Short |
| Blog posts | 1,000 chars | Optional | Medium |
| Technical docs | 5,000 chars | Yes | Long |
| Manuals | 20,000 chars | Required | Very long |
| Chat messages | < 100 chars | No | Short |
Decision in Sonamu
Cautions
Next Steps
Vector Search
Implementing chunk-based search API
Embeddings
Generating batch embeddings