SDK Overview
Data Intelligence SDK
Knowledge graph + retrieval-augmented generation for context-aware AI. A multi-tenant SDK that combines large language models with graph-based retrieval, so your applications can reason over private data - not just pattern-match against it.
Why Data Intelligence
LLMs are powerful, but they're limited to what they saw in training. The moment you point them at private documents - your contracts, your patient records, your internal research - they start guessing. Standard RAG helps, but only partially: chunks get retrieved in isolation, context gets fragmented, and the model still can't connect dots that live in different documents.
Data Intelligence fixes the connection problem. Instead of retrieving disconnected chunks, it builds a knowledge graph from your data first - entities, relationships, themes - and uses that graph at query time to deliver answers that are coherent, grounded, and traceable, with less hallucination and fewer fragmented responses.
What makes it different
Reasons across documents, not just within them
Standard RAG retrieves chunks that match keywords. Data Intelligence traverses relationships - chunks from different documents stay connected through the graph, and the model sees how concepts relate, not just what words appeared near each other. When a question requires connecting facts across multiple documents - “which clauses in contract A reference obligations defined in contract B?” - the graph makes those links first-class, and the SDK stops re-retrieving the same idea in different phrasings.
Domain-adaptive
The graph adapts to your terminology. Legal, healthcare, scientific, financial - wherever the relationships between entities matter as much as the entities themselves, Data Intelligence performs.
Built-in provenance
Every claim in a response is traceable to the source document and the path through the graph that produced it. Auditability isn’t a feature you bolt on later - it’s how the system works.
How it works
Data Intelligence runs four stages, three at ingestion time and one at query time.
1. Dataset ingestion
Data enters through batch upload, live connectors, or MCP-compliant SaaS sources - see Programmatic ingestion, Live connectors, and MCP integrations below for the full surface. From any path, the system reads the content and identifies the entities that matter - people, places, organizations, concepts - along with the relationships between them.
2. Knowledge graph construction
The extracted entities and relationships are assembled into a knowledge graph. Each node is an entity. Each edge is a relationship. The graph is generated and refined by the LLM itself, then stored alongside your documents.
3. Semantic clustering
The graph is analyzed bottom-up to form clusters of related entities. These clusters represent themes - distinct semantic regions of your data. They enable hierarchical understanding and pre-summarization, so the system can reason about whole topics, not just individual facts.
4. Query-time augmentation
At inference, the user’s query triggers retrieval across both the graph structure and the original source content. Retrieval spans every connected source in scope - uploaded files and live connectors alike - concurrently, and the LLM receives the relevant subgraph plus source passages to generate an answer grounded in both.
Multi-tenant by design
Data Intelligence is built to serve multiple isolated tenants from a single deployment. Every document, graph, and query is scoped to an organization, and within that, to a project. Users belong to organizations and have roles that determine what they can read, ingest, or modify. Tokens carry the tenant context, so a query issued by one tenant can never traverse another tenant’s graph - isolation happens at the data layer, not just the application layer.
This means:
- A single SDK deployment can serve dozens of customers without cross-contamination
- Inside a customer’s account, separate projects keep different data domains cleanly partitioned
- Role-based access controls who can ingest, query, or administrate within each project
- Audit trails are scoped per tenant, so compliance reviews stay clean
Built for production
Identity that fits your stack
Data Intelligence integrates with enterprise identity providers through OIDC and SAML, so user identity, roles, and access rules flow through from your existing auth - no parallel user system to maintain.
Sessions are JWT-based with access and refresh tokens. Refresh rotates both tokens on every call, so a stolen token has a short useful life. Users can hold multiple active sessions across devices, list them from their profile, and revoke any session individually - useful for “I forgot to log out on my laptop” or for security teams cutting off compromised devices.
Programmatic ingestion
Upload files in bulk through the SDK or API across the supported document, spreadsheet, and image formats. Attach metadata, assign ownership, and let the system handle parsing, OCR, entity extraction, and graph construction in the background. Track ingestion status - pending, processing, done, failed - through the same interface, with structured failure reasons surfaced for the rows that don’t make it through.
Live connectors
Beyond batch upload, Data Intelligence connects directly to operational data sources - PostgreSQL, MySQL, Slack, and arbitrary REST endpoints - and keeps the graph current as records change. Each connector carries its own scoped configuration and credentials, runs a continuous health-check loop, and reports degradation through the same status interface as document ingestion, so the platform knows when a source is unreliable before queries start returning stale answers.
MCP integrations
For SaaS content sources - Google Drive, Notion, and any MCP-compliant service - the SDK speaks Model Context Protocol directly, so files and pages from those systems flow into the same ingestion pipeline as batch uploads. The integration runs entirely server-side: there is no desktop or IDE component to install, no per-user agent to keep alive. Source-side permissions are honoured at ingestion and at query time, so a user only ever sees content they were already entitled to in the underlying service.
Query with context
Applications query the knowledge base using user-scoped tokens. Results are filtered to what the requesting user is allowed to see - enforced by the platform, not your application code. Every query reaches every authorised source - files, spreadsheets, images, live connectors, and MCP-integrated content - through a single interface; the graph is the join, so the application doesn’t have to route per-source-type.
Responses stream from the SDK token-by-token, and clients can resume a stream that was interrupted mid-flight - a dropped network or a backgrounded tab doesn’t lose the answer in progress.
Audit trails per tenant
Every action - logins, document uploads, queries, configuration changes - is logged and scoped to the organization that produced it. Logs can be retrieved per organization, per user, per project, or filtered to security-relevant events. Compliance reviews stay clean because the audit boundary matches the tenancy boundary.
Where it fits
- Financial services - real-time transaction analysis, fraud detection, and regulatory reporting where tracing every claim back to source data is non-negotiable.
- Healthcare - HIPAA-bound document processing, clinical reference retrieval, and patient record analysis where terminology precision and access control define correctness.
- Legal and compliance - contract review, regulatory tracking, and case research across document corpora where the relationships between clauses, parties, and obligations carry the meaning.
- Research and scientific work - literature synthesis, citation traversal, and hypothesis exploration where multi-hop reasoning across thousands of papers is the work itself.
What’s next
Data Intelligence ships today with the capabilities above. On the near-term roadmap:
- Audio and video ingestion- extend ingestion beyond text and OCR’d images to transcripts, speaker turns, and visual scenes, so the graph spans every modality your team produces
- Federated document stores - query across systems and clouds while keeping data where it lives
- Continued retrieval improvements - sharper targeting, faster traversal, better handling of ambiguous queries
Get started
Two paths from here:
- Quickstart - install the SDK, authenticate, ingest your first document, and run a query in under ten minutes
- API Reference - full endpoint documentation, request and response shapes, and authentication flows