Using Social Media Data for AI: RAG Pipelines with Instagram and TikTok
Large language models are powerful, but they don't know what happened on Instagram yesterday. They can't tell you which TikTok trends are gaining traction this week or what your competitors posted this morning. That's not a model limitation — it's a data problem.
Retrieval-Augmented Generation (RAG) solves this by giving LLMs access to external, up-to-date information at query time. Instead of relying solely on training data, the model retrieves relevant context from a knowledge base before generating a response.
Social media data is one of the most valuable — and underused — sources for RAG pipelines. Profiles, posts, comments, engagement metrics, and hashtags contain real-time signal about brands, trends, and audience sentiment that no training dataset captures.
This guide walks through building a RAG pipeline that ingests social media data from Instagram and TikTok via the EternalSocial API, stores it in a vector database, and uses it to power AI applications with real-time social context.
Architecture Overview
The pipeline has four stages:
- Fetch — Pull social media data from the EternalSocial API
- Chunk — Break data into meaningful segments for embedding
- Embed & Store — Convert chunks to vectors and store in a vector database
- Query — Retrieve relevant context and pass it to an LLM
EternalSocial API → Chunking → Embeddings → Vector DB → LLM Query
Each stage is independent and can be swapped. Use whatever vector store and LLM you prefer — the patterns are the same.
Stage 1: Fetching Social Media Data
Start by pulling structured data from the EternalSocial API. The key data types for RAG are:
- Profiles — Brand descriptions, bios, follower counts
- Posts — Captions, hashtags, engagement metrics, timestamps
- Comments — Audience sentiment and feedback
- Reels/Videos — Descriptions and performance metrics
Fetching Instagram Posts
const API_BASE = "https://api.eternalsocial.dev/v1";
interface Post {
id: string;
caption: string;
like_count: number;
comment_count: number;
timestamp: string;
media_type: string;
hashtags: string[];
}
async function fetchInstagramPosts(
username: string,
limit: number = 50
): Promise<Post[]> {
const response = await fetch(
`${API_BASE}/instagram/posts?username=${username}&limit=${limit}`,
{
headers: { Authorization: "Bearer YOUR_API_KEY" },
}
);
if (!response.ok) {
throw new Error(`Failed to fetch posts: ${response.status}`);
}
const data = await response.json();
return data.posts;
}
Fetching TikTok Content
async function fetchTikTokPosts(
username: string,
limit: number = 50
): Promise<Post[]> {
const response = await fetch(
`${API_BASE}/tiktok/posts?username=${username}&limit=${limit}`,
{
headers: { Authorization: "Bearer YOUR_API_KEY" },
}
);
if (!response.ok) {
throw new Error(`Failed to fetch TikTok posts: ${response.status}`);
}
const data = await response.json();
return data.posts;
}
Fetching Profiles for Context
Profile data provides important context for the LLM. Include it in your knowledge base so the model understands who these accounts are:
async function fetchProfile(
platform: "instagram" | "tiktok",
username: string
) {
const response = await fetch(
`${API_BASE}/${platform}/profile?username=${username}`,
{
headers: { Authorization: "Bearer YOUR_API_KEY" },
}
);
return response.json();
}
Stage 2: Chunking Social Media Data
RAG chunking for social media data is different from document chunking. You're not splitting long text — you're structuring discrete data points into meaningful, self-contained chunks that an embedding model can represent well.
The Chunk Format
Each chunk should be a natural-language description of a single data point, enriched with metadata:
interface Chunk {
id: string;
text: string;
metadata: {
source: "instagram" | "tiktok";
type: "post" | "profile" | "comment";
username: string;
timestamp: string;
engagement_rate?: number;
};
}
function chunkPost(post: Post, username: string, platform: string): Chunk {
const engagementRate = post.like_count + post.comment_count;
const hashtags =
post.hashtags.length > 0 ? `Hashtags: ${post.hashtags.join(", ")}.` : "";
const text = [
`${platform} post by @${username} on ${post.timestamp}.`,
`Caption: "${post.caption}"`,
`${post.like_count} likes, ${post.comment_count} comments.`,
`Media type: ${post.media_type}.`,
hashtags,
]
.filter(Boolean)
.join(" ");
return {
id: `${platform}-post-${post.id}`,
text,
metadata: {
source: platform as "instagram" | "tiktok",
type: "post",
username,
timestamp: post.timestamp,
engagement_rate: engagementRate,
},
};
}
Chunking Profiles
function chunkProfile(profile: any, platform: string): Chunk {
const text = [
`${platform} profile: @${profile.username}.`,
`Name: ${profile.full_name}.`,
`Bio: "${profile.biography}"`,
`${profile.follower_count} followers, ${profile.following_count} following.`,
`${profile.post_count} posts.`,
profile.is_verified ? "Verified account." : "",
profile.category ? `Category: ${profile.category}.` : "",
]
.filter(Boolean)
.join(" ");
return {
id: `${platform}-profile-${profile.username}`,
text,
metadata: {
source: platform as "instagram" | "tiktok",
type: "profile",
username: profile.username,
timestamp: new Date().toISOString(),
},
};
}
Why This Chunk Format Works
Embedding models work best with natural-language text, not raw JSON. By converting structured data into readable sentences, you get better semantic similarity matching at query time. The metadata fields enable filtered searches — "show me only TikTok posts from the last week" — without polluting the embedding space.
Stage 3: Embedding and Storage
Convert chunks to vector embeddings and store them in a vector database. This example uses a generic pattern compatible with Pinecone, Weaviate, Qdrant, Chroma, or pgvector.
Generating Embeddings
import { OpenAI } from "openai";
const openai = new OpenAI();
async function embedChunks(
chunks: Chunk[]
): Promise<{ chunk: Chunk; embedding: number[] }[]> {
const texts = chunks.map((c) => c.text);
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts,
});
return chunks.map((chunk, i) => ({
chunk,
embedding: response.data[i].embedding,
}));
}
Upserting to a Vector Store
async function upsertToVectorStore(
embeddedChunks: { chunk: Chunk; embedding: number[] }[]
) {
// Generic upsert — adapt to your vector DB client
const records = embeddedChunks.map(({ chunk, embedding }) => ({
id: chunk.id,
values: embedding,
metadata: {
...chunk.metadata,
text: chunk.text,
},
}));
// Example: Pinecone
// await index.upsert(records);
// Example: Chroma
// await collection.upsert({ ids, embeddings, metadatas, documents });
// Example: pgvector with Drizzle
// await db.insert(embeddings).values(records).onConflictDoUpdate(...)
return records.length;
}
Incremental Updates
Social media data changes constantly. New posts appear daily. Don't re-embed everything on each run — track what you've already processed:
async function ingestNewPosts(username: string, platform: string) {
const posts = await fetchInstagramPosts(username, 50);
// Filter to only new posts since last ingestion
const lastIngested = await getLastIngestedTimestamp(username, platform);
const newPosts = posts.filter((p) => new Date(p.timestamp) > lastIngested);
if (newPosts.length === 0) {
return 0;
}
const chunks = newPosts.map((p) => chunkPost(p, username, platform));
const embedded = await embedChunks(chunks);
const count = await upsertToVectorStore(embedded);
await updateLastIngestedTimestamp(username, platform);
return count;
}
Stage 4: Querying with RAG
Now the valuable part — using the stored social media context to answer questions.
Basic RAG Query
async function queryWithSocialContext(userQuestion: string): Promise<string> {
// 1. Embed the question
const questionEmbedding = await openai.embeddings.create({
model: "text-embedding-3-small",
input: userQuestion,
});
// 2. Search vector store for relevant context
const results = await vectorStore.query({
vector: questionEmbedding.data[0].embedding,
topK: 10,
includeMetadata: true,
});
// 3. Build context string from results
const context = results.matches
.map((match: any) => match.metadata.text)
.join("\n\n");
// 4. Generate answer with context
const completion = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: `You are a social media analyst. Answer questions using the following social media data as context. Cite specific posts and metrics when relevant. If the context doesn't contain enough information to answer, say so.
Context:
${context}`,
},
{ role: "user", content: userQuestion },
],
});
return completion.choices[0].message.content ?? "";
}
Filtered Queries
Use metadata filters to scope queries to specific platforms, accounts, or time ranges:
// Only search TikTok content
const tiktokResults = await vectorStore.query({
vector: embedding,
topK: 10,
filter: { source: "tiktok" },
});
// Only search a specific competitor
const competitorResults = await vectorStore.query({
vector: embedding,
topK: 10,
filter: { username: "competitor_brand" },
});
// Only recent content
const recentResults = await vectorStore.query({
vector: embedding,
topK: 10,
filter: {
timestamp: { $gte: "2026-01-01T00:00:00Z" },
},
});
Example Queries This Pipeline Can Answer
Once your pipeline is running with data from a few accounts, you can ask questions like:
- "What content themes are driving the most engagement for @competitor this month?" — The LLM retrieves their top-performing posts and identifies patterns.
- "How does our TikTok strategy compare to @competitor's?" — Pulls data from both accounts and generates a comparative analysis.
- "What hashtags are trending in our niche this week?" — Aggregates hashtag usage across multiple tracked accounts.
- "Write a social media report for @ourbrand's Instagram performance." — Retrieves all recent post data and generates a structured report.
- "What posting schedule works best for accounts in our category?" — Analyzes posting times correlated with engagement across tracked accounts.
These aren't hypothetical — with real data in the vector store, the LLM generates specific, data-backed answers.
Running the Full Pipeline
Here's the orchestration that ties everything together:
async function runIngestionPipeline(
accounts: { username: string; platform: "instagram" | "tiktok" }[]
) {
console.log(`Ingesting data for ${accounts.length} accounts...`);
for (const account of accounts) {
const { username, platform } = account;
// Ingest profile
const profile = await fetchProfile(platform, username);
const profileChunk = chunkProfile(profile, platform);
const embeddedProfile = await embedChunks([profileChunk]);
await upsertToVectorStore(embeddedProfile);
// Ingest new posts
const newPostCount = await ingestNewPosts(username, platform);
console.log(
` @${username} (${platform}): ${newPostCount} new posts ingested`
);
}
console.log("Ingestion complete.");
}
// Run daily
const trackedAccounts = [
{ username: "yourbrand", platform: "instagram" as const },
{ username: "competitor1", platform: "instagram" as const },
{ username: "competitor2", platform: "tiktok" as const },
];
await runIngestionPipeline(trackedAccounts);
Scaling Considerations
A few things to keep in mind as your pipeline grows:
- Embedding costs.
text-embedding-3-smallis cheap (~$0.02 per million tokens), but if you're tracking hundreds of accounts, batch your embedding calls and monitor usage. - Vector store size. Each post creates one vector. 100 accounts × 50 posts each = 5,000 vectors — well within free-tier limits for most vector databases.
- Freshness. Run ingestion daily for most use cases. For real-time monitoring, you can run it hourly, but the API costs and vector store writes add up.
- Deduplication. Use deterministic chunk IDs (like
instagram-post-{id}) so re-ingestion upserts instead of creating duplicates.
Beyond RAG: AI Agents with Social Data
RAG is the foundation, but the same data pipeline supports more advanced patterns:
- AI agents that autonomously monitor social media and trigger alerts based on criteria you define
- Automated reporting that generates weekly competitive intelligence briefings
- Content suggestion engines that analyze what's performing well and recommend topics for your own strategy
- Sentiment analysis pipelines that process comments at scale and surface trends
EternalSocial provides the data layer. What you build on top of it is up to you.
Get Started
Sign up for an EternalSocial API key to start feeding social media data into your AI applications. The API documentation covers all available endpoints, response formats, and rate limits. If you're building RAG pipelines or AI agents, the structured data from EternalSocial is designed to be AI-ready — clean JSON, consistent schemas, and reliable delivery.