Back to Blog

Using Social Media Data for AI: RAG Pipelines with Instagram and TikTok

By EternalSocial Team4 {minutes} min read

Large language models are powerful, but they don't know what happened on Instagram yesterday. They can't tell you which TikTok trends are gaining traction this week or what your competitors posted this morning. That's not a model limitation — it's a data problem.

Retrieval-Augmented Generation (RAG) solves this by giving LLMs access to external, up-to-date information at query time. Instead of relying solely on training data, the model retrieves relevant context from a knowledge base before generating a response.

Social media data is one of the most valuable — and underused — sources for RAG pipelines. Profiles, posts, comments, engagement metrics, and hashtags contain real-time signal about brands, trends, and audience sentiment that no training dataset captures.

This guide walks through building a RAG pipeline that ingests social media data from Instagram and TikTok via the EternalSocial API, stores it in a vector database, and uses it to power AI applications with real-time social context.

Architecture Overview

The pipeline has four stages:

  1. Fetch — Pull social media data from the EternalSocial API
  2. Chunk — Break data into meaningful segments for embedding
  3. Embed & Store — Convert chunks to vectors and store in a vector database
  4. Query — Retrieve relevant context and pass it to an LLM
EternalSocial API → Chunking → Embeddings → Vector DB → LLM Query

Each stage is independent and can be swapped. Use whatever vector store and LLM you prefer — the patterns are the same.

Stage 1: Fetching Social Media Data

Start by pulling structured data from the EternalSocial API. The key data types for RAG are:

  • Profiles — Brand descriptions, bios, follower counts
  • Posts — Captions, hashtags, engagement metrics, timestamps
  • Comments — Audience sentiment and feedback
  • Reels/Videos — Descriptions and performance metrics

Fetching Instagram Posts

const API_BASE = "https://api.eternalsocial.dev/v1";

interface Post {
  id: string;
  caption: string;
  like_count: number;
  comment_count: number;
  timestamp: string;
  media_type: string;
  hashtags: string[];
}

async function fetchInstagramPosts(
  username: string,
  limit: number = 50
): Promise<Post[]> {
  const response = await fetch(
    `${API_BASE}/instagram/posts?username=${username}&limit=${limit}`,
    {
      headers: { Authorization: "Bearer YOUR_API_KEY" },
    }
  );

  if (!response.ok) {
    throw new Error(`Failed to fetch posts: ${response.status}`);
  }

  const data = await response.json();
  return data.posts;
}

Fetching TikTok Content

async function fetchTikTokPosts(
  username: string,
  limit: number = 50
): Promise<Post[]> {
  const response = await fetch(
    `${API_BASE}/tiktok/posts?username=${username}&limit=${limit}`,
    {
      headers: { Authorization: "Bearer YOUR_API_KEY" },
    }
  );

  if (!response.ok) {
    throw new Error(`Failed to fetch TikTok posts: ${response.status}`);
  }

  const data = await response.json();
  return data.posts;
}

Fetching Profiles for Context

Profile data provides important context for the LLM. Include it in your knowledge base so the model understands who these accounts are:

async function fetchProfile(
  platform: "instagram" | "tiktok",
  username: string
) {
  const response = await fetch(
    `${API_BASE}/${platform}/profile?username=${username}`,
    {
      headers: { Authorization: "Bearer YOUR_API_KEY" },
    }
  );

  return response.json();
}

Stage 2: Chunking Social Media Data

RAG chunking for social media data is different from document chunking. You're not splitting long text — you're structuring discrete data points into meaningful, self-contained chunks that an embedding model can represent well.

The Chunk Format

Each chunk should be a natural-language description of a single data point, enriched with metadata:

interface Chunk {
  id: string;
  text: string;
  metadata: {
    source: "instagram" | "tiktok";
    type: "post" | "profile" | "comment";
    username: string;
    timestamp: string;
    engagement_rate?: number;
  };
}

function chunkPost(post: Post, username: string, platform: string): Chunk {
  const engagementRate = post.like_count + post.comment_count;
  const hashtags =
    post.hashtags.length > 0 ? `Hashtags: ${post.hashtags.join(", ")}.` : "";

  const text = [
    `${platform} post by @${username} on ${post.timestamp}.`,
    `Caption: "${post.caption}"`,
    `${post.like_count} likes, ${post.comment_count} comments.`,
    `Media type: ${post.media_type}.`,
    hashtags,
  ]
    .filter(Boolean)
    .join(" ");

  return {
    id: `${platform}-post-${post.id}`,
    text,
    metadata: {
      source: platform as "instagram" | "tiktok",
      type: "post",
      username,
      timestamp: post.timestamp,
      engagement_rate: engagementRate,
    },
  };
}

Chunking Profiles

function chunkProfile(profile: any, platform: string): Chunk {
  const text = [
    `${platform} profile: @${profile.username}.`,
    `Name: ${profile.full_name}.`,
    `Bio: "${profile.biography}"`,
    `${profile.follower_count} followers, ${profile.following_count} following.`,
    `${profile.post_count} posts.`,
    profile.is_verified ? "Verified account." : "",
    profile.category ? `Category: ${profile.category}.` : "",
  ]
    .filter(Boolean)
    .join(" ");

  return {
    id: `${platform}-profile-${profile.username}`,
    text,
    metadata: {
      source: platform as "instagram" | "tiktok",
      type: "profile",
      username: profile.username,
      timestamp: new Date().toISOString(),
    },
  };
}

Why This Chunk Format Works

Embedding models work best with natural-language text, not raw JSON. By converting structured data into readable sentences, you get better semantic similarity matching at query time. The metadata fields enable filtered searches — "show me only TikTok posts from the last week" — without polluting the embedding space.

Stage 3: Embedding and Storage

Convert chunks to vector embeddings and store them in a vector database. This example uses a generic pattern compatible with Pinecone, Weaviate, Qdrant, Chroma, or pgvector.

Generating Embeddings

import { OpenAI } from "openai";

const openai = new OpenAI();

async function embedChunks(
  chunks: Chunk[]
): Promise<{ chunk: Chunk; embedding: number[] }[]> {
  const texts = chunks.map((c) => c.text);

  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: texts,
  });

  return chunks.map((chunk, i) => ({
    chunk,
    embedding: response.data[i].embedding,
  }));
}

Upserting to a Vector Store

async function upsertToVectorStore(
  embeddedChunks: { chunk: Chunk; embedding: number[] }[]
) {
  // Generic upsert — adapt to your vector DB client
  const records = embeddedChunks.map(({ chunk, embedding }) => ({
    id: chunk.id,
    values: embedding,
    metadata: {
      ...chunk.metadata,
      text: chunk.text,
    },
  }));

  // Example: Pinecone
  // await index.upsert(records);

  // Example: Chroma
  // await collection.upsert({ ids, embeddings, metadatas, documents });

  // Example: pgvector with Drizzle
  // await db.insert(embeddings).values(records).onConflictDoUpdate(...)

  return records.length;
}

Incremental Updates

Social media data changes constantly. New posts appear daily. Don't re-embed everything on each run — track what you've already processed:

async function ingestNewPosts(username: string, platform: string) {
  const posts = await fetchInstagramPosts(username, 50);

  // Filter to only new posts since last ingestion
  const lastIngested = await getLastIngestedTimestamp(username, platform);
  const newPosts = posts.filter((p) => new Date(p.timestamp) > lastIngested);

  if (newPosts.length === 0) {
    return 0;
  }

  const chunks = newPosts.map((p) => chunkPost(p, username, platform));
  const embedded = await embedChunks(chunks);
  const count = await upsertToVectorStore(embedded);

  await updateLastIngestedTimestamp(username, platform);
  return count;
}

Stage 4: Querying with RAG

Now the valuable part — using the stored social media context to answer questions.

Basic RAG Query

async function queryWithSocialContext(userQuestion: string): Promise<string> {
  // 1. Embed the question
  const questionEmbedding = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: userQuestion,
  });

  // 2. Search vector store for relevant context
  const results = await vectorStore.query({
    vector: questionEmbedding.data[0].embedding,
    topK: 10,
    includeMetadata: true,
  });

  // 3. Build context string from results
  const context = results.matches
    .map((match: any) => match.metadata.text)
    .join("\n\n");

  // 4. Generate answer with context
  const completion = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `You are a social media analyst. Answer questions using the following social media data as context. Cite specific posts and metrics when relevant. If the context doesn't contain enough information to answer, say so.

Context:
${context}`,
      },
      { role: "user", content: userQuestion },
    ],
  });

  return completion.choices[0].message.content ?? "";
}

Filtered Queries

Use metadata filters to scope queries to specific platforms, accounts, or time ranges:

// Only search TikTok content
const tiktokResults = await vectorStore.query({
  vector: embedding,
  topK: 10,
  filter: { source: "tiktok" },
});

// Only search a specific competitor
const competitorResults = await vectorStore.query({
  vector: embedding,
  topK: 10,
  filter: { username: "competitor_brand" },
});

// Only recent content
const recentResults = await vectorStore.query({
  vector: embedding,
  topK: 10,
  filter: {
    timestamp: { $gte: "2026-01-01T00:00:00Z" },
  },
});

Example Queries This Pipeline Can Answer

Once your pipeline is running with data from a few accounts, you can ask questions like:

  • "What content themes are driving the most engagement for @competitor this month?" — The LLM retrieves their top-performing posts and identifies patterns.
  • "How does our TikTok strategy compare to @competitor's?" — Pulls data from both accounts and generates a comparative analysis.
  • "What hashtags are trending in our niche this week?" — Aggregates hashtag usage across multiple tracked accounts.
  • "Write a social media report for @ourbrand's Instagram performance." — Retrieves all recent post data and generates a structured report.
  • "What posting schedule works best for accounts in our category?" — Analyzes posting times correlated with engagement across tracked accounts.

These aren't hypothetical — with real data in the vector store, the LLM generates specific, data-backed answers.

Running the Full Pipeline

Here's the orchestration that ties everything together:

async function runIngestionPipeline(
  accounts: { username: string; platform: "instagram" | "tiktok" }[]
) {
  console.log(`Ingesting data for ${accounts.length} accounts...`);

  for (const account of accounts) {
    const { username, platform } = account;

    // Ingest profile
    const profile = await fetchProfile(platform, username);
    const profileChunk = chunkProfile(profile, platform);
    const embeddedProfile = await embedChunks([profileChunk]);
    await upsertToVectorStore(embeddedProfile);

    // Ingest new posts
    const newPostCount = await ingestNewPosts(username, platform);
    console.log(
      `  @${username} (${platform}): ${newPostCount} new posts ingested`
    );
  }

  console.log("Ingestion complete.");
}

// Run daily
const trackedAccounts = [
  { username: "yourbrand", platform: "instagram" as const },
  { username: "competitor1", platform: "instagram" as const },
  { username: "competitor2", platform: "tiktok" as const },
];

await runIngestionPipeline(trackedAccounts);

Scaling Considerations

A few things to keep in mind as your pipeline grows:

  • Embedding costs. text-embedding-3-small is cheap (~$0.02 per million tokens), but if you're tracking hundreds of accounts, batch your embedding calls and monitor usage.
  • Vector store size. Each post creates one vector. 100 accounts × 50 posts each = 5,000 vectors — well within free-tier limits for most vector databases.
  • Freshness. Run ingestion daily for most use cases. For real-time monitoring, you can run it hourly, but the API costs and vector store writes add up.
  • Deduplication. Use deterministic chunk IDs (like instagram-post-{id}) so re-ingestion upserts instead of creating duplicates.

Beyond RAG: AI Agents with Social Data

RAG is the foundation, but the same data pipeline supports more advanced patterns:

  • AI agents that autonomously monitor social media and trigger alerts based on criteria you define
  • Automated reporting that generates weekly competitive intelligence briefings
  • Content suggestion engines that analyze what's performing well and recommend topics for your own strategy
  • Sentiment analysis pipelines that process comments at scale and surface trends

EternalSocial provides the data layer. What you build on top of it is up to you.

Get Started

Sign up for an EternalSocial API key to start feeding social media data into your AI applications. The API documentation covers all available endpoints, response formats, and rate limits. If you're building RAG pipelines or AI agents, the structured data from EternalSocial is designed to be AI-ready — clean JSON, consistent schemas, and reliable delivery.