Skip to content

Chunks after batch_size treated as non-existant #32612

@shkarupa-alex

Description

Checked other resources

  • This is a bug, not a usage question. For questions, please use the LangChain Forum (https://forum.langchain.com/).
  • I added a clear and descriptive title that summarizes this issue.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
  • I read what a minimal reproducible example is (https://stackoverflow.com/help/minimal-reproducible-example).
  • I posted a self-contained, minimal, reproducible example. A maintainer can copy it and run it AS IS.

Example Code

Install deps:

pip install -U aiosqlite greenlet langchain langchain_community langchain_core langchain_text_splitters langchain_qdrant qdrant_client

Run code:

from langchain import indexes
from langchain_community.embeddings import FakeEmbeddings
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams


client = QdrantClient(":memory:")
client.create_collection(
    collection_name="some_collection",
    vectors_config=VectorParams(size=256, distance=Distance.COSINE),
)

store = QdrantVectorStore(
    client=client,
    collection_name="some_collection",
    embedding=FakeEmbeddings(size=256),
)

manager = indexes.SQLRecordManager("index", db_url="sqlite+aiosqlite:///db.sqlite", async_mode=True)
await manager.acreate_schema()

splitter = RecursiveCharacterTextSplitter(chunk_size=10, chunk_overlap=0)


document = Document(
    page_content="\n".join(map(str, range(100))),
    metadata={
        "source": "some_url",
        "title": "some_title",
    },
)

chunks = await splitter.atransform_documents([document])


for _ in range(5):
    stats = await indexes.aindex(
        chunks,
        manager,
        store,
        batch_size=8, # <--- !
        cleanup="incremental",
        source_id_key="source",
    )
    print(stats)

Error Message and Stack Trace (if applicable)

The previous example will print:

{'num_added': 24, 'num_updated': 0, 'num_skipped': 8, 'num_deleted': 24}
{'num_added': 24, 'num_updated': 0, 'num_skipped': 8, 'num_deleted': 24}
{'num_added': 24, 'num_updated': 0, 'num_skipped': 8, 'num_deleted': 24}
{'num_added': 24, 'num_updated': 0, 'num_skipped': 8, 'num_deleted': 24}
{'num_added': 24, 'num_updated': 0, 'num_skipped': 8, 'num_deleted': 24}
{'num_added': 24, 'num_updated': 0, 'num_skipped': 8, 'num_deleted': 24}

Description

Look at these numbers. First 8 chunks (=== batch_size) always skipped as they should.
But every chunk after the first batch is treated as "not found" and then deleted + embedded (CPU/GPU heavy) and inserted back.

System Info

System Information

OS: Darwin
OS Version: Darwin Kernel Version 24.5.0: Tue Apr 22 19:53:27 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6041
Python Version: 3.12.10 (main, Apr 9 2025, 03:49:38) [Clang 20.1.0 ]

Package Information

langchain_core: 0.3.74
langchain: 0.3.27
langchain_community: 0.3.27
langsmith: 0.4.14
langchain_qdrant: 0.2.0
langchain_text_splitters: 0.3.9

Optional packages not installed

langserve

Other Dependencies

aiohttp<4.0.0,>=3.8.3: Installed. No version info available.
async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
dataclasses-json<0.7,>=0.5.7: Installed. No version info available.
fastembed: Installed. No version info available.
httpx-sse<1.0.0,>=0.4.0: Installed. No version info available.
httpx<1,>=0.23.0: Installed. No version info available.
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-azure-ai;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.66: Installed. No version info available.
langchain-core<1.0.0,>=0.3.72: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-perplexity;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.9: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langchain<1.0.0,>=0.3.26: Installed. No version info available.
langsmith-pyo3>=0.1.0rc2;: Installed. No version info available.
langsmith>=0.1.125: Installed. No version info available.
langsmith>=0.1.17: Installed. No version info available.
langsmith>=0.3.45: Installed. No version info available.
numpy>=1.26.2;: Installed. No version info available.
numpy>=2.1.0;: Installed. No version info available.
openai-agents>=0.0.3;: Installed. No version info available.
opentelemetry-api>=1.30.0;: Installed. No version info available.
opentelemetry-exporter-otlp-proto-http>=1.30.0;: Installed. No version info available.
opentelemetry-sdk>=1.30.0;: Installed. No version info available.
orjson>=3.9.14;: Installed. No version info available.
packaging>=23.2: Installed. No version info available.
pydantic: 2.11.7
pydantic-settings<3.0.0,>=2.4.0: Installed. No version info available.
pydantic<3,>=1: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic>=2.7.4: Installed. No version info available.
pytest>=7.0.0;: Installed. No version info available.
PyYAML>=5.3: Installed. No version info available.
qdrant-client: 1.15.1
requests-toolbelt>=1.0.0: Installed. No version info available.
requests<3,>=2: Installed. No version info available.
requests>=2.0.0: Installed. No version info available.
rich>=13.9.4;: Installed. No version info available.
SQLAlchemy<3,>=1.4: Installed. No version info available.
tenacity!=8.4.0,<10,>=8.1.0: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.
vcrpy>=7.0.0;: Installed. No version info available.
zstandard>=0.23.0: Installed. No version info available.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugRelated to a bug, vulnerability, unexpected error with an existing featureexternalinvestigateFlagged for investigationtext-splittersRelated to the package `text-splitters`

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions