Skip to content

Offset by 1 bug on RecursiveJsonSplitter::split_json() function #29153

@blupants

Description

@blupants

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_text_splitters import RecursiveJsonSplitter


input_data = {
  "projects": {
    "AS": {
      "AS-1": {}
    },
    "DLP": {
      "DLP-7": {},
      "DLP-6": {},
      "DLP-5": {},
      "DLP-4": {},
      "DLP-3": {},
      "DLP-2": {},
      "DLP-1": {}
    },
    "GTMS": {
      "GTMS-22": {},
      "GTMS-21": {},
      "GTMS-20": {},
      "GTMS-19": {},
      "GTMS-18": {},
      "GTMS-17": {},
      "GTMS-16": {},
      "GTMS-15": {},
      "GTMS-14": {},
      "GTMS-13": {},
      "GTMS-12": {},
      "GTMS-11": {},
      "GTMS-10": {},
      "GTMS-9": {},
      "GTMS-8": {},
      "GTMS-7": {},
      "GTMS-6": {},
      "GTMS-5": {},
      "GTMS-4": {},
      "GTMS-3": {},
      "GTMS-2": {},
      "GTMS-1": {}
    },
    "IT": {
      "IT-3": {},
      "IT-2": {},
      "IT-1": {}
    },
    "ITSAMPLE": {
      "ITSAMPLE-12": {},
      "ITSAMPLE-11": {},
      "ITSAMPLE-10": {},
      "ITSAMPLE-9": {},
      "ITSAMPLE-8": {},
      "ITSAMPLE-7": {},
      "ITSAMPLE-6": {},
      "ITSAMPLE-5": {},
      "ITSAMPLE-4": {},
      "ITSAMPLE-3": {},
      "ITSAMPLE-2": {},
      "ITSAMPLE-1": {}
    },
    "MAR": {
      "MAR-2": {},
      "MAR-1": {}
    }
  }
}

splitter = RecursiveJsonSplitter(max_chunk_size=216)
json_chunks = splitter.split_json(json_data=input_data)

input_data_DLP_5 = input_data.get("projects", {}).get("DLP", {}).get("DLP-5", None)
input_data_GTMS_10 = input_data.get("projects", {}).get("GTMS", {}).get("GTMS-10", None)
input_data_ITSAMPLE_2 = input_data.get("projects", {}).get("ITSAMPLE", {}).get("ITSAMPLE-2", None)

chunk_DLP_5 = None
chunk_GTMS_10 = None
chunk_ITSAMPLE_2 = None

for chunk in json_chunks:
    print(chunk)
    node = chunk.get("projects", {}).get("DLP", {}).get("DLP-5", None)
    if isinstance(node, dict):
        chunk_DLP_5 = node
    node = chunk.get("projects", {}).get("GTMS", {}).get("GTMS-10", None)
    if isinstance(node, dict):
        chunk_GTMS_10 = node
    node = chunk.get("projects", {}).get("ITSAMPLE", {}).get("ITSAMPLE-2", None)
    if isinstance(node, dict):
        chunk_ITSAMPLE_2 = node

print("\nRESULTS:")
if isinstance(chunk_DLP_5, dict):
    print(f"[PASS] - Node DLP-5 was found both in input_data and json_chunks")
else:
    print(f"[TEST FAILED] - Node DLP-5 from input_data was NOT FOUND in json_chunks")

if isinstance(chunk_GTMS_10, dict):
    print(f"[PASS] - Node GTMS-10 was found both in input_data and json_chunks")
else:
    print(f"[TEST FAILED] - Node GTMS-10 from input_data was NOT FOUND in json_chunks")

if isinstance(chunk_ITSAMPLE_2, dict):
    print(f"[PASS] - Node ITSAMPLE-2 was found both in input_data and json_chunks")
else:
    print(f"[TEST FAILED] - Node ITSAMPLE-2 from input_data was NOT FOUND in json_chunks")

Error Message and Stack Trace (if applicable)

No response

Description

I am trying to use langchain_text_splitters library to split JSON content using the function RecursiveJsonSplitter::split_json()

For most cases it works, however I am experiencing some data being lost depending on the input JSON and the chunk size I am using.

I was able to consistently replicate the issue for the input JSON provided on my sample code. I always get the nodes "GTMS-10" and "ITSAMPLE-2" discarded when I split the JSON using max_chunk_size=216.

I noticed this issue always occurs with nodes that would be on the edge of the chunks. When you run my sample code, it will print all the 5 chunks generated:

python split_json_bug.py 

{'projects': {'AS': {'AS-1': {}}, 'DLP': {'DLP-7': {}, 'DLP-6': {}, 'DLP-5': {}, 'DLP-4': {}, 'DLP-3': {}, 'DLP-2': {}, 'DLP-1': {}}}}
{'projects': {'GTMS': {'GTMS-22': {}, 'GTMS-21': {}, 'GTMS-20': {}, 'GTMS-19': {}, 'GTMS-18': {}, 'GTMS-17': {}, 'GTMS-16': {}, 'GTMS-15': {}, 'GTMS-14': {}, 'GTMS-13': {}, 'GTMS-12': {}, 'GTMS-11': {}}}}
{'projects': {'GTMS': {'GTMS-9': {}, 'GTMS-8': {}, 'GTMS-7': {}, 'GTMS-6': {}, 'GTMS-5': {}, 'GTMS-4': {}, 'GTMS-3': {}, 'GTMS-2': {}, 'GTMS-1': {}}, 'IT': {'IT-3': {}, 'IT-2': {}, 'IT-1': {}}}}
{'projects': {'ITSAMPLE': {'ITSAMPLE-12': {}, 'ITSAMPLE-11': {}, 'ITSAMPLE-10': {}, 'ITSAMPLE-9': {}, 'ITSAMPLE-8': {}, 'ITSAMPLE-7': {}, 'ITSAMPLE-6': {}, 'ITSAMPLE-5': {}, 'ITSAMPLE-4': {}, 'ITSAMPLE-3': {}}}}
{'projects': {'ITSAMPLE': {'ITSAMPLE-1': {}}, 'MAR': {'MAR-2': {}, 'MAR-1': {}}}}

RESULTS:
[PASS] - Node DLP-5 was found both in input_data and json_chunks
[TEST FAILED] - Node GTMS-10 from input_data was NOT FOUND in json_chunks
[TEST FAILED] - Node ITSAMPLE-2 from input_data was NOT FOUND in json_chunks

Please, noticed that the 2nd chunk ends with node "GTMS-11" and the 3rd chunk starts with "GTMS-9". Same thing for chunks number 4 (ends with "ITSAMPLE-3") and chunk number 5 (starts with "ITSAMPLE-1")

Because the chunks "GTMS-10" and "ITSAMPLE-2" were lost on the edges of chunks, I believe that might a case of an "offset by 1 bug" on the RecursiveJsonSplitter::split_json() Python function.

Since I am calling it exactly how it is described in the documentation and I couldn't find any bug and discussion mentioning it, I thought I should file a bug for it.

System Info

(.venv) user@User-MacBook-Air split_json_bug % python -m langchain_core.sys_info

System Information
------------------
> OS:  Darwin
> OS Version:  Darwin Kernel Version 23.6.0: Thu Sep 12 23:34:49 PDT 2024; root:xnu-10063.141.1.701.1~1/RELEASE_X86_64
> Python Version:  3.11.9 (main, Apr  2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)]

Package Information
-------------------
> langchain_core: 0.3.29
> langsmith: 0.2.10
> langchain_text_splitters: 0.3.5

Optional packages not installed
-------------------------------
> langserve

Other Dependencies
------------------
> httpx: 0.28.1
> jsonpatch: 1.33
> langsmith-pyo3: Installed. No version info available.
> orjson: 3.10.14
> packaging: 24.2
> pydantic: 2.10.5
> PyYAML: 6.0.2
> requests: 2.32.3
> requests-toolbelt: 1.0.0
> tenacity: 9.0.0
> typing-extensions: 4.12.2
> zstandard: Installed. No version info available.
(.venv) user@User-MacBook-Air split_json_bug % pip freeze
annotated-types==0.7.0
anyio==4.8.0
certifi==2024.12.14
charset-normalizer==3.4.1
h11==0.14.0
httpcore==1.0.7
httpx==0.28.1
idna==3.10
jsonpatch==1.33
jsonpointer==3.0.0
langchain-core==0.3.29
langchain-text-splitters==0.3.5
langsmith==0.2.10
orjson==3.10.14
packaging==24.2
pydantic==2.10.5
pydantic_core==2.27.2
PyYAML==6.0.2
requests==2.32.3
requests-toolbelt==1.0.0
sniffio==1.3.1
tenacity==9.0.0
typing_extensions==4.12.2
urllib3==2.3.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugRelated to a bug, vulnerability, unexpected error with an existing featureexternaltext-splittersRelated to the package `text-splitters`

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions