MongoDB $substrCP Operator

Q: Can $substrCP handle both single-byte and multibyte character sets?

Yes, $substrCP is designed to handle both single-byte and multibyte character sets effectively, ensuring accurate substring extraction regardless of character encoding.

Q: How does $substrCP differ from $substr in MongoDB?

$substrCP uses Unicode code points for indexing, making it suitable for strings with non-ASCII characters, whereas $substr uses byte-based indexing which may not accurately handle multibyte characters.

Q: In which MongoDB version was $substrCP introduced?

$substrCP was introduced in MongoDB version 3.4 as part of enhancements to support internationalization and multilingual applications.

Q: Can $substrCP handle substrings that cross multibyte character boundaries?

Yes, $substrCP correctly handles substrings that cross multibyte character boundaries by using Unicode code point indexing.

Last Updated : 10 Mar, 2025

The $substrCP operator in MongoDB is a powerful tool used within the aggregation framework to extract substrings from strings based on Unicode code points. Unlike traditional substring functions that operate on bytes, $substrCP ensures accurate substring extraction for both ASCII and non-ASCII characters, making it an essential operator for handling multilingual and special character-based text data.

What is $substrCP Operator in MongoDB?

The $substrCP Operator in MongoDB is used in the aggregation pipeline to find substrings from a given string expression. It uses the Unicode code point index and count to determine the substring. This makes it useful when working with strings that contain non-ASCII characters, as it handles the Unicode code points correctly ensuring that characters are correctly processed regardless of their encoding.

For example: { $substrCP: [ “geeksforgeeks”, 0, 5 ] } will give the output as “geeks” as 0 is given as starting location, and from there 5 characters need to be taken and hence “geeks” is the result

Why Use `$substrCP`?

Works seamlessly with non-ASCII characters (e.g., Chinese, Arabic, emojis, etc.).
Uses Unicode code points instead of byte positions, ensuring accuracy.
Supports multibyte character sets effectively.
Enhances data processing in multilingual applications.
Helps extract portions of text fields for analysis, filtering, and transformations.

Syntax:

{ $substrCP: [ <your string expression>, <code point index>, <code point count> ] }

Key Terms:

string expression: It is a valid string expression with alpha/alphanumeric and also with special characters from which the substring will be extracted.
code point index: It is a non-negative integer that represents the starting point of the substring
code point counts: Non-negative integer specifying the number of characters that need to be taken from the code point index.

Examples of MongoDB $substrCP Operator

To understand MongoDB $substrCP Operator we need a collection on which we will perform various operations and queries.

Database: geeksforgeeks
Collection: articles
Documents: three documents that contain the details of the articles in the form of field-value pairs.

demo database and collection

Example 1: Using $substrCP operator

We have an articles collection with a publishedon field storing publication dates in YYYYMMDD format. We need to extract the publication month and publication year separately.

Query:

db.articles.aggregate([
  {
    $project: {
      articlename: 1,
      publicationmonth: { $substrCP: [ "$publishedon", 0, 4 ] },
      publicationyear: {
        $substrCP: [
          "$publishedon",
          4,
          { $subtract: [ { $strLenCP: "$publishedon" }, 4 ] }
        ]
      }
    }
  }
])

Output:

Explanation:

"publicationmonth" extracts the first 4 characters of publishedon, representing the year.
"publicationyear" extracts the remaining characters by using $subtract to calculate the length dynamically.

Example 2: Single-Byte Character Set

Suppose we have a collection articles in the geeksforgeeks database with documents containing an articlename field. We want to create a new field shortName with only the first 10 characters of each article’s name. This is useful for displaying short previews of article titles.

Query:

db.articles.aggregate([
  {
    $project: {
      articlename: 1,
      shortName: {
        $substrCP: ["$articlename", 0, 10]
      }
    }
  }
]);

Output:

{
  "articlename": "Deep learning in R Programming",
  "shortName": "Deep learn"
}

Explanation:

$substrCP extracts a substring starting from index 0 (first character) and taking 10 characters from articlename.
The resulting shortName contains the first 10 characters, which can be used as a preview or snippet of the full title.
This approach ensures correct handling of Unicode characters, preventing any corruption in case of multibyte characters.

Example 3: Handling Multibyte Character Set

Suppose another document in the articles collection has an articlename in a Multibyte Character Set.

Query:

db.articles.aggregate([
  {
    $project: {
      shortName: {
        $substrCP: ["$articlename", 0, 15]
      }
    }
  }
]);

Output:

{ "shortName": "Social Media AP" }

Explanation: $substrCP ensures that characters are correctly extracted even if they are multibyte characters, preventing data corruption.

Important Points About MongoDB $substrCP Operator

The $substrCP operator is used in the aggregation pipeline to extract a substring from a given string expression.
It uses the Unicode code point index and count to determine the substring.
The $substrCP operator is useful when working with strings that contain non-ASCII characters, as it handles the Unicode code points correctly.
The $substrCP operator is designed to work efficiently within the MongoDB aggregation framework, providing a way to manipulate string data based on Unicode code points.

Conclusion

In MongoDB, the $substrCP operator is crucial for accurately extracting substrings based on Unicode code points. It supports efficient handling of both single-byte and multibyte character sets which making it essential for applications managing diverse textual data. By using $substrCP, MongoDB users can effectively manipulate string data within the aggregation framework. Whether we’re dealing with date parsing, text truncation, or multilingual support, $substrCP is a must-use operator for handling Unicode-compliant string manipulation in MongoDB.

FAQs

Can `$substrCP` handle both single-byte and multibyte character sets?

Yes, $substrCP is designed to handle both single-byte and multibyte character sets effectively, ensuring accurate substring extraction regardless of character encoding.

How does `$substrCP` differ from `$substr` in MongoDB?

$substrCP uses Unicode code points for indexing, making it suitable for strings with non-ASCII characters, whereas $substr uses byte-based indexing which may not accurately handle multibyte characters.

In which MongoDB version was `$substrCP` introduced?

$substrCP was introduced in MongoDB version 3.4 as part of enhancements to support internationalization and multilingual applications.

Can `$substrCP` handle substrings that cross multibyte character boundaries?

Yes, $substrCP correctly handles substrings that cross multibyte character boundaries by using Unicode code point indexing.

Defining, Creating and Dropping a MongoDB collection

priyarajtt

Improve

Article Tags :

MongoDB $substrCP Operator

What is $substrCP Operator in MongoDB?

Why Use $substrCP?

Examples of MongoDB $substrCP Operator

Example 1: Using $substrCP operator

Example 2: Single-Byte Character Set

Example 3: Handling Multibyte Character Set

Important Points About MongoDB $substrCP Operator

Conclusion

FAQs

Can $substrCP handle both single-byte and multibyte character sets?

How does $substrCP differ from $substr in MongoDB?

In which MongoDB version was $substrCP introduced?

Can $substrCP handle substrings that cross multibyte character boundaries?

Similar Reads

Introduction

Installation

Basics of MongoDB

MongoDB Methods

Comparison Operators

Logical Operators

Arithmetic Operators

Field Update Operators

Array Expression Operators

Thank You!

What kind of Experience do you want to share?

Why Use `$substrCP`?

Can `$substrCP` handle both single-byte and multibyte character sets?

How does `$substrCP` differ from `$substr` in MongoDB?

In which MongoDB version was `$substrCP` introduced?

Can `$substrCP` handle substrings that cross multibyte character boundaries?