Open In App

MongoDB $substrCP Operator

Last Updated : 10 Mar, 2025
Summarize
Comments
Improve
Suggest changes
Like Article
Like
Share
Report
News Follow

The $substrCP operator in MongoDB is a powerful tool used within the aggregation framework to extract substrings from strings based on Unicode code points. Unlike traditional substring functions that operate on bytes, $substrCP ensures accurate substring extraction for both ASCII and non-ASCII characters, making it an essential operator for handling multilingual and special character-based text data.

What is $substrCP Operator in MongoDB?

The $substrCP Operator in MongoDB is used in the aggregation pipeline to find substrings from a given string expression. It uses the Unicode code point index and count to determine the substring. This makes it useful when working with strings that contain non-ASCII characters, as it handles the Unicode code points correctly ensuring that characters are correctly processed regardless of their encoding.

For example: { $substrCP: [ “geeksforgeeks”, 0, 5 ] } will give the output as “geeks” as 0 is given as starting location, and from there 5 characters need to be taken and hence “geeks” is the result

Why Use $substrCP?

  • Works seamlessly with non-ASCII characters (e.g., Chinese, Arabic, emojis, etc.).
  • Uses Unicode code points instead of byte positions, ensuring accuracy.
  • Supports multibyte character sets effectively.
  • Enhances data processing in multilingual applications.
  • Helps extract portions of text fields for analysis, filtering, and transformations.

Syntax:

{ $substrCP: [ <your string expression>, <code point index>, <code point count> ] }

Key Terms: 

  • string expression: It is a valid string expression with alpha/alphanumeric and also with special characters from which the substring will be extracted.
  • code point index: It is a non-negative integer that represents the starting point of the substring
  • code point counts: Non-negative integer specifying the number of characters that need to be taken from the code point index.

Examples of MongoDB $substrCP Operator

To understand MongoDB $substrCP Operator we need a collection on which we will perform various operations and queries.

  • Database: geeksforgeeks
  • Collection: articles
  • Documents: three documents that contain the details of the articles in the form of field-value pairs.

demo database and collection

Example 1: Using $substrCP operator

We have an articles collection with a publishedon field storing publication dates in YYYYMMDD format. We need to extract the publication month and publication year separately.

Query:

db.articles.aggregate([
{
$project: {
articlename: 1,
publicationmonth: { $substrCP: [ "$publishedon", 0, 4 ] },
publicationyear: {
$substrCP: [
"$publishedon",
4,
{ $subtract: [ { $strLenCP: "$publishedon" }, 4 ] }
]
}
}
}
])

Output:

Explanation:

  • "publicationmonth" extracts the first 4 characters of publishedon, representing the year.
  • "publicationyear" extracts the remaining characters by using $subtract to calculate the length dynamically.

Example 2: Single-Byte Character Set

Suppose we have a collection articles in the geeksforgeeks database with documents containing an articlename field. We want to create a new field shortName with only the first 10 characters of each article’s name. This is useful for displaying short previews of article titles.

Query:

db.articles.aggregate([
{
$project: {
articlename: 1,
shortName: {
$substrCP: ["$articlename", 0, 10]
}
}
}
]);

Output:

{
"articlename": "Deep learning in R Programming",
"shortName": "Deep learn"
}

Explanation:

  • $substrCP extracts a substring starting from index 0 (first character) and taking 10 characters from articlename.
  • The resulting shortName contains the first 10 characters, which can be used as a preview or snippet of the full title.
  • This approach ensures correct handling of Unicode characters, preventing any corruption in case of multibyte characters.

Example 3: Handling Multibyte Character Set

Suppose another document in the articles collection has an articlename in a Multibyte Character Set.

Query:

db.articles.aggregate([
{
$project: {
shortName: {
$substrCP: ["$articlename", 0, 15]
}
}
}
]);

Output:

{ "shortName": "Social Media AP" }

Explanation: $substrCP ensures that characters are correctly extracted even if they are multibyte characters, preventing data corruption.

Important Points About MongoDB $substrCP Operator

  1. The $substrCP operator is used in the aggregation pipeline to extract a substring from a given string expression.
  2. It uses the Unicode code point index and count to determine the substring.
  3. The $substrCP operator is useful when working with strings that contain non-ASCII characters, as it handles the Unicode code points correctly.
  4. The $substrCP operator is designed to work efficiently within the MongoDB aggregation framework, providing a way to manipulate string data based on Unicode code points.

Conclusion

In MongoDB, the $substrCP operator is crucial for accurately extracting substrings based on Unicode code points. It supports efficient handling of both single-byte and multibyte character sets which making it essential for applications managing diverse textual data. By using $substrCP, MongoDB users can effectively manipulate string data within the aggregation framework. Whether we’re dealing with date parsing, text truncation, or multilingual support, $substrCP is a must-use operator for handling Unicode-compliant string manipulation in MongoDB.

FAQs

Can $substrCP handle both single-byte and multibyte character sets?

Yes, $substrCP is designed to handle both single-byte and multibyte character sets effectively, ensuring accurate substring extraction regardless of character encoding.

How does $substrCP differ from $substr in MongoDB?

$substrCP uses Unicode code points for indexing, making it suitable for strings with non-ASCII characters, whereas $substr uses byte-based indexing which may not accurately handle multibyte characters.

In which MongoDB version was $substrCP introduced?

$substrCP was introduced in MongoDB version 3.4 as part of enhancements to support internationalization and multilingual applications.

Can $substrCP handle substrings that cross multibyte character boundaries?

Yes, $substrCP correctly handles substrings that cross multibyte character boundaries by using Unicode code point indexing.



Next Article

Similar Reads

three90RightbarBannerImg