MongoDB $substrCP Operator
The $substrCP
operator in MongoDB is a powerful tool used within the aggregation framework to extract substrings from strings based on Unicode code points. Unlike traditional substring functions that operate on bytes, $substrCP
ensures accurate substring extraction for both ASCII and non-ASCII characters, making it an essential operator for handling multilingual and special character-based text data.
What is $substrCP Operator in MongoDB?
The $substrCP Operator in MongoDB is used in the aggregation pipeline to find substrings from a given string expression. It uses the Unicode code point index and count to determine the substring. This makes it useful when working with strings that contain non-ASCII characters, as it handles the Unicode code points correctly ensuring that characters are correctly processed regardless of their encoding.
For example: { $substrCP: [ “geeksforgeeks”, 0, 5 ] } will give the output as “geeks” as 0 is given as starting location, and from there 5 characters need to be taken and hence “geeks” is the result
Why Use $substrCP
?
- Works seamlessly with non-ASCII characters (e.g., Chinese, Arabic, emojis, etc.).
- Uses Unicode code points instead of byte positions, ensuring accuracy.
- Supports multibyte character sets effectively.
- Enhances data processing in multilingual applications.
- Helps extract portions of text fields for analysis, filtering, and transformations.
Syntax:
{ $substrCP: [ <your string expression>, <code point index>, <code point count> ] }
Key Terms:
- string expression: It is a valid string expression with alpha/alphanumeric and also with special characters from which the substring will be extracted.
- code point index: It is a non-negative integer that represents the starting point of the substring
- code point counts: Non-negative integer specifying the number of characters that need to be taken from the code point index.
Examples of MongoDB $substrCP Operator
To understand MongoDB $substrCP Operator we need a collection on which we will perform various operations and queries.
- Database: geeksforgeeks
- Collection: articles
- Documents: three documents that contain the details of the articles in the form of field-value pairs.
Example 1: Using $substrCP operator
We have an articles
collection with a publishedon
field storing publication dates in YYYYMMDD format. We need to extract the publication month and publication year separately.
Query:
db.articles.aggregate([
{
$project: {
articlename: 1,
publicationmonth: { $substrCP: [ "$publishedon", 0, 4 ] },
publicationyear: {
$substrCP: [
"$publishedon",
4,
{ $subtract: [ { $strLenCP: "$publishedon" }, 4 ] }
]
}
}
}
])
Output:
Explanation:
"publicationmonth"
extracts the first 4 characters ofpublishedon
, representing the year."publicationyear"
extracts the remaining characters by using$subtract
to calculate the length dynamically.
Example 2: Single-Byte Character Set
Suppose we have a collection articles
in the geeksforgeeks
database with documents containing an articlename
field. We want to create a new field shortName
with only the first 10 characters of each article’s name. This is useful for displaying short previews of article titles.
Query:
db.articles.aggregate([
{
$project: {
articlename: 1,
shortName: {
$substrCP: ["$articlename", 0, 10]
}
}
}
]);
Output:
{
"articlename": "Deep learning in R Programming",
"shortName": "Deep learn"
}
Explanation:
$substrCP
extracts a substring starting from index0
(first character) and taking10
characters fromarticlename
.- The resulting
shortName
contains the first 10 characters, which can be used as a preview or snippet of the full title. - This approach ensures correct handling of Unicode characters, preventing any corruption in case of multibyte characters.
Example 3: Handling Multibyte Character Set
Suppose another document in the articles
collection has an articlename
in a Multibyte Character Set.
Query:
db.articles.aggregate([
{
$project: {
shortName: {
$substrCP: ["$articlename", 0, 15]
}
}
}
]);
Output:
{ "shortName": "Social Media AP" }
Explanation: $substrCP
ensures that characters are correctly extracted even if they are multibyte characters, preventing data corruption.
Important Points About MongoDB $substrCP Operator
- The $substrCP operator is used in the aggregation pipeline to extract a substring from a given string expression.
- It uses the Unicode code point index and count to determine the substring.
- The $substrCP operator is useful when working with strings that contain non-ASCII characters, as it handles the Unicode code points correctly.
- The $substrCP operator is designed to work efficiently within the MongoDB aggregation framework, providing a way to manipulate string data based on Unicode code points.
Conclusion
In MongoDB, the $substrCP
operator is crucial for accurately extracting substrings based on Unicode code points. It supports efficient handling of both single-byte and multibyte character sets which making it essential for applications managing diverse textual data. By using $substrCP
, MongoDB users can effectively manipulate string data within the aggregation framework. Whether we’re dealing with date parsing, text truncation, or multilingual support, $substrCP
is a must-use operator for handling Unicode-compliant string manipulation in MongoDB.
FAQs
Can $substrCP
handle both single-byte and multibyte character sets?
Yes,
$substrCP
is designed to handle both single-byte and multibyte character sets effectively, ensuring accurate substring extraction regardless of character encoding.
How does $substrCP
differ from $substr
in MongoDB?
$substrCP
uses Unicode code points for indexing, making it suitable for strings with non-ASCII characters, whereas$substr
uses byte-based indexing which may not accurately handle multibyte characters.
In which MongoDB version was $substrCP
introduced?
$substrCP
was introduced in MongoDB version 3.4 as part of enhancements to support internationalization and multilingual applications.
Can $substrCP
handle substrings that cross multibyte character boundaries?
Yes,
$substrCP
correctly handles substrings that cross multibyte character boundaries by using Unicode code point indexing.