Tokenize
suggest changeListed from least expensive to most expensive at run-time:
str::strtok
is the cheapest standard provided tokenization method, it also allows the delimiter to be modified between tokens, but it incurs 3 difficulties with modern C++:std::strtok
cannot be used on multiplestrings
at the same time (though some implementations do extend to support this, such as:strtok_s
)- For the same reason
std::strtok
cannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio’s implementation is thread safe) - Calling
std::strtok
modifies thestd::string
it is operating on, so it cannot be used onconst string
s,const char*
s, or literal strings, to tokenize any of these withstd::strtok
or to operate on astd::string
who’s contents need to be preserved, the input would have to be copied, then the copy could be operated on
Generally any of these options cost will be hidden in the allocation cost of the tokens, but if the cheapest algorithm is required and
std::strtok
’s difficulties cannot be overcome, consider a hand-spun solution.// String to tokenize std::string str{ "The quick brown fox" }; // Vector to store tokens vector<std::string> tokens; for (auto i = strtok(&str[0], " "); i != NULL; i = strtok(NULL, " ")) { tokens.push_back(i); }
- The
std::istream_iterator
uses the stream’s extraction operator iteratively. If the inputstd::string
is white-space delimited this is able to expand on thestd::strtok
option by eliminating its difficulties, allowing inline tokenization thereby supporting the generation of aconst vector<string>
, and by adding support for multiple delimiting white-space character:// String to tokenize const std::string str("The quick \tbrown \nfox"); std::istringstream is(str); // Vector to store tokens const std::vector<std::string> tokens = std::vector<std::string>( std::istream_iterator<std::string>(is), std::istream_iterator<std::string>());
- The
std::regex_token_iterator
uses astd::regex
to iteratively tokenize. It provides for a more flexible delimiter definition. For example, non-delimited commas and white-space:// String to tokenize const std::string str{ "The ,qu\\,ick ,\tbrown, fox" }; const std::regex re{ "\\s*((?:[^\\\\,]|\\\\.)*?)\\s*(?:,|$)" }; // Vector to store tokens const std::vector<std::string> tokens{ std::sregex_token_iterator(str.begin(), str.end(), re, 1), std::sregex_token_iterator() };
See the
regex_token_iterator example
for more details.
Found a mistake? Have a question or improvement idea?
Let me know.
Table Of Contents