Tokenize
suggest changeListed from least expensive to most expensive at run-time:
str::strtokis the cheapest standard provided tokenization method, it also allows the delimiter to be modified between tokens, but it incurs 3 difficulties with modern C++:std::strtokcannot be used on multiplestringsat the same time (though some implementations do extend to support this, such as:strtok_s)- For the same reason
std::strtokcannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio’s implementation is thread safe) - Calling
std::strtokmodifies thestd::stringit is operating on, so it cannot be used onconst strings,const char*s, or literal strings, to tokenize any of these withstd::strtokor to operate on astd::stringwho’s contents need to be preserved, the input would have to be copied, then the copy could be operated on
Generally any of these options cost will be hidden in the allocation cost of the tokens, but if the cheapest algorithm is required and
std::strtok’s difficulties cannot be overcome, consider a hand-spun solution.// String to tokenize std::string str{ "The quick brown fox" }; // Vector to store tokens vector<std::string> tokens; for (auto i = strtok(&str[0], " "); i != NULL; i = strtok(NULL, " ")) { tokens.push_back(i); }- The
std::istream_iteratoruses the stream’s extraction operator iteratively. If the inputstd::stringis white-space delimited this is able to expand on thestd::strtokoption by eliminating its difficulties, allowing inline tokenization thereby supporting the generation of aconst vector<string>, and by adding support for multiple delimiting white-space character:// String to tokenize const std::string str("The quick \tbrown \nfox"); std::istringstream is(str); // Vector to store tokens const std::vector<std::string> tokens = std::vector<std::string>( std::istream_iterator<std::string>(is), std::istream_iterator<std::string>()); - The
std::regex_token_iteratoruses astd::regexto iteratively tokenize. It provides for a more flexible delimiter definition. For example, non-delimited commas and white-space:// String to tokenize const std::string str{ "The ,qu\\,ick ,\tbrown, fox" }; const std::regex re{ "\\s*((?:[^\\\\,]|\\\\.)*?)\\s*(?:,|$)" }; // Vector to store tokens const std::vector<std::string> tokens{ std::sregex_token_iterator(str.begin(), str.end(), re, 1), std::sregex_token_iterator() };See the
regex_token_iterator examplefor more details.
Found a mistake? Have a question or improvement idea?
Let me know.
Table Of Contents