C++ String Toolkit (StrTk) Tokenizer
Introduction
This article will present the tokenizing and splitting functionality of a simple C++ library called the String Toolkit. Tokenization in the context of string processing, is the method by which a sequence of elements are broken up or fragmented in sub-sequences called tokens. The indices in the original sequence that determine such breaks in the sequence are known as delimiters.
There are two types of delimiters, normal or thin delimiters which are of length one element and a thick delimiters which are of length two or more elements. Even though tokenization is primarily used in conjunction with strings, any sequence of types that can be iterated in a linear fashion can be tokenized, examples may be list of integers, a vector of person classes or a map of strings.
Another Tokenizer?
To date there have been many attempts to define a "standard" Tokenizer implementation in C++. Of them all the likely candidate might be the implementation in the Boost library. Regardless proposed implementations should to some extent consider one or more of the following areas: over-all usage patterns, constructions, generality (or is it genericty these days?) of design, performance-efficiency.
1. Over-all usage patterns
This requirement is concerned with how easy it is to instantiate the tokenizer and integrate it into existing processing patterns, which most often than not requires integration with C++ STL algorithms and containers. A tokenzier by definition would be classed as a producer, so the question becomes how easy is it for others to consume its output? Another issue is consistency of the definition of a token in the situation where one has consecutive delimiters but is not compressing them - can or should there be such a thing as an empty token? and what do preceding and trailing delimiters mean? and when should they be included as part of the token?
2. Constructions
In the context of string tokenizing, the majority of implementations return the token as a new instance of a string. This process requires a string to be created on heap, populated by the sub-string in question from the original sequence, then returned back (some of this may be alleviated by Return Value Optimization RVO). In the case of iterators this is essentially another copy to the caller. Furthermore two kinds of tokens can make this situation worse, they are primarily a large sequence made up of lots of very short tokens or a large sequence made up of lots of very large tokens. The solution is not to return the token as a string but rather as a range and allow the caller to decide how they wish to inspect and further manipulate the token.
This minor change in interface design provides a great deal of flexibility and performance gain.
3. Generality(Genericity) of design
Most tokenizer implementations concern themselves only with strings, which for the most part is ok, because most things that need tokenizing are strings. However there will be times when one has a sequence of types that may need to be tokenized that aren't strings, hence a tokenizer should be designed in such a way to enable such features, moreover it becomes clear that the most basic of tokenizing algorithms are invariant to the type of the delimiter.
4. Performance and Efficiency
Tokenizing strings range from low frequency inputs such as user input or parsing of simple configuration files to more complex situations such as tokenizing of HTML pages that a web crawler/indexer might do, to parsing of large market-data streams in FIX format. Performance is generally important and can usually be helped along with good usage patterns that encourage reuse of instances, minimal preprocessing and allow for user supplied predicates for the more nasty areas of the process. It should be noted that everything in the proceeding article can be done by purely using the STL - that said, C++'s ability to allow one to skin the proverbial cat in numerous way gives rise to novel solutions that are for the most part not of any practical use other than to showcase ones abilities in using the STL.
Getting started
The String Toolkit Library (StrTk) provides two common tokenization concepts, a split function and a token iterator. Both these concepts require the user to provide a delimiter predicate and an iterator range over which the process will be carried out.
The tokenization process can be further parametrized by switching between "compressed delimiters" or "no compressed delimiters" mode. This essentially means that consecutive delimiters will be compressed down to one and treated as such. Below are two tables depicting the expected tokens from various types of input. The tables represent no compressed and compressed tokenization processes respectively. The delimiter in this instance is a pipe symbol | and <> denotes an empty token.
No Compressed Delimiters
Input | Token List |
a | a |
a|b | a,b |
a||b | a,<>,b |
|a | <>,a |
a| | a,<> |
|a||b | <>,a,<>,b |
||a||b|| | <>,<>,a,<>,b,<>,<> |
| | <>,<> |
|| | <>,<>,<> |
||| | <>,<>,<>,<> |
Compressed Delimiters
Input | Token List |
a | a |
a||b | a,b |
|a||b | <>,a,b |
||a||b|| | <>,a,b,<> |
| | <>,<> |
|| | <>,<> |
||| | <>,<> |
Delimiters
Two forms of delimiters are supported and they are single delimiter predicate and multiple delimiters perdicate otherwise known as SDP and MDP respectively. Essentially an SDP is where there is only one type that can break or fragment the sequence, where as with MDPs there is more than one unique type that can break the sequence. It is possible to represent a SDP using the MDP, however from a performance POV having separate predicates is far more efficient. Additionally for strings based on char or unsigned char (8-bit versions) there is a MDP that has a look-up complexity of O(1) making it greatly more efficient than the basic MDP.
Single Delimiter Predicate
Instantiation requires specialization of type and construction requires an instance of the type.
strtk::single_delimiter_predicate<typename T>(const T& t)
strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
Multiple Delimiter Predicate
Instantiation requires specialization of type and construction requires a sequence of potential delimiters through a range described by iterators.
strtk::multiple_delimiter_predicate<typename T>(Iterator begin, Iterator end);
std::string str_delimiters = " ,.;:<>'[]{}()_?/'`~!@#$%^&*|-_\"=+";
strtk::multiple_delimiter_predicate<char> mdp1(str_delimiters.begin(),str_delimiters.end());
unsigned int uint_delimiters[5] = {1,10,101,1010,10101};
strtk::multiple_delimiter_predicate<unsigned int> mdp2(uint_delimiters,uint_delimiters + 5);
As previously mentioned tokenization of data need not be limited to strings comprised of chars, but can also be extended to other PODs or complex types. In the above example a predicate used for tokenizing a sequence of unsigned ints is being defined.
Multiple Char Delimiter Predicate
Instantiation requires an iterator range based on either unsigned char or char. This delimiter is more efficient than the simple MDP as it has a predicate evaluation of O(1) due to the use of a lookup-table as opposed to O(n) where n is the number of delimiters. Performance increase is seen as more delimiters are used.
strtk::multiple_char_delimiter_predicate(const std::string&)
strtk::multiple_char_delimiter_predicate(const std::string::const_iterator begin,const std::string::const_iterator end)
strtk::multiple_char_delimiter_predicate predicate(" .;?");
The delimeter concept can be extended to the point where the predicate itself can act as a state machine transitioning from state to state based on conditions and rules related to the current symbol being processed. This simple extension can lead to some very interesting parsing capabilities.
Split
This is a function that performs tokenization over an entire sequence
in one go. strtk::split
takes a sequence through
a range of iterators or in the case of a string through
a reference to its instance, a delimiter predicate and an output
iterator or otherwise known as a type sink. It populates the output iterator with the tokens it
extracts. The tokens in this case are std::pairs of iterators for
the sequence.
Split can be used in a "simple - no frills" manner by simply passing the necessary parameters:
std::string str = "abc|123|xyz|789";
strtk::std_string::token_list_type token_list;
strtk::split(" |.;?",str,std::back_inserter(token_list));
strtk::split
can also be used in a more explicit
manner whereby the exact type of delimiter predicate can be
specificed by the user:
std::string str = "abc|123|xyz|789";
strtk::std_string::token_list_type token_list;
strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
strtk::split(predicate,str,std::back_inserter(token_list));
strtk::split
provides an additional usage option that
allows the user to specify if they would like to either compress the
delimiters and whether or not they would like to include the
delimiter as part of the token range. This enum parameter is called
strtk::split_options
and has the following values:
Split Option | Definition |
strtk::split_options::default_mode | Default options |
strtk::split_options::compress_delimiters | Consecutive delimiters are treated as one |
strtk::split_options::include_1st_delimiter | The first delimiter is included in the resulting token range |
strtk::split_options::include_delimiters | All delimiters are included in the resulting token range |
The simple example below demonstrates a split that will occur over a string given a predicate where the provided split options indicate that consecutive delimiters will be treated as one and also all delimiters encountered after each token will also be included in the token up until the next token begins.
strtk::split(predicate,
str,
std::back_inserter(token_list),
strtk::split_options::compress_delimiters | strtk::split_options::include_delimiters);
Another way of using split may be to use the
strtk::multiple_char_delimiter_predicate
as follows:
std::string str = "abc?123;xyz.789";
strtk::std_string::token_list_type token_list;
strtk::multiple_char_delimiter_predicate predicate(" .;?");
strtk::split(predicate,str,std::back_inserter(token_list));
The contents of the token_list
can be printed out as follows:
strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
std::cout << (*itr) << "\t";
++itr;
}
Split N-Tokens
A natural extension of strtk::split
is strtk::split_n
.
This function provides the ability to tokenize a sequence up until a specific
number of tokens have been encountered or until there are no more tokens left.
The return value of the strtk::split_n
would be the number of tokens
encountered.
std::string data = "token1?token2,token3;token4,token5";
strtk::std_string::token_list_type token_list;
const std::size_t token_count = 4;
const std::string delimiters (" ,.;?");
strtk::split_n(delimiters,
data,
token_count,
std::back_inserter(token_list));
strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
std::cout << "[" << (*itr) << "]\t";
++itr;
}
std::cout << std::endl;
Offset Splitter
Another simple variant is the strtk::offset_splitter
.
This form of split takes a series of offsets and an iterator range or
string and determines the tokens based on the offsets. This function
can be set to perform a single pass of the offsets or to rotate
them until the range has been completely traversed. The example
below demonstrates how a string representing date and time can be
tokenized into its constituent parts (year, month, day, hour, minutes
,seconds,milliseconds)
std::string time_data = "20000101091011345"; //2000-01-01 9:10:11sec 345ms
const std::size_t offset_list_size = 7;
const int offset_list[offset_list_size] = {
4, // year
2, // month
2, // day
2, // hour
2, // minute
2, // second
3 // ms
};
const strtk::offset_predicate<offset_list_size> predicate(offset_list);
strtk::std_string::token_list_type token_list;
strtk::offset_splitter(time_data,predicate,std::back_inserter(token_list));
strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
std::cout << "[" << (*itr) << "] ";
++itr;
}
std::cout << std::endl;
Split Regex
Another form of splitter is based on the concept of using regular expressions as the delimiter predicate. Below is a simple example of splitting tokens wrapped in round-brackets.
std::string str = "(12)(345)(6789)(0ijkx)(yz)";
std::list<std::string> token_list;
strtk::split_regex("\\(.*?\\)",
s,
std::back_inserter(token_list),
strtk::regex_match_mode::match_1);
std::list<std::string>::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
std::cout << "[" << (*itr) << "]\t";
++itr;
}
std::cout << std::endl;
</std::string>
Note: The parameter regex_match_mode represents the capture of the
marked sub-expression in the current match. By default it is
strtk::regex_match_mode::match_all
which in this case
would provide the entire match including the bounding pattern,
eg: Token3 would be (0ijkx). However in the above example
we are only interested in the sub-expression between the round-brackets,
hence strtk::regex_match_mode::match_1
is used resulting
in Token3 being 0ijkx.
The following examples demonstrate the use of strtk::split_regex
and strtk::split_regex_n
routines in extracting specific
types of data - in this case the POD types int and double.
int main()
{
{
// Extract ints from data string
std::string data = "a 1^bc,0023| def?gh(4567ijk)-89 10l,m$n-op+123r@st+3u v*w2y56yz+";
std::deque<int> int_list;
strtk::split_regex("([+-]?([\\d]+))",
data,
strtk::range_to_type_back_inserter(int_list),
strtk::regex_match_mode::match_1);
}
{
// Extract doubles from data string
std::string data = "ab$c1.1?d-2.2ef#ghi+3.3%(123.456)!&*-7.89E+12@^=";
std::deque<double> double_list;
strtk::split_regex(strtk::ieee754_expression,
data,
strtk::range_to_type_back_inserter(double_list),
strtk::regex_match_mode::match_1);
}
{
// Extract the first 3 ints from data string
std::string data = "a 1^bc,0023| def?gh(4567ijk)-89 10l,m$n-op+123r@st+3u v*w2y56yz+";
std::deque<int> int_list;
strtk::split_regex_n("([+-]?([\\d]+))",
data,
3,
strtk::range_to_type_back_inserter(int_list),
strtk::regex_match_mode::match_1);
}
{
// Extract the first 4 doubles from data string
std::string data = "ab$c1.1?d-2.2ef#ghi+3.3%(123.456)!&*-7.89E+12@^=";
std::deque<double> double_list;
strtk::split_regex_n(strtk::ieee754_expression,
data,
4,
strtk::range_to_type_back_inserter(double_list),
strtk::regex_match_mode::match_1);
}
return 0;
}
The following table describes the various regex patterns built
into StrTk which can be used seemlessly with the strtk::split_regex
and strtk::split_regex_n
routines.
Regex | Definition |
strtk::uri_expression | URI, URL address e.g.: http://www.example.com, domain.example.net/index.html |
strtk::email_expression | E-mail address e.g.: some.one@example.com |
strtk::ip_expression | IPv4 address e.g.: 192.168.0.1, 127.0.0.1 |
strtk::ieee754_expression | Floating point value e.g.: 1.1, 1.234e-123, -1.0001E+10, 0.1234 |
Tokenizer
The tokenizer is specialized on a sequence iterator and predicate. It is constructed with a range of iterators for the sequence and a reference to the desired predicate. Definitions exist for std::string tokenizers. The tokenizer provides an iteration pattern as a means for accessing the tokens in the sequence.
const unsigned int data_size = 12;
unsigned int data[data_size] = {1,2,3,0,4,5,6,0,7,8,0,9};
strtk::single_delimiter_predicate<unsigned int> predicate(0);
typedef strtk::tokenizer<unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer_type;
tokenizer_type tokenizer(data,data + data_size,predicate);
Similar to that of strtk::split
,
strtk::tokenizer
provides tokenizing options that are
passed in during construction. Below is a table depicting said
options:
Tokenize Option | Definition |
strtk::tokenize_options::default_mode | Default options |
strtk::tokenize_options::compress_delimiters | Consecutive delimiters are treated as one |
strtk::tokenize_options::include_1st_delimiter | The first delimiter is included in the resulting token range |
strtk::tokenize_options::include_delimiters | All delimiters are included in the resulting token range |
typedef strtk::tokenizer<unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer_type
tokenizer_type tokenizer(data, data + data_size,
predicate,
strtk::tokenize_options::compress_delimiters |
strtk::tokenize_options::include_1st_delimiter);
Furthermore, Iteration over the tokens of strtk::tokenizer
is performed as follows:
tokenizer_type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
std::copy((*itr).first,(*itr).second,std::ostream_iterator<unsigned int>(std::cout," "));
std::cout << std::endl;
++itr;
}
A typical std::string
can be tokenized in the following
manner:
std::string str = "abc|123|xyz|789";
strtk::std_string::tokenizer<>::type tokenizer(str,"|");
strtk::std_string::tokenizer<>::type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
std::cout << "[" << (*itr) << "]\t";
++itr;
}
std::cout << std::endl;
Another common situation may be tokenizing a sequence of strings, such as the following:
const std::string str_list[] = { "abc" , "delimiter" , "ijk" , "delimiter" ,
"mno" , "delimiter" , "rst" , "uvw" ,
"delimiter" , "xyz" };
const std::size_t str_list_size = sizeof(str_list) / sizeof(std::string);
strtk::range_adapter<std::string> range(str_list,str_list_size);
strtk::single_delimiter_predicate<std::string> predicate("delimiter");
typedef strtk::tokenizer<std::string*,strtk::single_delimiter_predicate<std::string> > tokenizer_type;
tokenizer_type tokenizer(range.begin(),range.end(),predicate);
tokenizer_type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
std::copy((*itr).first,(*itr).second,std::ostream_iterator<std::string>(std::cout," "));
std::cout << std::endl;
++itr;
}
The Parse Routine
Till now the mentioned routines worked specifically with tokens, or in
other words ranges of characters. The responsibility of managing the
tokens and converting the tokens to user specified types was done
manually via "range to type" oriented back inserters and converters.
This can be a bit cumbersome and as such StrTk provides a series of
helper routines called strtk::parse
. Parse takes an
std::string
representing a tuple of delimited values as
input data, a delimiter set, and a series of references to variables
that are to be populated with the values from the parsed tokens. The
following diagram demonstrates the flow of data, tokens and the
corresponding relationships and conversions between each of the
parameters.
Note: strtk::parse
returns a boolean value of true
upon successful parsing and false for all other results. Situations
that cause strtk::parse
to fail include:
- Insufficient number of tokens for the given number of variables
- Conversion failure from token range to variable type
- Empty or null token(s)
Some Simple Parse Examples
strtk::parse
can take an arbitary number of variable
references. The code below demonstrates the basic usage of strtk::parse
taking various number of parameters.
std::string data = "abcde,-1234|567.890#1.1f";
std::string delimiters = ",|#";
std::string var0;
int var1;
double var2;
float var3;
strtk::parse(data,delimiters,var0);
strtk::parse(data,delimiters,var0,var1);
strtk::parse(data,delimiters,var0,var1,var2);
strtk::parse(data,delimiters,var0,var1,var2,var3);
The following examples demonstrate parsing of PODs such as int and
double into STL compatible sequences (std::vector
,
std::deque
, std::list
, std::set
,
std::queue
, std::stack
and
std::priority_queue
).
// Insert into std::vector
std::string int_string = "1 +2 -3 4 +5 -6 7 +8 -9 10 +11 -12 13 +14 -15";
std::vector<int> int_vector;
strtk::parse(int_string," ",int_vector);
// Insert into std::deque
std::string double_string = "-123.456,789.012,-345.678,901.234,+567.890";
std::deque<double> double_deque;
strtk::parse(double_string,",",double_deque);
// Insert into std::list
std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz";
std::list<std::string> string_list;
strtk::parse(data_string,",",string_list);
// Insert into std::set
std::string set_string = "a|bc#def|ghij#klmno|pqrstu#vwxyz";
std::set<std::string> string_set;
strtk::parse(set_string,"#|",string_set);
// Insert into std::queue
std::string queue_string = "value1,value2,value3,value4,value5";
std::queue<std::string> string_queue;
strtk::parse(queue_string,",|",string_queue);
// Insert into std::stack
std::string stack_string = "value1|value2,value3|value4,value5";
std::stack<std::string> string_stack;
strtk::parse(stack_string,",|",string_stack);
// Insert into std::priority_queue
std::string priority_queue_string = "value1|value2,value3#value4,value5";
std::priority_queue<std::string> string_priority_queue;
strtk::parse(priority_queue_string,",|#",string_priority_queue);
Similar to what is described above, the following demonstrates parsing of upto "N" elements into an STL compatible sequence.
// Insert N elements into std::vector
std::string int_string = "100,-200,+300,400,-500,+600,700,-800,+900";
std::vector<int> int_vector;
strtk::parse_n(int_string,",",5,int_vector);
// Insert N elements into std::deque
std::string double_string = "123.456,+789.012,345.678,-901.234,567.890";
std::deque<double> double_deque;
strtk::parse_n(double_string,",",3,double_deque);
// Insert N elements into std::list
std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz";
std::list<std::string> string_list;
strtk::parse_n(data_string,",",6,string_list);
// Insert N elements into std::set
std::string set_string = "a|bc#def|ghij#klmno|pqrstu#vwxyz";
std::set<std::string> string_set;
strtk::parse_n(set_string,"#|",6,string_set);
// Insert N elements into std::queue
std::string queue_string = "value0,value1,value2,value3,value4,value5";
std::queue<std::string> string_queue;
strtk::parse_n(queue_string,",|",4,string_queue);
// Insert N elements into std::stack
std::string stack_string = "value0|value1|value2,value3|value4,value5";
std::stack<std::string> string_stack;
strtk::parse_n(stack_string,",|",4,string_stack);
// Insert N elements into std::priority_queue
std::string priority_queue_string = "value0#value1|value2,value3#value4,value5";
std::priority_queue<std::string> string_priority_queue;
strtk::parse_n(priority_queue_string,",|#",4,string_priority_queue);
Note: The return value of the routine strtk::parse_n
indicates how many elements were parsed and placed into the specified
sequence.
A Practical Example
Lets assume you have been given an english text file to process, with the intention of extracting a lexicon from the file.
One solution would be to break the problem down to a line by line tokenization problem. In this case you would define a functional object such as the following which will take the container in which you plan on storing your tokens (words) and a predicate and insert the tokens as strings into your container.
template<typename Container, typename Predicate>
struct parse_line
{
public:
parse_line(Container& container, const Predicate& predicate)
: container_(container),
predicate_(predicate)
{}
inline void operator() (const std::string& str)
{
strtk::split(str,
predicate_,
strtk::range_to_type_back_inserter(container_),
strtk::split_options::compress_delimiters);
}
private:
Container& container_;
const Predicate& predicate_;
};
The whole thing together would include a process of opening the
file and reading it line by line each time invoking the
parse_line
would be as follows:
template<typename Container>
void parse_text(const std::string& file_name, Container& c)
{
static const std::string delimiters = " ,.;:<>'[]{}()_?/"
"`~!@#$%^&*|-_\"=+\t\r\n\0"
"0123456789";
strtk::multiple_char_delimiter_predicate predicate(delimiters);
strtk::for_each_line(file_name,
parse_line<Container,strtk::multiple_char_delimiter_predicate>(c,predicate));
}
int main()
{
std::string text_file_name = "text.txt";
std::deque<std::string> word_list;
parse_text(text_file_name,word_list);
std::cout << "Token Count: " << word_list.size() << std::endl;
return 0;
}
Now coming back to the original problem, that being the construction of a lexicon. In this case the set of "words" should only contain words of interest. For the sake of simplicity lets define words of interest as being anything other than the following prepositions: as, at, but, by, for, in, like, next, of, on, opposite, out, past, to, up and via. This type of list is commonly known as a Stop Word List. In this example the stop-word list definition will be as follows:
const std::string stop_word_list [] =
{
"as", "at", "but", "by", "for",
"in", "like", "next", "of", "on",
"opposite", "out", "past", "to",
"up", "via", ""
};
const std::size_t stop_word_list_size = sizeof(stop_word_list) / sizeof(std::string);
Some minor updates to the parse_line processor include using the
filter_on_match
predicate for determining if the
currently processed token is a preposition and also the invocation of
the range_to_type
back_inserter to convert the tokens
from their range iterator representation to a type representation
compatible with the user defined container. For the new implementation
to provide unique words of interest the simplest change that can be
made is to replace the deque used as the container for the word_list
to some kind of 1-1 associative container such as a set. The following
is the improved version of the parse_line
processor:
template<typename Container, typename Predicate>
struct parse_line
{
public:
parse_line(Container& container, const Predicate& predicate)
: container_(container),
predicate_(predicate),
tmp_(" "),
tokenizer_(tmp_,predicate_,true),
filter_(stop_word_list,stop_word_list + stop_word_list_size,
strtk::range_to_string_back_inserter_iterator<Container>(container_),
true,false)
{}
inline void operator() (const std::string& s)
{
const filter_type& filter = filter_;
strtk::for_each_token(s,tokenizer_,filter);
}
private:
Container& container_;
const Predicate& predicate_;
std::string tmp_;
typename strtk::std_string_tokenizer<Predicate>::type tokenizer_;
strtk::filter_on_match<strtk::range_to_string_back_inserter_iterator<Container>> filter_;
};
The above described predicate can be greatly simplified by using binders and various lamba expressions.
Another Example
When performing serialization or deserialization of an instance object such as a class or struct, a simple approach one could take would be to take each of the members and convert them into string representations and from those strings construct a larger string delimiting each member with a special character guaranteed not to exist in any of the string representations.
In this example we will assume that there exists a struct which represents the properties of a person, a person struct if you will:
struct person
{
unsigned int id;
std::string name;
unsigned int age;
double height
float weight;
};
The process of populating a person struct would entail having an
instance of a person and the necessary data string. The following is
an example of how this would be done using the
strtk::parse
function.
Person Tuple Format
Token0 | Delimiter0 | Token1 | Delimiter1 | Token2 | Delimiter2 | Token3 | Delimiter3 | Token4 |
Unique ID(hex) | | | Name | | | Age | | | Height(m) | | | Weight(kg) |
std::string data = "0xFA37ED12|Rumpelstiltskin|397|1.31|58.7";
person p;
strtk::hex_to_number_sink<unsigned int> hex_sink(p.id); // register id with the hex sink
strtk::parse(data,"|",hex_sink,p.name,p.age,p.height,p.weight);
Batch processing of a text file comprised of one person tuple
per-line is somewhat similar to the previous example. A predicate
is defined that takes a container specialized on the person struct,
and a delimiter predicate with which the
strtk::parse
function will be invoked. This
predicate is then instantiated and coupled with the text
file name, is fed to the strtk::for_each_line
processor.
template<typename Container, typename Predicate>
struct parse_person_tuple
{
public:
parse_person_tuple(Container& container)
: container_(container),
hex_sink(p_.id)
{}
inline void operator() (const std::string& s)
{
if (strtk::parse(s,"|",hex_sink,p_.name,p_.age,p_.height,p_.weight))
container_.push_back(p_);
else
std::cerr << "Failed to parse: " << s << std::endl;
}
private:
Container& container_;
person p_;
strtk::hex_to_number_sink<unsigned int> hex_sink;
};
Bringing the above pieces together to process a file results in the following:
int main()
{
std::string text_file_name = "person_records.txt";
std::deque<person> person_list;
strtk::for_each_line(file_name,predicate_type(person_list));
return 0;
}
To make things easier one could adapt a struct (made up entirely of PODs) to a parser. This makes the usage syntax little easier to follow. An example of this adaption is as follows:
struct type
{
std::string s;
double d;
int i;
char c;
bool b;
};
strtk_parse_begin(type)
strtk_parse_type(s)
strtk_parse_type(d)
strtk_parse_type(i)
strtk_parse_type(c)
strtk_parse_type(b)
strtk_parse_end()
int main()
{
type t;
std::string s = "abcdefghijklmnop|123.456|987654321|A|1";
strtk::parse(s,"|",t);
return 0;
}
Another similar example to the previous, would be parsing a text file of 3D coordinates into a sequence. This can be done easily and cleanly using lambdas and StrTk as follows:
struct point
{
double x,y,z;
};
int main()
{
std::string point_data = "point_data.txt";
std::deque<point> points;
point p;
strtk::for_each_line(point_data,
[&points,&p](const std::string& str)
{
if (strtk::parse(str,",",p.x,p.y,p.z))
points.push_back(p);
});
return 0;
}
Simple Date-Time Parser
Assuming the datetime struct defined below, and a string representation of a combined date and time in the form of YYYY-MM-DD HH:MM:SS.MS eg: 2005-06-26 15:57:03.678
struct datetime
{
unsigned int year;
unsigned int month;
unsigned int day;
unsigned int hour;
unsigned int minute;
unsigned int second;
unsigned int msecond;
};
The following assumes an input of date-time values separated by a
pipe. To facilitate parsing of a date-time by
the strtk::parse
routine into an STL compatible sequence an
implementation of string_to_type_converter_impl
specific
to the datetime type is required. The following demonstrates how such
a routine can be generated and used with the strtk::parse
context:
strtk_string_to_type_begin(datetime)
static const std::string delimiters ("-:. ");
return strtk::parse(begin, end, delimiters,
t.year, t.month, t.day,
t.hour, t.minute, t.second, t.msecond);
strtk_string_to_type_end()
A simple example of using strtk::parse
in conjunction
with the newly integrated datetime variable mixed with variables
of other types is as follows:
int main()
{
const std::string data = "abc 123 xyz,2000-01-10 03:01:16.123|+98765.43210";
std::string var0;
datetime var1;
double var2;
strtk::parse(data,",|",var0,var1,var2);
return 0;
}
Bringing the above pieces together, in the following we can then proceed to parse a sequence of date-time strings delimited by pipe "|" into a deque of type datetime.
int main()
{
static const std::string data = "2000-01-10 03:01:16.123|2001-02-22 05:12:24.234|"
"2002-03-13 07:23:32.345|2003-04-24 09:34:47.456|"
"2004-05-15 11:46:51.767|2005-06-26 15:57:03.678|"
"2006-07-17 17:45:31.561|2007-08-26 19:06:02.809|"
"2008-09-18 21:16:23.267|2009-10-26 23:12:03.798|"
"2010-11-23 13:47:11.963|2011-12-26 15:35:08.168";
std::deque<datetime> datetime_list;
strtk::parse(data,"|",datetime_list);
for (std::size_t i = 0; i < datetime_list.size(); ++i)
{
datetime& dt = datetime_list[i];
std::cout << dt.year << "-"
<< strtk::text::right_align(2,'0', dt.month) << "-"
<< strtk::text::right_align(2,'0', dt.day) << " "
<< strtk::text::right_align(2,'0', dt.hour) << ":"
<< strtk::text::right_align(2,'0', dt.minute) << ":"
<< strtk::text::right_align(2,'0', dt.second) << "."
<< strtk::text::right_align(3,'0',dt.msecond) << std::endl;
}
return 0;
}
Parsing Sub-Lists
So far the demonstrated capabilities of the strtk::parse
function has been based on passing a series of parameters that are
populated in a linear fashion as the parser processes the tokens it
encounters. That said, some formats have their own sub-structures, a
simple example would be a list of values (such as integers) that need
to be loaded into a deque or stack. StrTk provides a series of sink
facilities that consume a range and an STL container which can be
forwarded onto strtk::parse
.
In the following example, the data string is comprised of 3
separate lists delimited by a pipe "|". An integer, a
string and a double type list. Each list is to be parsed into an STL
container of appropriate type. In this case a vector, a deque and a
list. StrTk provides the ability to instantiate a sink for the
specific container type that is compatible with
strtk::parse
.
int main()
{
std::string data = "1,+2,-3,4|abc,ijk,rst,xyz|123.456,+234.567,-345.678,456.789,567.890";
//define containers
std::vector<int> int_vector;
std::deque<std::string> string_deque;
std::list<double> double_list;
std::set<int> int_set;
std::queue<std::string> string_queue;
std::stack<double> double_stack;
std::priority_queue<int> int_priority_queue;
//define sinks
strtk::vector_sink<int>::type vec_sink(",");
strtk::deque_sink<std::string>::type deq_sink(",");
strtk::list_sink<double>::type lst_sink(",");
strtk::set_sink<int>::type set_sink(",");
strtk::queue_sink<std::string>::type que_sink(",");
strtk::stack_sink<double>::type stk_sink(",");
strtk::priority_queue_sink<int>::type prq_sink(",");
strtk::parse(data,"|",vec_sink( int_vector),
deq_sink(string_deque),
lst_sink( double_list));
strtk::parse(data,"|",set_sink( int_set),
que_sink( string_queue),
stk_sink( double_stack),
prq_sink(int_priority_queue));
return 0;
}
If only a certain number of elements in the list are required, for example only the first 3, then the element count on the sink can be set appropriately. The above example could be modified as follows:
int main()
{
std::string data = "1,+2,-3,4|string0|abc,ijk,rst,xyz|string1|123.456,+234.567,-345.678,456.789,567.890";
std::vector<int> int_vector;
std::deque<std::string> string_deque;
std::list<double> double_list;
strtk::vector_sink<int>::type vec_sink(",");
strtk::deque_sink<std::string>::type deq_sink(",");
strtk::list_sink<double>::type lst_sink(",");
std::string string_0;
std::string string_1;
strtk::parse(data,"|",vec_sink( int_vector).count(2), // consume first 2 values
string_0,
deq_sink(string_deque).count(3), // consume first 3 values
string_1,
lst_sink( double_list).count(4)); // consume first 4 values
return 0;
}
Note: If there aren't enough elements in a particular
list, then parsing of that list fails and subsequently the whole
strtk::parse
call will fail as well.
Extending The Date-Time Parser Example
Building upon the previous datetime example, We are presented with a tuple of data that represents an astronomical event. The event defines a name, a location and a series of date-times in UTC the event was observed. In order to construct the necessary sink(s) that will be used for parsing the required type into a container, the macro strtk_register_userdef_type_sink with the specified type is invoked. The following is a definition of the struct one might construct:
struct event_information
{
std::size_t id;
std::string name;
std::string location;
std::deque<datetime> observation_datetimes;
};
strtk_register_userdef_type_sink(datetime)
Bringing the above together with a call to strtk::parse
results in the following code which parses the event data tuple into
the allocated event_information
instance.
int main()
{
std::string event_data = "172493|Lunar Impact|Mare Tranquillitatis|"
"2010-01-19 00:28:45.357,2010-02-18 00:57:07.109,"
"2010-03-20 01:15:11.261,2010-04-21 01:07:27.972";
strtk::deque_sink<datetime>::type deq_sink(",");
event_information evt_info;
strtk::parse(event_data,"|",evt_info.id,
evt_info.name,
evt_info.location,
deq_sink(evt_info.observation_datetimes));
return 0;
}
Ignoring Tokens During Parsing
There may be scenarios when given a delimited tuple of data, that one
or more of the tokens need to be ignored or skipped. StrTk provides a
mechanism called strtk::ignore_token
that allows the
parser to consume specific tokens easily without affecting overall
performance. Below is an example of how strtk::ignore_token
can be used in conjunction with strtk::parse
to skip the
2nd and 4th tokens in the tuple:
int main()
{
static const std::string data = "+123,ignore0,123.456,ignore1,abcdef,ignore2";
int i;
double d;
std::string s;
strtk::ignore_token ignore;
strtk::parse(data,",",i,ignore,d,ignore,s);
std::cout << "i=" << i << " d=" << d << " s=" << s << std::endl;
return 0;
}
Simple 3D Mesh File Format Parser
Lets assume there is a file format for expressing a mesh. The format consists of a section that defines the unique vertexes in the mesh, and another section that defines the triangles in the mesh as a tuple of three indicies indicating the vertexes used. Each section is preceded by an integer that denotes the number of subsequent elements in that section. An example of such a file is the following:
5
+1.0,+1.0,+1.0
-1.0,+1.0,-1.0
-1.0,-1.0,+1.0
+1.0,-1.0,-1.0
+0.0,+0.0,+0.0
4
0,1,4
1,2,4
2,3,4
3,1,4
Code to parse such a file format is as follows:
struct point
{
double x,y,z;
};
struct triangle
{
std::size_t i0,i1,i2;
};
int main()
{
std::string mesh_file = "mesh.txt";
std::ifstream stream(mesh_file.c_str());
std::string s;
// Process points section
std::deque<point> points;
point p;
std::size_t point_count = 0;
strtk::parse_line(stream," ",point_count);
strtk::for_each_line_n(stream,
point_count,
[&points,&p](const std::string& line)
{
if (strtk::parse(line,",",p.x,p.y,p.z))
points.push_back(p);
});
// Process triangles section
std::deque<triangle> triangles;
triangle t;
std::size_t triangle_count = 0;
strtk::parse_line(stream," ",triangle_count);
strtk::for_each_line_n(stream,
triangle_count,
[&triangles,&t](const std::string& line)
{
if (strtk::parse(line,",",t.i0,t.i1,t.i2))
triangles.push_back(t);
});
return 0;
}
Pick A Random Line From A Text File
A random line of text is to be selected from a user provided text file comprised of lines of varying length, such that the probability of the line being selected is 1/N where N is the number of lines in the text file. There are many trivial solutions to this problem, however if one were to further constrain the problem by indicating the file is very large (TBs) and that the system upon which the solution is to run is very limited memory-wise, most if not all trivial solutions such as storing indexes of all line offsets, or reading the entire file into memory etc will be eliminated.
However, there exists a very elegant solution to this problem of O(n), O(1) time and space complexities respectively, that involves scanning the entire file once line by line, and at every ith line choosing whether or not to replace the final result line with the current line by sampling a uniform distribution of range [0,1) and performing a replace if and only if the random value is less than 1 / i.
The logic behind this solution revolves around the fact that the probability of selecting the ith line will be 1/i and as such the total probability for selecting any of the previous i-1 lines will be 1 - (1/i) or (i - 1)/i. Because there are (i - 1) lines before the ith line, we divide the previous sum of probabilities by (i - 1), resulting in a selection probability of 1/i for any one of the lines up to and including the ith line. If the ith line were to be the last line of the text file, this then results in each of the lines having a selection probability of 1/N - simple and sweet, and so to is the following implementation of said solution:
int main(int argc, char* argv[])
{
std::string file_name = argv[1];
std::string line;
std::size_t i = 0;
strtk::uniform_real_rng rng(static_cast<std::size_t>(::time(0));
strtk::for_each_line(file_name,
[&i,&line,&rng](const std::string& s)
{ if (rng() < (1.0 / ++i)) line = s; }
);
std::cout << line << std::endl;
return 0;
}
Note: TAOCP Volume II section 3.4.2 has an in depth discussion about this problem and other similar problems relating to random distributions. Also one should note that the above example has an inefficiency, in that upon every string replace, an actual string is being copied, normally all one needs to do is maintain a file offset to the beginning of the line, not doing this causes slow downs due to continuous memory allocations/reallocations which is made all the worse when large lines are encountered.
Token Grid
StrTk provides a means to easily parse and consume 2D grids of tokens in an efficient and simple manner. A grid is simply defined as a series of rows comprised of tokens, otherwise known as Delimiter Separated Values (DSV). The ith token of a row is grouped with every ith token of all other rows to produce a column. Tokens can be processed as either rows or columns.
An example of a simple token grid, where each numeric value is deemed to be a token:
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
A token grid can be either passed in via a file or a string buffer. The delimiters to be used for parsing the columns and rows can also be provided, if not provided standard common defaults will be used.
The following demonstrates how each cell in the grid can be access and cast to a specific type.
std::string data = "1,2,3,4,5,6\n"
"7,8,9,0,1,2\n"
"3,4,5,6,7,8\n"
"9,0,1,2,3,4\n"
"5,6,7,8,9,0\n";
strtk::token_grid grid(data,data.size(),",");
for (std::size_t r = 0; r < grid.row_count(); ++r)
{
strtk::token_grid::row_type row = grid.row(r);
for (std::size_t c = 0; c < row.size(); ++c)
{
std::cout << grid.get<int>(r,c) << "\t";
}
std::cout << std::endl;
}
The following example demonstrates how averages over rows and columns of a token grid can be computed:
std::string data = "1.1,1.1,1.1,1.1,1.1,1.1\n"
"2.2,2.2,2.2,2.2,2.2,2.2\n"
"3.3,3.3,3.3,3.3,3.3,3.3\n"
"4.4,4.4,4.4,4.4,4.4,4.4\n"
"5.5,5.5,5.5,5.5,5.5,5.5\n"
"6.6,6.6,6.6,6.6,6.6,6.6\n";
strtk::token_grid grid(data,data.size(),",");
std::vector<double> avg_c(grid.row(0).size(),0.0);
std::vector<double> avg_r(grid.row_count(),0.0);
std::vector<double> tmp(grid.row(0).size(),0.0);
std::fill(avg_c.begin(),avg_c.end(),0.0);
for (std::size_t i = 0; i < grid.row_count(); ++i)
{
grid.row(i).parse<double>(tmp.begin());
std::transform(avg_c.begin(),avg_c.end(),tmp.begin(),avg_c.begin(),std::plus<double>());
avg_r[i] = std::accumulate(tmp.begin(),tmp.end(),0.0) / tmp.size();
}
for (std::size_t i = 0; i < avg_c.size(); avg_c[i++] /= grid.row_count());
std::cout << "Column Averages:\t";
std::copy(avg_c.begin(),avg_c.end(),std::ostream_iterator<double>(std::cout,"\t"));
std::cout << "\n";
std::cout << "Row Averages:\t";
std::copy(avg_r.begin(),avg_r.end(),std::ostream_iterator<double>(std::cout,"\t"));
std::cout << "\n";
</double></double></double></double>
Processing Of Comma Seperated Values Data
The original intent of the token grid was to support fast and efficient processing of simple data tuples, such as comma seperated values (CSV) formats et. al. The following example demonstrates a simple summation of traded floor volume and average daily volume based on NASDAQ OHLC (Open-High-Low-Close) data.
//Date,Symbol,Open,Close,High,Low,Volume
std::string market_data = "20090701,GOOG,424.2000,418.9900,426.4000,418.1500,2310768\n"
"20090701,MSFT,24.0500,24.0400,24.3000,23.9600,54915127\n"
"20090702,GOOG,415.4100,408.4900,415.4100,406.8100,2517630\n"
"20090702,MSFT,23.7600,23.3700,24.0400,23.2100,65427699\n"
"20090703,GOOG,408.4900,408.4900,408.4900,408.4900,0\n"
"20090703,MSFT,23.3700,23.3700,23.3700,23.3700,0\n"
"20090706,GOOG,406.5000,409.6100,410.6400,401.6600,2262557\n"
"20090706,MSFT,23.2100,23.2000,23.2800,22.8700,49207638\n"
"20090707,GOOG,408.2400,396.6300,409.1900,395.9801,3260307\n"
"20090707,MSFT,23.0800,22.5300,23.1400,22.4600,52842412\n"
"20090708,GOOG,400.0000,402.4900,406.0000,398.0600,3441854\n"
"20090708,MSFT,22.3100,22.5600,22.6900,2200000,73023306\n"
"20090709,GOOG,406.1200,410.3900,414.4500,405.8000,3275816\n"
"20090709,MSFT,22.6500,22.4400,22.8100,22.3700,46981174\n"
"20090710,GOOG,409.5700,414.4000,417.3700,408.7000,2929559\n"
"20090710,MSFT,22.1900,22.3900,22.5400,22.1500,43238698\n";
strtk::token_grid grid(market_data,market_data.size(),",");
struct stock_info
{
stock_info(const std::string& s = " ")
: symbol(s),
total_volume(0),
day_count(0),
average_daily_volume(0.0)
{}
std::string symbol;
unsigned long long total_volume;
std::size_t day_count;
double average_daily_volume;
};
stock_info goog("GOOG");
stock_info msft("MSFT");
static const std::size_t volume_column = 6;
static const std::size_t symbol_column = 1;
goog.day_count = grid.accumulate_column(volume_column,
[](const strtk::token_grid::row_type& row) -> bool
{
static const std::string google_symbol("GOOG");
return google_symbol == row.get<std::string>(symbol_column);
},
goog.total_volume);
msft.day_count = grid.accumulate_column(volume_column,
[](const strtk::token_grid::row_type& row) -> bool
{
static const std::string microsoft_symbol("MSFT");
return microsoft_symbol == row.get<std::string>(symbol_column);
},
msft.total_volume);
goog.average_daily_volume = (1.0 * goog.total_volume) / goog.day_count;
msft.average_daily_volume = (1.0 * msft.total_volume) / msft.day_count;
std::cout << "[GOOG] Total Volume: " << goog.total_volume << std::endl;
std::cout << "[MSFT] Total Volume: " << msft.total_volume << std::endl;
std::cout << "[GOOG] ADV: " << goog.average_daily_volume << std::endl;
std::cout << "[MSFT] ADV: " << msft.average_daily_volume << std::endl;
The strtk::token_grid
is thread-safe iff read operations
are in play. As such the above calls to accumulate_column et al. can
all be safely and easily executed concurrently using threads. This
allows for a far more efficient data processing methodology.
TIMTOWTDI
Playing devil's advocate, another way of performing the above processing task, assuming only the specific values for computing the ADV are required and no further processing of the CSV data is needed, then the problem can be solved efficiently by utilizing a single pass of the data as follows:
std::string file_name = "market_data.csv";
std::unordered_map<std::string,stock_info> stock_map;
stock_map.insert(std::make_pair<std::string,stock_info>("GOOG",stock_info("GOOG")));
stock_map.insert(std::make_pair<std::string,stock_info>("MSFT",stock_info("MSFT")));
strtk::for_each_line(file_name,
[&stock_map](const std::string& line)
{
strtk::ignore_token ignore;
stock_info temp;
const bool result = strtk::parse(line,
",",
ignore,
temp.symbol,
ignore,
ignore,
ignore,
ignore,
temp.total_volume);
if (!result) return;
auto iter = stock_map.find(temp.symbol);
if (stock_map.end() == iter) return;
(*iter).second.total_volume += temp.total_volume;
(*iter).second.day_count++;
});
auto itr = stock_map.begin();
auto end = stock_map.end();
while (end != itr)
{
stock_info& stock = (*itr++).second;
stock.average_daily_volume = (1.0 * stock.total_volume) / stock.day_count;
}
Sequential Partitions
A typical operation carried out upon time-series data is to group tuples into buckets (or bins) based upon the time index value. For example grouping data into 2-minute buckets and then performing some kind of operation upon the grouped tuples such as a summation or an average etc. This process is sometimes also called: "discretization"
The strtk::token_grid
class provides a method known as
sequential_partition
. The sequential_partition
method requires a Transition Predicate, a Function
and optionally a row-range. The Transition Predicate
consumes a row and returns true
only if the row is
in the next partition. All other subsequent consecutive rows until
the transition predicate returns a true
are said to be
in the current partition. Prior to transitioning to a new partition,
the function predicate is invoked and provided with the range of
rows in the current partition.
The following example takes a simple time-series (time and
value), partitions the tuples into groups of Time-Buckets
of period length 3 and then proceeds to compute the total sum of each
group. The below summarizer
class provides provides a
Transition Predicate and Function.
class summarizer
{
public:
enum column_index
{
tick_time_column = 0,
tick_value_column = 1
};
summarizer(std::deque<double>& sum_value)
: next_tick_time_(0),
sum_value_(sum_value)
{}
// Transition Predicate
bool operator()(const strtk::token_grid::row_type& row)
{
if (row.get<std::size_t>(tick_time_column) >= next_tick_time_)
{
next_tick_time_ = row.get<std::size_t>(tick_time_column) + 3;
return true;
}
else
return false;
}
// Function
bool operator()(const strtk::token_grid& grid,
const strtk::token_grid::row_range_type& range)
{
double bucket_sum = 0.0;
if (!grid.accumulate_column(tick_value_column,range,bucket_sum))
{
std::cout << "failed to accumulate!" << std::endl;
return false;
}
else
sum_value_.push_back(bucket_sum);
return true;
}
private:
summarizer& operator=(const summarizer&);
std::size_t next_tick_time_;
std::deque<double>& sum_value_;
};
int main()
{
//time index, value
std::string data = "10000,123.456\n"
"10001,612.345\n"
"10002,561.234\n"
"10003,456.123\n"
"10004,345.612\n"
"10005,234.561\n"
"10006,123.456\n";
strtk::token_grid grid(data,data.size(),",");
std::deque<double> sum_value;
summarizer s(sum_value);
grid.sequential_partition(s,s);
for (std::size_t i = 0; i < sum_value.size(); ++i)
{
std::cout << "bucket[" << i << "] = " << sum_value[i] << std::endl;
}
return 0;
}
The expected output is as follows:
bucket[0] = 1297.035
bucket[1] = 1036.296
bucket[2] = 123.456
Parsing CSV Files With Embedded Double-Quotes
One of the simple extensions to the CSV format is the concept of
double quoted tokens. Such tokens may contain column or row
delimiters. When such a scenario is encountered, all subsequent
delimiters are ignored, and kept as part of the token, until the
corresponding closing double quote is encountered. The StrTk
token_grid
supports the parsing of such tokens. This
parsing mode can be easily activated via the token_grid
option set. Below is an example of a token_grid
loading a
CSV data set representing various airports from around the world and
their specific codes and locations, in which some of the cells are
double-quoted:
ICAO | IATA | Airport | City | Country |
AYGA | GKA | "Goroka Gatue" | Goroka | Papua New Guinea |
BGCO | GCO | "Nerlerit Inaat Constable Pynt" | "Nerlerit Inaat" | Greenland |
BZGD | ZGD | Godley | Auckland | New Zealand |
CYQM | YQM | "Greater Moncton International" | Moncton | Canada |
EDRK | ZNV | "Koblenz Winningen" | Koblenz | Germany |
FAHU | AHU | "HMS Bastard Memorial" | Kwazulu-Natal | South Africa |
FQMP | MZB | "Mocimboa Da Praia" | "Mocimboa Da Praia" | Mozambique |
KINS | INS | "Indian Springs AF AUX" | Indian Springs | USA |
UHNN | HNN | Nikolaevsk | "Nikolaevsk Na Amure" | Russia |
WBKK | BKI | "Kota Kinabalu International" | Kota Kinabalu | Malaysia |
ZSJD | JDZ | "Jingdezhen Airport" | Jingdezhen | China |
The following is the StrTk code example using token_grid
to parse the above CSV data set:
int main()
{
std::string airport_data_file_name = "airport_data.csv";
strtk::token_grid::options options;
options.column_delimiters = "| ,";
options.support_dquotes = true;
strtk::token_grid airport_grid(airport_data_file_name,options);
// for each row r, for each column c, print cell[r,c]
for (std::size_t r = 0; r < airport_grid.row_count(); ++r)
{
strtk::token_grid::row_type row = airport_grid.row(r);
for (std::size_t c = 0; c < row.size(); ++c)
{
std::cout << "[" << row.get<std::string>(c) << "] ";
}
std::cout << std::endl;
}
return 0;
}
</std::string>
Extending Delimiter Predicates
As previously mentioned the concept of a delimiter based predicate can lead to some very interesting solutions. A predicate as has been defined so far, with the exception of the offset predicate, has been a stateless entity. Adding the ability to maintain a state based on what the predicate has encountered so far can allow it to behave differently from the simple single and multiple delimiter predicates.
For this example, lets assume a typical command line parsing problem which consists of double quotation mark groupings and escapable special characters, which can be considered being dual use as either delimiters or data. An example input and output is as follows:
Inputs
Data | abc;"123, mno xyz",i\,jk |
Delimiters | <space>;,. |
Output Tokens
Token0 | abc |
Token1 | 123, mno xyz |
Token2 | i\,jk |
In order to tokenize the above described string, one can create a composite predicate using a multiple char delimiter predicate and some simple state rules. The following is an example of such an extended predicate:
class extended_predicate
{
public:
extended_predicate(const std::string& delimiters)
: escape_(false),
in_bracket_range_(false),
mdp_(delimiters)
{}
inline bool operator()(const unsigned char c) const
{
if (escape_)
{
escape_ = false;
return false;
}
else if ('\\' == c)
{
escape_ = true;
return false;
}
else if ('"' == c)
{
in_bracket_range_ = !in_bracket_range_;
return true;
}
else if (in_bracket_range_)
return false;
else
return mdp_(c);
}
inline void reset()
{
escape_ = false;
in_bracket_range_ = false;
}
private:
mutable bool escape_;
mutable bool in_bracket_range_;
mutable strtk::multiple_char_delimiter_predicate mdp_;
};
Usage of the newly defined extended predicate is as follows:
int main()
{
std::string str = "abc;\"123, mno xyz\",i\\,jk";
strtk::std_string::token_list_type token_list;
strtk::split(extended_predicate(".,; "),
str,
std::back_inserter(token_list),
strtk::split_options::compress_delimiters);
return 0;
}
Performance Comparisons
The following are tables of results generated by running the strtk_tokenizer_cmp test. Currently it covers simple comparisons between Boost String Algorithms, Boost lexical_cast, The Standard Library, Spirit (Karma Qi) and StrTk in the following areas:
- Tokenization
- Splitting
- Integer To String
- String To Integer
- String To Double
Scenario 0 - MSVC 2010 (64-bit, O2, Ot, GL and PGO)
Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 8.5857sec | 2795359.4074tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 3.5019sec | 6853393.1186tks/sec | 40.7%, 245.1% |
Boost | Split | 9600000 | 5.5414sec | 1732414.5137tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 0.8218sec | 11681814.9167tks/sec | 14.8%, 674.3% |
sprintf | Integer To String | 80000000 | 35.8128sec | 2233840.0564nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 19.3994sec | 4123832.0477nums/sec | 54.1%, 184.6% |
Karma | Integer To String | 80000000 | 6.2528sec | 12794349.6524nums/sec | 17.4%, 572.7% |
StrTk | Integer To String | 80000000 | 1.5664sec | 51071439.9822nums/sec | 4.3%, 2286.2% |
atoi | String To Integer | 88500000 | 5.1802sec | 17084370.4936nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 119.6261sec | 739805.3712nums/sec | 2309.2%, 4.3% |
Qi | String To Integer | 88500000 | 2.1951sec | 40317238.6629nums/sec | 42.3%, 235.9% |
StrTk | String To Integer | 88500000 | 1.8181sec | 48677773.5466nums/sec | 35.0%, 284.9% |
atof | String To Double | 30650000 | 15.2306sec | 2012396.7122nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 52.9244sec | 579127.8866nums/sec | 347.4%, 28.7% |
Qi | String To Double | 30650000 | 2.8665sec | 10692313.5853nums/sec | 18.8%, 531.3% |
StrTk | String To Double | 30650000 | 1.6069sec | 19073975.7679nums/sec | 10.5%, 947.8% |
Scenario 1 - MSVC 2010 (O2, Ot, GL and PGO)
Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 9.4715sec | 2533910.4769tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 2.8889sec | 8307786.9292tks/sec | 30.5%, 327.8% |
Boost | Split | 9600000 | 7.2291sec | 1327965.9706tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.1301sec | 8494610.9664tks/sec | 15.6%, 639.6% |
sprintf | Integer To String | 80000000 | 38.2576sec | 2091088.8038nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 28.9931sec | 2759277.4769nums/sec | 75.7%, 131.9% |
Karma | Integer To String | 80000000 | 4.9173sec | 16269254.0190nums/sec | 12.8%, 778.0% |
StrTk | Integer To String | 80000000 | 1.8270sec | 43786838.0279nums/sec | 4.7%, 2093.9% |
atoi | String To Integer | 88500000 | 6.0076sec | 14731435.8942nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 185.4955sec | 477100.6474nums/sec | 3087.0%, 3.2% |
Qi | String To Integer | 88500000 | 2.5060sec | 35314785.8370nums/sec | 41.7%, 239.7% |
StrTk | String To Integer | 88500000 | 2.2095sec | 40054213.0736nums/sec | 36.7%, 271.8% |
atof | String To Double | 30650000 | 17.6435sec | 1737179.9302nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 78.6528sec | 389687.3997nums/sec | 445.7%, 22.4% |
Qi | String To Double | 30650000 | 3.8034sec | 8058494.1994nums/sec | 21.5%, 463.8% |
StrTk | String To Double | 30650000 | 2.0450sec | 14987780.2310nums/sec | 11.5%, 862.7% |
Scenario 2 - MSVC 2008 SP1 (O2, Ot, GL and PGO)
Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 9.6533sec | 2486184.8282tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 3.4748sec | 6906943.9529tks/sec | 35.9%, 277.8% |
Boost | Split | 9600000 | 10.2600sec | 935674.7490tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.3793sec | 6959830.0652tks/sec | 13.4%, 743.8% |
sprintf | Integer To String | 80000000 | 24.6427sec | 3246397.8287nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 27.5865sec | 2899968.5753nums/sec | 111.9%, 89.3% |
Karma | Integer To String | 80000000 | 5.4864sec | 14581630.6963nums/sec | 22.2%, 449.1% |
StrTk | Integer To String | 80000000 | 2.4224sec | 33025441.1256nums/sec | 9.8%, 1017.2% |
atoi | String To Integer | 88500000 | 5.9297sec | 14924814.8683nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 186.1372sec | 475455.6660nums/sec | 3139.0%, 3.1% |
Qi | String To Integer | 88500000 | 2.0874sec | 42397446.1804nums/sec | 35.2%, 284.0% |
StrTk | String To Integer | 88500000 | 2.0485sec | 43202160.1371nums/sec | 34.5%, 289.4% |
atof | String To Double | 30650000 | 18.0458sec | 1698455.0767nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 77.4527sec | 395725.4111nums/sec | 429.2%, 23.2% |
Qi | String To Double | 30650000 | 3.9631sec | 7733881.1294nums/sec | 21.9%, 455.3% |
StrTk | String To Double | 30650000 | 2.0723sec | 14790236.0804nums/sec | 11.4%, 870.8% |
Scenario 3 - Intel C++ v11.1.060 IA-32 (O2, Ot, Qipo, QxHost and PGO)
Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 10.0096sec | 2397697.7836tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 3.1837sec | 7538416.8541tks/sec | 31.8%, 314.4% |
Boost | Split | 9600000 | 9.5450sec | 1005760.0310tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.4292sec | 6716893.1359tks/sec | 14.9%, 667.8% |
sprintf | Integer To String | 80000000 | 23.8979sec | 3347577.5824nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 27.5618sec | 2902565.2045nums/sec | 115.3%, 86.7% |
Karma | Integer To String | 80000000 | 4.6600sec | 17167208.7654nums/sec | 19.4%, 512.8% |
StrTk | Integer To String | 80000000 | 2.8450sec | 28119857.2736nums/sec | 11.9%, 840.0% |
atoi | String To Integer | 88500000 | 5.9386sec | 14902610.8922nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 180.5856sec | 490072.4001nums/sec | 3040.8%, 3.2% |
Qi | String To Integer | 88500000 | 2.5273sec | 35017073.8639nums/sec | 42.5%, 234.9% |
StrTk | String To Integer | 88500000 | 1.8718sec | 47281492.1287nums/sec | 31.5%, 317.2% |
atof | String To Double | 30650000 | 18.4357sec | 1662538.0810nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 78.1543sec | 392172.9598nums/sec | 423.9%, 23.5% |
Qi | String To Double | 30650000 | 2.8321sec | 10822353.0510nums/sec | 15.3%, 650.9% |
StrTk | String To Double | 30650000 | 2.2930sec | 13366541.5515nums/sec | 12.4%, 803.9% |
Scenario 4 - GCC 4.5 (O3, PGO)
Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 9.2510sec | 2594305.4347tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 3.9717sec | 6042688.5734tks/sec | 42.9%, 232.9% |
Boost | Split | 9600000 | 5.0640sec | 1895728.2331tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.5411sec | 6229231.8384tks/sec | 30.4%, 328.5% |
sprintf | Integer To String | 80000000 | 14.7807sec | 5412477.0993nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 19.1131sec | 4185620.7707nums/sec | 129.3%, 77.3% |
Karma | Integer To String | 80000000 | 6.4455sec | 12411808.2841nums/sec | 43.6%, 229.3% |
StrTk | Integer To String | 80000000 | 4.5174sec | 17709364.5349nums/sec | 30.5%, 327.1% |
atoi | String To Integer | 88500000 | 5.2139sec | 16973721.6103nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 50.5326sec | 1751344.8498nums/sec | 969.1%, 10.3% |
Qi | String To Integer | 88500000 | 1.9694sec | 44937612.8835nums/sec | 37.7%, 264.7% |
StrTk | String To Integer | 88500000 | 1.9008sec | 46558706.5833nums/sec | 36.4%, 274.2% |
atof | String To Double | 30650000 | 6.6975sec | 4576328.3036nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 29.6375sec | 1034162.2422nums/sec | 442.5%, 22.5% |
Qi | String To Double | 30650000 | 2.9852sec | 10267435.7138nums/sec | 44.5%, 224.3% |
StrTk | String To Double | 30650000 | 1.5961sec | 19202937.1409nums/sec | 23.8%, 419.6% |
Scenario 5 - GCC 4.5 (O3, PGO) Intel Atom N450
Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 29.1370sec | 823695.4389tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 12.3607sec | 1941644.0499tks/sec | 42.4%, 235.7% |
Boost | Split | 9600000 | 16.5261sec | 580899.9726tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 4.9102sec | 1955110.2611tks/sec | 29.7%, 336.5% |
sprintf | Integer To String | 80000000 | 50.3456sec | 1589015.6118nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 91.1475sec | 877698.1401nums/sec | 181.0%, 55.2% |
Karma | Integer To String | 80000000 | 21.8904sec | 3654568.8712nums/sec | 43.4%, 229.9% |
StrTk | Integer To String | 80000000 | 12.1877sec | 6564009.9274nums/sec | 24.2%, 413.0% |
atoi | String To Integer | 88500000 | 17.6615sec | 5010896.5768nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 191.9446sec | 461070.5357nums/sec | 1086.7%, 9.2% |
Qi | String To Integer | 88500000 | 6.2808sec | 14090561.7119nums/sec | 35.5%, 281.1% |
StrTk | String To Integer | 88500000 | 6.1552sec | 14378086.8208nums/sec | 34.8%, 286.9% |
atof | String To Double | 30650000 | 21.4865sec | 1426474.1027nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 139.8166sec | 219215.7409nums/sec | 650.7%, 15.3% |
Qi | String To Double | 30650000 | 11.3916sec | 2690567.9223nums/sec | 53.0%, 188.6% |
StrTk | String To Double | 30650000 | 6.4396sec | 4759608.7027nums/sec | 29.9%, 333.6% |
Scenario 6 - GCC 4.5 (O3, PGO) Intel Xeon
Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 7.5657sec | 3172216.8787tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 2.7379sec | 8765832.8290tks/sec | 36.1%, 276.3% |
Boost | Split | 9600000 | 3.0706sec | 3126386.1126tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.1279sec | 8511136.2899tks/sec | 36.7%, 272.2% |
sprintf | Integer To String | 80000000 | 10.9012sec | 7338642.9638nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 12.3317sec | 6487328.7872nums/sec | 113.1%, 88.3% |
Karma | Integer To String | 80000000 | 3.7202sec | 21504260.6660nums/sec | 34.1%, 293.0% |
StrTk | Integer To String | 80000000 | 2.5183sec | 31768042.4612nums/sec | 23.1%, 432.8% |
atoi | String To Integer | 88500000 | 4.0087sec | 22077037.6357nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 30.3659sec | 2914454.4393nums/sec | 757.4%, 13.2% |
Qi | String To Integer | 88500000 | 1.7976sec | 49231871.5454nums/sec | 43.3%, 223.0% |
StrTk | String To Integer | 88500000 | 1.7384sec | 50908881.7303nums/sec | 43.3%, 230.5% |
atof | String To Double | 30650000 | 5.2118sec | 5880843.9328nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 21.5546sec | 1421966.9538nums/sec | 413.5%, 24.1% |
Qi | String To Double | 30650000 | 3.2149sec | 9533840.3118nums/sec | 61.6%, 162.1% |
StrTk | String To Double | 30650000 | 1.3929sec | 22003661.2944nums/sec | 26.7%, 374.1% |
Note 1: The tests are compiled with specific optimisation flags to produce the best possible results for the respective compilers and architectures. Furthermore the tests are run natively (no virtualizations were used) on an almost completely idle machine so as to reduce interference from background processes. The Boost version used was 1.45. Furthermore the standard libraries including libc were rebuilt for the linux system based tests, using architecture specific flags and optimizations. The following is a table mapping the scenarios to their respective architectures:
Scenario | Architecture |
0 | ThinkPad W510 (64-Bit Intel Quad Core Extreme i7-920XM 2.0GHz, 16GB RAM, Windows 7) |
1-3 | ThinkPad x61 (32-Bit Intel Core 2 Duo 2.4GHz, 2GB RAM, Windows 7) |
4 | ThinkPad x61 (32-Bit Intel Core 2 Duo 2.4GHz, 2GB RAM, Ubuntu 10.10) |
5 | Acer Aspire One (32-Bit Intel Atom N450 1.6Ghz, 1GB RAM, Ubuntu 10.10) |
6 | HP Proliant DL380G6 (64-Bit Intel Xeon E5540 2.5GHz, 8GB RAM, Ubuntu 10.10) |
Note 2: The percentages in the final column represent the percentage of the current row versus the baseline in total running time and rate respectively. For the first percentage the lower the value the better and for the second percentage the higher the value the better. The baseline used for a specific combination of tests is defined in the following table:
Test Combination | Baseline |
Boost, StrTk | Boost |
Boost, StdLib/STL, Spirit, StrTk | StdLib/STL |
StdLib/STL, Spirit, StrTk | StdLib/STL |
Note 3: The test sizes are set such that no single run will result in a running time less than one second. This is done so as to ensure that runs-per-second results are not deemed to have been projected. In the future these sizes may need to be revisited once 3.5+GHz CPU speeds become more commonplace. Furthermore the charts represent the rate of operation over a one second interval - In short, the larger the rate the better.
Note 4: The binaries used for the above performance tests can be downloaded from here
Note 5: It would be great to have comparisons for other architectures. If you can provide access to shell accounts with GCC 4.5+ or Clang/LLVM 2.0+ for the following architectures: UltraSPARC T2 Plus, SPARC64 VII, POWER6/7, please feel free to get in contact.
StrTk Library Dependency
Originally StrTk was mostly compatible with C++ compilers ranging from
GCC v2.95 and Visual studio v6.0, however due to upgrades it now
requires at least GCC v4.4, Intel C++ Compiler v10 or Visual Studio
v9.0 to compile easily. StrTk also makes use of the Boost library for
its boost::lexical_cast
routine for types other than
PODs, and its TR1 compliant Random and Regex libraries. These
dependencies are not crucial and can be easily removed simply by
defining the preprocessor: strtk_no_tr1_or_boost. That said
Boost is an integral part of modern C++ programming, and having it
around is as beneficial as having access to the STL, hence it is
recommended that it be installed. For Visual Studio users, BoostPro
provides a free and easy to use installer for the latest Boost
libraries that can be obtained from Here.
For Linux users, mainstream distributions such as Ubuntu and Red-Hat
(Fedora) provide easy installation of the Boost libraries via their
respective package management systems. For more information please
consult the readme.txt found in the StrTk distribution.
Compiler Support
The following is a listing of the various compilers that StrTk can be built with error and warning free.
- GCC - verions 4.4+
- Intel C++ Compiler - versions 10+
- Clang/LLVM - versions 1.0+
- MSVC - versions 9.0+
- Comeau C/C++ - versions 4.3+
Note: Versions of compilers prior to the ones denoted above "should" compile, however they may require a very lineant warning/error level be set during compilation.
Conclusion
StrTk was designed with performance and efficiency as its sole primary principles, and as such some of the available interfaces may not be as user-friendly as they should - however that said, the gains made in other areas hopefully will compensate for any perceived difficulties. Like most things there is a trade-off between performance and usability with the above mentioned tokenizers and parsing methods. The original aim was to provide an interface similar to that of the Boost Tokenizer and Split routines, but to also avail the developer simplifications that will hopefully provide them more flexibility and efficiency in the long run. That said, tokenizing a string isn't the most fascinating problem one could tackle but it does have its interesting points when one has a few TB of data to process, doing it properly could mean the difference between finishing a simple data processing job today or next month.
Post Comment
g7WhY4 Whispering Misty So sorry you can expect to skip the workshop!