C++ String Toolkit (StrTk) Tokenizer

Download source - 100KB.

Introduction

This article will present the tokenizing and splitting functionality of a simple C++ library called the String Toolkit. Tokenization in the context of string processing, is the method by which a sequence of elements are broken up or fragmented in sub-sequences called tokens. The indices in the original sequence that determine such breaks in the sequence are known as delimiters.

There are two types of delimiters, normal or thin delimiters which are of length one element and a thick delimiters which are of length two or more elements. Even though tokenization is primarily used in conjunction with strings, any sequence of types that can be iterated in a linear fashion can be tokenized, examples may be list of integers, a vector of person classes or a map of strings.

Another Tokenizer?

To date there have been many attempts to define a "standard" Tokenizer implementation in C++. Of them all the likely candidate might be the implementation in the Boost library. Regardless proposed implementations should to some extent consider one or more of the following areas: over-all usage patterns, constructions, generality (or is it genericty these days?) of design, performance-efficiency.

1. Over-all usage patterns

This requirement is concerned with how easy it is to instantiate the tokenizer and integrate it into existing processing patterns, which most often than not requires integration with C++ STL algorithms and containers. A tokenzier by definition would be classed as a producer, so the question becomes how easy is it for others to consume its output? Another issue is consistency of the definition of a token in the situation where one has consecutive delimiters but is not compressing them - can or should there be such a thing as an empty token? and what do preceding and trailing delimiters mean? and when should they be included as part of the token?

2. Constructions

In the context of string tokenizing, the majority of implementations return the token as a new instance of a string. This process requires a string to be created on heap, populated by the sub-string in question from the original sequence, then returned back (some of this may be alleviated by Return Value Optimization RVO). In the case of iterators this is essentially another copy to the caller. Furthermore two kinds of tokens can make this situation worse, they are primarily a large sequence made up of lots of very short tokens or a large sequence made up of lots of very large tokens. The solution is not to return the token as a string but rather as a range and allow the caller to decide how they wish to inspect and further manipulate the token.

StrTk Token Iterators - Copyright Arash Partow

This minor change in interface design provides a great deal of flexibility and performance gain.

3. Generality(Genericity) of design

Most tokenizer implementations concern themselves only with strings, which for the most part is ok, because most things that need tokenizing are strings. However there will be times when one has a sequence of types that may need to be tokenized that aren't strings, hence a tokenizer should be designed in such a way to enable such features, moreover it becomes clear that the most basic of tokenizing algorithms are invariant to the type of the delimiter.

4. Performance and Efficiency

Tokenizing strings range from low frequency inputs such as user input or parsing of simple configuration files to more complex situations such as tokenizing of HTML pages that a web crawler/indexer might do, to parsing of large market-data streams in FIX format. Performance is generally important and can usually be helped along with good usage patterns that encourage reuse of instances, minimal preprocessing and allow for user supplied predicates for the more nasty areas of the process. It should be noted that everything in the proceeding article can be done by purely using the STL - that said, C++'s ability to allow one to skin the proverbial cat in numerous way gives rise to novel solutions that are for the most part not of any practical use other than to showcase ones abilities in using the STL.

Getting started

The String Toolkit Library (StrTk) provides two common tokenization concepts, a split function and a token iterator. Both these concepts require the user to provide a delimiter predicate and an iterator range over which the process will be carried out.

The tokenization process can be further parametrized by switching between "compressed delimiters" or "no compressed delimiters" mode. This essentially means that consecutive delimiters will be compressed down to one and treated as such. Below are two tables depicting the expected tokens from various types of input. The tables represent no compressed and compressed tokenization processes respectively. The delimiter in this instance is a pipe symbol | and <> denotes an empty token.

No Compressed Delimiters

Input	Token List
`a`	`a`
`a\|b`	`a,b`
`a\|\|b`	`a,<>,b`
`\|a`	`<>,a`
`a\|`	`a,<>`
`\|a\|\|b`	`<>,a,<>,b`
`\|\|a\|\|b\|\|`	`<>,<>,a,<>,b,<>,<>`
`\|`	`<>,<>`
`\|\|`	`<>,<>,<>`
`\|\|\|`	`<>,<>,<>,<>`

Compressed Delimiters

Input	Token List
`a`	`a`
`a\|\|b`	`a,b`
`\|a\|\|b`	`<>,a,b`
`\|\|a\|\|b\|\|`	`<>,a,b,<>`
`\|`	`<>,<>`
`\|\|`	`<>,<>`
`\|\|\|`	`<>,<>`

Delimiters

Two forms of delimiters are supported and they are single delimiter predicate and multiple delimiters perdicate otherwise known as SDP and MDP respectively. Essentially an SDP is where there is only one type that can break or fragment the sequence, where as with MDPs there is more than one unique type that can break the sequence. It is possible to represent a SDP using the MDP, however from a performance POV having separate predicates is far more efficient. Additionally for strings based on char or unsigned char (8-bit versions) there is a MDP that has a look-up complexity of O(1) making it greatly more efficient than the basic MDP.

Single Delimiter Predicate

Instantiation requires specialization of type and construction requires an instance of the type.

strtk::single_delimiter_predicate<typename T>(const T& t)

strtk::single_delimiter_predicate<std::string::value_type> predicate('|');

Multiple Delimiter Predicate

Instantiation requires specialization of type and construction requires a sequence of potential delimiters through a range described by iterators.

strtk::multiple_delimiter_predicate<typename T>(Iterator begin, Iterator end);

std::string str_delimiters = " ,.;:<>'[]{}()_?/'`~!@#$%^&*|-_\"=+";
strtk::multiple_delimiter_predicate<char> mdp1(str_delimiters.begin(),str_delimiters.end());

unsigned int uint_delimiters[5] = {1,10,101,1010,10101};
strtk::multiple_delimiter_predicate<unsigned int> mdp2(uint_delimiters,uint_delimiters + 5);

As previously mentioned tokenization of data need not be limited to strings comprised of chars, but can also be extended to other PODs or complex types. In the above example a predicate used for tokenizing a sequence of unsigned ints is being defined.

Multiple Char Delimiter Predicate

Instantiation requires an iterator range based on either unsigned char or char. This delimiter is more efficient than the simple MDP as it has a predicate evaluation of O(1) due to the use of a lookup-table as opposed to O(n) where n is the number of delimiters. Performance increase is seen as more delimiters are used.

strtk::multiple_char_delimiter_predicate(const std::string&)
strtk::multiple_char_delimiter_predicate(const std::string::const_iterator begin,const std::string::const_iterator end)

strtk::multiple_char_delimiter_predicate predicate(" .;?");

The delimeter concept can be extended to the point where the predicate itself can act as a state machine transitioning from state to state based on conditions and rules related to the current symbol being processed. This simple extension can lead to some very interesting parsing capabilities.

Split

This is a function that performs tokenization over an entire sequence in one go. strtk::split takes a sequence through a range of iterators or in the case of a string through a reference to its instance, a delimiter predicate and an output iterator or otherwise known as a type sink. It populates the output iterator with the tokens it extracts. The tokens in this case are std::pairs of iterators for the sequence.

StrTk Split Routine - Copyright Arash Partow

Split can be used in a "simple - no frills" manner by simply passing the necessary parameters:

std::string str = "abc|123|xyz|789";
strtk::std_string::token_list_type token_list;
strtk::split(" |.;?",str,std::back_inserter(token_list));

strtk::split can also be used in a more explicit manner whereby the exact type of delimiter predicate can be specificed by the user:

std::string str = "abc|123|xyz|789";
strtk::std_string::token_list_type token_list;

strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
strtk::split(predicate,str,std::back_inserter(token_list));

strtk::split provides an additional usage option that allows the user to specify if they would like to either compress the delimiters and whether or not they would like to include the delimiter as part of the token range. This enum parameter is called strtk::split_options and has the following values:

Split Option	Definition
`strtk::split_options::default_mode`	Default options
`strtk::split_options::compress_delimiters`	Consecutive delimiters are treated as one
`strtk::split_options::include_1st_delimiter`	The first delimiter is included in the resulting token range
`strtk::split_options::include_delimiters`	All delimiters are included in the resulting token range

The simple example below demonstrates a split that will occur over a string given a predicate where the provided split options indicate that consecutive delimiters will be treated as one and also all delimiters encountered after each token will also be included in the token up until the next token begins.

strtk::split(predicate,
             str,
             std::back_inserter(token_list),
             strtk::split_options::compress_delimiters | strtk::split_options::include_delimiters);

Another way of using split may be to use the strtk::multiple_char_delimiter_predicate as follows:

std::string str = "abc?123;xyz.789";
strtk::std_string::token_list_type token_list;
strtk::multiple_char_delimiter_predicate predicate(" .;?");
strtk::split(predicate,str,std::back_inserter(token_list));

The contents of the token_list can be printed out as follows:

strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
   std::cout << (*itr) << "\t";
   ++itr;
}

Split N-Tokens

A natural extension of strtk::split is strtk::split_n. This function provides the ability to tokenize a sequence up until a specific number of tokens have been encountered or until there are no more tokens left. The return value of the strtk::split_n would be the number of tokens encountered.

std::string data = "token1?token2,token3;token4,token5";
strtk::std_string::token_list_type token_list;
const std::size_t token_count = 4;
const std::string delimiters (" ,.;?");
strtk::split_n(delimiters,
               data,
               token_count,
               std::back_inserter(token_list));
strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
   std::cout << "[" << (*itr) << "]\t";
   ++itr;
}
std::cout << std::endl;

Offset Splitter

Another simple variant is the strtk::offset_splitter. This form of split takes a series of offsets and an iterator range or string and determines the tokens based on the offsets. This function can be set to perform a single pass of the offsets or to rotate them until the range has been completely traversed. The example below demonstrates how a string representing date and time can be tokenized into its constituent parts (year, month, day, hour, minutes ,seconds,milliseconds)

std::string time_data = "20000101091011345"; //2000-01-01 9:10:11sec 345ms
const std::size_t offset_list_size = 7;
const int offset_list[offset_list_size] = {
                                            4, // year
                                            2, // month
                                            2, // day
                                            2, // hour
                                            2, // minute
                                            2, // second
                                            3  // ms
                                          };
const strtk::offset_predicate<offset_list_size> predicate(offset_list);
strtk::std_string::token_list_type token_list;
strtk::offset_splitter(time_data,predicate,std::back_inserter(token_list));

strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
   std::cout << "[" << (*itr) << "] ";
   ++itr;
}
std::cout << std::endl;

Split Regex

Another form of splitter is based on the concept of using regular expressions as the delimiter predicate. Below is a simple example of splitting tokens wrapped in round-brackets.

std::string str = "(12)(345)(6789)(0ijkx)(yz)";
std::list<std::string> token_list;
strtk::split_regex("\\(.*?\\)",
                   s,
                   std::back_inserter(token_list),
                   strtk::regex_match_mode::match_1);

std::list<std::string>::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
   std::cout << "[" << (*itr) << "]\t";
   ++itr;
}
std::cout << std::endl;
</std::string>

Note: The parameter regex_match_mode represents the capture of the marked sub-expression in the current match. By default it is strtk::regex_match_mode::match_all which in this case would provide the entire match including the bounding pattern, eg: Token₃ would be (0ijkx). However in the above example we are only interested in the sub-expression between the round-brackets, hence strtk::regex_match_mode::match_1 is used resulting in Token₃ being 0ijkx.

The following examples demonstrate the use of strtk::split_regex and strtk::split_regex_n routines in extracting specific types of data - in this case the POD types int and double.

int main()
{
   {
      // Extract ints from data string
      std::string data = "a 1^bc,0023| def?gh(4567ijk)-89 10l,m$n-op+123r@st+3u v*w2y56yz+";
      std::deque<int> int_list;
      strtk::split_regex("([+-]?([\\d]+))",
                         data,
                         strtk::range_to_type_back_inserter(int_list),
                         strtk::regex_match_mode::match_1);
   }

   {
      // Extract doubles from data string
      std::string data = "ab$c1.1?d-2.2ef#ghi+3.3%(123.456)!&*-7.89E+12@^=";
      std::deque<double> double_list;
      strtk::split_regex(strtk::ieee754_expression,
                         data,
                         strtk::range_to_type_back_inserter(double_list),
                         strtk::regex_match_mode::match_1);
   }

   {
      // Extract the first 3 ints from data string
      std::string data = "a 1^bc,0023| def?gh(4567ijk)-89 10l,m$n-op+123r@st+3u v*w2y56yz+";
      std::deque<int> int_list;
      strtk::split_regex_n("([+-]?([\\d]+))",
                           data,
                           3,
                           strtk::range_to_type_back_inserter(int_list),
                           strtk::regex_match_mode::match_1);
   }

   {
      // Extract the first 4 doubles from data string
      std::string data = "ab$c1.1?d-2.2ef#ghi+3.3%(123.456)!&*-7.89E+12@^=";
      std::deque<double> double_list;
      strtk::split_regex_n(strtk::ieee754_expression,
                           data,
                           4,
                           strtk::range_to_type_back_inserter(double_list),
                           strtk::regex_match_mode::match_1);
   }

   return 0;
}

The following table describes the various regex patterns built into StrTk which can be used seemlessly with the strtk::split_regex and strtk::split_regex_n routines.

Regex	Definition
`strtk::uri_expression`	URI, URL address e.g.: http://www.example.com, domain.example.net/index.html
`strtk::email_expression`	E-mail address e.g.: some.one@example.com
`strtk::ip_expression`	IPv4 address e.g.: 192.168.0.1, 127.0.0.1
`strtk::ieee754_expression`	Floating point value e.g.: 1.1, 1.234e-123, -1.0001E+10, 0.1234

Tokenizer

The tokenizer is specialized on a sequence iterator and predicate. It is constructed with a range of iterators for the sequence and a reference to the desired predicate. Definitions exist for std::string tokenizers. The tokenizer provides an iteration pattern as a means for accessing the tokens in the sequence.

const unsigned int data_size = 12;
unsigned int data[data_size] = {1,2,3,0,4,5,6,0,7,8,0,9};
strtk::single_delimiter_predicate<unsigned int> predicate(0);
typedef strtk::tokenizer<unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer_type;
tokenizer_type tokenizer(data,data + data_size,predicate);

Similar to that of strtk::split, strtk::tokenizer provides tokenizing options that are passed in during construction. Below is a table depicting said options:

Tokenize Option	Definition
`strtk::tokenize_options::default_mode`	Default options
`strtk::tokenize_options::compress_delimiters`	Consecutive delimiters are treated as one
`strtk::tokenize_options::include_1st_delimiter`	The first delimiter is included in the resulting token range
`strtk::tokenize_options::include_delimiters`	All delimiters are included in the resulting token range

typedef strtk::tokenizer<unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer_type
tokenizer_type tokenizer(data, data + data_size,
                         predicate,
                         strtk::tokenize_options::compress_delimiters |
                         strtk::tokenize_options::include_1st_delimiter);

Furthermore, Iteration over the tokens of strtk::tokenizer is performed as follows:

tokenizer_type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
   std::copy((*itr).first,(*itr).second,std::ostream_iterator<unsigned int>(std::cout," "));
   std::cout << std::endl;
   ++itr;
}

A typical std::string can be tokenized in the following manner:

std::string str = "abc|123|xyz|789";
strtk::std_string::tokenizer<>::type tokenizer(str,"|");
strtk::std_string::tokenizer<>::type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
   std::cout << "[" << (*itr) << "]\t";
   ++itr;
}
std::cout << std::endl;

Another common situation may be tokenizing a sequence of strings, such as the following:

const std::string str_list[] = { "abc" , "delimiter" , "ijk" , "delimiter" ,
                                 "mno" , "delimiter" , "rst" , "uvw" ,
                                 "delimiter" , "xyz" };

const std::size_t str_list_size = sizeof(str_list) / sizeof(std::string);

strtk::range_adapter<std::string> range(str_list,str_list_size);
strtk::single_delimiter_predicate<std::string> predicate("delimiter");
typedef strtk::tokenizer<std::string*,strtk::single_delimiter_predicate<std::string> > tokenizer_type;
tokenizer_type tokenizer(range.begin(),range.end(),predicate);
tokenizer_type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
   std::copy((*itr).first,(*itr).second,std::ostream_iterator<std::string>(std::cout," "));
   std::cout << std::endl;
   ++itr;
}

The Parse Routine

Till now the mentioned routines worked specifically with tokens, or in other words ranges of characters. The responsibility of managing the tokens and converting the tokens to user specified types was done manually via "range to type" oriented back inserters and converters. This can be a bit cumbersome and as such StrTk provides a series of helper routines called strtk::parse. Parse takes an std::string representing a tuple of delimited values as input data, a delimiter set, and a series of references to variables that are to be populated with the values from the parsed tokens. The following diagram demonstrates the flow of data, tokens and the corresponding relationships and conversions between each of the parameters.

StrTk Parse Routine - Copyright Arash Partow

Note: strtk::parse returns a boolean value of true upon successful parsing and false for all other results. Situations that cause strtk::parse to fail include:

Insufficient number of tokens for the given number of variables
Conversion failure from token range to variable type
Empty or null token(s)

Some Simple Parse Examples

strtk::parse can take an arbitary number of variable references. The code below demonstrates the basic usage of strtk::parse taking various number of parameters.

std::string data = "abcde,-1234|567.890#1.1f";
std::string delimiters = ",|#";

std::string var0;
int var1;
double var2;
float var3;

strtk::parse(data,delimiters,var0);

strtk::parse(data,delimiters,var0,var1);

strtk::parse(data,delimiters,var0,var1,var2);

strtk::parse(data,delimiters,var0,var1,var2,var3);

The following examples demonstrate parsing of PODs such as int and double into STL compatible sequences (std::vector, std::deque, std::list, std::set, std::queue, std::stack and std::priority_queue).

// Insert into std::vector
std::string int_string = "1 +2 -3 4 +5 -6 7 +8 -9 10 +11 -12 13 +14 -15";
std::vector<int> int_vector;
strtk::parse(int_string," ",int_vector);

// Insert into std::deque
std::string double_string = "-123.456,789.012,-345.678,901.234,+567.890";
std::deque<double> double_deque;
strtk::parse(double_string,",",double_deque);

// Insert into std::list
std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz";
std::list<std::string> string_list;
strtk::parse(data_string,",",string_list);

// Insert into std::set
std::string set_string = "a|bc#def|ghij#klmno|pqrstu#vwxyz";
std::set<std::string> string_set;
strtk::parse(set_string,"#|",string_set);

// Insert into std::queue
std::string queue_string = "value1,value2,value3,value4,value5";
std::queue<std::string> string_queue;
strtk::parse(queue_string,",|",string_queue);

// Insert into std::stack
std::string stack_string = "value1|value2,value3|value4,value5";
std::stack<std::string> string_stack;
strtk::parse(stack_string,",|",string_stack);

// Insert into std::priority_queue
std::string priority_queue_string = "value1|value2,value3#value4,value5";
std::priority_queue<std::string> string_priority_queue;
strtk::parse(priority_queue_string,",|#",string_priority_queue);

Similar to what is described above, the following demonstrates parsing of upto "N" elements into an STL compatible sequence.

// Insert N elements into std::vector
std::string int_string = "100,-200,+300,400,-500,+600,700,-800,+900";
std::vector<int> int_vector;
strtk::parse_n(int_string,",",5,int_vector);

// Insert N elements into std::deque
std::string double_string = "123.456,+789.012,345.678,-901.234,567.890";
std::deque<double> double_deque;
strtk::parse_n(double_string,",",3,double_deque);

// Insert N elements into std::list
std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz";
std::list<std::string> string_list;
strtk::parse_n(data_string,",",6,string_list);

// Insert N elements into std::set
std::string set_string = "a|bc#def|ghij#klmno|pqrstu#vwxyz";
std::set<std::string> string_set;
strtk::parse_n(set_string,"#|",6,string_set);

// Insert N elements into std::queue
std::string queue_string = "value0,value1,value2,value3,value4,value5";
std::queue<std::string> string_queue;
strtk::parse_n(queue_string,",|",4,string_queue);

// Insert N elements into std::stack
std::string stack_string = "value0|value1|value2,value3|value4,value5";
std::stack<std::string> string_stack;
strtk::parse_n(stack_string,",|",4,string_stack);

// Insert N elements into std::priority_queue
std::string priority_queue_string = "value0#value1|value2,value3#value4,value5";
std::priority_queue<std::string> string_priority_queue;
strtk::parse_n(priority_queue_string,",|#",4,string_priority_queue);

Note: The return value of the routine strtk::parse_n indicates how many elements were parsed and placed into the specified sequence.

A Practical Example

Lets assume you have been given an english text file to process, with the intention of extracting a lexicon from the file.

One solution would be to break the problem down to a line by line tokenization problem. In this case you would define a functional object such as the following which will take the container in which you plan on storing your tokens (words) and a predicate and insert the tokens as strings into your container.

template<typename Container, typename Predicate>
struct parse_line
{
public:

   parse_line(Container& container, const Predicate& predicate)
   : container_(container),
     predicate_(predicate)
   {}

   inline void operator() (const std::string& str)
   {
      strtk::split(str,
                   predicate_,
                   strtk::range_to_type_back_inserter(container_),
                   strtk::split_options::compress_delimiters);
   }

private:

   Container& container_;
   const Predicate& predicate_;
};

The whole thing together would include a process of opening the file and reading it line by line each time invoking the parse_line would be as follows:

template<typename Container>
void parse_text(const std::string& file_name, Container& c)
{
   static const std::string delimiters = " ,.;:<>'[]{}()_?/"
                                         "`~!@#$%^&*|-_\"=+\t\r\n\0"
                                         "0123456789";
   strtk::multiple_char_delimiter_predicate predicate(delimiters);
   strtk::for_each_line(file_name,
                        parse_line<Container,strtk::multiple_char_delimiter_predicate>(c,predicate));
}

int main()
{
   std::string text_file_name = "text.txt";
   std::deque<std::string> word_list;
   parse_text(text_file_name,word_list);
   std::cout << "Token Count: " << word_list.size() << std::endl;
   return 0;
}

Now coming back to the original problem, that being the construction of a lexicon. In this case the set of "words" should only contain words of interest. For the sake of simplicity lets define words of interest as being anything other than the following prepositions: as, at, but, by, for, in, like, next, of, on, opposite, out, past, to, up and via. This type of list is commonly known as a Stop Word List. In this example the stop-word list definition will be as follows:

const std::string stop_word_list [] =
                  {
                     "as", "at", "but", "by", "for",
                     "in", "like", "next", "of", "on",
                     "opposite", "out", "past", "to",
                     "up", "via", ""
                  };

const std::size_t stop_word_list_size = sizeof(stop_word_list) / sizeof(std::string);

Some minor updates to the parse_line processor include using the filter_on_match predicate for determining if the currently processed token is a preposition and also the invocation of the range_to_type back_inserter to convert the tokens from their range iterator representation to a type representation compatible with the user defined container. For the new implementation to provide unique words of interest the simplest change that can be made is to replace the deque used as the container for the word_list to some kind of 1-1 associative container such as a set. The following is the improved version of the parse_line processor:

template<typename Container, typename Predicate>
struct parse_line
{
public:

   parse_line(Container& container, const Predicate& predicate)
   : container_(container),
     predicate_(predicate),
     tmp_(" "),
     tokenizer_(tmp_,predicate_,true),
     filter_(stop_word_list,stop_word_list + stop_word_list_size,
             strtk::range_to_string_back_inserter_iterator<Container>(container_),
             true,false)
   {}

   inline void operator() (const std::string& s)
   {
      const filter_type& filter = filter_;
      strtk::for_each_token(s,tokenizer_,filter);
   }

private:

   Container& container_;
   const Predicate& predicate_;
   std::string tmp_;
   typename strtk::std_string_tokenizer<Predicate>::type tokenizer_;
   strtk::filter_on_match<strtk::range_to_string_back_inserter_iterator<Container>> filter_;
};

The above described predicate can be greatly simplified by using binders and various lamba expressions.

Another Example

When performing serialization or deserialization of an instance object such as a class or struct, a simple approach one could take would be to take each of the members and convert them into string representations and from those strings construct a larger string delimiting each member with a special character guaranteed not to exist in any of the string representations.

In this example we will assume that there exists a struct which represents the properties of a person, a person struct if you will:

struct person
{
   unsigned int id;
   std::string name;
   unsigned int age;
   double height
   float weight;
};

The process of populating a person struct would entail having an instance of a person and the necessary data string. The following is an example of how this would be done using the strtk::parse function.

Person Tuple Format

Token₀	Delimiter₀	Token₁	Delimiter₁	Token₂	Delimiter₂	Token₃	Delimiter₃	Token₄
Unique ID(hex)	\|	Name	\|	Age	\|	Height(m)	\|	Weight(kg)

std::string data = "0xFA37ED12|Rumpelstiltskin|397|1.31|58.7";
person p;
strtk::hex_to_number_sink<unsigned int> hex_sink(p.id); // register id with the hex sink
strtk::parse(data,"|",hex_sink,p.name,p.age,p.height,p.weight);

Batch processing of a text file comprised of one person tuple per-line is somewhat similar to the previous example. A predicate is defined that takes a container specialized on the person struct, and a delimiter predicate with which the strtk::parse function will be invoked. This predicate is then instantiated and coupled with the text file name, is fed to the strtk::for_each_line processor.

template<typename Container, typename Predicate>
struct parse_person_tuple
{
public:

   parse_person_tuple(Container& container)
   : container_(container),
     hex_sink(p_.id)
   {}

   inline void operator() (const std::string& s)
   {
      if (strtk::parse(s,"|",hex_sink,p_.name,p_.age,p_.height,p_.weight))
         container_.push_back(p_);
      else
         std::cerr << "Failed to parse: " << s << std::endl;
   }

private:

   Container& container_;
   person p_;
   strtk::hex_to_number_sink<unsigned int> hex_sink;
};

Bringing the above pieces together to process a file results in the following:

int main()
{
   std::string text_file_name = "person_records.txt";
   std::deque<person> person_list;
   strtk::for_each_line(file_name,predicate_type(person_list));
   return 0;
}

To make things easier one could adapt a struct (made up entirely of PODs) to a parser. This makes the usage syntax little easier to follow. An example of this adaption is as follows:

struct type
{
   std::string s;
   double d;
   int i;
   char c;
   bool b;
};

strtk_parse_begin(type)
 strtk_parse_type(s)
 strtk_parse_type(d)
 strtk_parse_type(i)
 strtk_parse_type(c)
 strtk_parse_type(b)
strtk_parse_end()

int main()
{
   type t;
   std::string s = "abcdefghijklmnop|123.456|987654321|A|1";
   strtk::parse(s,"|",t);
   return 0;
}

Another similar example to the previous, would be parsing a text file of 3D coordinates into a sequence. This can be done easily and cleanly using lambdas and StrTk as follows:

struct point
{
   double x,y,z;
};

int main()
{
   std::string point_data = "point_data.txt";
   std::deque<point> points;
   point p;
   strtk::for_each_line(point_data,
                        [&points,&p](const std::string& str)
                        {
                           if (strtk::parse(str,",",p.x,p.y,p.z))
                              points.push_back(p);
                        });
   return 0;
}

Simple Date-Time Parser

Assuming the datetime struct defined below, and a string representation of a combined date and time in the form of YYYY-MM-DD HH:MM:SS.MS eg: 2005-06-26 15:57:03.678

struct datetime
{
   unsigned int year;
   unsigned int month;
   unsigned int day;
   unsigned int hour;
   unsigned int minute;
   unsigned int second;
   unsigned int msecond;
};

The following assumes an input of date-time values separated by a pipe. To facilitate parsing of a date-time by the strtk::parse routine into an STL compatible sequence an implementation of string_to_type_converter_impl specific to the datetime type is required. The following demonstrates how such a routine can be generated and used with the strtk::parse context:

strtk_string_to_type_begin(datetime)
   static const std::string delimiters ("-:. ");
   return strtk::parse(begin, end, delimiters,
                       t.year, t.month, t.day,
                       t.hour, t.minute, t.second, t.msecond);
strtk_string_to_type_end()

A simple example of using strtk::parse in conjunction with the newly integrated datetime variable mixed with variables of other types is as follows:

int main()
{
   const std::string data = "abc 123 xyz,2000-01-10 03:01:16.123|+98765.43210";

   std::string var0;
   datetime var1;
   double var2;

   strtk::parse(data,",|",var0,var1,var2);

   return 0;
}

Bringing the above pieces together, in the following we can then proceed to parse a sequence of date-time strings delimited by pipe "|" into a deque of type datetime.

int main()
{
   static const std::string data = "2000-01-10 03:01:16.123|2001-02-22 05:12:24.234|"
                                   "2002-03-13 07:23:32.345|2003-04-24 09:34:47.456|"
                                   "2004-05-15 11:46:51.767|2005-06-26 15:57:03.678|"
                                   "2006-07-17 17:45:31.561|2007-08-26 19:06:02.809|"
                                   "2008-09-18 21:16:23.267|2009-10-26 23:12:03.798|"
                                   "2010-11-23 13:47:11.963|2011-12-26 15:35:08.168";

   std::deque<datetime> datetime_list;
   strtk::parse(data,"|",datetime_list);

   for (std::size_t i = 0; i < datetime_list.size(); ++i)
   {
      datetime& dt = datetime_list[i];
      std::cout << dt.year << "-"
                << strtk::text::right_align(2,'0',  dt.month) << "-"
                << strtk::text::right_align(2,'0',    dt.day) << " "
                << strtk::text::right_align(2,'0',   dt.hour) << ":"
                << strtk::text::right_align(2,'0', dt.minute) << ":"
                << strtk::text::right_align(2,'0', dt.second) << "."
                << strtk::text::right_align(3,'0',dt.msecond) << std::endl;
   }
   return 0;
}

Parsing Sub-Lists

So far the demonstrated capabilities of the strtk::parse function has been based on passing a series of parameters that are populated in a linear fashion as the parser processes the tokens it encounters. That said, some formats have their own sub-structures, a simple example would be a list of values (such as integers) that need to be loaded into a deque or stack. StrTk provides a series of sink facilities that consume a range and an STL container which can be forwarded onto strtk::parse.

In the following example, the data string is comprised of 3 separate lists delimited by a pipe "|". An integer, a string and a double type list. Each list is to be parsed into an STL container of appropriate type. In this case a vector, a deque and a list. StrTk provides the ability to instantiate a sink for the specific container type that is compatible with strtk::parse.

int main()
{
   std::string data = "1,+2,-3,4|abc,ijk,rst,xyz|123.456,+234.567,-345.678,456.789,567.890";

   //define containers
   std::vector<int> int_vector;
   std::deque<std::string> string_deque;
   std::list<double> double_list;
   std::set<int> int_set;
   std::queue<std::string> string_queue;
   std::stack<double> double_stack;
   std::priority_queue<int> int_priority_queue;

   //define sinks
   strtk::vector_sink<int>::type         vec_sink(",");
   strtk::deque_sink<std::string>::type  deq_sink(",");
   strtk::list_sink<double>::type        lst_sink(",");
   strtk::set_sink<int>::type            set_sink(",");
   strtk::queue_sink<std::string>::type  que_sink(",");
   strtk::stack_sink<double>::type       stk_sink(",");
   strtk::priority_queue_sink<int>::type prq_sink(",");


   strtk::parse(data,"|",vec_sink(  int_vector),
                         deq_sink(string_deque),
                         lst_sink( double_list));

   strtk::parse(data,"|",set_sink(           int_set),
                         que_sink(      string_queue),
                         stk_sink(      double_stack),
                         prq_sink(int_priority_queue));

   return 0;
}

If only a certain number of elements in the list are required, for example only the first 3, then the element count on the sink can be set appropriately. The above example could be modified as follows:

int main()
{
   std::string data = "1,+2,-3,4|string0|abc,ijk,rst,xyz|string1|123.456,+234.567,-345.678,456.789,567.890";

   std::vector<int> int_vector;
   std::deque<std::string> string_deque;
   std::list<double> double_list;

   strtk::vector_sink<int>::type        vec_sink(",");
   strtk::deque_sink<std::string>::type deq_sink(",");
   strtk::list_sink<double>::type       lst_sink(",");

   std::string string_0;
   std::string string_1;

   strtk::parse(data,"|",vec_sink(  int_vector).count(2),  // consume first 2 values
                         string_0,
                         deq_sink(string_deque).count(3),  // consume first 3 values
                         string_1,
                         lst_sink( double_list).count(4)); // consume first 4 values
   return 0;
}

Note: If there aren't enough elements in a particular list, then parsing of that list fails and subsequently the whole strtk::parse call will fail as well.

Extending The Date-Time Parser Example

Building upon the previous datetime example, We are presented with a tuple of data that represents an astronomical event. The event defines a name, a location and a series of date-times in UTC the event was observed. In order to construct the necessary sink(s) that will be used for parsing the required type into a container, the macro strtk_register_userdef_type_sink with the specified type is invoked. The following is a definition of the struct one might construct:

struct event_information
{
   std::size_t id;
   std::string name;
   std::string location;
   std::deque<datetime> observation_datetimes;
};

strtk_register_userdef_type_sink(datetime)

Bringing the above together with a call to strtk::parse results in the following code which parses the event data tuple into the allocated event_information instance.

int main()
{
   std::string event_data = "172493|Lunar Impact|Mare Tranquillitatis|"
                            "2010-01-19 00:28:45.357,2010-02-18 00:57:07.109,"
                            "2010-03-20 01:15:11.261,2010-04-21 01:07:27.972";

   strtk::deque_sink<datetime>::type deq_sink(",");

   event_information evt_info;
   strtk::parse(event_data,"|",evt_info.id,
                               evt_info.name,
                               evt_info.location,
                               deq_sink(evt_info.observation_datetimes));
   return 0;
}

Ignoring Tokens During Parsing

There may be scenarios when given a delimited tuple of data, that one or more of the tokens need to be ignored or skipped. StrTk provides a mechanism called strtk::ignore_token that allows the parser to consume specific tokens easily without affecting overall performance. Below is an example of how strtk::ignore_token can be used in conjunction with strtk::parse to skip the 2nd and 4th tokens in the tuple:

int main()
{
   static const std::string data = "+123,ignore0,123.456,ignore1,abcdef,ignore2";

   int i;
   double d;
   std::string s;

   strtk::ignore_token ignore;
   strtk::parse(data,",",i,ignore,d,ignore,s);
   std::cout << "i=" << i << " d=" << d << " s=" << s << std::endl;

   return 0;
}

Simple 3D Mesh File Format Parser

Lets assume there is a file format for expressing a mesh. The format consists of a section that defines the unique vertexes in the mesh, and another section that defines the triangles in the mesh as a tuple of three indicies indicating the vertexes used. Each section is preceded by an integer that denotes the number of subsequent elements in that section. An example of such a file is the following:

5
+1.0,+1.0,+1.0
-1.0,+1.0,-1.0
-1.0,-1.0,+1.0
+1.0,-1.0,-1.0
+0.0,+0.0,+0.0
4
0,1,4
1,2,4
2,3,4
3,1,4

Code to parse such a file format is as follows:

struct point
{
   double x,y,z;
};

struct triangle
{
   std::size_t i0,i1,i2;
};

int main()
{
   std::string mesh_file = "mesh.txt";
   std::ifstream stream(mesh_file.c_str());
   std::string s;
   // Process points section
   std::deque<point> points;
   point p;
   std::size_t point_count = 0;
   strtk::parse_line(stream," ",point_count);
   strtk::for_each_line_n(stream,
                          point_count,
                          [&points,&p](const std::string& line)
                          {
                             if (strtk::parse(line,",",p.x,p.y,p.z))
                                points.push_back(p);
                          });

   // Process triangles section
   std::deque<triangle> triangles;
   triangle t;
   std::size_t triangle_count = 0;
   strtk::parse_line(stream," ",triangle_count);
   strtk::for_each_line_n(stream,
                          triangle_count,
                          [&triangles,&t](const std::string& line)
                          {
                             if (strtk::parse(line,",",t.i0,t.i1,t.i2))
                                triangles.push_back(t);
                          });
   return 0;
}

Pick A Random Line From A Text File

A random line of text is to be selected from a user provided text file comprised of lines of varying length, such that the probability of the line being selected is 1/N where N is the number of lines in the text file. There are many trivial solutions to this problem, however if one were to further constrain the problem by indicating the file is very large (TBs) and that the system upon which the solution is to run is very limited memory-wise, most if not all trivial solutions such as storing indexes of all line offsets, or reading the entire file into memory etc will be eliminated.

However, there exists a very elegant solution to this problem of O(n), O(1) time and space complexities respectively, that involves scanning the entire file once line by line, and at every ith line choosing whether or not to replace the final result line with the current line by sampling a uniform distribution of range [0,1) and performing a replace if and only if the random value is less than 1 / i.

StrTk Line Probability Diagram - Copyright Arash Partow

The logic behind this solution revolves around the fact that the probability of selecting the ith line will be 1/i and as such the total probability for selecting any of the previous i-1 lines will be 1 - (1/i) or (i - 1)/i. Because there are (i - 1) lines before the ith line, we divide the previous sum of probabilities by (i - 1), resulting in a selection probability of 1/i for any one of the lines up to and including the ith line. If the ith line were to be the last line of the text file, this then results in each of the lines having a selection probability of 1/N - simple and sweet, and so to is the following implementation of said solution:

int main(int argc, char* argv[])
{
   std::string file_name = argv[1];
   std::string line;
   std::size_t i = 0;
   strtk::uniform_real_rng rng(static_cast<std::size_t>(::time(0));

   strtk::for_each_line(file_name,
                        [&i,&line,&rng](const std::string& s)
                        { if (rng() < (1.0 / ++i)) line = s; }
                       );

   std::cout << line << std::endl;
   return 0;
}

Note: TAOCP Volume II section 3.4.2 has an in depth discussion about this problem and other similar problems relating to random distributions. Also one should note that the above example has an inefficiency, in that upon every string replace, an actual string is being copied, normally all one needs to do is maintain a file offset to the beginning of the line, not doing this causes slow downs due to continuous memory allocations/reallocations which is made all the worse when large lines are encountered.

Token Grid

StrTk provides a means to easily parse and consume 2D grids of tokens in an efficient and simple manner. A grid is simply defined as a series of rows comprised of tokens, otherwise known as Delimiter Separated Values (DSV). The ith token of a row is grouped with every ith token of all other rows to produce a column. Tokens can be processed as either rows or columns.

An example of a simple token grid, where each numeric value is deemed to be a token:

`1.1`	`2.2`	`3.3`	`4.4`	`5.5`
`1.1`	`2.2`	`3.3`	`4.4`	`5.5`
`1.1`	`2.2`	`3.3`	`4.4`	`5.5`
`1.1`	`2.2`	`3.3`	`4.4`	`5.5`
`1.1`	`2.2`	`3.3`	`4.4`	`5.5`

A token grid can be either passed in via a file or a string buffer. The delimiters to be used for parsing the columns and rows can also be provided, if not provided standard common defaults will be used.

The following demonstrates how each cell in the grid can be access and cast to a specific type.

std::string data = "1,2,3,4,5,6\n"
                   "7,8,9,0,1,2\n"
                   "3,4,5,6,7,8\n"
                   "9,0,1,2,3,4\n"
                   "5,6,7,8,9,0\n";

strtk::token_grid grid(data,data.size(),",");

for (std::size_t r = 0; r < grid.row_count(); ++r)
{
   strtk::token_grid::row_type row = grid.row(r);
   for (std::size_t c = 0; c < row.size(); ++c)
   {
      std::cout << grid.get<int>(r,c) << "\t";
   }
   std::cout << std::endl;
}

The following example demonstrates how averages over rows and columns of a token grid can be computed:

std::string data = "1.1,1.1,1.1,1.1,1.1,1.1\n"
                   "2.2,2.2,2.2,2.2,2.2,2.2\n"
                   "3.3,3.3,3.3,3.3,3.3,3.3\n"
                   "4.4,4.4,4.4,4.4,4.4,4.4\n"
                   "5.5,5.5,5.5,5.5,5.5,5.5\n"
                   "6.6,6.6,6.6,6.6,6.6,6.6\n";

strtk::token_grid grid(data,data.size(),",");

std::vector<double> avg_c(grid.row(0).size(),0.0);
std::vector<double> avg_r(grid.row_count(),0.0);
std::vector<double> tmp(grid.row(0).size(),0.0);
std::fill(avg_c.begin(),avg_c.end(),0.0);

for (std::size_t i = 0; i < grid.row_count(); ++i)
{
   grid.row(i).parse<double>(tmp.begin());
   std::transform(avg_c.begin(),avg_c.end(),tmp.begin(),avg_c.begin(),std::plus<double>());
   avg_r[i] = std::accumulate(tmp.begin(),tmp.end(),0.0) / tmp.size();
}

for (std::size_t i = 0; i < avg_c.size(); avg_c[i++] /= grid.row_count());

std::cout << "Column Averages:\t";
std::copy(avg_c.begin(),avg_c.end(),std::ostream_iterator<double>(std::cout,"\t"));
std::cout << "\n";

std::cout << "Row Averages:\t";
std::copy(avg_r.begin(),avg_r.end(),std::ostream_iterator<double>(std::cout,"\t"));
std::cout << "\n";

</double></double></double></double>

Processing Of Comma Seperated Values Data

The original intent of the token grid was to support fast and efficient processing of simple data tuples, such as comma seperated values (CSV) formats et. al. The following example demonstrates a simple summation of traded floor volume and average daily volume based on NASDAQ OHLC (Open-High-Low-Close) data.

                          //Date,Symbol,Open,Close,High,Low,Volume
std::string market_data = "20090701,GOOG,424.2000,418.9900,426.4000,418.1500,2310768\n"
                          "20090701,MSFT,24.0500,24.0400,24.3000,23.9600,54915127\n"
                          "20090702,GOOG,415.4100,408.4900,415.4100,406.8100,2517630\n"
                          "20090702,MSFT,23.7600,23.3700,24.0400,23.2100,65427699\n"
                          "20090703,GOOG,408.4900,408.4900,408.4900,408.4900,0\n"
                          "20090703,MSFT,23.3700,23.3700,23.3700,23.3700,0\n"
                          "20090706,GOOG,406.5000,409.6100,410.6400,401.6600,2262557\n"
                          "20090706,MSFT,23.2100,23.2000,23.2800,22.8700,49207638\n"
                          "20090707,GOOG,408.2400,396.6300,409.1900,395.9801,3260307\n"
                          "20090707,MSFT,23.0800,22.5300,23.1400,22.4600,52842412\n"
                          "20090708,GOOG,400.0000,402.4900,406.0000,398.0600,3441854\n"
                          "20090708,MSFT,22.3100,22.5600,22.6900,2200000,73023306\n"
                          "20090709,GOOG,406.1200,410.3900,414.4500,405.8000,3275816\n"
                          "20090709,MSFT,22.6500,22.4400,22.8100,22.3700,46981174\n"
                          "20090710,GOOG,409.5700,414.4000,417.3700,408.7000,2929559\n"
                          "20090710,MSFT,22.1900,22.3900,22.5400,22.1500,43238698\n";

strtk::token_grid grid(market_data,market_data.size(),",");

struct stock_info
{
   stock_info(const std::string& s = " ")
   : symbol(s),
     total_volume(0),
     day_count(0),
     average_daily_volume(0.0)
   {}

   std::string symbol;
   unsigned long long total_volume;
   std::size_t day_count;
   double average_daily_volume;
};

stock_info goog("GOOG");
stock_info msft("MSFT");

static const std::size_t volume_column = 6;
static const std::size_t symbol_column = 1;

goog.day_count = grid.accumulate_column(volume_column,
                                        [](const strtk::token_grid::row_type& row) -> bool
                                        {
                                           static const std::string google_symbol("GOOG");
                                           return google_symbol == row.get<std::string>(symbol_column);
                                        },
                                        goog.total_volume);

msft.day_count = grid.accumulate_column(volume_column,
                                        [](const strtk::token_grid::row_type& row) -> bool
                                        {
                                           static const std::string microsoft_symbol("MSFT");
                                           return microsoft_symbol == row.get<std::string>(symbol_column);
                                        },
                                        msft.total_volume);

goog.average_daily_volume = (1.0 * goog.total_volume) / goog.day_count;
msft.average_daily_volume = (1.0 * msft.total_volume) / msft.day_count;

std::cout << "[GOOG] Total Volume: " << goog.total_volume << std::endl;
std::cout << "[MSFT] Total Volume: " << msft.total_volume << std::endl;

std::cout << "[GOOG] ADV: " << goog.average_daily_volume << std::endl;
std::cout << "[MSFT] ADV: " << msft.average_daily_volume << std::endl;

The strtk::token_grid is thread-safe iff read operations are in play. As such the above calls to accumulate_column et al. can all be safely and easily executed concurrently using threads. This allows for a far more efficient data processing methodology.

TIMTOWTDI

Playing devil's advocate, another way of performing the above processing task, assuming only the specific values for computing the ADV are required and no further processing of the CSV data is needed, then the problem can be solved efficiently by utilizing a single pass of the data as follows:

std::string file_name = "market_data.csv";

std::unordered_map<std::string,stock_info> stock_map;

stock_map.insert(std::make_pair<std::string,stock_info>("GOOG",stock_info("GOOG")));
stock_map.insert(std::make_pair<std::string,stock_info>("MSFT",stock_info("MSFT")));

strtk::for_each_line(file_name,
                     [&stock_map](const std::string& line)
                     {
                        strtk::ignore_token ignore;
                        stock_info temp;
                        const bool result = strtk::parse(line,
                                                         ",",
                                                         ignore,
                                                         temp.symbol,
                                                         ignore,
                                                         ignore,
                                                         ignore,
                                                         ignore,
                                                         temp.total_volume);
                        if (!result) return;
                        auto iter = stock_map.find(temp.symbol);
                        if (stock_map.end() == iter) return;
                        (*iter).second.total_volume += temp.total_volume;
                        (*iter).second.day_count++;
                     });

auto itr = stock_map.begin();
auto end = stock_map.end();

while (end != itr)
{
   stock_info& stock = (*itr++).second;
   stock.average_daily_volume = (1.0 * stock.total_volume) / stock.day_count;
}

Sequential Partitions

A typical operation carried out upon time-series data is to group tuples into buckets (or bins) based upon the time index value. For example grouping data into 2-minute buckets and then performing some kind of operation upon the grouped tuples such as a summation or an average etc. This process is sometimes also called: "discretization"

The strtk::token_grid class provides a method known as sequential_partition. The sequential_partition method requires a Transition Predicate, a Function and optionally a row-range. The Transition Predicate consumes a row and returns true only if the row is in the next partition. All other subsequent consecutive rows until the transition predicate returns a true are said to be in the current partition. Prior to transitioning to a new partition, the function predicate is invoked and provided with the range of rows in the current partition.

The following example takes a simple time-series (time and value), partitions the tuples into groups of Time-Buckets of period length 3 and then proceeds to compute the total sum of each group. The below summarizer class provides provides a Transition Predicate and Function.

StrTk Sequential Partition - Copyright Arash Partow

class summarizer
{
public:

   enum column_index
   {
      tick_time_column  = 0,
      tick_value_column = 1
   };

   summarizer(std::deque<double>& sum_value)
   : next_tick_time_(0),
     sum_value_(sum_value)
   {}

   // Transition Predicate
   bool operator()(const strtk::token_grid::row_type& row)
   {
      if (row.get<std::size_t>(tick_time_column) >= next_tick_time_)
      {
         next_tick_time_ = row.get<std::size_t>(tick_time_column) + 3;
         return true;
      }
      else
         return false;
   }

   // Function
   bool operator()(const strtk::token_grid& grid,
                   const strtk::token_grid::row_range_type& range)
   {
      double bucket_sum = 0.0;
      if (!grid.accumulate_column(tick_value_column,range,bucket_sum))
      {
         std::cout << "failed to accumulate!" << std::endl;
         return false;
      }
      else
         sum_value_.push_back(bucket_sum);
      return true;
   }

private:

   summarizer& operator=(const summarizer&);

   std::size_t next_tick_time_;
   std::deque<double>& sum_value_;
};

int main()
{
                      //time index, value
   std::string data = "10000,123.456\n"
                      "10001,612.345\n"
                      "10002,561.234\n"
                      "10003,456.123\n"
                      "10004,345.612\n"
                      "10005,234.561\n"
                      "10006,123.456\n";

   strtk::token_grid grid(data,data.size(),",");
   std::deque<double> sum_value;
   summarizer s(sum_value);
   grid.sequential_partition(s,s);
   for (std::size_t i = 0; i < sum_value.size(); ++i)
   {
      std::cout << "bucket[" << i << "] = " << sum_value[i] << std::endl;
   }
   return 0;
}

The expected output is as follows:

  bucket[0] = 1297.035
  bucket[1] = 1036.296
  bucket[2] = 123.456

Parsing CSV Files With Embedded Double-Quotes

One of the simple extensions to the CSV format is the concept of double quoted tokens. Such tokens may contain column or row delimiters. When such a scenario is encountered, all subsequent delimiters are ignored, and kept as part of the token, until the corresponding closing double quote is encountered. The StrTk token_grid supports the parsing of such tokens. This parsing mode can be easily activated via the token_grid option set. Below is an example of a token_grid loading a CSV data set representing various airports from around the world and their specific codes and locations, in which some of the cells are double-quoted:

ICAO	IATA	Airport	City	Country
AYGA	GKA	"Goroka Gatue"	Goroka	Papua New Guinea
BGCO	GCO	"Nerlerit Inaat Constable Pynt"	"Nerlerit Inaat"	Greenland
BZGD	ZGD	Godley	Auckland	New Zealand
CYQM	YQM	"Greater Moncton International"	Moncton	Canada
EDRK	ZNV	"Koblenz Winningen"	Koblenz	Germany
FAHU	AHU	"HMS Bastard Memorial"	Kwazulu-Natal	South Africa
FQMP	MZB	"Mocimboa Da Praia"	"Mocimboa Da Praia"	Mozambique
KINS	INS	"Indian Springs AF AUX"	Indian Springs	USA
UHNN	HNN	Nikolaevsk	"Nikolaevsk Na Amure"	Russia
WBKK	BKI	"Kota Kinabalu International"	Kota Kinabalu	Malaysia
ZSJD	JDZ	"Jingdezhen Airport"	Jingdezhen	China

The following is the StrTk code example using token_grid to parse the above CSV data set:

int main()
{
   std::string airport_data_file_name = "airport_data.csv";

   strtk::token_grid::options options;
   options.column_delimiters = "| ,";
   options.support_dquotes = true;

   strtk::token_grid airport_grid(airport_data_file_name,options);

   // for each row r, for each column c, print cell[r,c]
   for (std::size_t r = 0; r < airport_grid.row_count(); ++r)
   {
      strtk::token_grid::row_type row = airport_grid.row(r);
      for (std::size_t c = 0; c < row.size(); ++c)
      {
         std::cout << "[" << row.get<std::string>(c) << "] ";
      }
      std::cout << std::endl;
   }
   return 0;
}

</std::string>

Extending Delimiter Predicates

As previously mentioned the concept of a delimiter based predicate can lead to some very interesting solutions. A predicate as has been defined so far, with the exception of the offset predicate, has been a stateless entity. Adding the ability to maintain a state based on what the predicate has encountered so far can allow it to behave differently from the simple single and multiple delimiter predicates.

For this example, lets assume a typical command line parsing problem which consists of double quotation mark groupings and escapable special characters, which can be considered being dual use as either delimiters or data. An example input and output is as follows:

Inputs

`Data`	abc;"123, mno xyz",i\,jk
`Delimiters`	<space>;,.

Output Tokens

`Token₀`	abc
`Token₁`	123, mno xyz
`Token₂`	i\,jk

In order to tokenize the above described string, one can create a composite predicate using a multiple char delimiter predicate and some simple state rules. The following is an example of such an extended predicate:

class extended_predicate
{
public:

   extended_predicate(const std::string& delimiters)
   : escape_(false),
     in_bracket_range_(false),
     mdp_(delimiters)
   {}

   inline bool operator()(const unsigned char c) const
   {
      if (escape_)
      {
         escape_ = false;
         return false;
      }
      else if ('\\' == c)
      {
         escape_ = true;
         return false;
      }
      else if ('"' == c)
      {
         in_bracket_range_ = !in_bracket_range_;
         return true;
      }
      else if (in_bracket_range_)
         return false;
      else
         return mdp_(c);
   }

   inline void reset()
   {
      escape_ = false;
      in_bracket_range_ = false;
   }

private:

   mutable bool escape_;
   mutable bool in_bracket_range_;
   mutable strtk::multiple_char_delimiter_predicate mdp_;
};

Usage of the newly defined extended predicate is as follows:

int main()
{
   std::string str = "abc;\"123, mno xyz\",i\\,jk";
   strtk::std_string::token_list_type token_list;
   strtk::split(extended_predicate(".,; "),
                str,
                std::back_inserter(token_list),
                strtk::split_options::compress_delimiters);
   return 0;
}

Performance Comparisons

The following are tables of results generated by running the strtk_tokenizer_cmp test. Currently it covers simple comparisons between Boost String Algorithms, Boost lexical_cast, The Standard Library, Spirit (Karma Qi) and StrTk in the following areas:

Tokenization
Splitting
Integer To String
String To Integer
String To Double

Scenario 0 - MSVC 2010 (64-bit, O2, Ot, GL and PGO)

Source	Test	Size	Time(sec)	Rate	% from Baseline
`Boost`	Tokenizer	24000000	8.5857sec	2795359.4074tks/sec	100.0%, 100.0%
`StrTk`	Tokenizer	24000000	3.5019sec	6853393.1186tks/sec	40.7%, 245.1%
`Boost`	Split	9600000	5.5414sec	1732414.5137tks/sec	100.0%, 100.0%
`StrTk`	Split	9600000	0.8218sec	11681814.9167tks/sec	14.8%, 674.3%
`sprintf`	Integer To String	80000000	35.8128sec	2233840.0564nums/sec	100.0%, 100.0%
`Boost`	Integer To String	80000000	19.3994sec	4123832.0477nums/sec	54.1%, 184.6%
`Karma`	Integer To String	80000000	6.2528sec	12794349.6524nums/sec	17.4%, 572.7%
`StrTk`	Integer To String	80000000	1.5664sec	51071439.9822nums/sec	4.3%, 2286.2%
`atoi`	String To Integer	88500000	5.1802sec	17084370.4936nums/sec	100.0%, 100.0%
`Boost`	String To Integer	88500000	119.6261sec	739805.3712nums/sec	2309.2%, 4.3%
`Qi`	String To Integer	88500000	2.1951sec	40317238.6629nums/sec	42.3%, 235.9%
`StrTk`	String To Integer	88500000	1.8181sec	48677773.5466nums/sec	35.0%, 284.9%
`atof`	String To Double	30650000	15.2306sec	2012396.7122nums/sec	100.0%, 100.0%
`Boost`	String To Double	30650000	52.9244sec	579127.8866nums/sec	347.4%, 28.7%
`Qi`	String To Double	30650000	2.8665sec	10692313.5853nums/sec	18.8%, 531.3%
`StrTk`	String To Double	30650000	1.6069sec	19073975.7679nums/sec	10.5%, 947.8%

StrTk MSVC10 Result Chart - Copyright Arash Partow

Scenario 1 - MSVC 2010 (O2, Ot, GL and PGO)

Source	Test	Size	Time(sec)	Rate	% from Baseline
`Boost`	Tokenizer	24000000	9.4715sec	2533910.4769tks/sec	100.0%, 100.0%
`StrTk`	Tokenizer	24000000	2.8889sec	8307786.9292tks/sec	30.5%, 327.8%
`Boost`	Split	9600000	7.2291sec	1327965.9706tks/sec	100.0%, 100.0%
`StrTk`	Split	9600000	1.1301sec	8494610.9664tks/sec	15.6%, 639.6%
`sprintf`	Integer To String	80000000	38.2576sec	2091088.8038nums/sec	100.0%, 100.0%
`Boost`	Integer To String	80000000	28.9931sec	2759277.4769nums/sec	75.7%, 131.9%
`Karma`	Integer To String	80000000	4.9173sec	16269254.0190nums/sec	12.8%, 778.0%
`StrTk`	Integer To String	80000000	1.8270sec	43786838.0279nums/sec	4.7%, 2093.9%
`atoi`	String To Integer	88500000	6.0076sec	14731435.8942nums/sec	100.0%, 100.0%
`Boost`	String To Integer	88500000	185.4955sec	477100.6474nums/sec	3087.0%, 3.2%
`Qi`	String To Integer	88500000	2.5060sec	35314785.8370nums/sec	41.7%, 239.7%
`StrTk`	String To Integer	88500000	2.2095sec	40054213.0736nums/sec	36.7%, 271.8%
`atof`	String To Double	30650000	17.6435sec	1737179.9302nums/sec	100.0%, 100.0%
`Boost`	String To Double	30650000	78.6528sec	389687.3997nums/sec	445.7%, 22.4%
`Qi`	String To Double	30650000	3.8034sec	8058494.1994nums/sec	21.5%, 463.8%
`StrTk`	String To Double	30650000	2.0450sec	14987780.2310nums/sec	11.5%, 862.7%

StrTk MSVC10 Result Chart - Copyright Arash Partow

Scenario 2 - MSVC 2008 SP1 (O2, Ot, GL and PGO)

Source	Test	Size	Time(sec)	Rate	% from Baseline
`Boost`	Tokenizer	24000000	9.6533sec	2486184.8282tks/sec	100.0%, 100.0%
`StrTk`	Tokenizer	24000000	3.4748sec	6906943.9529tks/sec	35.9%, 277.8%
`Boost`	Split	9600000	10.2600sec	935674.7490tks/sec	100.0%, 100.0%
`StrTk`	Split	9600000	1.3793sec	6959830.0652tks/sec	13.4%, 743.8%
`sprintf`	Integer To String	80000000	24.6427sec	3246397.8287nums/sec	100.0%, 100.0%
`Boost`	Integer To String	80000000	27.5865sec	2899968.5753nums/sec	111.9%, 89.3%
`Karma`	Integer To String	80000000	5.4864sec	14581630.6963nums/sec	22.2%, 449.1%
`StrTk`	Integer To String	80000000	2.4224sec	33025441.1256nums/sec	9.8%, 1017.2%
`atoi`	String To Integer	88500000	5.9297sec	14924814.8683nums/sec	100.0%, 100.0%
`Boost`	String To Integer	88500000	186.1372sec	475455.6660nums/sec	3139.0%, 3.1%
`Qi`	String To Integer	88500000	2.0874sec	42397446.1804nums/sec	35.2%, 284.0%
`StrTk`	String To Integer	88500000	2.0485sec	43202160.1371nums/sec	34.5%, 289.4%
`atof`	String To Double	30650000	18.0458sec	1698455.0767nums/sec	100.0%, 100.0%
`Boost`	String To Double	30650000	77.4527sec	395725.4111nums/sec	429.2%, 23.2%
`Qi`	String To Double	30650000	3.9631sec	7733881.1294nums/sec	21.9%, 455.3%
`StrTk`	String To Double	30650000	2.0723sec	14790236.0804nums/sec	11.4%, 870.8%

StrTk MSVC9 Result Chart - Copyright Arash Partow

Scenario 3 - Intel C++ v11.1.060 IA-32 (O2, Ot, Qipo, QxHost and PGO)

Source	Test	Size	Time(sec)	Rate	% from Baseline
`Boost`	Tokenizer	24000000	10.0096sec	2397697.7836tks/sec	100.0%, 100.0%
`StrTk`	Tokenizer	24000000	3.1837sec	7538416.8541tks/sec	31.8%, 314.4%
`Boost`	Split	9600000	9.5450sec	1005760.0310tks/sec	100.0%, 100.0%
`StrTk`	Split	9600000	1.4292sec	6716893.1359tks/sec	14.9%, 667.8%
`sprintf`	Integer To String	80000000	23.8979sec	3347577.5824nums/sec	100.0%, 100.0%
`Boost`	Integer To String	80000000	27.5618sec	2902565.2045nums/sec	115.3%, 86.7%
`Karma`	Integer To String	80000000	4.6600sec	17167208.7654nums/sec	19.4%, 512.8%
`StrTk`	Integer To String	80000000	2.8450sec	28119857.2736nums/sec	11.9%, 840.0%
`atoi`	String To Integer	88500000	5.9386sec	14902610.8922nums/sec	100.0%, 100.0%
`Boost`	String To Integer	88500000	180.5856sec	490072.4001nums/sec	3040.8%, 3.2%
`Qi`	String To Integer	88500000	2.5273sec	35017073.8639nums/sec	42.5%, 234.9%
`StrTk`	String To Integer	88500000	1.8718sec	47281492.1287nums/sec	31.5%, 317.2%
`atof`	String To Double	30650000	18.4357sec	1662538.0810nums/sec	100.0%, 100.0%
`Boost`	String To Double	30650000	78.1543sec	392172.9598nums/sec	423.9%, 23.5%
`Qi`	String To Double	30650000	2.8321sec	10822353.0510nums/sec	15.3%, 650.9%
`StrTk`	String To Double	30650000	2.2930sec	13366541.5515nums/sec	12.4%, 803.9%

StrTk Intel C++ v11 Result Chart - Copyright Arash Partow

Scenario 4 - GCC 4.5 (O3, PGO)

Source	Test	Size	Time(sec)	Rate	% from Baseline
`Boost`	Tokenizer	24000000	9.2510sec	2594305.4347tks/sec	100.0%, 100.0%
`StrTk`	Tokenizer	24000000	3.9717sec	6042688.5734tks/sec	42.9%, 232.9%
`Boost`	Split	9600000	5.0640sec	1895728.2331tks/sec	100.0%, 100.0%
`StrTk`	Split	9600000	1.5411sec	6229231.8384tks/sec	30.4%, 328.5%
`sprintf`	Integer To String	80000000	14.7807sec	5412477.0993nums/sec	100.0%, 100.0%
`Boost`	Integer To String	80000000	19.1131sec	4185620.7707nums/sec	129.3%, 77.3%
`Karma`	Integer To String	80000000	6.4455sec	12411808.2841nums/sec	43.6%, 229.3%
`StrTk`	Integer To String	80000000	4.5174sec	17709364.5349nums/sec	30.5%, 327.1%
`atoi`	String To Integer	88500000	5.2139sec	16973721.6103nums/sec	100.0%, 100.0%
`Boost`	String To Integer	88500000	50.5326sec	1751344.8498nums/sec	969.1%, 10.3%
`Qi`	String To Integer	88500000	1.9694sec	44937612.8835nums/sec	37.7%, 264.7%
`StrTk`	String To Integer	88500000	1.9008sec	46558706.5833nums/sec	36.4%, 274.2%
`atof`	String To Double	30650000	6.6975sec	4576328.3036nums/sec	100.0%, 100.0%
`Boost`	String To Double	30650000	29.6375sec	1034162.2422nums/sec	442.5%, 22.5%
`Qi`	String To Double	30650000	2.9852sec	10267435.7138nums/sec	44.5%, 224.3%
`StrTk`	String To Double	30650000	1.5961sec	19202937.1409nums/sec	23.8%, 419.6%

StrTk GCC v4.5 Result Chart - Copyright Arash Partow

Scenario 5 - GCC 4.5 (O3, PGO) Intel Atom N450

Source	Test	Size	Time(sec)	Rate	% from Baseline
`Boost`	Tokenizer	24000000	29.1370sec	823695.4389tks/sec	100.0%, 100.0%
`StrTk`	Tokenizer	24000000	12.3607sec	1941644.0499tks/sec	42.4%, 235.7%
`Boost`	Split	9600000	16.5261sec	580899.9726tks/sec	100.0%, 100.0%
`StrTk`	Split	9600000	4.9102sec	1955110.2611tks/sec	29.7%, 336.5%
`sprintf`	Integer To String	80000000	50.3456sec	1589015.6118nums/sec	100.0%, 100.0%
`Boost`	Integer To String	80000000	91.1475sec	877698.1401nums/sec	181.0%, 55.2%
`Karma`	Integer To String	80000000	21.8904sec	3654568.8712nums/sec	43.4%, 229.9%
`StrTk`	Integer To String	80000000	12.1877sec	6564009.9274nums/sec	24.2%, 413.0%
`atoi`	String To Integer	88500000	17.6615sec	5010896.5768nums/sec	100.0%, 100.0%
`Boost`	String To Integer	88500000	191.9446sec	461070.5357nums/sec	1086.7%, 9.2%
`Qi`	String To Integer	88500000	6.2808sec	14090561.7119nums/sec	35.5%, 281.1%
`StrTk`	String To Integer	88500000	6.1552sec	14378086.8208nums/sec	34.8%, 286.9%
`atof`	String To Double	30650000	21.4865sec	1426474.1027nums/sec	100.0%, 100.0%
`Boost`	String To Double	30650000	139.8166sec	219215.7409nums/sec	650.7%, 15.3%
`Qi`	String To Double	30650000	11.3916sec	2690567.9223nums/sec	53.0%, 188.6%
`StrTk`	String To Double	30650000	6.4396sec	4759608.7027nums/sec	29.9%, 333.6%

StrTk GCC v4.5 Atom N450 Result Chart - Copyright Arash Partow

Scenario 6 - GCC 4.5 (O3, PGO) Intel Xeon

Source	Test	Size	Time(sec)	Rate	% from Baseline
`Boost`	Tokenizer	24000000	7.5657sec	3172216.8787tks/sec	100.0%, 100.0%
`StrTk`	Tokenizer	24000000	2.7379sec	8765832.8290tks/sec	36.1%, 276.3%
`Boost`	Split	9600000	3.0706sec	3126386.1126tks/sec	100.0%, 100.0%
`StrTk`	Split	9600000	1.1279sec	8511136.2899tks/sec	36.7%, 272.2%
`sprintf`	Integer To String	80000000	10.9012sec	7338642.9638nums/sec	100.0%, 100.0%
`Boost`	Integer To String	80000000	12.3317sec	6487328.7872nums/sec	113.1%, 88.3%
`Karma`	Integer To String	80000000	3.7202sec	21504260.6660nums/sec	34.1%, 293.0%
`StrTk`	Integer To String	80000000	2.5183sec	31768042.4612nums/sec	23.1%, 432.8%
`atoi`	String To Integer	88500000	4.0087sec	22077037.6357nums/sec	100.0%, 100.0%
`Boost`	String To Integer	88500000	30.3659sec	2914454.4393nums/sec	757.4%, 13.2%
`Qi`	String To Integer	88500000	1.7976sec	49231871.5454nums/sec	43.3%, 223.0%
`StrTk`	String To Integer	88500000	1.7384sec	50908881.7303nums/sec	43.3%, 230.5%
`atof`	String To Double	30650000	5.2118sec	5880843.9328nums/sec	100.0%, 100.0%
`Boost`	String To Double	30650000	21.5546sec	1421966.9538nums/sec	413.5%, 24.1%
`Qi`	String To Double	30650000	3.2149sec	9533840.3118nums/sec	61.6%, 162.1%
`StrTk`	String To Double	30650000	1.3929sec	22003661.2944nums/sec	26.7%, 374.1%

StrTk GCC v4.5 Intel Xeon E5540 Result Chart - Copyright Arash Partow

Note 1: The tests are compiled with specific optimisation flags to produce the best possible results for the respective compilers and architectures. Furthermore the tests are run natively (no virtualizations were used) on an almost completely idle machine so as to reduce interference from background processes. The Boost version used was 1.45. Furthermore the standard libraries including libc were rebuilt for the linux system based tests, using architecture specific flags and optimizations. The following is a table mapping the scenarios to their respective architectures:

Scenario	Architecture
0	ThinkPad W510 (64-Bit Intel Quad Core Extreme i7-920XM 2.0GHz, 16GB RAM, Windows 7)
1-3	ThinkPad x61 (32-Bit Intel Core 2 Duo 2.4GHz, 2GB RAM, Windows 7)
4	ThinkPad x61 (32-Bit Intel Core 2 Duo 2.4GHz, 2GB RAM, Ubuntu 10.10)
5	Acer Aspire One (32-Bit Intel Atom N450 1.6Ghz, 1GB RAM, Ubuntu 10.10)
6	HP Proliant DL380G6 (64-Bit Intel Xeon E5540 2.5GHz, 8GB RAM, Ubuntu 10.10)

Note 2: The percentages in the final column represent the percentage of the current row versus the baseline in total running time and rate respectively. For the first percentage the lower the value the better and for the second percentage the higher the value the better. The baseline used for a specific combination of tests is defined in the following table:

Test Combination	Baseline
Boost, StrTk	Boost
Boost, StdLib/STL, Spirit, StrTk	StdLib/STL
StdLib/STL, Spirit, StrTk	StdLib/STL

Note 3: The test sizes are set such that no single run will result in a running time less than one second. This is done so as to ensure that runs-per-second results are not deemed to have been projected. In the future these sizes may need to be revisited once 3.5+GHz CPU speeds become more commonplace. Furthermore the charts represent the rate of operation over a one second interval - In short, the larger the rate the better.

Note 4: The binaries used for the above performance tests can be downloaded from here

Note 5: It would be great to have comparisons for other architectures. If you can provide access to shell accounts with GCC 4.5+ or Clang/LLVM 2.0+ for the following architectures: UltraSPARC T2 Plus, SPARC64 VII, POWER6/7, please feel free to get in contact.

StrTk Library Dependency

Originally StrTk was mostly compatible with C++ compilers ranging from GCC v2.95 and Visual studio v6.0, however due to upgrades it now requires at least GCC v4.4, Intel C++ Compiler v10 or Visual Studio v9.0 to compile easily. StrTk also makes use of the Boost library for its boost::lexical_cast routine for types other than PODs, and its TR1 compliant Random and Regex libraries. These dependencies are not crucial and can be easily removed simply by defining the preprocessor: strtk_no_tr1_or_boost. That said Boost is an integral part of modern C++ programming, and having it around is as beneficial as having access to the STL, hence it is recommended that it be installed. For Visual Studio users, BoostPro provides a free and easy to use installer for the latest Boost libraries that can be obtained from Here. For Linux users, mainstream distributions such as Ubuntu and Red-Hat (Fedora) provide easy installation of the Boost libraries via their respective package management systems. For more information please consult the readme.txt found in the StrTk distribution.

Compiler Support

The following is a listing of the various compilers that StrTk can be built with error and warning free.

GCC - verions 4.4+
Intel C++ Compiler - versions 10+
Clang/LLVM - versions 1.0+
MSVC - versions 9.0+
Comeau C/C++ - versions 4.3+

Note: Versions of compilers prior to the ones denoted above "should" compile, however they may require a very lineant warning/error level be set during compilation.

Conclusion

StrTk was designed with performance and efficiency as its sole primary principles, and as such some of the available interfaces may not be as user-friendly as they should - however that said, the gains made in other areas hopefully will compensate for any perceived difficulties. Like most things there is a trade-off between performance and usability with the above mentioned tokenizers and parsing methods. The original aim was to provide an interface similar to that of the Boost Tokenizer and Split routines, but to also avail the developer simplifications that will hopefully provide them more flexibility and efficiency in the long run. That said, tokenizing a string isn't the most fascinating problem one could tackle but it does have its interesting points when one has a few TB of data to process, doing it properly could mean the difference between finishing a simple data processing job today or next month.

http://www.partow.net/programming/strtk/index.html

关键字 string toolkit,strtk,tokenizer

推荐.NET配套的通用数据层ORM框架：CYQ.Data 通用数据层框架

IT技术博客