Monday, 18 May 2015

Simple AWK Commands To Understand

awk - 10 examples to read files with multiple delimiters

 

 In this article of
 awk series, we will see how to use awk to read or parse text or CSV files containing multiple delimiters or repeating delimiters. Also, we will discuss about some peculiar delimiters and how to handle them using awk.

Let us consider a sample file. This colon separated file contains item, purchase year and a set of prices separated by a semicolon.
 

$ cat file

Item1:2010:10;20;30

Item2:2012:12;29;19

Item3:2014:15;50;61


1. To print the 3rd column
 which contains the prices:

$ awk -F: '{print $3}' file

10;20;30

12;29;19

15;50;61

  This is straight forward. By specifying colon(:) in the option with -F, the 3rd column can be retrieved using the $3 variable.

2. To print the 1st component of $3 alone:

$ awk -F '[:;]' '{print $4}' file

20

29

50

    What did we do here? Specified multiple delimiters, one is : and other is ; .  How awk parses the file? Its simple. First, it looks at the delimiters which is colon(:) and semi-colon(;). This means, while reading the line, as and when the delimiter : or ; is encountered,  store the part read in $1. Continue further. Again on encountering one of the delimiters, store the read part in $2. And this continues till the end of the line is reached.  In this way, $4 contained the first part of the price component above.

Note:    Always keep in mind. While specifying multiple delimiters, it has to be specified inside square brackets( [;:] ).


3. To sum the individual components of the 3rd column and print it:

$ awk -F '[;:]' '{$3=$3+$4+$5;print $1,$2,$3}' OFS=: file

Item1:2010:60

Item2:2012:60

Item3:2014:126

      The individual components of the price($3) column are available in $3, $4 and $5. Simply, sum them up and store in $3, and print all the variables. OFS (output field separator) is used to specify the delimiter while printing the output.

Note: If we do not use the OFS, awk will print the fields using the default output delimiter which is space.


4. Un-group or re-group every record depending on the price column:

$ awk -F '[;:]' '{for(i=3;i<=5;i++){print $1,$2,$i;}}' OFS=":" file

Item1:2010:10

Item1:2010:20

Item1:2010:30

Item2:2012:12

Item2:2012:29

Item2:2012:19

Item3:2014:15

Item3:2014:50

Item3:2014:61

     The requirement here is:  New records have to be created for every component of the price column. Simply, a loop is run on from columns 3 to 5, and every time a record is framed using the price component.

5-6. Read file in which the delimiter is square brackets:

$ cat file

123;abc[202];124

125;abc[203];124

127;abc[204];124

  5.  To print the value present within the brackets:

$ awk -F '[][]' '{print $2}' file

202

203

204

     At the first sight, the delimiter used in the above command might be confusing. Its simple. 2 delimiters are to be used in this case: One is [ and the other is  ]. Since the delimiters itself is square brackets which is to be placed within the square brackets, it looks tricky at the first instance.

 

Note: If square brackets are delimiters, it should be put in this way only, meaning first ] followed by [. Using the delimiter like -F '[[]]' will give a different interpretation altogether.


 
 6.  To print the first value, the value within brackets, and the last value:

$ awk -F '[][;]' '{print $1,$3,$5}' OFS=";" file

123;202;124

125;203;124

127;204;124

     3 delimiters are used in this case with semi-colon also included.

7-8. Read or parse a file containing a series of delimiters:

$ cat file

123;;;202;;;203

124;;;213;;;203

125;;;222;;;203

      The above file contains a series of 3 semi-colons between every 2 values.

    7. Using the multiple delimiter method:

$ awk -F'[;;;]' '{print $2}' file

 

 

 

    Blank output !!! The above delimiter, though specified as 3 colons is as good as one delimiter which is a semi-colon(;) since they are all the same. Due to this, $2 will be the value between the first and the second semi-colon which in our case is blank and hence no output.


     8. Using the delimiter without square brackets:

$ awk -F';;;' '{print $2}' file

202

213

222

     The expected output !!!  No square brackets is used and we got the output which we wanted.

Difference between using square brackets and not using it : When a set of delimiters are specified using square brackets, it means an OR condition of the delimiters. For example, -F '[;:]'means to separate the contents either on encountering ':' or ';'. However, when a set of delimiters are specified without using square brackets, awk looks at them literally to separate the contents. For example, -F ':;' means to separate the contents only on encountering a colon followed by a semi-colon. Hence, in the last example, the file contents are separated only when a set of 3 continuous semi-colons are encountered.


9. Read or parse a file containing a series of delimiters of varying lengths:
      In the below file, the 1st and 2nd column are separated  using 3 semi-colons, however the 2nd and 3rd are separated by 4 semi-colons
 

$ cat file

123;;;202;;;;203

124;;;213;;;;203

125;;;222;;;;203

$ awk -F';'+ '{print $2,$3}' file

202 203

213 203

222 203

      The '+' is a regular expression. It indicates one or more of previous characters. ';'+ indicates one or more semi-colons, and hence both the 3 semi-colons and 4 semi-colons get matched.


10.  Using a word as a delimiter:

$ cat file

123Unix203

124Unix203

125Unix203

     Retrieve the numbers before and after the word "Unix" :

$ awk -F'Unix' '{print $1, $2}' file

123 203

124 203

125 203

     In this case, we use the word "Unix" as the delimiter. And hence $1 and $2 contained the appropriate values . Keep in mind, it is not just the special characters which can be used as delimiters. Even alphabets, words can also be used as delimiters.

 


No comments:

Post a Comment