Tuesday, December 27, 2011

scary awk magic

AWK is quite a powerful program that can do all sorts of gymnastics with structured text files.  Here's one example.

Say you have a file with four columns (depicted below) delimited by tabs:
-an id field
-a JSON string of dates
-another JSON string of values associated to those dates
-a last column that has the total of those values in column 3

Source File
6237    [20111102,20111103,20111104,20111105,20111106,20111107,20111108,20111109,20111110,20111111,20111112,20111113,20111114,20111115,20111116,20111117,20111118,20111119,20111120,20111121,20111122,20111123,20111124,20111125,20111126,20111127,20111128,20111129,20111130,20111201,20111202,20111203,20111204,20111205,20111206,20111207,20111208,20111209,20111210,20111211,20111212,20111213,20111214,20111215,20111216]  [286,291,276,274,339,424,428,408,458,401,354,398,543,571,631,581,515,516,571,721,768,943,960,917,899,941,1168,1274,1256,1527,1472,1315,1323,1741,1826,1958,1988,1760,1662,1770,2088,2117,2325,2588,2214]    47786
6237    [20111102,20111103,20111104,20111105,20111106,20111107,20111108,20111109,20111110,20111111,20111112,20111113,20111114,20111115,20111116,20111117,20111118,20111119,20111120,20111121,20111122,20111123,20111124,20111125,20111126,20111127,20111128,20111129,20111130,20111201,20111202,20111203,20111204,20111205,20111206,20111207,20111208,20111209,20111210,20111211,20111212,20111213,20111214,20111215,20111216]  [286,291,276,274,339,424,428,408,458,401,354,398,543,571,631,581,515,516,571,721,768,943,960,917,899,941,1168,1274,1256,1527,1472,1315,1323,1741,1826,1958,1988,1760,1662,1770,2088,2117,2325,2588,2214]    47786
6237    [20111102,20111103,20111104,20111105,20111106,20111107,20111109,20111110,20111111,20111112,20111113,20111114,20111115,20111116,20111117,20111118,20111119,20111120,20111121,20111122,20111123,20111124,20111125,20111126,20111127,20111128,20111129,20111130]       [286,291,276,274,339,424,408,458,401,354,398,543,571,631,581,515,516,571,721,768,943,960,917,899,941,1168,1274,1256]        17684
6237    [20111102,20111103,20111104,20111105,20111106,20111107,20111108,20111109,20111110,20111111,20111112,20111113,20111114,20111115,20111116,20111117,20111118,20111119,20111120,20111121,20111122,20111123,20111124,20111125,20111126,20111127,20111128,20111129,20111130,20111201,20111202,20111203,20111204,20111205,20111206]        [286,291,276,274,339,424,428,408,458,401,354,398,543,571,631,581,515,516,571,721,768,943,960,917,899,941,1168,1274,1256,1527,1472,1315,1323,1741,1826]      27316
10778   [20111102,20111103,20111104,20111105,20111106,20111107,20111108,20111109,20111110,20111111,20111112,20111113,20111114,20111115,20111116,20111117,20111118,20111119,20111120,20111121,20111122,20111123,20111124,20111125,20111126,20111127,20111128,20111129,20111130,20111201,20111202,20111203,20111204,20111205,20111206]        [48,47,37,41,49,35,49,47,59,46,37,41,39,67,60,59,45,39,44,58,62,47,62,56,70,55,57,53,35,59,57,42,45,61,63]  1771
10778   [20111102,20111103,20111104,20111105,20111106,20111107,20111109,20111110,20111111,20111112,20111113,20111114,20111115,20111116,20111117,20111118,20111119,20111120,20111121,20111122,20111123,20111124,20111125,20111126,20111127,20111128,20111129,20111130]       [48,47,37,41,49,35,47,59,46,37,41,39,67,60,59,45,39,44,58,62,47,62,56,70,55,57,53,35]   1395
10778   [20111102,20111103,20111104,20111105,20111106,20111107,20111108,20111109,20111110,20111111,20111112,20111113,20111114,20111115,20111116,20111117,20111118,20111119,20111120,20111121,20111122,20111123,20111124,20111125,20111126,20111127,20111128,20111129,20111130,20111201,20111202,20111203,20111204,20111205,20111206,20111207,20111208,20111209,20111210,20111211,20111212,20111213,20111214,20111215,20111216]  [48,47,37,41,49,35,49,47,59,46,37,41,39,67,60,59,45,39,44,58,62,47,62,56,70,55,57,53,35,59,57,42,45,61,63,55,10633,51,41,36,37,62,41,53,45] 12825

The Goal
Notice that the row id's repeat a variable number of times.  My task was to use the first column as an index, and only take the last line in the file related to that index.


The Solution
The solution looks something hideous like this:
awk -F"\t" 'NR!=1 {a[$1]; b[$1]=$2 ; c[$1]=$3; d[$1]=$4} \
END{OFS="\t";for (i in a) printf("%s\t%s\t%s\t%s\n", i, b[i], c[i], d[i])}' \
file.txt

Breaking it apart into separate lines helps a bit (the backslash allows you to enter the pieces of the command on multiple lines in a Linux terminal window):
awk -F"\t"\
'NR!=1 {a[$1]; b[$1]=$2 ; c[$1]=$3; d[$1]=$4} \
END\
{OFS="\t";\
for (i in a)\ 
printf("%s\t%s\t%s\t%s\n", i, b[i], c[i], d[i])}'|\
file.txt

Let's take this apart piece by piece to see what it means.

awk -F"\t" start awk and use the tab delimiter to separate out the columns
NR!=1 read in all but the first record
a an array of values called a
a[$1] assign values to array named a, the values are to be found in the first column ($1) of the data file
b[$1]=$2 for the array b, at the index of $1, assign value of the second column ($2) of the data file
c[$1]=$3 for the array c, at the index of $1, assign value of the third column ($3) of the data file
d[$1]=$4 for the array d, at the index of $1, assign value of the fourth column ($4) of the data file
END do the following at the last
OFS="\t" output field separator is a tab
for (i in a) for each value in the array a, do the following
print.. print the current value for i (the index, the first column)
print the last value for the second column at index i: b[i]
print the last value for the third column at index i: c[i]
print the last value for the fourth column at index i: d[i]

Final Output
Now when I run the program, I get this for the output:
6237    [20111102,20111103,20111104,20111105,20111106,20111107,20111108,20111109,20111110,20111111,20111112,20111113,20111114,20111115,20111116,20111117,20111118,20111119,20111120,20111121,20111122,20111123,20111124,20111125,20111126,20111127,20111128,20111129,20111130,20111201,20111202,20111203,20111204,20111205,20111206]        [286,291,276,274,339,424,428,408,458,401,354,398,543,571,631,581,515,516,571,721,768,943,960,917,899,941,1168,1274,1256,1527,1472,1315,1323,1741,1826]      27316
10778   [20111102,20111103,20111104,20111105,20111106,20111107,20111108,20111109,20111110,20111111,20111112,20111113,20111114,20111115,20111116,20111117,20111118,20111119,20111120,20111121,20111122,20111123,20111124,20111125,20111126,20111127,20111128,20111129,20111130,20111201,20111202,20111203,20111204,20111205,20111206,20111207,20111208,20111209,20111210,20111211,20111212,20111213,20111214,20111215,20111216]  [48,47,37,41,49,35,49,47,59,46,37,41,39,67,60,59,45,39,44,58,62,47,62,56,70,55,57,53,35,59,57,42,45,61,63,55,10633,51,41,36,37,62,41,53,45] 12825

Note there are only two lines now instead of seven and that I only have the last two instances of the value indexed in column one.

And that's some funky AWK magic.

Reference
http://en.wikipedia.org/wiki/AWK

No comments:

Feel free to drop me a line or ask me a question.