- 論壇徽章:
- 2
|
本帖最后由 yinyuemi 于 2016-07-26 10:23 編輯
好久沒發(fā)帖了,今天來策一策awk的數(shù)組升級版,參考自https://www.gnu.org/software/gawk/manual/gawk.html#Arrays
awk數(shù)組的基本用法,這里就不再贅述了(3.0+版本數(shù)組的主要用法在這里http://www.72891.cn/thread-2312439-1-1.html),這里主要講是gawk4.0+版本中關(guān)于數(shù)組的2種新用法,所以還沒有升級版本的筒子們趕快動起來吧。
1. 預(yù)定義遍歷的數(shù)組
一般情況下,用for item in array的方法輸出數(shù)組的值,其順序是沒有定義的,也就是”亂序的“。但是很多時候,我們希望
數(shù)組的值按照一定的要求輸出,比如按照數(shù)值大小升序或降序的方式等等。此時,一般的做法,是通過asort或asorti來間接實現(xiàn)。
不過,現(xiàn)在好了,gawk4.0+版本提供了非常方便的對數(shù)組輸出的控制模式。
這里涉及到一個gawk的一個內(nèi)置數(shù)組PROCINFO,大家可以運行這個查看它的詳細信息:- awk 'BEGIN{for(i in PROCINFO){if(isarray(PROCINFO[i])){for( j in PROCINFO[i])print i,j,PROCINFO[i][j]}else{print i,PROCINFO[i]}}}'
復(fù)制代碼 其中控制數(shù)組遍歷模式的是"sorted_in",如下面的列表:
ROCINFO ["sorted_in"] | Description | @unsorted | Array indexes are processed in arbitrary order (default awk behavior). | @ind_str_asc | The array is sorted with indexes compared as strings in ascending order. | @ind_num_asc | The array is sorted with indexes compared as numbers in ascending order. Non-numeric indexes are treated as zero. | @val_type_asc | The array is sorted based on values as per its type in ascending order. All numbers come before the strings. The sub-arrays come after the strings. | @val_str_asc | The array is sorted based on values of elements, treating the values as strings, in ascending order. | @val_num_asc | The array is sorted based on values of elements, treating values as numbers, in ascending order. | @ind_str_desc | The array is sorted based on index, treated as strings, in descending order. | @ind_num_desc | The array is sorted based on index, treated as numbers, in descending order. | @val_type_desc | The array is sorted based on the value of the element as per its type in descending order. Subarrays come first, then the strings and lastly, the numbers. | @val_str_desc | The array is sorted based on element values, treated as strings, in descending order. | @val_num_desc | The array is sorted based on values, treated as numbers, in descending order. |
一言不合舉栗子: - # 默認方式,即無序
- awk '
- BEGIN {PROCINFO ["sorted_in"] = "@unsorted"
- fruit ["apple"] = 4
- fruit ["mango"] = 12
- fruit ["guava"] = 8
- fruit ["banana"] = 16
- for (j in fruit)
- printf ("%s: %d numbers\n", j, fruit [j])
- } '
- guava: 8 numbers
- mango: 12 numbers
- apple: 4 numbers
- banana: 16 numbers
復(fù)制代碼- # 按照value的大小升序
- awk '
- BEGIN {PROCINFO ["sorted_in"] = "@val_num_asc"
- fruit ["apple"] = 4
- fruit ["mango"] = 12
- fruit ["guava"] = 8
- fruit ["banana"] = 16
- for (j in fruit)
- printf ("%s: %d numbers\n", j, fruit [j])
- } '
- apple: 4 numbers
- guava: 8 numbers
- mango: 12 numbers
- banana: 16 numbers
復(fù)制代碼- # 按照index字母順序降序
- awk '
- BEGIN {PROCINFO ["sorted_in"] = "@ind_str_desc"
- fruit ["apple"] = 4
- fruit ["mango"] = 12
- fruit ["guava"] = 8
- fruit ["banana"] = 16
- for (j in fruit)
- printf ("%s: %d numbers\n", j, fruit [j])
- } '
- mango: 12 numbers
- guava: 8 numbers
- banana: 16 numbers
- apple: 4 numbers
復(fù)制代碼 俗話說,”栗子不過三“,就舉到這里先。
是不是覺得asort/asorti在這個sorted_in”控制閥“面前弱爆了?!
友情提示: 因為PROCINFO ["sorted_in"]是全局性的變量,一旦設(shè)定之后,會改變整個awk的數(shù)組遍歷方式,所以如果你希望在小范圍內(nèi)使用,可以按照下面的方式來做。
- …
- if ("sorted_in" in PROCINFO) {
- save_sorted = PROCINFO["sorted_in"]
- PROCINFO["sorted_in"] = "@val_str_desc" # or whatever
- }
- …
- if (save_sorted)
- PROCINFO["sorted_in"] = save_sorted
復(fù)制代碼 事實上,除了awk內(nèi)置的遍歷函數(shù),sorted_in也可以被賦予自定義的函數(shù)。
自定義的函數(shù)有個通用的代碼框架如下:
- function comp_func(i1, v1, i2, v2) # 至少包含4個參數(shù)
- {
- compare elements 1 and 2 in some fashion
- return < 0; 0; or > 0
- }
復(fù)制代碼 栗子如下:
- awk '
- BEGIN{
- arr[1] = 10
- arr[2] = 2
- arr[3] = 100
- arr["one"] = 10
- arr["two"] = 1
- arr["three"] = 100
- PROCINFO["sorted_in"] = "cmp_num_val_desc"
- print "#exactly the same as @val_num_desc"
- for(i in arr)
- print "arr["i"] = " arr[i]
- print "如果排序規(guī)則改為:1. index:字母在前,數(shù)字之后 2. index一致時, value降序"
- PROCINFO["sorted_in"] = "cmp_smart_desc"
- print "#sort in a smarter way"
- for(i in arr)
- print "arr["i"] = " arr[i]
- }
- function cmp_num_val_desc(i1, v1, i2, v2)
- {
- # numerical value comparison, descending order,
- return (v2 - v1)
- }
- function cmp_smart_desc(i1, v1, i2, v2, n1, n2)
- {
- # numbers after string value comparison, descending order
- n1 = i1 + 0
- n2 = i2 + 0
- if (n1 != i1)
- return (n2 != i2) ? (v2 - v1) : -1
- else if (n2 != i2)
- return 1
- return v2 - v1
- }
- '
- #exactly the same as @val_num_desc
- arr[three] = 100
- arr[3] = 100
- arr[one] = 10
- arr[1] = 10
- arr[2] = 2
- arr[two] = 1
- 如果排序規(guī)則改為:1. index:字母在前,數(shù)字之后 2. index一致時, value降序
- #sort in a smarter way
- arr[three] = 100
- arr[one] = 10
- arr[two] = 1
- arr[3] = 100
- arr[1] = 10
- arr[2] = 2
復(fù)制代碼 2. 數(shù)組的數(shù)組 (Arrays of Arrays)
有了它,awk就可以真正創(chuàng)建多維數(shù)組,而不像以前版本那樣用一維數(shù)組來模擬多維。
如果有童鞋對perl的hash熟悉的話,那么它可以理解為hash of hash。
下面先看“數(shù)組的數(shù)組”活生生的樣子吧
- a[1][1]=1
- a[1][2]=2
- a[1][3]=3
復(fù)制代碼 是不是很眼熟,在某種/些語言里有相同的寫法。
沒錯,這就是一個典型的二維數(shù)組,第一維的index為[1],第二維為[1][2][3]。
事實上,為了保持每一維度在index使用的靈活性,對于下面的寫法也是繼續(xù)支持的:
- a[1][1,"a"]=1
- a[1][2,"a"]=2
- a[1][3,"a"]=3
復(fù)制代碼 并且,每一維數(shù)組的value可以是一個scalar,也可以是一個subarray
- a[1][1,"a"]=1
- a[2]=2
- a[3][3][4]=3
復(fù)制代碼 好了,說了這么多,如何打印Arrays of Arrays呢?其實很簡單~
- for (i in array)
- for (j in array[i])
- print array[i][j]
復(fù)制代碼 當你不知道某個維度的value是scalar,還是subarray,那么可以加個判斷。
如何判斷呢?也很簡單,因為新版gawk已經(jīng)幫你寫好函數(shù),就等你用了,它就是isarray。
官方文檔還配備了一個殘暴的walk_array, 簡直是無所不至。
- function walk_array(arr, name, i)
- {
- for (i in arr) {
- if (isarray(arr[i]))
- walk_array(arr[i], (name "[" i "]"))
- else
- printf("%s[%s] = %s\n", name, i, arr[i])
- }
- }
復(fù)制代碼 好久沒有寫文檔了,一口氣寫了這么多,感覺身體快被掏空 不多說了,再打一套以上兩個新功能的“組合拳”就結(jié)貼了!
模擬sort排序
- cat file
- abc 123 100
- abc 456 100
- abc 456 10
- def 123 10
- def 123 100
- abc 123 1
- xzy 789 0
- # sort 排序: 第一列按照字母升序,第二列數(shù)字升序,第三列數(shù)字降序
- sort -k1,1 -k2,2n -k3,3nr file
- abc 123 100
- abc 123 1
- abc 456 100
- abc 456 10
- def 123 100
- def 123 10
- xzy 789 0
- # awk 3.0+ 排序
- awk '
- {
- a[$1" "$2" "$3];
- b[$1]=$1;
- c[$2];
- d[$3]
- }
- END{
- for(i=1;i<=asort(b);i++)
- for(j=1;j<=asorti(c,e);j++)
- for(k=asorti(d,f);k>=1;k--)
- if(b[i]" "e[j]" "f[k] in a)
- print b[i],e[j],f[k]
- }
- ' file
- abc 123 100
- abc 123 1
- abc 456 100
- abc 456 10
- def 123 100
- def 123 10
- xzy 789 0
- # gawk 4.0+ 排序
- awk '
- {
- arr[$1][$2][$3]
- }
- END{
- PROCINFO["sorted_in"] = "@ind_str_asc"
- for(i in arr){
- PROCINFO["sorted_in"] = "@ind_num_asc"
- for(j in arr[i]){
- PROCINFO["sorted_in"] = "@ind_num_desc"
- for(k in arr[i][j])
- print i,j,k
- }
- }
- }
- ' file
- abc 123 100
- abc 123 1
- abc 456 100
- abc 456 10
- def 123 100
- def 123 10
- xzy 789 0
復(fù)制代碼 兩種awk的寫法相比,gawk的是不是更加清晰,明了呢
艾瑪呀,終于寫完了,希望能給大家一些啟示和幫助,拋磚引玉,如有錯誤的地方,請不吝指正!
最后還想說的是gawk4.0版本還有很多fancy的功能,有興趣的可以翻翻 http://www.72891.cn/thread-3559813-1-1.html
|
評分
-
查看全部評分
|