Editing distance clusters: a greedy string classification algorithm

栏目: IT技术 · 发布时间: 5年前

Editing Distance Clusters

Matthew
Mattias
Luke
John
James
Thomas
Nick
Zack
Mark

A greedy string classification algorithm

This algorithm attempts to classify strings by their editing distance. It first computes the case-insensitiveediting distances between all input strings in O( n ² ab ), where a and b the maximum string lengths. Normalized editing distances are stored in a square matrix, M. Normalized, in this context, means that the editing distance is divided by the length of the longer of the two strings. The value M i,j = d means that the normalized editing distance between elements i and j is equal to d . The diagonal of this matrix should consist of all zeros.

Next, the algorithm sorts the columns of this distance matrix by the simple invariant that the element immediately to the right of the diagonal is the least element on that row right of the diagonal. Symbolically, the invariant states that for matrix M of size m×n , M i , i +1 ≤ M i,j for all i +1 < jn . This "sorting" is basically an insertion sort and occurs in Θ( n ²).

Finally, the algorithm assigns clusters by the normalized editing distance. This is a simple comparison. The threshold for differentiating clusters configurable by the user. If M i , i +1 > threshold (more different) then the elements i and i +1 are classified in different clusters.

My goal with all of this is to automatically classify Syslog messages. This application looks promising! Try pasting the below test inputs into the text area above to see what interesting clusters you can come up with.

Test inputs and recommended thresholds

"Nice" clusters (0.50)

Thomas
Zack
John
Matthew
Luke
Mattias
Nick
Mark
James

Even nicer clusters (.40)

space
spouse
cats
dot
house
cat
dig
dog
mouse
horse

Low variation (.45)

Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday

High variation (.75)

3505220008376247808
4148830745941488640
4385269805240825344
3925546062580223488
6963321067607564288
1081202811808172672
6074028453849956352
3620542548649959936
5167561983569581056
2577125115152971776

Very high variation (.80)

white
green
blue
purple
magenta
orange
cyan
pink
effervescent

Long input (notice Grover Cleveland) (.50)

George Washington
John Adams
Thomas Jefferson
James Madison
James Monroe
John Quincy Adams
Andrew Jackson
Martin Van Buren
William Henry Harrison
John Tyler
James K. Polk
Zachary Taylor
Millard Fillmore
Franklin Pierce
James Buchanan
Abraham Lincoln
Andrew Johnson
Ulysses S. Grant
Rutherford B. Hayes
James Garfield
Chester A. Arthur
Grover Cleveland
Benjamin Harrison
Grover Cleveland
William McKinley
Theodore Roosevelt
William Howard Taft
Woodrow Wilson
Warren G. Harding
Calvin Coolidge
Herbert Hoover
Franklin D. Roosevelt
Harry S. Truman
Dwight D. Eisenhower
John F. Kennedy
Lyndon B. Johnson
Richard M. Nixon
Gerald R. Ford
James Carter
Ronald Reagan
George H. W. Bush
William J. Clinton
George W. Bush
Barack Obama
Donald J. Trump

Interface state change Syslog messages (.20)

00:00:46: %LINK-3-UPDOWN: Interface Port-channel1, changed state to up
00:00:47: %LINK-3-UPDOWN: Interface GigabitEthernet0/1, changed state to up
00:00:47: %LINK-3-UPDOWN: Interface GigabitEthernet0/2, changed state to up
00:00:48: %LINEPROTO-5-UPDOWN: Line protocol on Interface Vlan1, changed state to down
00:00:48: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/1, changed state to down
*Mar  1 18:46:11: %SYS-5-CONFIG_I: Configured from console by vty2 (10.34.195.36)
18:47:02: %SYS-5-CONFIG_I: Configured from console by vty2 (10.34.195.36)
*Mar  1 18:48:50.483 UTC: %SYS-5-CONFIG_I: Configured from console by vty2 (10.34.195.36)

Router adjacency change Syslog messages (.20)

*Mar 15 20:30:22.475: %SPANTREE-2-PVSTSIM_FAIL: Blocking designated port Fa0/1: Inconsitent superior PVST BPDU received on VLAN 10, claiming root 24586:0009.b752.d780
*Mar 15 20:30:22.479: %SPANTREE-2-PVSTSIM_FAIL: Blocking designated port Fa0/2: Inconsitent superior PVST BPDU received on VLAN 10, claiming root 24586:0009.b752.d780
*Mar 15 20:30:22.479: %SPANTREE-2-PVSTSIM_FAIL: Blocking designated port Fa0/3: Inconsitent superior PVST BPDU received on VLAN 10, claiming root 24586:0009.b752.d780
*Mar 15 20:30:22.479: %SPANTREE-2-PVSTSIM_FAIL: Blocking designated port Fa0/4: Inconsitent superior PVST BPDU received on VLAN 10, claiming root 24586:0009.b752.d780
*Mar 15 20:30:22.667: %LINEPROTO-5-UPDOWN: Line protocol on Interface Vlan1, changed state to down
*Mar 15 20:30:22.667: %LINEPROTO-5-UPDOWN: Line protocol on Interface Vlan50, changed state to down
*Mar 15 20:30:22.671: %DUAL-5-NBRCHANGE: EIGRP-IPv4:(1) 1: Neighbor 172.16.50.50 (Vlan50) is down: interface down
*Mar 15 20:30:22.671: %DUAL-5-NBRCHANGE: EIGRP-IPv4:(1) 1: Neighbor 172.16.50.101 (Vlan50) is down: interface down
*Mar 15 20:30:22.683: %LINEPROTO-5-UPDOWN: Line protocol on Interface Vlan10, changed state to down
*Mar 15 20:30:22.687: %DUAL-5-NBRCHANGE: EIGRP-IPv4:(1) 1: Neighbor 192.168.1.101 (Vlan10) is down: interface down
*Mar 15 20:30:24.387: %SYS-5-CONFIG_I: Configured from console by console
*Mar 15 20:30:42.479: %SPANTREE-2-PVSTSIM_OK: PVST Simulation inconsistency cleared on port FastEthernet0/2.
*Mar 15 20:30:52.651: %LINEPROTO-5-UPDOWN: Line protocol on Interface Vlan1, changed state to up
*Mar 15 20:30:52.651: %LINEPROTO-5-UPDOWN: Line protocol on Interface Vlan10, changed state to up
*Mar 15 20:30:52.679: %LINEPROTO-5-UPDOWN: Line protocol on Interface Vlan50, changed state to up
*Mar 15 20:30:54.043: %DUAL-5-NBRCHANGE: EIGRP-IPv4:(1) 1: Neighbor 172.16.50.50 (Vlan50) is up: new adjacency
2d16h: %SYS-5-CONFIG_I: Configured from console by console
2d16h: %SPANTREE-2-PVSTSIM_FAIL: Blocking root port Fa0/5: Inconsitent inferior PVST BPDU received on VLAN 20, claiming root 24596:0009.b752.d780
2d16h: %DUAL-5-NBRCHANGE: EIGRP-IPv4:(1) 1: Neighbor 172.16.50.50 (Vlan50) is up: new adjacency
2d16h: %DUAL-5-NBRCHANGE: EIGRP-IPv4:(1) 1: Neighbor 192.168.1.100 (Vlan10) is up: new adjacency
2d16h: %DUAL-5-NBRCHANGE: EIGRP-IPv4:(1) 1: Neighbor 172.16.50.100 (Vlan50) is up: new adjacency

Spanning tree state change Syslog messages (.20)

Jan  3 19:49:29:I:MSTP: MST 1 Port 1/1/9:1  - Bridge TC Event
Jan  3 19:49:12:I:MSTP: MST 0 Port 1/1/9:1 - DISCARDING
Jan  3 19:49:29:I:MSTP: MST 0 Port 1/1/9:1 - LEARNING
Jan  3 19:49:29:I:MSTP: MST 0 Port 1/1/9:1 - FORWARDING
Jan  3 19:49:12:I:MRP: Interface ethernet 1/1/9:1 of ring 51  Vlan 51, changing to disabled
Jan  3 19:49:29:I:MSTP: MST 0 Port 1/1/11:1  - Bridge TC Event
Jan  3 19:49:29:I:MRP: Interface ethernet 1/1/9:1 of ring 51  Vlan 51, changing to preforwarding
Jan  3 19:50:43:I:MSTP: MST 0 Port 1/1/11:1  - Bridge TC Event
Jan  3 19:49:12:I:System: Interface ethernet 1/1/9:3, state down
Jan  3 19:49:29:I:MSTP: MST 1 Port 1/1/9:1 - FORWARDING
Jan  3 19:49:29:I:MSTP: MST 0 Port 1/1/9:1  - Bridge TC Event
Jan  3 19:49:29:I:System: Logical link on dynamic lag interface ethernet 1/1/9:4 is up.
Jan  3 19:49:29:I:System: Logical link on dynamic lag interface ethernet 1/1/9:1 is up.
Jan  3 20:23:10:I:Security: startup-config was changed by operator from console
Jan  3 19:49:29:I:System: Logical link on dynamic lag interface ethernet 1/1/9:3 is up.
Jan  3 19:49:12:I:System: Logical link on dynamic lag interface ethernet 1/1/9:4 is down.
Jan  3 19:49:29:I:Trunk: Group (1/1/9:1, 1/1/9:2, 1/1/9:3, 1/1/9:4) created by 802.3ad link-aggregation module.
Jan  3 19:49:29:I:System: Interface ethernet 1/1/9:2, state up
Jan  3 19:49:12:I:System: Logical link on dynamic lag interface ethernet 1/1/9:3 is down.
Jan  3 19:49:29:I:System: Interface ethernet 1/1/9:3, state up
Jan  3 19:49:29:I:MSTP: MST 0 Port 1/1/9:1 - DISCARDING
Jan  3 19:49:29:I:MSTP: MST 1 Port 1/1/9:1 - LEARNING
Jan  3 19:49:12:I:MSTP: MST 1 Port 1/1/9:1 - DISCARDING
Jan  3 19:49:12:I:Trunk: Group (1/1/9:1, 1/1/9:2, 1/1/9:3, 1/1/9:4) removed by 802.3ad link-aggregation module.
Jan  3 19:49:12:I:System: Logical link on dynamic lag interface ethernet 1/1/9:1 is down.
Jan  3 19:49:29:I:System: Logical link on dynamic lag interface ethernet 1/1/9:2 is up.
Jan  3 19:49:29:I:MSTP: MST 1 Port 1/1/11:1  - Bridge TC Event
Jan  3 19:49:12:I:System: Interface ethernet 1/1/9:4, state down
Jan  3 19:49:12:I:System: Logical link on dynamic lag interface ethernet 1/1/9:2 is down.
Jan  3 20:17:25:I:Security: startup-config was changed by operator from console
Jan  3 19:49:29:I:MSTP: MST 1 Port 1/1/9:1 - DISCARDING
Jan  3 19:50:43:I:MSTP: MST 1 Port 1/1/11:1  - Bridge TC Event
Jan  3 19:49:29:I:System: Interface ethernet 1/1/9:4, state up
Jan  3 19:49:29:I:System: Interface ethernet 1/1/9:1, state up
Jan  3 19:49:12:I:System: Interface ethernet 1/1/9:2, state down
Jan  3 19:49:29:I:MRP: Interface ethernet 1/1/9:1 of ring 51  Vlan 51, changing to forwarding
Jan  3 19:49:12:I:System: Interface ethernet 1/1/9:1, state down

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

iOS软件开发揭密

iOS软件开发揭密

虞斌 / 电子工业出版社 / 2011-5-1 / 79.00元

本书以严密的体系性提供了iPhone和iPad软件开发从入门到专家的系统性知识,并提供来源于真实项目的可重用商业代码。书中的每个实例都是项目经验的提炼,深入浅出地讲解iPhone和iPad软件开发的核心技术要点,基本涵盖了iOS软件开发在真实商业项目中所需要的所有主题,并将实例介绍的技术深度和超值的实用性结合在一起,成为本书的特色。 随书附赠的光盘中包含了书中大量案例的完整工程源代码,可以让......一起来看看 《iOS软件开发揭密》 这本书的介绍吧!

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试