Algoritmo efficiente per trovare tutte le stringhe "uguali ai caratteri"?

Come possiamo scrivere una funzione efficiente che emetta "homoglyph equivalents" di una stringa di input?Algoritmo efficiente per trovare tutte le stringhe "uguali ai caratteri"?

Esempio 1 (pseudo-codice):

homoglyphs_list = [ 
        ["o", "0"], // "o" and "0" are homoglyphs 
        ["i", "l", "1"] // "i" and "l" and "1" are homoglyphs 
        ] 

input_string = "someinput" 
output   = [ 
        "someinput", "s0meinput", "somelnput", 
        "s0melnput", "some1nput", "s0me1nput" 
        ]

Esempio 2:

homoglyphs_list = [ 
        ["rn", "m", "nn"], 
        ] 

input_string = "rnn" 
output   = ["rnn", "rm", "mn", "rrn", "nnn", "nm", "nrn"]

Esempio 3:

homoglyphs_list = [ 
        ["d", "ci", "a"], // "o" and "0" are homoglyphs 
        ["i", "l", "1"] // "i" and "l" and "1" are homoglyphs 
        ] 
        /* 
        notice that with the two rules above, 
        we can infer "d" = "ci" = "a" = "cl" = "c1" 
        */ 

input_string = "pacerier" 
output   = [ 
        "pacerier", "pacerler", "pacer1er", "pcicerier", 
        "pcicerler", "pcicer1er", "pclcerier", "pc1cerier", 
        "pclcerler", "pc1cerler", "pclcer1er", "pc1cer1er", 
        "pdcerier", "pdcerler", "pdcer1er" 
        ]

Nota: l'ordine dei membri nell'array di output non è importante e possiamo supporre che i dati relativi agli homoglyph siano considerati appropriati (gli input non ci danno un "ciclo infinito").

Il mio algoritmo attuale funziona, ma sta usando la bruteforcing raw e le prestazioni sono orribili. Per esempio. un ingresso di "mmmmm" con homoglyphs ["rn", "m", "nn"] prende 38 secondi per eseguire:

// This is php code (non-pseudo just so we could test the running time), 
// but the question remains language-agnostic 

public function Func($in, Array $mappings){ 
    $out_table = array(); 
    $out_table[$in] = null; 
    while(true){ 
     $number_of_entries_so_far = count($out_table); 
     foreach(array_keys($out_table) as $key){ 
      foreach($mappings as $mapping){ 
       foreach($mapping as $value){ 
        for($a=0, $aa=strlen($key); $a<$aa; ++$a){ 
         $pos = strpos($key, $value, $a); 
         if($pos === false){ 
          continue; 
         } 
         foreach($mapping as $equal_value){ 
          if($value === $equal_value){ 
           continue; 
          } 
          $out_table[substr($key, 0, $pos) . $equal_value . substr($key, $pos + strlen($value))] = null; 
         } 
        } 

       } 
      } 
     } 
     if($number_of_entries_so_far === count($out_table)){ 
      // it means we have tried bruteforcing but added no additional entries, 
      // so we can quit now 
      break; 
     } 
    } 
    return array_keys($out_table); 
}

Come possiamo implementare un algoritmo di espansione homoglyph efficace (veloce)?

fonte

2013-08-05 Pacerier

E cosa farà se scrivo come '$ homoglyph_mappings [0] = array (" n "," nn "," nnn ");' ?? – Rajan

@Rajan, come affermato, possiamo supporre che gli input siano corretti (cioè su input impropri, si ottiene un comportamento indefinito). Il tuo esempio è un caso di input improprio perché ci darebbe un ciclo infinito. Se 'n = nn', quindi' nn = nnnn', quindi 'nnnn = nnnnnnnn', quindi' nnnnnnnn = nnnnnnnnnnnnnnnn', etc, etc, * infinitamente * ... – Pacerier

Ok, quindi è un caso _eccezionale_. – Rajan

Si dovrebbe ottenere un po 'di prestazioni da un'implementazione ricorsiva, tuttavia non ho pensato molto alle prestazioni effettive. Questo eviterebbe di ripetere il loop della stringa di output più volte e di contare l'output ogni iterazione. Inoltre ho aggiunto un po 'di "cache" per evitare di calcolare due volte lo stesso homoglyph, per semplicità non controlla i mapping, ma puoi implementarlo o semplicemente cancellare la cache prima di ogni chiamata in cui le mappature cambiano (ma questo è brutto e facilmente introduce bug).

Codice assomiglia:

cache = {} 
def homoglyph(input_string, mappings): 
    output = [] 
    character = input_string[0] 
    if input_string in cache: 
     return cache[input_string] 
    elif not input_string: 
     return [] 
    for charset in mappings: 
     if character in charset: 
      temp = input_string 
      sub_homoglyphs = homoglyph(input_string[1:], mappings) 
      for char in charset: 
       temp[0] = char 
       output.append(temp) 
       #adding the homoglyph character to the rest of the string 
       for subhomoglyph in sub_homoglyphs: 
        output.append(char+subhomoglyph) 
    cache[input_string] = output 
    return output

(Questo è scritto con Python come io non sono esperto nella sintassi php, non ho effettivamente eseguito questo modo ci possono essere errori ma il succo della logica c'è)

fonte

2013-08-20 20:13:24 Don

Algoritmo efficiente per trovare tutte le stringhe "uguali ai caratteri"?

risposta

Problemi correlati