The substitution cipher is a simple cipher where every character in the plaintext is substituted by another according to a 1-to-1 mapping, producing a ciphertext. For example, using a mapping such as this:
The plaintext "hello_world" would be enciphered into "hnllqpoqilb".
Not long ago, my CS2107 (Introduction to Information Security) lecturer issued a challenge to crack his ciphertexts. I jumped at this opportunity and wrote a script to face off this challenge:
How it works:
Since there are 27! possible keys, it is not computationally feasible to bruteforce all possible keys. Instead, we use a hill-climbing algorithm (original idea from here) to arrive at a good guess of the actual key. The hill-climbing algorithm works like this:
The plaintext "hello_world" would be enciphered into "hnllqpoqilb".
Not long ago, my CS2107 (Introduction to Information Security) lecturer issued a challenge to crack his ciphertexts. I jumped at this opportunity and wrote a script to face off this challenge:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | import enchant import random from ngram_score import ngram_score fitness = ngram_score('quadgrams.txt') d = enchant.Dict("en_US") # PROGRAM PARAMETERS ERROR_TOLERANCE = 1 # no. of tolerable non-words ALMOST_THERE_THRESHOLD = 80 # in % SWAP_COUNT = 1000 # no. of swaps in each improve() call print("Input ciphertext: ") ciphertext = input().upper() indices = [] for c in ciphertext: if c != '_': idx = ord(c) - 65 # 65 is ord('A') else: idx = -1 indices.append(idx) def is_word(string): return d.check(string) def decipher(key): chars = map(lambda i: key[i], indices) return "".join(chars) def is_sentence(string): errors, total = wrong_words(string) return errors <= ERROR_TOLERANCE and total > 1 # sentence shouldn't be just a word! def wrong_words(string): words = string.split() errors = 0 for word in words: if not is_word(word): errors += 1 return (errors, len(words)) def correct_words(string): errors, total = wrong_words(string) return (total-errors, total) def score(string): string = string.replace(" ", "") return fitness.score(string) def almost_there(string): errors, total = wrong_words(string) correct = total - errors correctness = correct/total * 100 return correctness >= ALMOST_THERE_THRESHOLD def improve(key, max_score): key = key[:] plaintext_guess = decipher(key) if almost_there(plaintext_guess): max_score = correct_words(plaintext_guess)[0] a = 0 while a < len(key)-1: b = a+1 while b < len(key): key[a], key[b] = key[b], key[a] text = decipher(key) curr_score = correct_words(text)[0] if curr_score > max_score: max_score = curr_score a = 0 b = 1 cycles = 0 print("Improved guess:", text) continue key[a], key[b] = key[b], key[a] b += 1 a += 1 else: count = 0 while count < SWAP_COUNT: a, b = random.sample(range(len(key)), 2) key[a], key[b] = key[b], key[a] text = decipher(key) new_score = score(text) if new_score > max_score: max_score = new_score count = 0 continue key[a], key[b] = key[b], key[a] count += 1 return (key, max_score) plaintext_guess = ciphertext max_score, max_key = -99e9, list("ABCDEFGHIJKLMNOPQRSTUVWXYZ ") while not is_sentence(plaintext_guess): curr_key = max_key[:] if not almost_there(plaintext_guess): random.shuffle(curr_key) curr_text = decipher(curr_key) curr_score = score(curr_text) curr_key, curr_score = improve(curr_key, curr_score) if curr_score > max_score: max_score = curr_score max_key = curr_key[:] plaintext_guess = decipher(max_key) print("Best guess:", plaintext_guess) print("Key:", max_key) |
Dependencies:
The script depends on the enchant library to check if given strings are actual English words. The script also depends on a file that contains quadgram statistics - that is, statistics of how often each 4-character string appears. Finally, the script also uses a script "ngram_score.py" (downloaded off the internet) to score the quadgrams in my program.
How it works:
Since there are 27! possible keys, it is not computationally feasible to bruteforce all possible keys. Instead, we use a hill-climbing algorithm (original idea from here) to arrive at a good guess of the actual key. The hill-climbing algorithm works like this:
- If the current key guess is not good enough, we shuffle it.
- We gradually improve the key so that our plaintext becomes more "English-like" (as determined by a sentence/word's quadgram score). If the final result sounds like a sentence, terminate and return this result. Otherwise, we repeat from step 1.
The bulk of the program is in the function "improve". What happens in the function is it first determines if there are enough correct words in the plaintext, and if the plaintext is "almost there", we try all pairwise swaps to fine-tune our key. Otherwise, we randomly swap any two characters in the key to make our plaintext sound more "English-like". If the swap does not improve our plaintext, we maintain the old key guess.
Eventually, the plaintext sounds more and more like a proper sentence and the program finally returns a good guess. Here's the program in action:
You can see that it starts off with poor guesses. However, as time passes, the guess improves to sound more and more like English and eventually the program returns the correct plaintext.
Too bad I wasn't the first to post the answer to the forum XD
Comments
Post a Comment