TE: Cracking the Substitution Cipher

The substitution cipher is a simple cipher where every character in the plaintext is substituted by another according to a 1-to-1 mapping, producing a ciphertext. For example, using a mapping such as this:

The plaintext "hello_world" would be enciphered into "hnllqpoqilb".

Not long ago, my CS2107 (Introduction to Information Security) lecturer issued a challenge to crack his ciphertexts. I jumped at this opportunity and wrote a script to face off this challenge:

import enchant
import random
from ngram_score import ngram_score
fitness = ngram_score('quadgrams.txt')
d = enchant.Dict("en_US")

# PROGRAM PARAMETERS
ERROR_TOLERANCE = 1 # no. of tolerable non-words
ALMOST_THERE_THRESHOLD = 80 # in %
SWAP_COUNT = 1000 # no. of swaps in each improve() call

print("Input ciphertext: ")
ciphertext = input().upper()
indices = []
for c in ciphertext:
    if c != '_':
        idx = ord(c) - 65 # 65 is ord('A')
    else:
        idx = -1
    indices.append(idx)

def is_word(string):
    return d.check(string)

def decipher(key):
    chars = map(lambda i: key[i], indices)
    return "".join(chars)

def is_sentence(string):
    errors, total = wrong_words(string)
    return errors <= ERROR_TOLERANCE and total > 1 # sentence shouldn't be just a word!

def wrong_words(string):
    words = string.split()
    errors = 0
    for word in words:
        if not is_word(word):
            errors += 1
    return (errors, len(words))

def correct_words(string):
    errors, total = wrong_words(string)
    return (total-errors, total)

def score(string):
    string = string.replace(" ", "")
    return fitness.score(string)

def almost_there(string):
    errors, total = wrong_words(string)
    correct = total - errors
    correctness = correct/total * 100
    return correctness >= ALMOST_THERE_THRESHOLD

def improve(key, max_score):
    key = key[:]
    plaintext_guess = decipher(key)
    if almost_there(plaintext_guess):
        max_score = correct_words(plaintext_guess)[0]
        a = 0
        while a < len(key)-1:
            b = a+1
            while b < len(key):
                key[a], key[b] = key[b], key[a]
                text = decipher(key)
                curr_score = correct_words(text)[0]
                if curr_score > max_score:
                    max_score = curr_score
                    a = 0
                    b = 1
                    cycles = 0
                    print("Improved guess:", text)
                    continue
                key[a], key[b] = key[b], key[a]
                b += 1
            a += 1
    else:
        count = 0
        while count < SWAP_COUNT:
            a, b = random.sample(range(len(key)), 2)
            key[a], key[b] = key[b], key[a]
            text = decipher(key)
            new_score = score(text)
            if new_score > max_score:
                max_score = new_score
                count = 0
                continue
            key[a], key[b] = key[b], key[a]
            count += 1
    return (key, max_score)


plaintext_guess = ciphertext
max_score, max_key = -99e9, list("ABCDEFGHIJKLMNOPQRSTUVWXYZ ")

while not is_sentence(plaintext_guess):
    curr_key = max_key[:]
    if not almost_there(plaintext_guess):
        random.shuffle(curr_key)
    curr_text = decipher(curr_key)
    curr_score = score(curr_text)
    curr_key, curr_score = improve(curr_key, curr_score)
    if curr_score > max_score:
        max_score = curr_score
        max_key = curr_key[:]
        plaintext_guess = decipher(max_key)
        print("Best guess:", plaintext_guess)
        print("Key:", max_key)

Dependencies:

The script depends on the enchant library to check if given strings are actual English words. The script also depends on a file that contains quadgram statistics - that is, statistics of how often each 4-character string appears. Finally, the script also uses a script "ngram_score.py" (downloaded off the internet) to score the quadgrams in my program.

How it works:

Since there are 27! possible keys, it is not computationally feasible to bruteforce all possible keys. Instead, we use a hill-climbing algorithm (original idea from here) to arrive at a good guess of the actual key. The hill-climbing algorithm works like this:

If the current key guess is not good enough, we shuffle it.
We gradually improve the key so that our plaintext becomes more "English-like" (as determined by a sentence/word's quadgram score). If the final result sounds like a sentence, terminate and return this result. Otherwise, we repeat from step 1.

The bulk of the program is in the function "improve". What happens in the function is it first determines if there are enough correct words in the plaintext, and if the plaintext is "almost there", we try all pairwise swaps to fine-tune our key. Otherwise, we randomly swap any two characters in the key to make our plaintext sound more "English-like". If the swap does not improve our plaintext, we maintain the old key guess.

Eventually, the plaintext sounds more and more like a proper sentence and the program finally returns a good guess. Here's the program in action:

You can see that it starts off with poor guesses. However, as time passes, the guess improves to sound more and more like English and eventually the program returns the correct plaintext.

Too bad I wasn't the first to post the answer to the forum XD

Open-Source Thoughts

Search This Blog

TE: Cracking the Substitution Cipher

Comments

Post a Comment