In today’s Programming Praxis exercise, our goal is to write a program to analyse whether two books were written by the same author. Let’s get started, shall we?
import Data.Char import Data.List import Data.List.Split import qualified Data.List.Key as K import qualified Data.Map as M
We record four pieces of information about a book: a list of the words used, the average sentence length, the average paragraph length and the amount of punctuation used.
data Info = Info { _words :: [String], _sentenceLength :: Float, _paraLength :: Float, _puncPct :: Float }
Extracting the four facts of information from the text of a book is fairly self-explanatory.
avg :: (Fractional a, Integral a1) => [a1] -> a avg xs = fromIntegral (sum xs) / fromIntegral (length xs) sentenceLength :: String -> Float sentenceLength = avg . map length . splitOneOf ".!?" paragraphLength :: String -> Float paragraphLength = avg . map (length . words . unlines) . splitOn [""] . lines punctuationPct :: String -> Float punctuationPct text = fromIntegral (length $ filter isPunctuation text) / fromIntegral (length text) * 100 process :: String -> Info process text = Info (words . filter (not . isPunctuation) $ map toLower text) (sentenceLength text) (paragraphLength text) (punctuationPct text)
We use the words of a book to determine the top 100 most used ngrams, using the assumption that every writer has certain expressions he or she uses often.
topNgrams :: Int -> [String] -> [[String]] topNgrams n ws = take 100 . map fst . K.sort (negate . snd) . M.assocs $ M.fromListWith (+) . map (flip (,) 1 . take n) $ foldr ($) (tails ws) $ replicate n init
To calculate the similarity of two books, we look at a weighted combination of the amount of shared n-grams of lengths 3, 4 and 5 minus the difference in sentence length, paragraph length and punctuation use. The higher the score, the more similar they are.
similarity :: Info -> Info -> Float similarity (Info wsA slA plA puA) (Info wsB slB plB puB) = 1 * fromIntegral (length $ intersect (topNgrams 3 wsA) (topNgrams 3 wsB)) + 2 * fromIntegral (length $ intersect (topNgrams 4 wsA) (topNgrams 4 wsB)) + 4 * fromIntegral (length $ intersect (topNgrams 5 wsA) (topNgrams 5 wsB)) - abs (slA - slB) - abs (plA - plB) - 10 * abs (puA - puB)
To test our algorithm, we compare a few groups of books.
main :: IO () main = do hamlet <- fmap process $ readFile "F:/hamlet.txt" romeo <- fmap process $ readFile "F:/romeo.txt" oliver <- fmap process $ readFile "F:/oliver.txt" huckleberry <- fmap process $ readFile "F:/huckleberry.txt" twocities <- fmap process $ readFile "F:/twocities.txt" crusoe <- fmap process $ readFile "F:/crusoe.txt" island <- fmap process $ readFile "F:/island.txt" mystery <- fmap process $ readFile "F:/sawyer.txt" print $ similarity romeo hamlet print $ similarity romeo huckleberry print $ similarity romeo oliver putStrLn "---" print $ similarity oliver twocities print $ similarity oliver romeo print $ similarity oliver huckleberry putStrLn "---" print $ similarity mystery crusoe print $ similarity mystery twocities print $ similarity mystery island print $ similarity mystery huckleberry
The results are as follows:
-2.1613884 -70.873985 -52.829903 --- 51.257236 -52.829903 -4.4081688e-2 --- -271.75955 11.982366 4.711507 22.518066
As we can see, Romeo & Juliet is most similar to Hamlet, Oliver Twist is most similar to The Tale of Two Cities and our mystery book is correctly identified as belonging to Mark Twain by virtue of being most similar to Huckleberry Finn.
Image may be NSFW.
Clik here to view.
Clik here to view.
