Climbing Everest with #CopyPasteCris
This has been a while coming, and I finally have some time to sit down and write it all out. This is the story of how I’ve more or less accidentally gotten involved in a fairly large plagiarism scandal that’s been running on Twitter among romance writers.
First, some back-story.
The Brazilian Romance Novelist
Sometime in mid-February, a story broke on Courtney Milan’s blog. (Courtney is a quite well-known romance writer who I’ve been following for a while because she’s awesome.) She had a very serious accusation: another writer, Cristiane Serruya, had plagiarized chunks of her novel, The Duchess War, in her book called Royal Love. She showed the evidence, and it was compelling, that the text copied between both books could not be a coincidence. A reader had tipped her off.
I heard about it on Twitter, naturally, and I read the blog post just as more people were starting to look closely at Serruya’s other books. Personally, I was pretty shocked. It seemed pretty wild to me that no one had noticed before this—not even Amazon, and they definitely have the resources to detect and block this kind of thing.
More and more material came out. More instances of copying were found from other books, and different authors. It looked like she had mix’n’matched scenes from many different books, and changed some small details like names, and cobbled them all into something like a plot. Cris Serruya went into digital hiding pretty quickly; her social media shut down, her website went dark. Her official line was that she had used ghostwriters, and they had been the source of the plagiarism.
Yeah, no one believed that.
Then I spotted Nora Roberts’ blog post, and I felt my heart breaking. She was one of those who had been plagiarized, and this was certainly not the first time. Here’s what she said:
I can’t describe what I felt in that moment, the shock, the grief, the sense of betrayal.
This isn’t okay. None of it is okay. There is nothing more personal to authors than their own stories, even if they ultimately do it for money. We bleed for them, we pour our very souls into them, and not even God will help you if you try to take them. I knew what Nora was saying. I could feel her anger, and I got angry as well.
How dare someone do this to us? How dare she steal the words we have suffered for? HOW DARE SHE?
Anyway, one Wednesday night, I was bored, and I was angry. And I decided to do something.
So it all started because I got an idea.
When I’m not pretending to be a fantasy author, I’m a senior web programmer, and my specialty is data processing. For the record, I love my day job. I’d still be doing it even if I became a bestseller! I build cutting edge web applications, frequently involving truly staggering amounts of data, and I have to find ways of processing it, converting it into different forms, analyzing it, and producing useful metrics from it without taking days. I’m uniquely suited to solving this particular problem - and it is just a data problem, when you think about it.
No one had picked up this obvious case of plagiarism by technical means. So I asked myself, as I sometimes do, how would I have done it, if I’d been asked to build a system to do it? The answer became the core of what eventually turned into the algorithm - a program that could find similar text between two ebooks, even if the text had been paraphrased or the names changed.
There were limitations. Too much paraphrasing meant it wouldn’t recognize similarity, and it would probably come back with complete nonsense sometimes. But it just might work.
So on that Wednesday night, I started to write some code. It was just a PHP script, nothing special, but I had a feeling that it would work pretty well. Then I found a copy of The Duchess War on Smashwords, and after a few tweets, one of my followers sent me a link to a copy of Royal Love.
I did the first run on those two books, and the results looked pretty good.
I started talking about what I’d done on Twitter. At this point, the #CopyPasteCris hashtag had started to take off in a big way, and Kristy from CaffeinatedFae had started the List. This was the #CopyPasteCrisList, and it was my go-to reference for authors and books who’d been plagiarized by Serruya. I needed more books to test.
Nora Roberts’ publicist had left her email at the end of the blog post, and I contacted her and asked for Nora’s books. I wasn’t expecting a response - I’m just some random nobody, after all.
Then Courtney Milan tweeted back at me, and things got interesting.
I started with her books, and as many of Serruya’s books as I’d been able to find, either in the less-than-legit corners of the Internet, or through copies and links sent to me. The algorithm worked beyond my expectations, and I kept tweaking it and refining the accuracy in different ways. I wanted to help, to find everything, to answer the question: just how much had she taken from us? I could do it, I knew I could. It felt like I was the only one who would. Amazon didn’t care, obviously, and no one else had the technical knowledge. All I needed were the books, from all the authors she’d ripped off. The algorithm was slow, but it worked.
I knew I had something powerful here. I just had to get the word out about it.
I tweeted, and authors started to retweet me, and once word started to spread, I got in contact with more and more people. I got emails. Nora Roberts’ people got in touch, and I can’t describe how intimidating that was, because it’s Nora freaking Roberts, for gods sake. I just kept setting up the algorithm to run more and more batches of books, and it kept finding more and more confirmed hits, and I kept sending back reports to the authors who’d contacted me.
The biggest batch, by far, was running almost two hundred titles for Nora. I had to let it run overnight, but when I checked the output in the morning, there were confirmed hits in nine.
Nine titles. She’d ripped off scenes from everything from Untamed, published in 1983, to Whiskey Beach, published in 2013. She thought no one would notice. I remember sitting at my computer, staring at the output files, and feeling shaky. Scared. Sick, even. Like I’d been hit in the head.
I didn’t realize it would affect me the way it did, watching my results folder grow with every batch. It wasn’t my work. I felt so bad, having to send every report back to these authors. They had to know, didn’t they? At least they’d know the extent of the lifted material in their books. But I got to see everything, all at once, and there was just so much of it that I sometimes had to take a break.
I always came back and I never stopped running batches. Every time I thought I might be done, I’d get another email, another author who wanted to know what had been taken from them, and I’d get angry all over again. It felt like climbing Everest sometimes but fuck it, it had to be me, and I’d be goddamned if I let someone do this to us and get away with it.
So, where are we at right now?
I’ve been sending Kristy more and more titles to add to the List. She’s been keeping it updated with buy links so that readers can support the authors affected by all this—she’s far more organized than me and I can’t say how much I appreciate that. Go and buy those authors’ books, if you can.
I’m still running batches. Not as many as before, but as long as authors keep asking, I’m going to keep doing it. I’ve pulled the results together and created a kind of visual barcode of one of Serruya’s books that shows what was copied from elsewhere.
My Twitter follower count has completely exploded, and I don’t even know what to do about having so much attention, but hopefully I can use it for good.
Finally, I’m going to turn the algorithm into a web application. We need this, all of us. No one else cares as much as authors do about protecting their work. No one else will build it because, frankly, it’d cost too much for almost no business benefit, and the technology involved is… well, like climbing Everest. It’s unknown territory. Nothing like it exists yet.
But someone’s got to do it and it might as well be me. Because no one should be able to get away with this, to make money from shit like this, off the backs of better authors and with the theft of their stories. Because fuck plagiarists.
Any authors reading this: use my Contact page to reach out, if you need your books checked. Readers, support authors—and stay vigilant. If it wasn’t for you, none of this would have been discovered, and I know the whole writing community is grateful to you.