Computer is supposed to help make things easier for us. One simple example is to delete lines from a text file that doesn’t contain a specific keyword. This task is a no brainer but very time consuming and tedious. Recently I have spent some time in compiling a list of websites that has copied and published articles taken from this blog to their website. Although Google does a pretty good job in determining the original publisher, it is still a robot based on a bunch of constantly changing algorithm that can and has made mistakes. Searching for websites that has copied the posts from here is very time consuming, so I have used Copyscape Premium to automatically perform a batch scan on all 2000 articles on this website to track down plagiarism of the content from this blog.
Copyscape Premium finished scanning all 2000 posts in just 10 hours and I was able to export the results to a CSV file for further investigation. There are over 20,000 URLs in the list and I want to categorize the websites based on the domain names. Not all websites from the list are copycats but most of the websites hosted in free hosts such as blogspot/blogger/wordpress are either scrapers or copy paster. Once the URLs are categorized, I can concentrate on filing a DMCA complaint to Blogger, then followed by WordPress instead of jumping back and forth. Linux users can easily delete lines that doesn’t contain specific words by using the global ex command but unfortunately we need a software to do that in Windows. Since I’m a Notepad++ user, I discovered that it is possible to automatically delete lines using Notepad++ when the word specified by you is not present. Here is an example on how to remove lines that don’t contain the word “blogspot.com” or in another words I only want to keep the lines that contains the word “blogspot”.
1. Run Notepad++, either open the text file that you want to edit or paste the text into the empty page.
2. Go to Search menubar and select Find
3. Go to Mark tab, check Bookmark line checkbox, enter blogspot.com at the find what box, and click the Mark all button. A blue icon will be added to the line that contains the word blogspot.com
4. Close the Mark window.
5. Go to Search menubar > Bookmark > and select Remove Unmarked Lines
If the text file that you’re editing is very large, it may take a while for the process to complete. Alternatively, you can also select Remove Bookmarked Lines from Search > Bookmark if you’re trying to delete lines that contain the words that you specify. Please see the embedded video below if you’re having trouble following the step-by-step instructions on how to delete lines without the keywords using Notepad++.