Trim Strings in Vim / Organize Firefox History

I was able to trim the end of a line in Vim. Combining that along with a few other Vim functions allowed me to organize my Firefox history.

Background

I have a list of URLs from my Firefox history. I want to sort and deduplicate them. I don't care about what pages I visited. I am only interested in a list of the hosts.

So a record like

https://en.wikipedia.org/wiki/Large_Hadron_Collider

would be reduced to

https://en.wikipedia.org/

And if I visited several pages

# Pages
https://en.wikipedia.org/wiki/George_R._R._Martin
https://en.wikipedia.org/wiki/Game_of_Thrones

# Hosts 
https://en.wikipedia.org/
https://en.wikipedia.org/

then I only want one example of each host to show in the output.

https://en.wikipedia.org/

I have the URLs in Vim. But I am not sure how to proceed.

I was able to load the URLs in Vim by:

Open Firefox.
Click on hot dog menu > History (menu item) > Manage History (at the bottom). A new "Library" window will open showing your history.
Click on "This month" in the sidebar on the left.
Click on one of the entries in the main viewing area to select it.
Press Ctrl-A to select all entries.
Press Ctrl-C to copy.
Open Vim.
Enter the command below to paste.

:0 put +

This leaves me with a Vim window populated with my Firefox history. Specifically, there is a list of URLs. There is one URL per line.

I'm using Vim on Linux. These steps might vary on other platforms. For example, carriage returns, ^M, might not show on Windows.

Remove Carriage Returns

If you copy and paste from the Firefox History window, you will see a '^M' at the end of each line. These are explicit carriage returns which are normally non-printing characters.

You can remove the carriage returns using the :substitute command in Vim.

:%s/\r//

Match the Beginning of the Line

It is not immediately clear how to trim the end of each line.

One way to get started is to write a regular expression which captures the beginning of the string: the part of the URL I want to keep. Maybe we can use this as part of the solution.

So if we have a line like

https://www.phoronix.com/scan.php?page=news_item&px=System76-Scheduler-1.1

then we can progressively develop a regular expression that captures the beginning of the URL.

# Original URL
https://www.phoronix.com/scan.php?page=news_item&px=System76-Scheduler-1.1

# Manually delete the end of the string. 
https://www.phoronix.com/

# Anchor the beginning of the string by adding ^
^https://www.phoronix.com/

# Escape forward slashes. 
^https:\/\/www.phoronix.com\/

# Add wildcard for host name. 
^https:\/\/[^/]\+\/

We can test this regular expression by searching for it.

" Activate the Search Highlight option in Vim. 
:set hlsearch

" Search using the regular expression. 
/^https:\/\/[^/]\+\/

If you want to save typing the regular expression into the command line, it is possible to paste from the clipboard to the Vim command line. In this context, the search prompt is considered a command line for the purposes of Ctrl-R.

Copy the example regular expression.

^https:\/\/[^/]\+\/

Switch to the Vim window.
Press Esc twice to return to Vim's normal mode.
Type forward slash '/' to begin a search.
Press Ctrl-R. You will be silently prompted for a register.
Type plus '+' to specify the clipboard register.

After typing the plus symbol, the command line will fill with the contents of the clipboard. Note that if you happened to copy a newline then a '^M' might show at the end of the line. You can press Backspace once to remove '^M'.

Press Enter to begin the search.

Since I turned on search highlight, you might see the highlighting take effect as you enter the regular expression. But you still have to press Enter to formally begin the search.

At this point, we can match the beginning of the URL.

Start of the Match

Vim has a regular expression atom \zs which sets the start of the match.

It is not immediately obvious what this means or how it is helpful.

At this point, we have two basic functions.

One, we can substitute text for other text. So, if we can find text then we can replace it with nothing. This effectively deletes the text we are searching for.

Two, we can match the beginning of the URL.

I want to delete the end of the string.

It doesn't make sense to delete the beginning of the string which is what we can match right now.

In principle, it is possible to match the end of a URL. But that has an unknown number of path components separated by forward slashes. It is easier to match the beginning of the URL because it has a fixed form: https://wildcard/ That is why I've chosen to match that part.

What we need now is some way to combine (a match for the beginning of the string) with (the substitute command). And that combination has to help me reach my goal.

This is where \zs comes in.

\zs works in combination with another regular expression.

First we search for something.

/^https:\/\/[^/]\+\/

where the leading / begins a search in Vim and

^https:\/\/[^/]\+\/

is the regular expression we are looking for.

Then we add \zs.

Note you can press / and then the up arrow on the keyboard to recall the last search. Then you can modify the search by adding \zs. Don't forget to press Enter to begin the search.

/^https:\/\/[^/]\+\/\zs

At this point, Vim interprets this to mean:

Search for the beginning of the URL.
Start looking for a new match after the first match.

We can add a wildcard .* to match the rest of the line.

/^https:\/\/[^/]\+\/\zs.*

If you try this search in Vim with the 'hlsearch' option turned on, you will see the end of each URL get highlighted.

Another way to think of \zs is as a (regular expression ignore operator) in Vim.

We:

Search for something.
\zs
Search for something else.

And the first thing we search for is ignored.

You might be interested to know there is a complementary Vim atom \ze which works at the end of a string.

Trim Strings

Now we have a way to reliably select the end of each URL. If we combine the regular expression with the substitute command then we can replace the end of the string with nothing. This will effectively trim the strings.

:%s/^https:\/\/[^/]\+\/\zs.*//

One shortcut you can take here is to reuse the regular expression from a recent search. Using Ctrl-R in this way lets you experiment with different searches and then transfer a working expression to :substitute.

Type :%s/ to begin a substitution.
Press Ctrl-R and then / to recall the regular expression from the last search to the current command line. Here '/' represents a register dedicated to the last search pattern.
Type // to replace the match with nothing.

One consequence of using 'https' in my regular expression is that 'http' sites still have their path components. You can remove those paths by making the 's' optional.

:%s/^https\?:\/\/[^/]\+\/\zs.*//

At this point we have trimmed the strings.

Sort and Deduplicate

I'd like to sort the URLs alphabetically and remove duplicates.

Vim has a :sort command. The :sort command has an option, u, which only returns unique entries. By default, :sort applies to the entire buffer or file.