Architecting Sidenotes

Livefyre recently released a new way of commenting and interacting with Content on your favorite websites: Sidenotes.

Sidenotes lets any user respond to specific sections of the Content, not just a reply to the whole thing. The added context allows Sidenotes to display the annotation right next to the annotated image or paragraph (like this one).

But how does it work? This is how we built it.

Dependencies

Livefyre StreamHub provided most of the infrastructure needed for the Sidenotes Application. We didn’t have to start from scratch. We leveraged Livefyre’s existing world-class services to manage

  • Content - Comments, Tweets, Facebook posts, Instagram photos, RSS items are all normalized and stored as Content in StreamHub.
  • Collections - Every conversation powered by Livefyre is made up of a Collection of Content. StreamHub APIs provide high-availability indexing into these Collections billions of times per month.
  • Spam Protection - Livefyre Engineering’s affectionately labeled “Ministry of Abuse and Classification” keeps out all the bad stuff.
  • Identity - Whether Livefyre.com profiles for Livefyre Community Comments, Livefyre Enterprise Profiles, or single sign on integrations with custom Identity Providers, StreamHub provides a link to any authentication system necessary.

Finally, we leveraged a lot of reusable JavaScript modules in streamhub-sdk to communicate with these Services. Together, the StreamHub Platform made building Sidenotes an order of magnitude less challenging than building from scratch.

Challenge

While we had a great foundation in StreamHub, there were still some challenges to solve.

Most importantly, we had to provide a solution to the question: “How can Sidenotes anchor user-generated StreamHub Content to subsections of another piece of external content”?

To accomplish this, we’ve added the notion of a Block to Livefyre. Each new sidenote is anchored to a Block in the corresponding article. That block is frequently a paragraph of text, but it can also be an image, video, or any other HTML Element. When you use Sidenotes, you see a small icon next to each Block showing the number of sidenotes on that block, and clicking allows you to add a new one.

The resulting problem for Livefyre Engineering was to determine the best algorithm for provisioning Blocks and allocating new sidenotes into the right Block.

The simplest imaginable solution to this would involve storing the annotated text on each sidenote, and then to display each sidenote, search the article for that substring.

But we knew that in order to make the best product, we would need to do more than just finding an exact match. Content on the internet changes all the time, and as it should. Editors add or correct new information, fix typos, or A/B test different ways of presenting their Content.

How do you anchor annotations to an always-shifting, living document?

Naive Solutions

We first looked to existing products to reverse-engineer their methods for dealing with this. There are quite a few annotations experiences that only annotate Content that is managed by the same system. This is a comparatively trivial Engineering problem, and each solution used by these sort of closed systems wouldn’t work for our mission of bringing annotation to the rest of the web.

Of the similar products we did find, they usually used one of two naive solutions that would not be resilient to changes in the annotated Content.

Anchor by paragraph index

Some of the systems we found would attach annotations based on paragraph ordering. As soon as an editor added a paragraph, all annotations on later parts of the document would be lost or, perhaps worse, attached to the wrong section of the document. In the worse case, adding an introduction paragraph at the beginning would break annotations for the entire article.

Anchor based on hash of the text

The other common method we found is to hash the contents of the annotated text, and store that with the annotation. This has the benefit of being agnostic to the order of annotated sections. Sections can be added, removed, or rearranged, and annotations will stay anchored.

However, as soon as any paragraph changes, the hash will also change and all previous annotations would be lost.

This is because most hashing algorithms (e.g. md5) are designed to minimize collisions. That is, two different inputs should never produce the same output. Usually, even only slight changes to the input will result in wildly different output hashes.

md5('Hello, world') -> bc6e6f16b8a077ef5fbc8d59d0b931b9

md5('Hello, world!') -> 6cd3556deb0da54bca060b4c39479839

Locality-Sensitive Hashing

What we wanted was a way of hashing slightly different inputs and getting either the same or only slightly different outputs.

This would let us reduce entire paragraphs of varying lengths to a fixed-size hash, and help us identify when a section had been edited so we could keep anchoring annotations across edited versions.

We started researching Locality-Sensitive Hashing (LSH), which refers to hashing methods that are designed not to minimize collision, but specifically to maximize it.

Locality-Sensitive Hashing has many applications, including

  • Spam detection - Once you identify something as spam, you also want to quickly determine that very similar things (e.g. with only one character added) are also spam. e.g. Nilsimsa Hashing
  • DNA Analysis - Given two sequences of nucleotides, determine if they are likely to be from the same person.
  • De-duplication - Search engines use LSH to determine when the same page contents are being returned from different URLs, so they can deduplicate documents in their indexes.

Compared to the md5 example above, an LSH would produce results something like:

LSH('Hello, world') -> bc7e6f16b8a087ef5fbc8d50d0b931c9

LSH('Hello, world!') -> bc6e6f16b8a077ef6fbc8d59d0b931b9

Visualizing

This is all getting kind of theoretical, and we know that the best solutions frequently come only once one has developed a solid intuition about the problem space, so we set about trying to visualise these locality-senstive hashes.

We built a dataset of four different paragraphs. For three of them, we created three different versions by changing punctuation, spelling, or adding sentences. In this example, there were ten total paragraphs.

Each LSH hash of N bits can be mapped to a point in <= N-dimensional space. Slightly different inputs that result in slightly different hashes will result in points in this space whose are only slightly far apart. That is, their distance will be low.

We used scikit-learn and Principal Component Analysis (PCA) to reduce these N-dimensional vectors to three dimensional ones, and we could then graph these three dimensions.

You can see in this image four clusters of points. One of the clusters (on the top) only has one point, and the other three clusters (representing the paragraphs we made edits to), each have three points. Keep in mind that while we humans can spot the gestalt of these four clusters, so far to this algorithm has no idea.

Points in 3D Space

Allocating Sidenotes to Blocks

This research provided us a method to allocate Sidenotes into Blocks, even when the Sidenotes were annotating slightly different text. Each cluster in the above image should be one Block.

When you create a new Sidenotes in a paragraph, we allocate it to a Block by first applying an LSH algorithm to the text, then checking the following:

  1. Is the hash an exact match for an already-created Block? If so, add the annotation to that block.
  2. If not, perform a Nearest Neighbor Search to find existing Blocks that are a small distance from this hash. We use a relatively simple Hamming distance metric. If we find an existing block within a certain distance threshold of the new hash, we place the annotation in that Block. This is what happens when the contents of a paragraph have slightly changed. This proximity search is what lets the algorithm recognize the clusters in the above graph.
  3. If there are no Blocks with similar hashes, this must be the first time someone has annotated this section of the article. So a new Block is provisioned with the hash of what’s being annotated.

Visualizing Results

We have now successfully identified ‘clusters’ in the LSH points in this high dimensional space. Each cluster represents the same text but at a few different versions. And for each cluster, we’ve provisioned a Block.

After clustering the dataset visualized above, we gave each inferred Block a unique color so we could quickly check how our algorithm performed.

Clusters in 3D Space

Runtime

Now armed with a Block provisioning and allocation strategy, we felt comfortable starting work on Sidenotes and putting all the pieces together. Livefyre’s App Engineering, Product Design, and Product Management teams did a ton of brainstorming, user experience research, and straight up hacking and execution to make Sidenotes a reality.

Today, when Sidenotes loads on a webpage, it does the following:

1) Boot with configuration pointing to a StreamHub Collection and selectors indicating which elements of the page should be annotatable

2) Ask StreamHub for any known blocks in the Collection and, for each, the corresponding LSH hash and number of Sidenotes in the block

3) For each existing block and corresponding hash

  • Find the annotatable element that is most similar to that hash
  • Render the Sidenotes intent on that element, including the number of Sidenotes in that Block

When a user clicks on this intent, Sidenotes asks StreamHub for the top Sidenotes in that Block, then renders them.

When a user selects text and posts a new sidenote, we allocate it to a Block in the method described above, and post the Content to StreamHub with a parameter indicating the Block to attach it to.

Questions

We had a ton of fun envisioning and creating Sidenotes, and we sincerely hope that it makes it more fun to hang out on your favorite websites, interact with your favorite communities, and engage and give feedback to your favorite authors.

Have questions? Post them in the comments below.

Just kidding. Sidenote it!

"Radical Man" on Freedom and Technology

image

I’m currently reading Charles Hampden-Turner’s Radical Man (c. 1971), which I bought because some random Amazon review said it was amazing and that they wished they had read it at a mcuh younger age. It is very much worth reading, both because of the surprisingly timeless philosophy of human psycho-social development, as well as for some of the unfortunatley untimely, incorrect predictions. The following passage from the epicly-titled chapter, “Crypto-conservatism of Technological Thinking”:

Futurologists make great play with the notions of “freedom” and “technology creating opportunities” for us all and presenting us with a greater number of alternatives. Yet the meaning of freedom in the contexts which they use it is essentially a choice of a large number of games, or the choice in playing a particular game to reach a predetermined objective by interchangeable routes. Thus George A. Miller, a Harvard psychologist, and a prominent contributor to Toward the Year 2000 envisages numerous children in communication with a central computer.

Within the broad limits set by what materials have been prepared by the computer, the student is free to study those things that are of most interest to him.

Leaving aside for a moment the question of whether the computer as opposed to the programmer can be said to have prepared anything, let us consider this use of the term “freedom”. It is the freedom of a rat to run a maze by turning left or right on any of a hundred separate occasions. It is the freedom of a guided missile to reach a predetermined target by an infinite number of paths or the freedom of a quarterback to gain fifty yards by kicking, running, or passing in a myriad of ways. I can think of few marine commanders who would not endorse the kind of “initiative” by which recruits dropped in the wilderness find their way back to base by any means they wish, provided they cover three hundred miles in thirty six hours.

Freedom, of course, means more than the ability to explore a vast technological prison with new wings added yearly and play alternative technical games therein. Unless the paths chosen by the individual contain a human logic and meaning, then it makes little sense to consider the human being as investing himself freely, intensely, or authentically. Given a run-away technology, wherein “can” is repeatedly translated as “ought” as technical means become ends in themselves, the survival of human freedom and meaning and the confirmation of this meaning by others, becomes harder and harder to achieve.

Somewhat surprisingly, many of these patterns can be represented as linear translations. For example, the result of a vector calculation vec(“Madrid”) - vec(“Spain”) - vec(“France”) is closer to vec(“Paris”) than to any other word vector.

This talk shows some common optimizations for improving webapp performance. Smaller files, less connections, more hosts to parallelize past browser limits. It also shows that Data URIs are noticeably slower than CSS sprites, especially when served from cache.

less-sprites looks like a great npm package for LESS that makes this quite easy. We’ll definitely be switch from Data URIs to Sprites for streamhub-sdk for our default ContentVIew templates.

More of the data in this talk can be found on his blog.

Graph from the blog post

Scalable Service Tradeoffs: Search vs Retrieve

This is a well-articulated core principle of Twitter’s architecture via HighScalability.com

Search And Pull Are Inverses

  • Search and pull look remarkably similar but they have a property that is inverted from each other.
  • On the home timeline:
    • Write. when a tweet  comes in there’s an O(n) process to write to Redis clusters, where n is the number of people following you. Painful for Lady Gaga and Barack Obama where they are doing 10s of millions of inserts across the cluster. All the Redis clusters are backing disk, the Flock cluster stores the user timeline to disk, but usually timelines are found in RAM in the Redis cluster.
    • Read. Via API or the web it’s 0(1) to find the right Redis machine. Twitter is optimized to be highly available on the read path on the home timeline. Read path is in the 10s of milliseconds. Twitter is primarily a consumption mechanism, not a production mechanism. 300K requests per second for reading and 6000 RPS for writing.
  • On the search timeline:
    • Write. when a tweet comes in and hits the Ingester only one Early Bird machine is hit. Write time path is O(1). A single tweet is ingested in under 5 seconds between the queuing and processing to find the one Early Bird to write it to.
    • Read. When a read comes in it must do an 0(n) read across the cluster. Most people don’t use search so they can be efficient on how to store tweets for search. But they pay for it in time. Reading is on the order of 100 msecs. Search never hits disk. The entire Lucene index is in RAM so scatter-gather reading is efficient as they never hit disk.

– View on Path.

at American Airlines Arena – View on Path.

AMAZING Cuban food here in Miami. Heat vs Pacers live later tonight! #vacay at Versailles Restaurant – View on Path.

nbd my code is on Oprah – View on Path.

(Reblogged from julio-ody)