If you aren’t familiar with SoundCloud, they are one of the fastest growing sites in the US with 10.2 million uniques in March and traffic growth of 26% from the prior month. They reach over 200M people worldwide.
They are also one of the coolest social networks for sharing music and sound, and many of the most popular, modern day musicians, producers, and DJs release music here to a global audience that also collectively uploads 12 hours of music every minute, covering electronic, classical, jazz, blues, comedy, storytelling, and more.
This past June, Sebastian Ohm, Technical Lead at SoundCloud, gave a talk on their use of RabbitMQ at the Erlang User Conference in Stockholm. His talk and this article cover the functionality, messaging architecture, and lessons learned.
SoundCloud’s Functionality
SoundCloud is a social platform—instead of uploading and commenting on pictures, SoundCloud users upload and comment on audio via a waveform image as shown below. The content on SoundCloud is driven entirely by end users and 3rd parties, and it can be accessed via the SoundCloud website, embedded in other websites like Facebook or blogs, and heard via mobile apps like Android or iOS.
One of the places that RabbitMQ provides a service is when a user uploads an audio file to SoundCloud. Upon upload, RabbitMQ is used to asynchronously process the audio file, build the waveform image, and also notify followers of the new sound.
The Messaging Architecture—Transcoding and Activity Updates
SoundCloud stores media in Amazon S3, and the worker pool is in EC2. A message-based architecture was chosen a few years back to coordinate these separate storage and processing clouds. After reviewing STOMP and other protocols, the engineering team settled on AMQP with RabbitMQ. The team wanted producers and consumers to be entirely decoupled so that pools of resources could be scaled independently.
The application was developed with Ruby on Rails, and, when a new media file is uploaded, the Ruby code creates a record in MySQL and publishes a message to the media exchange with a unique ID for the media. Both the Ruby app and RabbitMQ are running in the SoundCloud data center in Amsterdam while the consumption end of the queue, a transcoding service, runs in EC2. The consumers receive the unique ID, transcode the media, and publish another message to the media exchange with some meta data and a unique ID for the files on S3, available via URI. A Rails app receives these messages and pushes some of the information into the database.
This approach addressed one of their first scale challenges—scaling uploads. Now, they can add resources to the pool of transcoders quickly and automatically with any spike in traffic—RabbitMQ is used to parallel process the workload across all transcoders and can recover from 10,000s of uploads within a few hours.
They also use a separate RabbitMQ broker to update the dashboard. The dashboard shows users the most recent activities or updates from the musicians and other users that they follow. Scale is not a problem until a user like Skrillex uploads 10 tracks at once and has about one million followers. In these cases, the system would have to synchronously publish a write to Cassandra 10 million times. Instead, the engineering team added a broadcast within their application’s domain and used RabbitMQ for staged, asynchronous processing—including three steps:
- Fan-out determines where activities should propagate
- Personalization captures the relationship between users and filters an index entry
- Serialization persists the information in Cassandra for end user display or API access
Key Lessons Learned
With their current approach, the team has been running about 20-30,000 persistent messages per second (as shown in the graph below). Sebastian was kind enough to share the honest challenges they faced and some key lessons learned during his talk:
- While things have not gone perfectly, Sebastian believes Erlang and RabbitMQ have had great performance and no operational issues, even though they had no Erlang knowledge before
- Separating production, test, and dev environments are important and reduce headaches and errors
- Don’t put every type of processing on one queue or one broker, separate workloads with different profiles of use so they can scale independently
- Use clustering—a load balancer in front allows us to publish once and then workers can subscribe to all
- AMQP heartbeats worked more smoothly than one TCP connection per broker
Find similar information on RabbitMQ:
- View Sebastian’s 40 minute talk with code examples and additional depth
- Read other Pivotal POV blog articles on RabbitMQ with overviews of talks from other conferences and case studies
- Read over 50 blog articles on RabbitMQ from VMware’s vFabric blog
- Read more about Pivotal One—and where RabbitMQ fits into our application fabric