Tag Archives for " performance "

How can watchRTC improve your WebRTC service operations?

watchRTC is our most recent addition to the testRTC product portfolio. It is a passive monitoring service that collects events information and metrics from WebRTC clients and analyzes, aggregates and visualizes it for you. It is a powerful WebRTC monitoring and troubleshooting platform, meant to help you improve and optimize your service delivery.

Learn more about watchRTC

It’s interesting how you can start building something with an idea of how your users will utilize it, to then find out that what you’ve worked on has many other uses as well.

This is exactly where I am finding myself with watchRTC. Now, about a year after we announced its private beta, I thought it would be a good opportunity to look at the benefits our customers are deriving out of it. The best way for me to think is by writing things down, so here are my thoughts at the moment:

What is watchRTC and how does it work?

watchRTC collects WebRTC related telemetry data from end users, making it available for analysis in real time and in aggregate.

For this to work, you need to integrate the watchRTC SDK into your application. This is straightforward integration work that takes an hour or less. Then the SDK can collect relevant WebRTC data in the background, while using as little CPU and network resources as possible.

On the server side, we have a cloud service that is ready to collect this telemetry data. This data is made available in real-time for our watchRTC Live feature. Once the session completes and the room closes, the collected data can get further analyzed and aggregated.

Here are 3 objectives we set out to solve, and 6 more we find ourselves helping with:

#1- Bird’s eye view of your WebRTC operations

This is the basic thing you want from a WebRTC passive monitoring solution. It collects data from all WebRTC clients, aggregates and shows it on nice dashboards:

The result offers powerful overall insights into your users and how they are interacting with your service.

#2- Drilldown for debugging and troubleshooting WebRTC issues

watchRTC was built on the heels of other testRTC services. This means we came into this domain with some great tooling for debugging and troubleshooting automated tests.

With automated testing, the mindset is to collect anything and everything you can lay your hands on and make it as detailed as possible for your users to use it. Oh – and be sure to make it simple to review and quick to use.

We took that mindset to watchRTC with a minor difference – some limits on what we collect and how. While we’re running inside your application we don’t want to interrupt it from doing what it needs to do.

What we ended up with is the short video above.

From a history view of all rooms (sessions) you can drill down to the room level and from there to the peer (user) level and finally from there to the detailed WebRTC analytics domain if and when needed.

In each layer we immediately highlight the important statistics and bubble up important notifications. The data is shown on interactive graphs which makes the work of debugging a lot simpler than any other means.

#3 – Monitoring WebRTC at scale

Then there’s the monitoring piece. Obvious considering this is a monitoring service.

Here the intent is to bubble up discrepancies and abnormal behavior to the IT people.

We are doing that by letting you define the thresholds of various metric values and then bubbling up notifications when such thresholds are reached.

Now that we’re past the obvious, here are 5 more things our clients are doing with watchRTC that we didn’t think of when we started off with watchRTC:

#4 – Application data enrichment and insights

There’s WebRTC session data that watchRTC collects automatically, and then there’s the application related metadata that is needed to make more sense out of the WebRTC metrics that are collected.

This additional data comes in different shapes and sizes, and with each release we add more at our clients request:

  • Share identifiers between the application and watchRTC, and quickly switch from one to the other across monitoring dashboards
  • Add application specific events to the session’s timeline
  • Map the names of incoming channels to other specific peers in a session
  • Designate different peers with different custom keys

The extra data is useful for later troubleshooting when you need to understand who the users involved are and what actions have they taken in your application.

#5 – Deriving business intelligence

Once we started going, we got these requests to add more insights.

We already collect data and process it to show the aggregate information. So why not provide filters towards that aggregate information?

Starting with the basics, we let people investigate the information based on dates and then added custom keys aggregation.

Now? We’re full on with high level metrics – from browsers and operating systems, to score values, bitrates and packet loss. Slice and dice the metrics however you see fit to figure out trends within your own custom population filters.

On top of it all, we’re getting ready to bulk export the data to external BI systems of our clients – some want to be able to build their own queries, dashboards and enrichment.

#6 – Rating, billing and reporting

Interestingly, once people started using the dashboards they then wanted to be able to make use of it in front of their own customers.

Interestingly, not all vendors are collecting their own metrics for rating purposes. Being able to use our REST API to retrieve highlights for these, and base it on the filtering capabilities we have, enables exactly that. For example, you can put a custom key to denote your largest customers, and then track their usage of your service using our APIs.

Download information as PDFs with relevant graphs or call our API to get it in JSON format.

#7 – Optimization of media servers and client code

For developers, one huge attraction of watchRTC is their ability to optimize their infrastructure and code – including the media servers and the client code.

By using watchRTC, they can deploy fixes and optimizations and check “in the wild” how these affect performance for their users.

watchRTC collects every conceivable WebRTC metric possible, optimization work can be done on a wide range of areas and vectors as the results collected capture the needed metrics to decide the usefulness of the optimizations.

#8 – A/B testing

With watchRTC you can A/B test things. This goes for optimizations as well as many other angles.

You can now think about and treat your WebRTC infrastructure as a marketer would. By creating a custom key and marking different users with it, based on your own logic, you can A/B test the results to see what’s working and what isn’t.

It is a kind of an extension of optimizing media servers, just at a whole new level of sophistication.

#9 – Manual testing

If you remember, our origin is in testing, and testing is used by developers.

These same developers already use our stress and regression testing capabilities. But as any user relying on test automation will tell you – there are times when manual testing is still needed (and with WebRTC that happens quite a lot).

The challenge with manual testing and WebRTC is data collection. A tester decided to file a bug. Where does he get all of the information he needs? Did he remember to keep his chrome://webrtc-internals tab open and download the file on time? How long does it take him to file that bug and collect everything?

Well, when you integrate with watchRTC, all of that manual labor goes away. Now, the tester needs only to explain the steps that caused the bug and add a link to the relevant session in watchRTC. The developer will have all the logs already there.

With watchRTC, you can tighten your development cycles and save expensive time for your development team.

watchRTC – run your WebRTC deployment at the speed of thought

One thing we aren’t compromising with watchRTC is speed and responsiveness. We work with developers and IT people who don’t have the time to sit and wait for dashboards to load and update, therefore, we’ve made sure and are making it a point for our UI to be snappy and interactive to the extreme.

From aggregated bird’s eye dashboard, to filtering, searching and drilling down to the single peer level – you’ll find watchRTC a powerful and lightning fast tool. Something that can offer you answers the moment you think of the questions.

If you’re new to testRTC and would like to find out more we would love to speak with you. Please send us a brief message, and we will be in contact with you shortly.

Understanding a call center agent’s network in a WFH world

As we settle into 2022, it seems like call center agents may continue in their WFH (work from home) mode even beyond the pandemic. This will be done either part time or full time, for some agents or for all of them.
The reasons for that are wide and varied, but that’s probably a topic for another time. This time, I’d like to discuss what we are going to do moving forward, to ensure that those reaching out to your call center get the best call quality possible, even when your agents are working from home.

The shift of the call center agent to WFH

Since the pandemic started, those who are able to work remotely have been directed to do so. That includes call center agents – the people who answer the phone when we want to complain, book, order, cancel or do a myriad of other activities in front of businesses.
The whole environment and architecture of the call center has changed due to the new world we live in today.

In the past, this used to be the call center:

The call center PBX, the network connections to the agents, the agent’s environment (room), computer and phone have all been in our control and in our office facilities.

Now? It looks more like this for an on premise call center:

With an on premise call center and work from home agents, we’re likely to deploy an SBC (Session Border Controller) and/or a VPN to connect the agents back to the office. It adds more moving parts, and burdens the internet connection of the office, but it is the fastest patch that can be employed and it might be the only available solution if you can’t or don’t want to run your call center in the cloud.

Or this for a cloud call center:

In a cloud call center, the agents connect directly to the cloud from their home office.
Just like the on premise call center, the cloud solution ends up with some new challenges. Mainly – the loss of control:

  • We don’t control the network quality of the agent
  • The environment of the agent is out of our control
  • It is likely that the device and peripherals of the agent are still in our control. But that’s not always the case either

And, even with our best intentions in asking the agents to be on ethernet, on a good network and in a quiet environment, they can struggle with doing it well enough.

Can you hear me now?

With work from home call center agents our main challenge becomes controlling their home environment and network.

At home, agents will have noise around them. Kids playing, family members watching television, the neighbors renovating (I had my share of this one during the pandemic), or traffic noises from the street. By using better headsets and noise suppression these can be improved and even solved.

The network is the bigger headache though. Many of your agents are likely to be non-technical in nature. How do they configure their home network? Which ISP are they using and with which communication bundle? How are they even connected to the network – via wifi or ethernet? How far are they from the wifi access point? Who else is using their home network and how? How is their network configured?
The answers to the questions above are going to affect the network quality and resulting audibility of their calls.
Since we can’t control their network, we at least want to understand it properly to be able to make intelligent decisions, such as routing calls to agents that have better networks and environments or to assist our agents in improving their network and environment.

Assessing a WFH call center agent’s environment

They say that knowing is half the battle. In order to solve a call quality problem you should start from understanding what is causing it, and that comes from understanding the network and environment of your agent.
There’s no specific, single solution or problem here, which is why the process usually takes a lot of back and forth interactions between the agent and the IT/support helping them out remotely.

What are the things that you’d like and need answers to?

  • What machine, operating system and browser is the agent using?
  • Are they using a headset? Is it a bluetooth one? Is it the one provided to them for this purpose?
  • Where is the agent located exactly? What ISP are they connected through?
  • Is the agent using a VPN? Are they behind a firewall? Has someone configured the agent’s DNS servers inappropriately? (you’ll be surprised)
  • Are their calls directed to the correct call center in a region nearby?
  • Can their calls flow over UDP or are they forced over TCP?
  • Are all of your applications needed by the agent available and reachable?
  • What does the agent’s network look like? Is it fiber? ADSL? Something else? Is their uplink accommodating enough for calling services?
  • How much VoIP traffic can their network handle?
  • When the agent connects to the PBX, what call quality do we measure?
  • Is his network clean or noisy with packet losses and jitter?
  • What’s the latency like?

Getting answers to these questions quickly and accurately reduces the handling time of such issues. This is what our clients use our qualityRTC product for – to get the data they need as fast as possible to help them resolve issues sooner.

What’s your workflow?

Each call center has its own nuances – different infrastructure to test and different locations.
You have your own workflow and support process to tackle issues. Do you empower agents with self service, or keep close tabs on when and how network tests are conducted?
Some would rather have agents test their network daily at the beginning of the shift, while others want that to take place only when issues arise.
Large call centers usually need access to the data for BI purposes. Others want to map all their call center agents’ status once in a while – just to understand where they stand.

We’ve built qualityRTC with the help of the biggest call center providers out there, so we’ve got you covered no matter your workflow. qualityRTC is flexible to the level you’ll need to help you in reducing your support strains of WFH agents and get you focused on what really matters – your customers.

If you want to really up your game in WebRTC diagnostics – for either voice or video scenarios – with Twilio, some other CPaaS vendor or with your own infrastructure – let us know. We can help using our qualityRTC network testing solution.

 

2

Network Jitter or Round Trip Time – which is more important in WebRTC?

Network Jitter or Round Trip Time – which is more important when testing or monitoring a WebRTC application?

You’ve got your WebRTC application. You have users communicating with it. How do you know they are having a good experience? How do you know you’ve placed your servers in the right locations? Got the routes properly configured? Do you need to add a new server in Frankfurt? Or maybe it would be better to beef up your Australian presence?

These answers require looking at the users you have and the quality they are getting. And when the time comes to look at WebRTC quality, you’ll hear a lot the terms network jitter, latency and round trip time thrown around.

So which one is more important to track and focus on with WebRTC? Is it network jitter or maybe it is round trip time?

I’d say both. But not exactly…

Let’s try to break this down to understand it better.

Network vs “glass to glass”

We can look at these metrics, and especially latency and round trip time in different ways, where the first question to ask is what exactly are we measuring?

The illustration above is a simplified version of the network traffic in a WebRTC session. We don’t have servers here and we don’t have a lot of other components. Rest assured that each component along the way can add latency and even affect jitter.

What I did in the illustration is also delineated 3 different areas:

  1. The peripheral, where the media is acquired and played. Screens, microphones, cameras, speakers – they all add inherent delays and some of it can be considerable. Bluetooth devices for example are notorious for adding delays (anyone said iOS 15?)
  2. WebRTC processing, on its own, designed and built to reduce delays and jitter, but a contributor to it as well. This is doubly true in media servers that you own and operate but also true for browsers you don’t control and your users are using to access your service
  3. Network, which is what we’re trying to measure, at least in this article

Here’s the thing: for the most part, in most use cases, you have little control or knowledge of the peripherals being used. Measuring their own effects is also hard and in many real world applications impossible. So we are going to ignore peripherals.

WebRTC processing and the network are usually bunched together and there’s little in the way of splitting them up. Based on what you see and experience, you will need to decide if the issue is the network (=infrastructure and DevOps) or WebRTC  processing (=software bugs and optimizations).

Network Jitter vs Round Trip Time (or Latency)

To me, the difference between network latency and round trip time is akin to the difference between weather and climate: Weather reflects short-term conditions of the atmosphere while climate is the average daily weather for an extended period of time at a certain location.

In the same token, jitter reflects short-term conditions or more accurately inconsistencies in the flow of packets over a network while round trip time (or latency) is the average time it takes for packets to flow through the network for a longer period of time and from one location to another.

Network Jitter answers the question how inconsistent the network is.

Round Trip Time (or Latency) answers the question how much delay is there in the network.

What’s “Network Jitter”?

In a WebRTC session, we will be sending over packets continuously. On a voice call, in many cases, a packet will be sent every 20 milliseconds. With video, we will be sending packets to reach 30 frames per second, and there are more than a single packet per frame usually, which means hundreds of packets every second.

Assuming the network experiences no packet loss, then we expect to receive the same number of packets in the same frequency.

Let’s look at a span of 200 milliseconds of audio from a sender’s perspective versus a receiver’s one. That’s 10 packets worth of data:

The sender sends an SRTP audio packet every 20 milliseconds in the illustration above, but the receiver doesn’t receive them exactly every 20 milliseconds – they are somewhat jittery… and that’s what we’re measuring with network jitter.

What contributes to network jitter?

Mainly the network.

When you send packets over the internet, who guarantees that what gets sent is actually received and in a timely manner?

Think about the post office. Not all letters delivered get to their destination, and not all letters delivered get to their destination with the same latency (=on time). The same is true for a computer network, and the more complex the network, the harder it gets to do this properly.

Here are some things that can affect network jitter badly:

  • The user’s network and his location
    • Poor location. A user connecting from inside an elevator over cellular or sitting far away from his WiFi access point will result in bursty connections that will introduce high jitter and packet loss
    • Congested network. Either the local one (your daughter on TikTok and your son on Fortnite while you’re trying to have a conversation over WebRTC; an office with too many people on the Internet on a slow connection; 50,000 people in a stadium trying to do Facebook Live at the same time) or the path to the WebRTC infrastructure being clogged by network traffic
    • Faulty hardware. A bad ethernet cable… a true story: we had a client some time ago stress testing his service, only to find that packet loss (and jitter) originated from a faulty cable in his data center
    • CPU. Local resources on a user’s device or your TURN and media servers in itself can add jitter. If the CPU of a machine starts throttling, the end result is going to be jitter (and packet loss)

Things that end up causing jitter on top of just jitter are packet loss (we never did receive what was sent), duplication of packets (yes, that can happen) and reordering of packets (if they are out of order, there’s definitely jitter, just with an added headache).

Why is network jitter a bad thing?

Why is this bad? Because if we want to smoothly playback the audio and video being sent, we need to align it yet again towards what the sender intended. Or more accurately, towards what the microphone and camera captured on the sender side.

If we don’t align the incoming media, the audio will not sound natural and the video will look choppy. If you want to experience this firsthand, just make sure the CPU of the device you are using is busy doing other things while being on a video call.

How does WebRTC compensate for jitter?

This is something that all VoIP services have, which is a jitter buffer. A jitter buffer is a software component that collects the received packets and decides when to play them out. It is used to handle lip synchronization (playing out audio and video together in sync), to reorder packets, and to take into account the jitter on the network.

If we know that jitter can be around 30 milliseconds, then the jitter buffer can wait for at least that time before playing back packets, so that whenever we need to play back a packet in a smooth manner, that packet has already been received.

Since network jitter is dynamic in nature, so is WebRTC’s jitter buffer – it is an adaptive jitter buffer that tries to understand how much jitter there is on the network, and increase or decrease the buffer size (length) based on what the network exhibits. Why do we do that? Because too little jitter means bad user experience due to dropped packets or improper playback and too high a jitter means adding to the latency of the playout, which we don’t want in real time interactive WebRTC sessions.

Do we look at “Latency” or “Round Trip Time”?

Latency, round trip time and delay are words that get dumped together. Also RTT – which is the acronym for round trip time. While there are nuances between them, and what exactly each one means, the lower they are the better the experience will be and the better interactive the session can be.

Here’s how I usually look at these and categorize them:

Latency for me is the time it takes for a packet of data to get from one point in the network to another.

Round trip time is the time it takes for a response packet to get back.

You can argue around latency and delay and decide if they should include or shouldn’t include the peripheral’s built in delay or even the delay added by WebRTC processing in end units or servers in the network.

For round trip time, the argument can be around the processing time needed to handle the incoming message and then send out the reply to it (if don’t incorrectly, this can add a considerable delay on its own).

And how do you measure latency exactly? If the clocks on the two devices aren’t fully in sync, how can you measure it? The result is, that in most cases, and WebRTC is no different, you rely on the round trip time instead – if I send a message and wait for a response, all I need to do is check the time that passed. And that’s exactly what you can glean out of the RTCP reports and WebRTC statistics.

What contributes to round trip time?

Besides the things that affect jitter, you’ll find here also the route taken by the packets over the network.

Here’s how I usually explain it – lets say your TURN server or media server or gateway is located in “East US”. That’s the generic name we all give to our first cloud data center choice.

Why? We want a global service, but we try to target the US first, so it needs to be in the US. And on the maps, the best alternative to also reach Europe is the east coast. So we end up with US East on one of the cloud vendors. At least until we grow and distribute our service.

What happens if the session takes place between 2 people who are both located in Paris and the session is routed through our media servers in the US?

That most probably will take a longer route both geographically and when measured in time, which ends up adding to the latency of the session. In many cases, it also means a higher packet loss as there are more opportunities along that route to lose packets.

This means that the way we design our infrastructure, deploy it around the globe and configure it has a considerable impact on the round trip time users are going to experience.

Why is high round trip time a bad thing?

More latency means it takes time from what we do until the other side can hear or see it.

For live streaming (somewhat related to WebRTC), the effects of latency are simply to explain. Here’s a good video for that:

If you are dealing with surveillance cameras, then latency is bad. When you’re in an interactive session – a 1:1 conversation or a group meeting, then you’ll be expecting latency of below 200 milliseconds. Anything above that would be noticeable and nagging. You won’t know when someone finished speaking so you can contribute to the conversation right after him for example.

So we’d like to have low round trip time as well as low network jitter for a good interactive experience in WebRTC applications.

How does WebRTC compensate for high round trip time?

It doesn’t. Not really. You’re on your own. You’ll need to decide where to place your servers and how to configure the routes between them to reduce latency.

Solutions we’ve seen recently range from:

  • Placing more media servers and TURN servers in more data centers closer to where your users are
  • Using third party TURN servers that are highly distributed (think Subspace and Cloudflare)
  • Go for a service such as AWS Global Accelerator to end up with an optimized route

At the end of the day, you’ll need to invest energy or money or both in order to improve round trip time as you grow your service.

We didn’t talk packet loss

Here’s something you should understand – high round trip time or network jitter can easily cause packet loss.

If there’s congestion on the network, you might end up with packet loss since a network switch or router along the path of your packets decided to drop some of your packets because it is congested.

But if the packets arrive too late (because of high round trip time or high jitter), then playing them might not be an option anymore – their time has passed. In such a case, WebRTC would simply drop the packet even though it received it. The real time nature of WebRTC doesn’t allow it to buffer data forever.

Network jitter and round trip time – are these an infrastructure problem or an end user problem?

Both.

At times, network jitter and round trip time can occur due to infrastructure issues – anything from faulty cables, bad network configurations or just machines that are too busy to process data fast enough.

Other times, your user is to blame. Either due to his device or the network he is using.

Then there’s the network. If everyone is currently trying to access the network, there are bound to be clogged routes, even if only periodically.

It is going to be your job to try and understand where the problem originates from.

How to fix network jitter and round trip time using testRTC’s tools?

Glad you asked 😀

testRTC offers tools for the full life cycle of WebRTC applications. For the most part, fixing jitter and round trip time is going to be part of the operations work on your end – understanding where traffic is routed through and how to redirect it elsewhere (including the possible need to add new regions and servers).  Here’s where you’ll meet network jitter and round trip time in our services:

testingRTC

Our WebRTC testing service enables you to conduct integration, regression, function, non-functional, sizing, load and stress testing.

In all tests we collect network jitter and round trip time for all simulated probes in a session. We treat your service as a black box, launch our machines from different locations around the globe (you define which ones) and collect that as part of the metrics we store. We make it available on the channel level, browser level and test level as an aggregate of everything. Access to it is offered via the dashboard and through APIs. You can even add your expectations of these values and cause tests to fail based on your thresholds. If you want, you can dynamically change these values for each browser in the test and see how this affects your service.

upRTC

upRTC is our WebRTC active monitoring service. Its main purpose is to understand the behavior of your infrastructure. It does that by bringing predictability to the user side and his network, so you can be sure that every time the monitor’s browser runs in front of your infrastructure they behave the same from the side of the network.

Here, looking at network jitter and round trip time and setting thresholds for them to alert you via email and webhook is the way to go.

watchRTC

watchRTC offers WebRTC passive monitoring. It hooks up to your users’ devices and collects their WebRTC metrics. This gets processed, aggregated and analyzed. Part of the metrics we collect and share is network jitter and round trip time. We do that on the individual channel level, the peer level, the room level and in aggregate across complex filters:

The purpose of it all is:

  1. To let you understand what your end users are experiencing
  2. Assist you in tracking down outliers in device types, operating systems, networks, locations, etc
  3. Drill down to a certain user’s complaint when needed

qualityRTC and probeRTC

With qualityRTC and probeRTC we help your support and users answer the question “how can I improve my connectivity to your service?”

This is done by a series of tests, many of them collecting network jitter and round trip time data

Talk to us

Need to figure out your network jitter? Have a round trip time and latency issue with users?

Come and talk to us. I am sure we will be able to help you figure out the issues.

2

Preparing for WebRTC scale: What Honorlock did to validate their infrastructure

This week, I decided to have a conversation with Carl Scheller, VP of Engineering at Honorlock. I knew about the remote proctoring space, and have seen a few clients work with testRTC. This was the first real opportunity I had to understand firsthand about the scale-related requirements of this industry.

Honorlock is a remote proctoring vendor, providing online testing services to higher education institutions. Colleges and universities use Honorlock when they want to proctor test taking online. The purpose? Ensure academic integrity of the students who take online exams as part of their courses.

Here’s the thing – every student taking an exam online ends up using Honorlock’s service, which in turn connects to the user’s camera and screen to collect video feeds in real time along with other things that Honorlock does.

Proctoring with WebRTC and usage behavior

When taking an online exam, the proctoring platform connects to the student’s camera and screen by using WebRTC. The media feeds then get sent and recorded  on Honorlock’s servers in real time, along the way passing some AI related checks to find any signs of cheating. The recordings themselves are stored for an additional period of time to be used if and when needed for manual review.

To offer such a secure testing environment requires media servers to record the session for as long as the exam takes place. If 100 students need to take an exam in a specific subject, they might need to do so at the same scheduled time.

Exams have their own seasonality. There is a lot of usage taking place during midterm and final exam periods, whereas January sees a lot less activity when most schools open a lot less.

Online proctoring platforms need to make sure that each student taking an exam gets a high experience no matter how many other students are taking an exam at the same time. This clears students to worry about their test and not about the proctoring software.

Honorlock’s scaling challenge

Honorlock are aware of this seasonality. They wanted to make sure that their application can handle the load in the various areas of the application. Especially due to the expected growth of their business in the near future.

What Honorlock were looking for was an answer to the question: at what point do they need to improve their application to scale further?

Honorlock is using a third party video platform. They have decided early on not to develop and deploy their own infrastructure, preferring to focus on the core experience for the students and the institutions using them.

Honorlock decided not to have a working assumption as to the scale of the third party video platform blindly, and instead went ahead to conduct end-to-end stress testing, validating their assumptions and scale requirements.

When researching for alternatives, it was important for Honorlock to be able to test the whole client-side UI, to make sure the video infrastructure gets triggered the same way it would in real life. There was also a need to be able to test everything, and not only focus on scale testing of each component separately. This led Honorlock to testRTC.

“We’ve selected testRTC because it enabled us to stress test our service as closely to the live usage we were expecting to see in our platform. Using testRTC significantly assisted us in validating our readiness to scale our business.”

Carl Scheller, VP of Engineering, Honorlock

Load testing with testRTC

Once Honorlock started using testRTC, it was immediately apparent that the testing requirements of Honorlock were met:

  • Honorlock made use of the powerful scripting available in testRTC. These enabled handling the complexity of the UX of a proctoring service
  • Once ready, being able to scale up tests to hundreds or thousands of concurrent browsers made the validation process easier. Especially with the built-in graphs in testRTC focusing on high level aggregate media quality information
  • The global distribution of testRTC’s probes and their granular network controls enabled Honorlock to run stress tests with different machine configurations mapping to Honorlock’s target audience of students

A big win for Honorlock was the level of support provided by testRTC throughout the process. testRTC played an important role in helping define the test plan, writing the test scripts and modifying the scripts to work in a realistic scenario for the Honorlock application.

Building a partnership

Working with testRTC has been useful for Honorlock. While the testRTC service offered a powerful and flexible solution, the real benefit was in the approach testRTC took to the work needed. From the get go, testRTC provided hands on assistance, making sure Honorlock get ramp up their testing quickly and validate their scale.

That ability to get hands on assistance, coupled with the self service capabilities found in testRTC were what Honorlock was looking for.

The validation itself assisted Honorlock in uncovering issues in both their platform as well as the third party they were using. These issues are being taken care of. Using testRTC enabled Honorlock to make better technical decisions.

How Houseparty uses testRTC as an integral part of its WebRTC testing

Houseparty selected testRTC for its WebRTC infrastructure regression testing through continuous integration.

Houseparty is a mobile group video chat application, where groups of up to eight friends gather to chat in virtual rooms. With over half a billion video chats conducted using WebRTC, Houseparty is massive in its scale. What makes Houseparty interesting, is that the majority of its users base are 24 old or younger audiences, spending upwards of 50 minutes a day inside the app.

Being a social platform, Houseparty has to innovate on a daily basis. This calls for frequent updates of its mobile applications and infrastructure. An update to the Houseparty backend infrastructure happens on a daily basis and the mobile apps are updated every two weeks on average.

In Search of a Regression Testing Tool

The developers at Houseparty wanted to get a kind of an early warning system in place. One that would tell the team if the changes being made are breaking the service for its users. And breakage here means a reduction in media quality or the inability to work in certain network conditions. What Houseparty’s developers were looking for was higher confidence in their version rollouts.

Houseparty already had stress testing capabilities in place, along with the ability to test their mobile applications. What they were missing was regression testing for the infrastructure. When a decision had to be made, Houseparty preferred to use testRTC’s testing service instead of building their own testing environment, saving months of effort of experienced WebRTC developers with the understanding that the end result would be inferior in terms of its feature set and capabilities.

By selecting testRTC, Houseparty’s developers  were able to improve their confidence level when upgrading the service for their millions of users.

“testRTC offered us the fastest and cheapest way to get the type of regression testing we needed, increasing the confidence we had when rolling out new releases of the Houseparty application”

Simplicity is Key

One of the key reasons for selecting testRTC was the simplicity of the service. From writing tests, through selecting the machines’ configuration and defining test success criteria down to integrating with the API.

The ability to pick different network configurations was really important to Houseparty. Using both the preconfigured settings as well as dynamically modifying network conditions enabled Houseparty to quickly and efficiently understand how the behavior of their application is affected.

Furthermore, by using test expectations in testRTC, a mechanism that lets developers set success and failure criteria for a test, based on metrics collected, Houseparty developers are alerted when results needs to be further analysed. This enables Houseparty’s developers to spend more time on their application and less in drilling down to results, trying to understand their meaning.

When drilling down to results is needed, then the graphs displayed assist the developers in debugging the problems and resolving them faster.

Outgoing and incoming video bitrate for an 8 people room with simulcast enabled

Mobile Only and WebRTC

While running predominantly as a mobile application, Houseparty’s video processing makes use of WebRTC. Houseparty is making a distinction between its application testing and infrastructure testing. It had in its arsenal existing tools that are being used to test its mobile clients. What it was looking for was a way to test their video infrastructure – their media servers and TURN servers – making sure they work as expected.

To that end, Houseparty is using a simple HTML page that can be used to create calls on its staging environment for the application. testRTC is then used to access that page and automate the testing process, simulating different network conditions while testing Houseparty’s video infrastructure.

Continuous Integration as a First Priority

Houseparty made the decision early on to use testRTC as part of their continuous integration environment. Using testRTC’s APIs, the developers at Houseparty were able to quickly integrate the testing scripts they’ve written in testRTC to their Jenkins automation server.

This allows Houseparty to run the testRTC regression tests every night. Integrating testRTC with Jenkins means that when tests complete, their results are reported back to Jenkins and from there they get sent to Slack, where developers get notified on potential failures.

Running testRTC tests nightly from Jenkins with integrated reporting and notifications

Moving Forward with testRTC

For Houseparty, the work is not done yet. testRTC is used on a daily basis, running a battery of tests designed to check their infrastructure. There are additional tests that are planned to be added to this test suite.

Peer-to-peer testing and direct TURN server testing will be added in the near future, increasing the coverage of regression testing done over testRTC.

How Clique Migrated Smoothly to the Newest AWS EC2 C5 Instance

In a need to focus resources on core activities, Clique Communications turned to testRTC for stress testing and sizing.

Clique API provides web-based voice and text application programming interfaces. In their eight years of existence, they have grown to support over 20 million users across 150 countries. This amounts to over 500 million minutes per month. Clique’s cloud services deliver multi-party voice that can be embedded by enterprises  into their own business processes.

What are their current goals?

  • Grow the business
  • Add features to improve customer service and experience
  • Offer value-added services

Adding WebRTC

Clique started working with WebRTC some 18 months ago with customers starting to use it at the end of 2017.

Today, Clique supports all major browsers – Chrome, Firefox, Safari, and Edge; enabling its customers to offer uninterrupted interactions with their users. When a user joins a conference, he/she can do so over PSTN, directly from the browser or from within a native application that utilizes Clique SDKs.

Making Use of testRTC

As with any other software product, Clique had to test and validate its solution. To that end, Clique had already been using  tools for handling call volumes and regression, testing the application and the SDKs. The challenge was the issue of scalability and quality of service, which is essential  when it comes to WebRTC support. Clique had a decision to make – either invest in building their own set of testing tools on top of open source frameworks such as Selenium, or opt for a commercial alternative. They decided to go with the latter and use testRTC. They also preferred using a third party tool for testing as they didn’t want to burden their engineering team.

Switching from AWS EC2 C4 to C5

Clique had previously used a standard instance on Amazon from the AWS EC2 C4 series but when the AWS EC2 C5 series came out, they wanted to take advantage of it – not only was it more economical but it also had better performance. Furthermore, knowing Amazon would release newer sets of servers that would need to be tested again, Clique required this process to be repeatable.

The Action Plan

Since Clique is an embeddable service, they decided it was most strategic to have a third party develop an application using the Clique client SDK and APIs, and use that application as a test framework that could scale and grow the performance of the platform. It was a wonderful opportunity to optimize their own resources and save on the instances that they deploy on Amazon. An added bonus was having a third party that could then be used by Clique’s customers and partners who are building applications hey can use as part of their development process.

Clique wrote  their own test scripts in testRTC. The main test scenario for Clique was having a moderator who creates the conference and then generates a URL for other participants in the conference to join. Once they figured out how to do that with testRTC, the rest was a piece of cake.

Using testRTC to assist in sizing the instances on the AWS has ancillary benefits beyond Clique’s core objectives. Clique tested the full life-cycle of its solution. From developing yet another application with its SDKs, integrating its APIs, to continuous integration & devops, Clique discovered  bugs that were then fixed, optimized performance and gave Clique confidence to run services at scale in next generation architectures.

“testRTC provided Clique with a reliable and repeatable mechanism to measure our CPaaS performance… allowing Clique to save money, remain confident in our architectural choices and more importantly showcase our platform to customers with the integrity of an independent test system.”

Moving Forward – Continued use of testRTC

There are a lot of moving pieces in Clique’s solution:  infrastructure in the backend, media servers , the WebRTC gateways. Features such as recording can fit into various components within the architecture, and Clique is always looking for ways to optimize and simplify.

testRTC helps Clique evaluate if the assumptions made in their architecture are valid by determining bottlenecks and identifying places of consolidation.

In the future, Clique will be looking at testRTC’s monitoring capability as well as using testRTC to instantiate browsers in different locations.

24

How Many Sessions Can a Kurento Server Hold?

Here’s a question we come across quite often at testRTC.

You decided to self develop your own service. Manage your own media servers. And now that time comes to understand your ongoing costs as well as decide on the scale out scheme – at what point do you launch/spawn a new server to take up some of the load from your current media servers farm? How many users can you cram into a single media server anyway?

We decided to check just that, doing it with the help of WebRTC.ventures who worked with us on the setup.

For the purpose of these set of sizing experiments, we picked up Kurento, one of the most versatile open source media servers out there today. We selected a few key scenarios, and WebRTC.ventures installed the server and configured it for us.

We then used our testRTC probes to understand how many users can we cram on the server in each scenario.

Simple scenario sizing is one step in the process. If you are serious about your service, then check out our best practices to stress testing your WebRTC application.

Get the best practices guide

Why Kurento?

There are a couple of reasons why we picked Kurento for this one.

  1. Because many use it out there, and we’ve been helping customers understand and debug it when they needed to
  2. It is versatile. We could try multiple scenarios with it with relative ease and little programming (although that wasn’t our part of the project)
  3. It does media processing beyond just routing media. We wanted to see how this will affect the numbers, especially considering the last reason below
  4. It’s the first of a few media servers we’re going to play with, so stay with us on this one

The Scenarios

For the Kurento service, we picked up 3 different scenarios we wanted to test:

  1. 1:1 video calls. A typical doctor visitation or similar scenario, where two participants join the same session and the session gets recorded (two separate streams, one for each participant).
  2. 4-way group video calls. The classic scenario, in an MCU configuration. Kurento decodes and encodes all media streams, so we’re giving it quite a workout
  3. Live broadcast. A single person talking to a large group of viewers.

For scenarios (1) and (2) our question is how many concurrent sessions can the Kurento server hold.

For scenario (3) our question is how many viewers for a single broadcast can the Kurento server hold.

The Setup

To set things up for our test, we did the following:

  • We went for a simple AWS t2.medium machine, but quickly had to switch to a more capable machine. We ended up with a c4.2xlarge instance (8 vCPU, 15 GB RAM) on AWS
  • We had it monitored via New Relic, to be able to check the metrics (but later decided to forgo this approach and just use top with root access directly on the machine)
  • We also had an easy way to reset the Kurento server. We knew that rattling it too much between tests without a reset would affect our results. We wanted a clean slate each time we started

The machine was hosted in Amazon US-East.

testRTC probes were coming in from a different cloud vendor, East and West US locations.

We didn’t do any TURN related stuff – so our browser traffic hit the Kurento server directly and over UDP.

The Process

For each scenario, we’ve written a simple test script that can scale nicely.

We then executed the test script in its minimal size.

For 1:1 video calls and broadcasts we used 2 probes and for the 4-way group video call we started with 4 probes.

We ran each test for a period of 4-5 minutes, to check the stability of the media flow.

We used that as the baseline of our results and monitored to see when adding more probes caused the media metrics to start faltering.

1:1 Video Calls

The above screenshot is what you’ll see if you participated in these sessions. There’s a picture in picture view of the session, where the full screen area is the remote incoming video and the smaller window holds our local view.

Baseline

Kurento’s basic configuration limits bitrate of calls to around 500kbps. This can be seen from running a single session in our high level chart:

And here’s the stats on the channels of one of the two probes in this baseline test run:

Now that we have our baseline, it was time to scale things up.

30 Probes (=15 sessions)

When we went up to 30 probes, running in 15 parallel 1:1 video sessions, we ended up with this graph:

While the average bitrate is still around 500kbps, we can see that the min/max bands are not as stable.

If we look at the packet loss graph, things aren’t happy (the baseline had no packet losses):

This is where we went for the “By probe” tab, looking at individual bitrates across the probes:

What we can see immediately is that 4 probes out of 30 didn’t get the full attention of the Kurento media server – they got to send and receive less than 500kbps.

If we switch to the packet loss by probe, we see this:

A couple of things that come to mind:

  1. Kurento degrades quality to specific sessions and not across the board. Out of 30 users, 22 got the expected results, 4 had lower bitrates and another 4 had packet losses
  2. There’s correlation here. When Probe #04 exhibits reduction in bitrate, Probe #3 reports incoming packet losses

From here, we can easily go down the path of drilling down to the probes that showed issues. I won’t do it now, as there’s still a lot to cover.

22 Probes (=11 sessions)

It stands to reason then that lowering the capacity to 22 probes should give us pristine results.

Here’s what we’ve seen instead:

We still have that one session that goes bad.

20 or 18?

When we went down to 18 or 20 probes, things got better.

With 20 the issue is that we couldn’t really reproduce a good result at all times. Sometimes, the scenario worked, and other times, it looked like the issues we’ve seen with the 22 probes.

18 though seemed rather stable when tested a couple of times:

Depending on the service you’re offering, I’d pick 18. Or even go down to 16…

4-Way Group Video Calls

The above is a screen capture of the 4-way group video call scenario we’ve analyzed.

In this case, each probe (browser) sends out video at a resolution of 640×360 and receives a video resolution of 800×600.

The screenshot doesn’t show the images getting cropped, so we can assume the Kurento media server takes the following approach to its pipeline:

That’s lots of processing needed for each probe added, which means we can expect lower scaling for this scenario.

Baseline

Our baseline this time is going to need 4 probes.

Here’s high the high level video graph looks like:

Not as stable as our 1:1 video calls, but it should do for what’s coming.

Note that each probe still has around 500kbps of video bitrate.

I’ll skip the drill down into the results of a specific probe metrics and take this as our baseline.

20 Probes (=5 sessions)

Since 1:1 video sessions didn’t go well above 20, we started there and went down.

Here’s how 20 probes look like:

Erratic.

Checking packet losses and bitrates by probe yielded similar results to the bad 1:1 sessions. Here’s the by probe bitrate graph:

Going down to 16 probes (=4 sessions) wasn’t any better:

I’ve actually looked at the bitrates and packet losses by probe, and then decided to map them out into the sessions we had:

This paints a rather grim picture – all 4 sessions hosted on the Kurento server suffered in one way or another. Somehow, the bad behavior wasn’t limited to one session, but showed itself on all of them.

Down to 12 Probes (=3 sessions)

We ended up with 12 probes showing this high level bitrate graph:

It showed some sporadic packet losses that were spread across 3 different probes. The following shows the high level by probe bitrate graph:

There’s some instability in the bitrates and the packet losses which will need some further investigation, but this is probably something we can work with and try and optimize our service to run well.

Live Broadcast

The above screenshot shows what a viewer sees on a live broadcast scenario that we’ve set up using Kurento.

We’ve got multiple testRTC probes joining the same broadcast, with the first one acting as the broadcaster and the rest are just viewers.

Baseline

Our baseline this time is going to need 2 probes. A broadcaster and a viewer.

From now on, we’ll be focusing on what the viewers experience – a lot more than what happens to the broadcaster.

We’re still in the domain of 500kbps for the video channel:

One thing to remember here – outgoing media happens only for our broadcaster probe and incoming media happens for all the other probes.

30 Probe (=29 viewers)

We started with 30 probes – assuming we will fail miserably based on our previous tests, and got positively surprised:

Solid bitrate for this test.

Climbing up

We’ve then started moving up with the numbers.

50, 60 and 80 probes went really well.

Got our appetite, and jumped towards 150 probes.

And ended up with this high level graph:

There wasn’t any packet loss to indicate why that drop with the broadcaster at around 240 seconds, so I switch to the “By probe” view.

This showed that things were starting to deteriorate somewhat:

We’re sorting the results just for this purpose – you can see there’s a slight decline in average bitrate across the probes here – something that is a lot less apparent for smaller test sizes. There was no packet loss.

We’ve tried going upwards to 200, but then 12 probes didn’t even connect properly:

Going down to a 100 yielded some connection errors in some of the probes as well. Specifically, I saw this one:

This indicates we’ve got a wee bit of an issue here that needs to be solved before we can continue our stress tests any further. Most probably in the signaling layer of our server. It is either unstable when we place so many viewers at once against it, or just doesn’t really handle the load well enough.

Results Summary

The table below shows the various limits we’ve reached in our rounds of sizing tests:

Scenario Size
1:1 video calls 18 users in 9 parallel sessions
4-way group video calls 3 rooms of 4 users each
Live broadcast 1 broadcaster + 80-150 viewers

What did we learn?

  1. Stress testing for sizing purposes is fun. I actually enjoyed going through the results and running a couple of tests of my own (I didn’t write the scripts or run the initial tests – I delegated that to our support engineer)
  2. Different scenarios will dictate very different sizing. With more time, I’d start working out on finding the bottlenecks and optimizing them – I’m sure more can be squeezed out of a Kurento machine
  3. Once set up and written intelligently, it’s really easy to rerun the tests and change the number of probes used

Next Steps

Once we got to the sweet spot in each scenario, the next thing to do would probably to run it more than once.

We usually setup a testRTC monitor to run once every 15 minutes to an hour for a couple of days on such a scenario, just to make sure we’re seeing stable results more than once.

Other than that, this needs to be tested under different network conditions, varying load factors, etc.

Check out our best practices for stress testing WebRTC applications. It is relevant even if you are not using testRTC

Get the best practices guide

I’d like to thank WebRTC.ventures for the assistance in setting this one up. If you are looking for a capable vendor to custom build your WebRTC application – check them out.

1

How to Prepare Your WebRTC Application for a Surge in Traffic

OK, this is the moment you’ve been waiting for: there’s a huge surge in traffic on your WebRTC application. Success! You even had the prescience to place all of your web application’s assets on a CDN and whatever uptime monitoring service you use, be it New Relic, Datadog or a homegrown Nagios solution – says all is fine. But there’s just one nagging problem – users are complaining. More than they used to. Either because the service doesn’t work at all for them or the quality of the media just doesn’t cut it for them. What The–?!

We recently hosted a webinar about preparing for that big WebRTC launch. You might want to check the suggestions we made there as well.

Register now for free: WebRTC – How NOT to Fail in Your Big Launch

Let’s start by focusing on the positives here. Your service is being used be people. Then again, these people aren’t getting the real deal – the quality they are experiencing isn’t top notch. What they are experiencing is inability to join sessions, low bitrates or inexplicable packet losses. These are different than your run of the mill 500 and 502 errors, and you might not even notice something is wrong until a user complains.

So, what now?

Here’s what I’m going to cover today:

  1. Learn how to predict service hiccups
  2. Prepare your WebRTC application in advance for growth

Learn How to Predict Service Hiccups

While lots of users is probably what you are aiming for in your business, the effects they can have on your WebRTC application if unprepared for it can be devastating. Sure, sometimes they’ll force your service to go offline completely, but in many other times, the service will keep on running but it will deliver bad user experience. This can manifest itself by having users wait for long times to connect, requiring them to refresh the page to connect or just having poor audio and video quality.

Once you get to that point, it is hard to know what to do:

  • Do you throw more machines on the problem?
  • Do you need to check your network connections?
  • How do you find the affected users and tell them things have been sorted out?

This mess is going to take up a lot of your time and attention to resolve.

Here is something you can do to try and predict when these hiccups are about to hit you:

Establish a Baseline

I’ve said it before and I’ll say it again. You need to understand the performance metrics of your WebRTC service. In order to do that, the best thing is to run it a bit with the acceptable load that you have and writing down for yourself the important metrics.

A few that come to mind:

  • Bitrate of the channels
  • Average packet loss
  • Jitter

Now that you have your baseline, take the time to gauge what exactly your WebRTC application is capable of doing in terms of traffic. How much load can it carry as you stack up more users?

One neat trick you can do is place a testRTC monitor and use rtcSetTestExpectation() to indicate the thresholds you’ve selected for your baseline. Things like “I don’t expect more than 0.5% packet loss on average” or “average bitrate must be above 500kbps”. The moment these thresholds are breached – you’ll get notified and able to see if this is caused by growth in your traffic, changes in usage behavior, etc.

Prepare Your WebRTC Application in Advance for Growth

There aren’t always warning signs that let you know when a rampaging horde of users may come at your door. And even when there is, you better have some kind of a solution in place and be prepared to react. This preparation can be just knowing your numbers and have a resolution plan in place that you can roll out or it can be an automated solution that doesn’t require any additional effort on your end.

To get there, here are some suggestions I have for you.

Find Your System’s Limits

In general, there are 3 main limits to look at:

  1. How big can a single session grow?
  2. How many users can I cram into a single server?
  3. How many users can my service serve concurrently?

You can read more on strategies and planning for stress testing and sizing WebRTC services. I want to briefly touch these limits though.

1. How big can a single session grow?

Being able to handle 500 1:1 sessions doesn’t always scale to 100 groups of 10 users sessions. The growth isn’t linear in nature. On top of it, the end result might choke your server or just provide bitrates that are just too low above a certain number of users.

Make sure you know what’s the biggest session size you are willing to run.

Besides doing automated testing and checking the metrics against the baseline you want, you can always run an automated test using testRTC and at the same time join from your own browser to get a real feeling of what’s going on. Doing that will add the human factor into the mix.

2. How many users can I cram into a single server?

Most sizing testing are about understanding how many sessions/users/whatever can you fit in a single server. Once you hit that number, you should probably launch another server and use a load balancer to scale out.

Getting that number figured out based on your specific scenario and infrastructure is important.

3. How many users can my service serve concurrently?

Now that you know how to scale out from a single server, growing can be thought of as linearly (up to a point). So it is probably time to put in place automatic scale out and scale down and test that this thing works nicely.

Doing so will greatly reduce the potential and destruction that a service hiccup can cause.

CDN and Caching

Make sure all of the HTML assets of your WebRTC application that static are served through a CDN.

In some cases, when we stress test services, just putting 200 browsers in front of an HTML page that serves a WebRTC application can cause intermittent failures in loading the pages. That’s because the web serving part of the application is often neglected by WebRTC developers who are focusing their time and energy on the bigger resource hogs.

We’ve had numerous cases where the first roadblock we’ve hit with a customer was him forgetting to place a minor javascript file in the CDN.

Don’t be that person.

Geographically Distributed Deployment

The web and WebRTC are global, but traffic is local.

You don’t want to send users to the other side of the globe unnecessarily in your service. You want your media and NAT traversal servers to be as close to the users as possible. This gives you the flexibility of optimizing the backend network when needed.

Make sure your deployment is distributed along multiple datacenters, and that the users are routed to the correct one.

Philipp Hancke wrote how they do it at appear.in for their TURN servers.

Monitor Everything

CPU. Memory. Storage. Network. The works.

Add application metrics you collect from your servers on top of it.

And then add a testRTC monitor to check media quality end-to-end to make sure everything run consistently.

Check across large time spans if there’s an improvement or degradation of your service quality.

Stress Testing

Check your system for the load you expect to be able to handle.

Do it whenever you upgrade pieces of your backend, as minor changes there may cause large changes in performance.

Don’t Let Things Out of Your Control

WebRTC has a lot of moving parts in it so deploying it isn’t as easy as putting up a WordPress site. You should be prepared for that surge in traffic, and that means:

  1. Understanding the baseline quality of your service
  2. Knowing where you stand with your sizing and scale out strategy
  3. Monitoring your service quality so you can react before customers start complaining

Do it on your own. Use testRTC. Use whatever other tool there is at your disposal.

Just make sure you take this seriously.

We recently hosted a webinar about preparing for that big WebRTC launch. You might want to check the suggestions we made there as well.

Register now for free: WebRTC – How NOT to Fail in Your Big Launch

2

3 Synchronization techniques to test WebRTC at scale

Testing WebRTC is hard enough when you need to automate a single test scenario with two people in it, so doing things at scale means lots more headache.

We’ve noticed that in the past several months where more developers have started using our service to understand the capacity they can load on a single server. And as we do with all of our customers, we assisted them in setting up the scripts properly – it is still early days for us, so we make it a point to learn from these interactions.

What we immediately noticed is, that while our existing mechanisms for synchronization can be used – they should be used slightly differently because at scale the problems are also different.

How do you synchronize with testRTC?

There are two main mechanisms in testRTC to synchronize tests, and we use them together.

What we do is think of a test run as a collection of sessions. Each session has its own group of agents/browsers who make up that session. And inside each such session group – you can share values across the agents.

So if we want to try and do a test run for our WebRTC service similar to the above – 4 video conference calls of 5 browsers in each call, we configure it the following way in testRTC:

While this is all nice and peachy, let’s assume that in order to initiate a video conference, we need someone in each group of 5 browsers to be the first to do *something*. It can be setting up the conference, getting a random URL – whatever.

This is why we’ve added the session values mechanism. With it, one agent (=browser) inside the session, can share a specific value with all other agents in his session – and agents can wait to receive such a value and act upon it.

Here’s how it looks like for a testRTC agent to announce it logged in and is ready to accept an incoming call for example:

We decided arbitrarily to call our session key “readyForCall”, and we used an arbitrary value of “ignore” just because.

On the ‘receiving’ end here, we use the following code:

So now we have the second browser in the session waiting to get a value for “readyForCall”, and in this simple case, ignore the value and click the “.call” button in the UI.

This technique is something we use all the time in most of the scripts these days to get agents to synchronize their actions properly.

How do we scale a WebRTC test up?

The neat thing about these session values is that they are get signaled around only within the same session. So if we plan and write our test script properly, we can build a single simple session where browsers interact with each other, and then scale it up by increasing the size of the session to what we want and the size of the concurrent agents in the test run.

With our video conferencing service, we start with a 3-way session, using 3 agents. We designate agent #1 in the session as our “leader”, who must be the first to login and setup the session. Once done, he sends the URL as a session value to the other agents in the session.

The moment we want to scale that test up, we can grow the session size to 5, 10, 20, 100 or more. And when we want to check multiple video conferences in parallel, we can just grow the number of concurrent agents in the test run but leave the session size smaller.

A typical configuration for several test runs of scale tests will look like this:

  1. Start with 5 agents in a single session
  2. Then run 10 agents in 2 sessions (5 agents per session)
  3. End with 200 agents in 10 sessions (20 agents per session)

What will usually go wrong as we scale our WebRTC scenario?

Loads of things. Mainly… load.

We’ve seen servers that break down due to poor network connection. Or maxed out CPU. Or I/O as they store logs (or media recordings) to the disk. And bad implementations and configurations. You name it.

There are though, a few issues that seem to plague most (all?) WebRTC based services out there. And the main one of them is that they hate a hoard logging in at roughly the same time.

That just kills them.

You take 20 browsers. Point them all to the same URL, in order to join the same session, and you get them to try it out all together in the span of less than a second. And things fall down in pieces.

I am not sure why, but I have my own doubts and ideas here (something to do with the way RTCPeerConnection is used to maintain these media streams and how the SFUs manage it internally in their own crazy state machine). Now, for the most part, customers don’t care. Because this usually won’t happen in real life. And if it does – the user will hit F5 to refresh his browser and the world will get back to normalcy for him. So it gets lower priority.

Which leads us again to synchronization issues. How can we almost un-synchronize browsers and have them NOT join together, or at least have them join “slower”?

We’ve devised a few techniques that we are using with our customers, so we wanted to share them here. I’ll call them our 3 synchronization techniques for testing WebRTC at scale.

Here they are.

#1 – Real-users-join-randomly

This is as obvious as it gets.

If we have 10 users that need to enter the same session, then in real-life they won’t be joining at the exact same time. Our browsers do. So what do you do? You randomize having them join.

For 3 browsers, we have them all join “at the same time”, we just spread it around a bit – just like in the illustration below, where you can see in the red lines where each browser decided to join:

Here’s how we usually achieve that in testRTC:

#2 – Pace-them-into-the-service technique

Random doesn’t always cut it for everyone. This becomes an issue when you have 100 or more browsers you want to load the server with. I am not sure why that is, as it has nothing to do with how testRTC operates (how do I know this? Using the same test on something like AppRTC with no pacing works perfectly well), but again – developers are usually too busy to look at these issues in most of the scenarios that we’ve seen.

The workaround is to have these browsers “walk in” to the room roughly one after the other, at a given interval.

Something like this:

Here, what we do is pacing the browsers to join in a 300 milliseconds interval from one another. The script to it will be similar to this:

This is a rather easy method we use a lot, but sometimes it doesn’t fit. This occurs when timing can get jumbled due to network and other backend shenanigans of services.

#3 – One-after-the-other technique

Which is why we use this one-after-the-other technique.

This one is slightly more difficult to implement, so we use it only when necessary. Which is when the delay we wish to create doesn’t sit at the beginning of the test, but rather after some asynchronous action needs to take place – like logging in, or waiting for one of the browsers to create the actual meeting place.

The idea here is that we let each browser join only after another one in the list as already joined. We create a kind of a dependency between them using the testRTC synchronization commands. This is what we are trying to achieve here:

So we don’t really care how much time each browser takes to finish his action – we just want to make sure they join in an orderly fashion.

Usually we do that from the last browser in the session down to the first. There are three reasons why:

  1. It looks a lot smarter – like we know what we’re doing – so my ego demands it
  2. It makes it easier to scale a session up, since we’re counting down the numbers down to zero
  3. We can stop in the middle easily, if we have different types of browsers in the same session

Here’s how the code for it looks like:

Here, what happens is this:

  • agentType holds the index number of the running browser inside the session
  • sessionSize holds the number of browsers in a single session
  • If we are not the last browser in the session, then we wait until the next browser tells us he is ready (line 8). When he does, we join (line 12) and then we tell the previous browser in line that we are ready (line 16)
  • If we are the last browser, we just join and tell the previous one that we’re ready

A bit more complex, so we save it for when it is really necessary.

What’s next?

Here’s what we’ve learned:

  1. We use session and session values for synchronization and scale purposes
    1. We split a test run into group of browsers, designated to their own sessions
    2. Inside a session, we can give different roles to different browsers
    3. This enables us to pick and choose the size of a session and the size of a test run easily
  2. In most cases, large sessions don’t like browsers joining all at once – it breaks most services out there (and somehow, developers are fine with it)
  3. There are different ways to get testRTC to mimic real life when needed. Different techniques support different scenarios

If you are planning on stress testing your WebRTC service – and you probably will be at some point in time, then come check us out. Here are a few of the questions we can answer for you:

  • How many users can I cram into a single session/room/conference without degrading quality?
  • How many users can a single media server I have support?
  • How many parallel sessions/rooms/conferences can a single media server I have support?
  • What happens when my service needs to scale horizontally? Is there any degradation for the users?

Partial list, but a good starting point. See you in our service!

6

Executing a WebRTC test that scales

There’s a growing trend from the companies that come to testRTC in recent months, and it has to do with the focus of what they are looking for.

Most are less interested in how testRTC can be used for functional testing – things like coverage of scenarios and finding edge cases and automating tests for them. What people are interested now when they want to run a WebRTC test scenario is how to scale it.

Customers typically try to take stress in WebRTC tests in two slightly different vectors: they either focus on testing how their WebRTC service can handle multiple sessions in parallel or they focus on testing how their WebRTC service can increase the number of users in a single session.

Let’s review what’s the meaning of each of these alternatives.

#1 – WebRTC test that scales to a large number of sessions

I decided to put things on a simple graph. The X axis denotes the number of sessions we’re going to focus on while the Y axis is all about the number of users in a single session.

In this case, where we want to test WebRTC for a large number of sessions, we will have this focus:

Scale a WebRTC test by the number of sessions

So we have a WebRTC service to test. It has a single user in a session (a contact center agent receiving calls from PSTN for example) or two users in a session (one person talking to another across browsers).

In such a case, vendors are usually concerned about stressing their servers – checking if they can fit their intended capacity.

When this is done, there are three different things that can be tested for scale:

  1. The signaling server
    • How well does it behave while increasing capacity? How is its connection to the databse? Does it slow down as connections accumulate? Does it leak memory?
    • Usually, stress testing a signaling server is better done with other tools. Ones that have a lower cost per connection than testRTC and don’t really require a full browser per connection
    • That said, oftentimes, you may as well want to throw in a few “real” users using testRTC on top of a tool that loads your signaling connections separately – just to make sure there’s nothing that kills your service when media is added into the mix on top of the signaling
    • You also need to think about the third component below – how do you test your TURN server?
  2. The media server
    • These crop into 1:1 tests when there’s a need to record the session or to enforce a given route. I’ve seen many of these recently, mainly in the healthcare and education markets
    • For single users, this usually means the gateway that connects the user to other networks is what we want to test, and there it will usually include a media server of sorts for media transcoding
    • In such a case, there’s no getting away from the fact that scale is in the low 10’s or 100’s of browsers and real ones are needed. It is also where we see a lot of interest in testRTC and its capabilities
  3. The TURN server
    • Anywhere between 5-20% of the calls will end up being relayed via a TURN server – and there’s nothing you can do about it
    • If you put up your own TURN servers – how confident are you in your setup and its ability to scale nicely as your service grows?
    • One way to find out is to place real browsers in front of your service, but doing so in a way that forces the browsers to negotiate via TURN. This can be acheived by changing the configuration of your client, filtering ICE candidates and doing SDP munging. A better way would be to enforce network rules on the machine running the browser and actually test your service in different network conditions
    • And yes. testRTC allows you to do just that

#2 – WebRTC test that accommodates a large group of users in a single session

The other type of focus use cases we see a lot from our customers are those that want to answer the question “how many users can I cram into a single session without considerably degrading the quality?”

Scale a WebRTC test by the number of users per sesson

Many look for doing such tests at around 10-20 concurrent browsers, either in MCU or SFU models (see this post on the differences between the multiparty WebRTC technologies).

What happens next is usually a single session where browsers are added one on top of the other to check for scale. Here, the main purpose of a test is validating the media server and not much else.

The scenario is rather simple:

  • Try 1:1. Record the results
  • Go for 4 users. Record the results
  • Expand to 10 users. Record the results
  • Rinse and repeat

Now go back to the recorded results and see if the media got degraded:

  • Was latency introduced?
  • Do we see more packet losses?
  • Does bitrates go down the more browsers we add?
  • Is the bitrate stable or fluctuating all over the chart?
  • Is the degradation linear or exponential?

These types of questions are indicators to problems in the WebRTC product’s infrastructure (be it network connections, CPU, storage or software).

#3 – Test WebRTC at scale

And then you can try to accommodate for both these needs. And you should – scale the size of the sessions at the same time that you scale the number of sessions.

Scale a WebRTC test by the number of sessions and by the number of users in them

Here what we’re trying to do is everything at the same time.

We want to be able to place multiple users in the same session but spread our browsers across sessions.

How about running 100 browsers, split across 10 different sessions, where each session accommodates for 10 browsers? This is where our customers are headed next after they tested their WebRTC multiparty service for a single session capacity.

Why is WebRTC test scaling so hard?

When you scale test WebRTC infrastructure, you end up needing lots of bandwidth and processing power. Remember that each user is a full browser (why that is necessary see here). Running 2 or 4 of these may be simple, but running 20 or more becomes quite a challenge:

  • You can no longer place them all in a single machine, so you need to start distributing them – across machines, across data centers
  • You need to take care of both downlink and uplink network speeds – this isn’t easy to acheive at scale
  • You need to synchronize across your small army of browsers so they hit the server at roughly the right time for it all to work
  • Oh – and you need the WebRTC test environment to be stable, so that when issues occur, it will more often than not be due to an issue in the tested product and not in your test environment itself

testRTC, users and sessions

There are many ways to do multiple users in a single session:

  • All join the same URL or room, given the same level of access
  • A chair hosting a large conference, where control and access is assymetric
  • A broadcaster and a large number of viewers
  • A few people in a discussion with a large number of viewers

Each of these scales differently and requires a slightly different treatment.

What we did at testRTC was introduce the notion of #session into the mix. When you indicate #session, the test will automatically wrap itself around that notion – splitting the number of concurrent users you want into sessions at the size you state by #session.

Want to see it in action? Check our our latest tutorial videos on how to scale WebRTC tests in testRTC, by using the notion of a session: