Tag Archives for " test analysis "

Methodically testing and optimizing WebRTC applications at Vowel

“testRTC is the defacto standard for providing reliable WebRTC testing functionality.”

Paul Fisher, CTO and Co-Founder at Vowel

There are many vendors who are trying to focus these days on making meetings more efficient. Vowel is a video conferencing tool that actually makes meetings better. It enables users to plan, host, transcribe, search, and share their meetings. They are doing that right from inside the browser, and make use of WebRTC.

Vowel has been using testRTC throughout 2020 and I thought it was a good time to talk with Paul Fisher, CTO and Co-Founder at Vowel. I wanted to understand from him how testRTC helps Vowel improve their product and its user experience.

Identifying bottlenecks and issues, scaling up for launch

One of the most important things in a video conferencing platform is the quality of the media. Before working with testRTC, Vowel lacked the visibility and the means to conduct systematic optimizations and improvements to their video platform. They got to know testRTC through an advisor in the company, whose first suggestion was to use testRTC.

In the early days, Vowel used internal tools, but found out that there’s a lot of overhead with using these tools. They require a lot more work to run, manage and extract the results from the tests conducted. Rolling their own was too time consuming and gave a lot less value.

Once testRTC was adopted by Vowel, things have changed for the better. By setting up a set of initial regression tests that can be executed on demand and through continuous integration, Vowel were able to create a baseline of their implementation performance and quality. From here, they were able to figure out what required improvement and optimization as well as understanding if a new release or modification caused an unwanted regression.

testRTC was extremely instrumental in assisting Vowel resolve multiple issues around its implementation: congestion control, optimizing resolution and bandwidth, debugging simulcast, understanding the cause and optimizing for latency, round trip time and jitter.

Vowel were able to proceed in huge strides in these areas by adopting testRTC. Prior to testRTC, Vowel had a kind of an ad-hoc approach, relying almost entirely on user feedback and metrics collected in datadog and other tools. There was no real methodical way for analyzing and pinpointing the source of the issues.

With the adoption of testRTC, Vowel is now able to reproduce issues and diagnose issues, as well as validate that these issues have been resolved. Vowel created a suite of test scripts for these issues and for the scenarios they focus on. They now methodically run these tests as regression with each release.

“Using testRTC has had the most significant impact in improving the quality, stability and maintenance of our platform.”

This approach got them to catch regression bugs earlier on, before potentially rolling out breaking changes to production – practically preventing them from happening.

Reliance on open source

Vowel was built on top of an open-source open source media server, but significant improvements, customizations and additional features were required for their platform. All these changes had to be rigorously tested, to see how they would affect behavior, stability and scalability.

On top of that, when using open source media servers, there are still all the aspects and nuances of the infrastructure itself. The cloud platform, running across regions, how the video layouts, etc.

One cannot just take an open source product or framework and expect it to work well without tweaking and tuning it.

Vowel made a number of significant modifications to lower-level media settings and behavior. testRTC was used to assess these changes — validating that there was a marked improvement across a range of scenarios, and ensuring that there were no unintentional, negative side effects or complications. Without the use of testRTC, it would be extremely difficult to run these validations — especially in a controlled, consistent, and replicable manner.

One approach is to roll out directly to production and try to figure out if a change made an improvement or not. The challenge there is that there is so much variability of testing in the wild that is unrelated to the changes made that it is easy to lose sight of the true effects of changes – big and small ones.

“A lot of the power of testRTC is that we can really isolate changes, create a clean room validation and make sure that there’s a net positive effect.”

testRTC enabled Vowel to establish a number of critical metrics and set goals across these metrics. Vowel then runs these recurring tests  automatically in regression and extracts these metrics to test and validate that they don’t “fail”.

On using testRTC

“testRTC is the defacto standard for providing reliable WebRTC testing functionality.”

testRTC is used today at Vowel by most of the engineering team.

Test results are shared across the teams, data is exported into the internal company wiki. Vowel’s engineers constantly add new test scripts. New Scrum stories commonly include the creation or improvement of test scripts in testRTC.Every release includes running a battery of tests on testRTC.

For Vowel, testRTC is extremely fast and easy to use.

It is easy to automate and spin up tests on demand with just a click of the button, no matter the scale needed.

The fact that testRTC uses Nightwatch, an open source browser automation framework, makes it powerful in its ability to create and customize practically any scenario.

The test results are well organized in ways that make it easy to understand the status of the test, pinpoint issues and drill down to see the things needed in each layer and level.

The new dashboard homepage for testRTC

New UI, assets and better WebRTC analytics

Earlier this week we’ve started rolling out our new version of testRTC to our customers. This was one of these releases that we’ve worked on for a long time, starting two or three releases back when a decision was made that enough technical debt has been accumulating and a refresh was needed. It started as a nicely sized Angular to React rewrite and “redesign” which ended up being a lot more than that.

The results? Something that I am really proud of.

New top level view of test result in testRTC

The switch to React included a switch to Highcharts as well, so we can offer better graphs moving forward. This isn’t why I wanted to write about this release though.

If you’ve been using testRTC already, you will be quite comfortable with this new release. It will feel like an improvement, while keeping everything you wanted and were used to using in the same place.

There are four things we’ve added that you should really care about:

#1 – Assets

This is something we are asked for quite some time now. We have clients who are running multiple tests and multiple monitors.

In some cases, the different scripts have only slight variations in them. In others, they share common generic tasks, such as login procedures.

The problem was that we were allowing customers to create a single file script only, and run it as a fully contained “program”. This kept our solution simple and elegant, but not flexible enough for growth.

This is why we are introducing Assets into the mix.

Assets screen in testRTC

You can now create asset files which are simple scripts. Once created, you can include them into any of your running test scripts. You do that by simply adding an .include(‘<asset-name>’) command into your test script.

#2 – Advanced WebRTC Analytics

We’ve kinda revamped the whole Advanced WebRTC Analytics screen.

Up until now, it was a braindump of all getstats statistics without much thought. It gave power to the people, but it took its toll.

This time around, we’ve sat down to decide what information we have available, looked at what others are doing, and ended up with our own interpretation of what’s needed:

Advanced WebRTC Analytics view in testRTC with new information

The Advanced WebRTC Analytics section now includes the following capabilities:

  • Splits information into peer connection for easy view
  • Shows getUserMedia constraints
  • Show the PeerConnection configuration, so it is now super easy to see what STUN and TURN servers were configured
  • Show cipher information for the security conscious
  • Show ICE state machine progress, correlating it with the events log
  • Show ICE negotiation table, to pinpoint on failure reasons (and understand what candidate pair got selected)
  • Show WebRTC API events log, with the detailed calls and callbacks
  • Show the actual graphs, just nicer, with Highcharts

I’ve been using these new capabilities just last week to explain to a new lead why his calls don’t connect with our probe’s firewall configuration.

#3 – Media scores everywhere

We’ve added media scores to our test results in the last release, but we placed them only on the test results page itself.

Media quality score in test results in testRTC

Now we’re taking the next step, putting the scores in monitor lists and test run lists. This means they are more accessible to you and can be seen everywhere.

What can you do with them?

  1. Quickly understand if your service degrades when you scale
    1. Run the smallest test possible. See the media score you get
    2. Start scaling the test up. Expect the media score to not drop. If it does, check why
  2. Make sure monitors are stable
    1. Run a monitor
    2. Check if the media score changes over it
    3. If it changes too much, you have an infrastructure problem

#4 – Client performance data

Another thing we’ve had for quite some time, but now decided to move front and center

There’s now a new tab in the test results of a single probe called “Performance”:

testRTC machine performance view

When opened, if you have the #perf directive in your run options, it will show you the probe’s machine performance – the CPU, memory and network use of the probe and browser.

This will give you some understanding of what user machines are going to be “feeling”, especially if you are aiming for a UI-heavy implementation.

We see customers using this for performance and stress testing.

Other

Other improvements that made it into this release?

  • Filtering webhooks to run only on failed test runs
  • Automating dynamic allocation of probes when no static ones are available
  • Export test run history
  • Ability to execute and collect traceroute on DNS lookups in the browser
  • Added support to run longer tests
  • Modified fields in most tables to make them more effective to users

Check it out 🙂

2 Automating Your WebRTC Product Testing (Recorded session)

I took part this week in Twilio’s Signal event in London.

As with the previous Signal event I attended, this one was excellent (but that’s for some other post).

Twilio were kind enough to invite me to talk at their event, which resulted in the recorded session below:

In the first part of this session, I tried explaining the challenges that WebRTC testing and automation brings with it. I ended up talking about these 5 challenges:

  1. WebRTC being a brand new technology (=always changing)
  2. Browser based (=you don’t control your whole tech stack)
  3. Resource intensive (=need to factor that in when allocating your testing machines)
  4. Network sensitive (=need to be able to test in different network conditions)
  5. It takes two to tango (=need to synchronize across browsers during a test)

The second part was going through some of the results we’ve collected in our recent Kurento experiment, where we tried to see how much can we scale a deployed Kurento media server in different scenarios.

After the session everyone asked me how was the session. Frankly – I don’t know. I wasn’t sitting and listening there. I was talking (enjoying myself while doing so). I hope the audience in the room found the session useful. You can check it out on your own and make your own judgement.

Oh – and if you need to test your WebRTC application then you know where to find us 🙂

–> And if you don’t, then here’s our contact page.

Automated WebRTC Testing using testRTC

Yesterday, we hosted a webinar on testRTC. This time, we were really focused on showing some live demos of our service.

I wanted this one to be useful, so I sat down earlier this week, working on a general story outline with the idea of showing live how you can write a test script from scratch, building more and more capabilities and functionality into it as I went along.

It was real fun.

If you missed it, I’d like to invite you to watch the replay:

watch @ crowdcast

For the purpose of this webinar, I took Jitsi Meet (https://meet.jit.si/) and created the following scripts for it:

  1. Simple one-on-one test
    • Then I cleaned it up a bit from nagging warnings
    • And added a few basic expectations
  2. 4-way video test
    • For this one I’ve added some synchronization across the probes, and made sure Jitsi is the one generating the random rooms
    • I changed the script to be aware of sessions (parallel meeting rooms in the same test)
    • Then I played with the test reconfiguring it to run 40 probes, 8 in each meeting room
  3. One-on-one test with network limits
    • Switched back to a 1:1 session, this time with the flexibility we achieved in (2)
    • Increased the test length to 3 minutes
    • Injected 5% packet loss to the test in the second minute of the test

I also went over some of the results from the Kurento post we’ve published yesterday and went through the screen sharing script we’ve written recently about that uses appear.in as an example

One of the things I was asked is to share the scripts used throughout the session.

So I cleaned up the scripts a bit and placed them on our Google Drive. I am sharing them here in two forms:

  1. The GDoc file of the script – open it to read, copy+paste it to wherever
  2. The JSON file of the script – you can import this one directly into your testRTC account (you’ll need to reconfigure the probe profiles before you run it):

Here they are:

  1. Simple one-on-one test: GDocJSON
  2. 4-way video test: GDocJSON
  3. One-on-one test with network limits: GDocJSON

We’re here for any questions you may have.

24

How Many Sessions Can a Kurento Server Hold?

Here’s a question we come across quite often at testRTC.

You decided to self develop your own service. Manage your own media servers. And now that time comes to understand your ongoing costs as well as decide on the scale out scheme – at what point do you launch/spawn a new server to take up some of the load from your current media servers farm? How many users can you cram into a single media server anyway?

We decided to check just that, doing it with the help of WebRTC.ventures who worked with us on the setup.

For the purpose of these set of sizing experiments, we picked up Kurento, one of the most versatile open source media servers out there today. We selected a few key scenarios, and WebRTC.ventures installed the server and configured it for us.

We then used our testRTC probes to understand how many users can we cram on the server in each scenario.

Simple scenario sizing is one step in the process. If you are serious about your service, then check out our best practices to stress testing your WebRTC application.

Get the best practices guide

Why Kurento?

There are a couple of reasons why we picked Kurento for this one.

  1. Because many use it out there, and we’ve been helping customers understand and debug it when they needed to
  2. It is versatile. We could try multiple scenarios with it with relative ease and little programming (although that wasn’t our part of the project)
  3. It does media processing beyond just routing media. We wanted to see how this will affect the numbers, especially considering the last reason below
  4. It’s the first of a few media servers we’re going to play with, so stay with us on this one

The Scenarios

For the Kurento service, we picked up 3 different scenarios we wanted to test:

  1. 1:1 video calls. A typical doctor visitation or similar scenario, where two participants join the same session and the session gets recorded (two separate streams, one for each participant).
  2. 4-way group video calls. The classic scenario, in an MCU configuration. Kurento decodes and encodes all media streams, so we’re giving it quite a workout
  3. Live broadcast. A single person talking to a large group of viewers.

For scenarios (1) and (2) our question is how many concurrent sessions can the Kurento server hold.

For scenario (3) our question is how many viewers for a single broadcast can the Kurento server hold.

The Setup

To set things up for our test, we did the following:

  • We went for a simple AWS t2.medium machine, but quickly had to switch to a more capable machine. We ended up with a c4.2xlarge instance (8 vCPU, 15 GB RAM) on AWS
  • We had it monitored via New Relic, to be able to check the metrics (but later decided to forgo this approach and just use top with root access directly on the machine)
  • We also had an easy way to reset the Kurento server. We knew that rattling it too much between tests without a reset would affect our results. We wanted a clean slate each time we started

The machine was hosted in Amazon US-East.

testRTC probes were coming in from a different cloud vendor, East and West US locations.

We didn’t do any TURN related stuff – so our browser traffic hit the Kurento server directly and over UDP.

The Process

For each scenario, we’ve written a simple test script that can scale nicely.

We then executed the test script in its minimal size.

For 1:1 video calls and broadcasts we used 2 probes and for the 4-way group video call we started with 4 probes.

We ran each test for a period of 4-5 minutes, to check the stability of the media flow.

We used that as the baseline of our results and monitored to see when adding more probes caused the media metrics to start faltering.

1:1 Video Calls

The above screenshot is what you’ll see if you participated in these sessions. There’s a picture in picture view of the session, where the full screen area is the remote incoming video and the smaller window holds our local view.

Baseline

Kurento’s basic configuration limits bitrate of calls to around 500kbps. This can be seen from running a single session in our high level chart:

And here’s the stats on the channels of one of the two probes in this baseline test run:

Now that we have our baseline, it was time to scale things up.

30 Probes (=15 sessions)

When we went up to 30 probes, running in 15 parallel 1:1 video sessions, we ended up with this graph:

While the average bitrate is still around 500kbps, we can see that the min/max bands are not as stable.

If we look at the packet loss graph, things aren’t happy (the baseline had no packet losses):

This is where we went for the “By probe” tab, looking at individual bitrates across the probes:

What we can see immediately is that 4 probes out of 30 didn’t get the full attention of the Kurento media server – they got to send and receive less than 500kbps.

If we switch to the packet loss by probe, we see this:

A couple of things that come to mind:

  1. Kurento degrades quality to specific sessions and not across the board. Out of 30 users, 22 got the expected results, 4 had lower bitrates and another 4 had packet losses
  2. There’s correlation here. When Probe #04 exhibits reduction in bitrate, Probe #3 reports incoming packet losses

From here, we can easily go down the path of drilling down to the probes that showed issues. I won’t do it now, as there’s still a lot to cover.

22 Probes (=11 sessions)

It stands to reason then that lowering the capacity to 22 probes should give us pristine results.

Here’s what we’ve seen instead:

We still have that one session that goes bad.

20 or 18?

When we went down to 18 or 20 probes, things got better.

With 20 the issue is that we couldn’t really reproduce a good result at all times. Sometimes, the scenario worked, and other times, it looked like the issues we’ve seen with the 22 probes.

18 though seemed rather stable when tested a couple of times:

Depending on the service you’re offering, I’d pick 18. Or even go down to 16…

4-Way Group Video Calls

The above is a screen capture of the 4-way group video call scenario we’ve analyzed.

In this case, each probe (browser) sends out video at a resolution of 640×360 and receives a video resolution of 800×600.

The screenshot doesn’t show the images getting cropped, so we can assume the Kurento media server takes the following approach to its pipeline:

That’s lots of processing needed for each probe added, which means we can expect lower scaling for this scenario.

Baseline

Our baseline this time is going to need 4 probes.

Here’s high the high level video graph looks like:

Not as stable as our 1:1 video calls, but it should do for what’s coming.

Note that each probe still has around 500kbps of video bitrate.

I’ll skip the drill down into the results of a specific probe metrics and take this as our baseline.

20 Probes (=5 sessions)

Since 1:1 video sessions didn’t go well above 20, we started there and went down.

Here’s how 20 probes look like:

Erratic.

Checking packet losses and bitrates by probe yielded similar results to the bad 1:1 sessions. Here’s the by probe bitrate graph:

Going down to 16 probes (=4 sessions) wasn’t any better:

I’ve actually looked at the bitrates and packet losses by probe, and then decided to map them out into the sessions we had:

This paints a rather grim picture – all 4 sessions hosted on the Kurento server suffered in one way or another. Somehow, the bad behavior wasn’t limited to one session, but showed itself on all of them.

Down to 12 Probes (=3 sessions)

We ended up with 12 probes showing this high level bitrate graph:

It showed some sporadic packet losses that were spread across 3 different probes. The following shows the high level by probe bitrate graph:

There’s some instability in the bitrates and the packet losses which will need some further investigation, but this is probably something we can work with and try and optimize our service to run well.

Live Broadcast

The above screenshot shows what a viewer sees on a live broadcast scenario that we’ve set up using Kurento.

We’ve got multiple testRTC probes joining the same broadcast, with the first one acting as the broadcaster and the rest are just viewers.

Baseline

Our baseline this time is going to need 2 probes. A broadcaster and a viewer.

From now on, we’ll be focusing on what the viewers experience – a lot more than what happens to the broadcaster.

We’re still in the domain of 500kbps for the video channel:

One thing to remember here – outgoing media happens only for our broadcaster probe and incoming media happens for all the other probes.

30 Probe (=29 viewers)

We started with 30 probes – assuming we will fail miserably based on our previous tests, and got positively surprised:

Solid bitrate for this test.

Climbing up

We’ve then started moving up with the numbers.

50, 60 and 80 probes went really well.

Got our appetite, and jumped towards 150 probes.

And ended up with this high level graph:

There wasn’t any packet loss to indicate why that drop with the broadcaster at around 240 seconds, so I switch to the “By probe” view.

This showed that things were starting to deteriorate somewhat:

We’re sorting the results just for this purpose – you can see there’s a slight decline in average bitrate across the probes here – something that is a lot less apparent for smaller test sizes. There was no packet loss.

We’ve tried going upwards to 200, but then 12 probes didn’t even connect properly:

Going down to a 100 yielded some connection errors in some of the probes as well. Specifically, I saw this one:

This indicates we’ve got a wee bit of an issue here that needs to be solved before we can continue our stress tests any further. Most probably in the signaling layer of our server. It is either unstable when we place so many viewers at once against it, or just doesn’t really handle the load well enough.

Results Summary

The table below shows the various limits we’ve reached in our rounds of sizing tests:

Scenario Size
1:1 video calls 18 users in 9 parallel sessions
4-way group video calls 3 rooms of 4 users each
Live broadcast 1 broadcaster + 80-150 viewers

What did we learn?

  1. Stress testing for sizing purposes is fun. I actually enjoyed going through the results and running a couple of tests of my own (I didn’t write the scripts or run the initial tests – I delegated that to our support engineer)
  2. Different scenarios will dictate very different sizing. With more time, I’d start working out on finding the bottlenecks and optimizing them – I’m sure more can be squeezed out of a Kurento machine
  3. Once set up and written intelligently, it’s really easy to rerun the tests and change the number of probes used

Next Steps

Once we got to the sweet spot in each scenario, the next thing to do would probably to run it more than once.

We usually setup a testRTC monitor to run once every 15 minutes to an hour for a couple of days on such a scenario, just to make sure we’re seeing stable results more than once.

Other than that, this needs to be tested under different network conditions, varying load factors, etc.

Check out our best practices for stress testing WebRTC applications. It is relevant even if you are not using testRTC

Get the best practices guide

I’d like to thank WebRTC.ventures for the assistance in setting this one up. If you are looking for a capable vendor to custom build your WebRTC application – check them out.

11

How do WebRTC Media Servers Behave on Packet Loss?

Differently from each other.

Whenever I see people comparing WebRTC media servers, they tend to focus on scale:

– How many sessions can you cram in parallel?

– How many streams can you serve from a single machine?

– How much bitrate can you pump out?

All of these are very important questions – they end up in your sizing calculation that then go into your pricing model for your service. Oh, and we did cover this a bit here when talking about handling WebRTC browsers synchronization at scale.

Now that our new version is taking shape (still in staging, so if you want access – ping us), it is time to play a bit with a few new toys we’ve added for our beloved community of sadists (you may know them as test engineers, but the good ones are sadists – they like inflicting pain upon digital products and services).

What I am talking about here is a combination of two script commands we have:

  1. rtcEvent() – place a vertical event in the graphs
  2. rtcSetNetworkProfile() – change network profiles in runtime

You’ll see how it looks in a second.

What Packet Loss Does?

Packet loss is bad.

You don’t control it. And it can happen at any time. Come and go as it pleases.

The moment you have packet loss, there will be some degradation in the quality of the media. Lost packets means lost data. Means can’t playback something. It might be minor. It might be important.

Next thing that happens? WebRTC (or most other VoIP products for that matter) will start lowering bitrates. Why? Because it assumes there’s congestion on the network, and it is trying to play nice with everyone.

But what happens once that packet loss is gone? Does things go back to normal? And if they do, then how fast will that happen?

My Experiment

I decided to devise a simple enough experiment to get some answers here. I chose the following steps:

  1. Connect to a service
  2. Run for a full minute
  3. Set packet loss to 10% for a full minute
  4. Go back to normal – no packet loss
  5. Wait two minutes

That’s it. What I am interested in is less of what happens during the second minute, but more what happens in the last two minutes, and how that is different than what we have in the first minute of the session.

In general, I decided to place 5 users in the same session, to get that media server working a bit. And I also decided to focus on the SFU kind.

The services I tinkered with are:

  1. AppRTC, just as a baseline for this exercise
  2. Janus, an open source media framework, that can act as an SFU
  3. Jitsi Videobridge, an open source SFU
  4. mediasoup, a relatively new open source SFU
  5. SwitchRTC, a commercial SFU
  6. appear.in, a service that recently added its own self-developed SFU (in beta at the moment)

If you are looking for Kurento or other SFUs – they weren’t included not because I didn’t want to, but because there was no readily available installation out there that I could just use.

I’ll be happy to add more SFUs to the comparison, so give us a shout out if you want to run such an analysis.

Let the fun begin.

AppRTC – My Favorite Baseline

For our baseline, I decide to use AppRTC.

This time, I had to use only 2 browsers, as AppRTC doesn’t support any group calling capabilities.

What it does do is offer the vinyl WebRTC experience.

I started with writing a simple script to fit my needs:

var roomUrl = process.env.RTC_SERVICE_URL + "testRTC" + process.env.RTC_SESSION_IDX + '?vsc=VP8';

var agentType = Number(process.env.RTC_IN_SESSION_ID);
var recuperationTime = 60; // in seconds

client
   .rtcInfo(roomUrl)
   .rtcProgress('open ' + roomUrl)
   .url(roomUrl)
   .waitForElementVisible('body', 60000)
   .pause(2000)
   .click('#confirm-join-button')
   .waitForElementVisible('#videos', 20000)
// Minute 1
   .pause(recuperationTime * 500)
   .rtcScreenshot('Phase 1')
   .rtcProgress('Phase 1')
   .pause(recuperationTime * 500);

// Minute 2
   if (agentType === 1) {
   client
       .rtcEvent('10% Packet Loss start', 'global')
           .rtcSetNetworkProfile('custom', 'packet loss', 10, 'both', 'both'); // 10% packet loss
   }

client
   .pause(recuperationTime * 500)
   .rtcScreenshot('Phase 2')
   .rtcProgress('Phase 2')
   .pause(recuperationTime * 500)

   if (agentType === 1) {
    client
       .rtcSetNetworkProfile('') // back to pristine network conditions
       .rtcEvent('10% Packet Loss End', 'global');
   }

// Minute 3-4
client
   .pause(recuperationTime * 1000)
   .rtcScreenshot('Phase 3')
   .rtcProgress('Phase 3')
   .pause(recuperationTime * 1000);

A few things to note here:

  1. All test scripts on this post can be found on our github account. Easiest way to use them is to import them into your testRTC account
  2. I decided to force VP8 here. VP9 is erratic a bit in its bitrate so I wanted to go for VP8 – hence the addition of ‘?vsc=VP8’ in the first line of this script (check out all of AppRTC’s parameters here)
  3. When the second minute is up, the first probe in each session will generate a global rtcEvent and set packet loss in both directions to 10% (look at lines 23-27)
  4. After an additional second is over, the first probe in each session will generate another global rtcEvent and remove all packet loss and network constraints that might have been used (look at lines 35-39)

Running that using testRTC yields these results once you drill into one of these sessions:

Above you see two things:

  1. The green vertical lines – these are the result of the rtcEvent() calls
  2. The blue and red bars, showing incoming and outgoing packet loss percentage, which averages at 10%

Above you see the video bitrate graph, with the two horizontal lines on it.

Notice how the outgoing bitrate tries going up in the beginning and then drops from 2.5mbps to 1mbps in 60 seconds?

The other thing that interest me is the time it takes for WebRTC/AppRTC to get back to 2.5mbps. And that’s somewhere in the range of 15-20 seconds.

Oh, and because I know you’ll be interested in this – also remember this screenshot of the video average delay we had:

Before we move on to the media servers – remember that what I tried doing with AppRTC is provide a baseline. And the baseline here is “picture perfect”. I didn’t really expect any of the SFUs that I’ve used to be able to match AppRTC with its metrics.

Janus

Janus is an open source media server created and maintained by Meetecho.

They have an online demo running that supports a simple video room.

So we just hooked our script on top of that to get the results we needed. We aimed for 5 browsers in a single room – which will be the norm from now on in this article.

The Janus demo has somewhat of a single room, and I had to end up with a J3rry user in there, though he seemed harmless with no camera or bitrate in my session.

You can see above that the bitrates are rather low – around 140 kbps for each video stream coming into this room. And that’s even before I started adding packet loss.

During packet loss and after it, we “lost” two participants. Here’s a screenshot taken a minute after I stopped packet loss altogether:

The graphs in testRTC show a grim picture:

Janus reports packet losses at higher intervals than what WebRTC does, which is why we see the spikes on the outgoing reporting that go up to 50% and more. The weird thing is the two incoming channels that show around 10% of packet loss as well. Which is weird – more about this later.

Here’s how video bitrates look like for some of the streams (one outgoing and two incoming):

No change even though we have packet loss.

And here’s what happens in the two other incoming streams:

Apparently, these two incoming streams are the ones showing packet loss from the start. They somehow decided to drop to 0 the moment we cranked up the artificial packet loss from 0 to 10% – but never recuperated from it.

Looking at the average delay for the video…

Things can’t be good, but seems like this has nothing to do with my packet loss shenanigans.

It might be Janus and it might just be the demo machine. If I could, I’d reboot it and start all over again.

Jitsi

For me the Jitsi Videobridge is where I go first to run demos and tests on an SFU with testRTC:

  • It is out there
  • It is easy to automate
  • And I am a creature of habit…

To run our test here, we’ve directed 5 of our probes into a single room on the Jitsi meet online service/demo.

After a few attempts, I decided it would be better to disable simulcast, using this prefix to the URL: ‘#config.disableSimulcast=true’. I didn’t do it because simulcast is a bad thing, but because it made analyzing the results much harder for what I had in mind.

If we look at the packet loss graph, it will tell a similar story to what we’ve seen so far:

While there are some packet losses out of the one minute killzone I created, they are negligible (or at least sporadic). That negative values you see for packet losses in the red color? They are reports of the browser’s outgoing stream from the machine we induced packet loss on. This is most probably related to a Chrome bug (HT to Philipp Hancke).

I’ve split the video bitrate graphs here into two graphs – the outgoing one and the incoming ones since they tell two separate stories.

This one caught me by surprise – the outgoing bitrate shows no signs of a change due to packet loss. I wonder what Jitsi is doing (or not doing) to have packet loss ignored in such a way. So I decided to look at it from the receiving end of one of the other four browsers in the same session:

Bitrate drops to 0 for a duration of almost a full minute before coming back up.

Back to the browser with the trashed network, let’s see what happens to the incoming video streams:

Things drop down from around 2mbps to almost 0 on all incoming channels, taking around 40-60 seconds to get back to normal.

One last glance before we move on – check out video average delay:

Jitsi had some hard time recuperating from that packet loss.

It should be noted that I’ve played around with Jitsi before their recent updates – especially the ones including adaptivity.

Mediasoup

mediasoup is a rather new player in the open source SFU space. It is built in C++ as a Node.js module. After a quick Twitter chat, Iñaki Baz Castillo was kind enough to configure it to my needs (specifically, allowing for more bandwidth on the online demo).

Starting as always with packet loss:

The graph seems fine. Percentages are low because of the way packet losses are reported back from the media server. Probably some FEC / retransmissions are involved as well (this would be the case with many of the media servers out there).

Looking at the video bitrate, we see an interesting picture:

There’s a hiccup in the outgoing bitrate (the red line), but that for some reason takes place close to the end of the 60 seconds packet loss window.

There’s also a reduction in incoming bitrate for one of the video stream. It starts around 20 seconds into the packet loss zone, but it doesn’t recover even when we remove the packet losses.

Video delay is also a bit problematic:

It starts off nicely, goes up when packet losses start and never recuperates.

SwitchRTC

Moving on from open source to commercial, there’s SwitchRTC.

It started by me asking for a 2mbps bitrate limit. Now, the way this was set up and without simulcast, it meant the browser is going to need to encode 2mbps and decode 4 streams of 2mbps each. This turned out to be a bit too much for the way we configure our machines (and frankly – probably too much for almost any use case you plan on deploying when it comes to assuming what your typical customer may have).

The end result of it was graphs that went all over the place – each stream and each browser tried hard to compete on resources that were limited, and it wasn’t really nice.

So we dialed back down to 1mbps bitrate limit.

As always, let’s first look at the packet loss graph:

Two things here to note:

  1. One of the incoming video streams has packet losses outside the packet loss zone. Not unheard of, but a bit off the charges compared to others. I think that is due to the data centers used by SwitchRTC for this demo
  2. There’s negative packet losses on the outgoing video stream. This is due to the way SwitchRTC handles packet loss reporting (or more likely filtering packet loss reporting)

For bitrate, I took two screenshots. One for the incoming video streams and one for the outgoing video stream.

On the incoming stream we see an interesting phenomena.

When packet loss starts, bitrate picks up, most likely to overcome the packet loss. It makes sense, since we didn’t limit bitrates, so that seems like the correct strategy. Would be interesting to see what will happen if we limit bitrate as well.

The second thing, is that we have one of the incoming stream dropping down to almost zero and then picking up again. This is the same stream that shows high packet losses. I wonder what causes that.

The graph above shows the outgoing video stream. This is almost textbook behavior for the outgoing video. Once it notices there’s issues, it starts increasing bitrate to compensate, and when that fails – it drops down slowly. It is similar, though not as smooth as what you see with AppRTC.

appear.in

appear.in have a beta SFU, which Philipp Hancke was kind enough to let me use.

Now, appear.in isn’t a media server or a component you can use in your own service – it is a full service, which makes this comparison a bit unfair – checking demos and comparing them to a commercial service.

But then I wanted to check this one out, as it isn’t based on any external framework – it was self developed in house at appear.in

The results are interesting.

Packet loss graph looks rather nice, if a tad low in the percentage:

This shows how far appear.in goes in gauging and polishing the way they make use of network resources.

Video bitrate stays at the 600kbps vicinity – not showing any real effects from my additional packet loss:

Best part though is that the video delay graph doesn’t look erratic:

I am not sure how to compare these results to the rest. I will need more time to check this out – time that I just didn’t have available for this experiment of mine. I will leave it for some future tinkering.

Summing things up

Different media servers will act differently. Especially when putting them under different network conditions.

What I wanted to show here, is how you can use testRTC to goof around with whatever setting you want. Here are a few other ideas:

  1. Drop the network down to 0 bitrate. Wait a bit. Put it back up. Did media return? How quickly did it come up again?
  2. Limit bitrates to different levels. Check if your media server adapts things like resolutions and other interesting parameters to fit the needs
  3. Go down to 50 or 100 kbps. Does video persist or is the media server shutting it down in favor of audio?
  4. Limit bitrate and add a bit of packet loss at the same time (this would be closest to real life). See what happens then – how will the media server behave?
  5. Do the above while adding some load on the server. Does it start fidgeting or is it handling this nicely?

A few things to remember here:

This isn’t an apples to apples comparison

I haven’t taken each and every media server and installed it on my own on the same server configuration. I just used the online demos each of these vendors had. At times, asking for assistance and a bit of configuration from the vendor.

What was different:

  • The server(s) the media server was installed on
  • The configuration of the server, especially what max bitrate it allows

What was similar:

  • I tried disabling simulcast in all servers. Assume that’s a bad thing to do, but I wanted a level playing field on that front
  • The browser used. It was the same for all tests. This includes their version, the machine they were installed on, the network they used, their geographical location – everything
  • The scenario itself. I essentially executed the same scenario over and over again in front of different media servers

Where do we go from here?

Media servers are hard to develop. They are hard to tweak and optimize. And they are hard when it comes to making sizing decisions with them.

They are also pretty good. Most of the ones shown here are running in production services with live customers.

When you go tomorrow to pick the media server for your own project. Or when you want to plan how to size capacities per machine. Or if you want to check your media server in real life scenarios – we’ve got your back.

Check us out. I am sure we can be of help to you.

9

The Missing chrome://webrtc-internals Documentation

There’s a wealth of information tucked into the chrome://webrtc-internals tab, but there was up until recently very little documentation about it. So we set out to solve that, and with the assistance of Philipp Hancke wrote a series of articles on what you can find in webrtc-internals and how to make use of it.

The result? The most up to date (and complete) webrtc-internals documentation.

To make sure it doesn’t get lost, here are the links to the various articles:

  1. webrtc-internals and getstats parameters – a detailed view of webrtc-internals and the getstats() parameters it collects
  2. active connections in webrtc-internals – an explanation of how to find the active connection in webrtc-internals – and how to wrap back from there to find the ICE candidates of the active connection
  3. webrtc-internals API trace – a guide on what to expect in the API trace for a successful WebRTC session, along with some typical failure cases

Enjoy!

1

Your best WebRTC debugging buddy? The webrtc-internals API trace

This time, we take you through the webrtc-internals API trace to see what can you learn from it.

To make this article as accurate as possible, I decided to go to my source of truth for the low level stuff related to WebRTC – Philipp Hancke, also known as fippo or hcornflower. This in a way, is a joint article we’ve put together.

Before we do, though, you should probably check out the other articles in this series:

  1. Parameter’s meaning in webrtc-internals
  2. Finding the current active connection in webrtc-internals

Now back to the API trace.

WebRTC is asynchronous

Here’s something you probably already noticed. WebRTC is asynchronous to the extreme. It almost painstakingly makes sure that whatever you are trying to achieve – you won’t be able to without multiple calls in different contexts of your JavaScript app in the browser.

It isn’t because the authors of WebRTC are mean. It is because the nature of communications is asynchronous. It is made worse by the various network topologies that require the use of curses like STUN, TURN and ICE and by the fact that we require the user to authorize things like accessing his camera.

This brings us to the tricky situation of error handling. With WebRTC, it takes place everywhere. Anything you do can fail twice:

  1. When you call the API and it returns
  2. When the callback/promise/event handler/whatever returns back with the result of your API call

This means that in many cases, you are going to be left with a half baked solution that looks at some of the error cases (did you ever see a sample that takes care of edge cases or failure scenarios?).

It also means that often times you’ll need to be able to debug them. And that’s what the API trace in webrtc-internals can help you with.

The webrtc-internals API trace

If you open chrome://webrtc-internals while in an active WebRTC session, you will immediately see the API trace:

WebRTC API trace sample

This is the list of API calls and events done on the peer connection, informing you of the progress and state of the connection.

You can click on any of these APIs to see its parameters.

WebRTC API trace click to expand

Before we look at what kind of analysis we can derive from these traces, let’s look at what some of the connection methods and events do.

    • addStream: if this method was called, the Javascript code has added a MediaStream to the peerconnection. You can see the id of the stream as well as the audio and video tracks. onAddStream shows a remote stream being added, including the audio and video track ids, it is called between the setRemoteDescription call and the setRemoteDescriptionOnSuccess callback
    • createOffer shows any calls to this API including the options such as offerToReceiveAudio, offerToReceiveVideo or iceRestart. createOfferOnSuccess shows the results of the createOffer call, including the type (which should be ‘offer’ obviously) and the SDP resulting from it. createOfferOnFailure could also be called indicating an error but that is quite rare
    • createAnswer and createAnswerOnSuccess and createAnswerOnFailure are similar but with no additional options
    • setLocalDescription shows you the type and SDP used in the setLocalDescription call. If you do any SDP munging between createOffer and setLocaldescription you will see this here. This results in either a setLocalDescriptionOnSuccess or setLocalDescriptionOnFailure callback which shows any errors. The same applies to the setRemoteDescription and its callbacks, setRemoteDescriptionOnSuccess and setRemoteDescriptionOnFailure
    • onRenegotiationNeeded is the old chrome-internal name for the onnegotiationneeded event. If your app uses this you might want to look for it
    • onSignalingStateChange shows the changes in the signaling state as a result of calls to setLocalDescription and setRemoteDescription. See the wonderful diagram in the specification for the gory details. At the end of the day, you will want to be in the stable state most of the time
    • iceGatheringStateChange is the little brother of the ice connection state. It will show you the state of the ice gatherer. It will change to gathering after setLocalDescription if there are ICE candidates to gather
    • onnicecandidate events show all candidates gathered, with information for which m-line and MID. Likewise, the addIceCandidate method shows that information from the other side. Typically you should see both event types. See below for a more detailed discussion of these events
    • oniceconnectionstate is one of the most important event handlers. It tells you whether a peer-to-peer connection succeeded or not. From here, you can start searching for the active candidate as we explained in the previous post

The two basic flows we can see are that of something offering connections and answering. The offerer case will typically consist of these events:

WebRTC API Trace - offer side

  • (addStream if the local side wants to send media)
  • createOffer
  • createOfferOnSuccess
  • setLocalDescription
  • setLocalDescriptionOnSuccess
  • setRemoteDescription
  • (onaddstream if the remote end signalled media streams in the SDP)
  • setRemoteDescriptionOnSuccess

While the answerer case will have:

WebRTC API Trace - answer side

  • setRemoteDescription
  • (onaddstream if the remote end signalled media streams in the SDP)
  • createAnswer
  • createAnswerOnSuccess
  • setLocalDescription
  • setLocalDescriptionOnSuccess

In both cases there should be a number of onicecandidate events and addIceCandidate calls along with signaling and ice connection state changes.

Let us look at two specific cases next.

Example #1 – My WebRTC app works locally but not on a different network!

This is actually one of the most frequent questions on the discuss-webrtc list or on stackoverflow. Most of the time the answer is “you need a TURN server” and “no, you can not use some TURN server credentials that you found somewhere on the internet”.

So it works locally. That means that you are creating an offer, sending it to the remote side, calling setLocalDescription() and are getting an answer that you feed into setRemoteDescription(). It also means that you are getting candidates in the onicecandidate() event, sending them to the remote side and getting candidates from there which you call the addIceCandidate() method with.

And locally you get a oniceconnectionstatechange() event to connected or completed:

WebRTC API trace - ICE state changes

Great! You probably just copied and pasted these pieces of code from somewhere on github.

Now… why does it not work when you’re on a different network? On different networks, you need both a STUN and a TURN server. Check if your app is using a STUN and TURN server and that you’re passing them correctly at the top of webrtc-internals:

NAT configuration in webrtc-internals

As you can see (assuming you have good eyes), there are a number of ice servers used here. In the case of our screenshot, it’s Google’s apprtc sample. There is a stun server, stun:stun.l.google.com:19302. There are also four TURN servers:

  1. turn:64.233.165.127:19305?transport=udp
  2. turn:[2A00:1450:4010:C01::7F]:19305?transport=udp
  3. turn:64.233.165.127:443?transport=tcp
  4. turn:[2A00:1450:4010:C01::7F]:443?transport=tcp

As you can see, apprtc uses TURN over both UDP and TCP and is running TURN servers for both IPv4 and IPv6.

Now just because you configured a TURN server does not mean there won’t be any errors. The TURN server might not be reachable. Or your credentials might not work (this will happen if you “found” the credentials on a list of “free public servers”). In order to verify that the STUN and TURN servers you use actually work you need to look at the onicecandidate() events.

If you use a STUN or a TURN server, you should see a onicecandidate() event with a candidate that has a ‘typ srflx’.

Similarly, if you use a TURN server, you need to check if you get an onicecandidate() event where the candidate has a ‘typ relay’.

Note that Chrome stops gathering candidates once it establishes a connection. But if your connection gets established you are probably not going to debug this issue.

If you get both of these you’re fine. But you also need to check what candidates your peer sent you with which addIceCandidate() was called.

Example #2 – The network is blocking my connection

Networks that block UDP traffic are quite common. TURN/TCP and TURN/TLS (as well as ICE-TCP even though we mention this mostly to make Emil Ivov happy) provide a way to enable calls even on those networks. This has some effect on the media quality as we discussed previously but let us see how we can detect whether we are on a network that is blocking UDP traffic to begin with.

If you want to follow along, open webrtc-internals and the webrtc candidate gathering demo page and start gathering. By default, it uses one of Google’s STUN servers. To keep things simple, uncheck the “gather IPv6 candidates” and “gather RTCP candidates” boxes before clicking on the “gather candidates” button:

Uncheck ICE gathering candidates

On webrtc-internals you will see a createOffer call with offerToReceiveAudio set to true (this is to create an m-line and gather candidates for it):

WebRTC ICE gathering receive audio

Followed by a createOfferOnSuccess and a setLocalDescription call. After that there will be a couple of onicecandidate events and an icegatheringstatechange to completed, followed by a stop call.

There should be an onicecandidate with a candidate that has a “typ srflx” in it:

ICE candidate types

It shows your public ip. If you don’t get such a candidate but only host candidates, either the STUN server is not working (which in the case of Google’s STUN server is somewhat unlikely) or your network is blocking UDP.

Now block UDP on your network (but mind you, do not block port 53 for DNS). If you don’t know a quick way to block UDP, lets try to simulate that by changing the stun server to something that will not respond, in this case Google’s well-known DNS server running at 8.8.8.8:

Fudging STUN configuration for WebRTC

Click “gather candidates” again. After around 10 seconds you will see a gathering state change to completed in webrtc-internals. But you will not see a server-reflexive candidate:

ICE negotiation with no STUN connectivity

You can try the same thing with a TURN UDP server. Make sure your credentials are valid (again, the “public TURN server list” is not a thing). This will show both a srflx and a relay candidate.

One of the nice tricks you can do here is to change the password to something invalid. Then you will only get a srflx but no relay candidate. Which is a nice and easy way to detect if your credentials are invalid — the candidates page even suggests this.

You can repeat this with TURN/TCP and TURN/TLS servers. You can even add all kinds of TURN servers and then use the priority trick we have shown in the last blog post to figure out from which servers you gathered candidates.

If you don’t get anything but host candidates you might be on a network which blocks both UDP traffic and is successful at blocking TURN/TCP and TURN/TLS. One scenario where that might happen currently is if there is a proxy that requires authentication which is not yet supported by Chrome.

Now let us take a step back. When is this useful? In a real-world scenario you will want to run with all kinds of STUN and TURN servers, otherwise you will get high failure rates. If you need to debug a failure to establish a connection, you should look for the onicecandidate and addIceCandidate events. They will allow you to figure out if the local or remote client was on a network that blocked it from establishing a connection to any peer outside the network.

What’s next?

So this time around, we’ve focused on the API traces:

  • We’ve acquainted with the fact that this is something that webrtc-internals does us a great service just by capturing all of these WebRTC API calls
  • We even went through the typical API calls and flows that are expected to appear in the WebRTC API trace
  • We’ve looked at two examples where the WebRTC API trace can help us debug the problems we’re seeing (there are more)
    • #1 – misconfiguration of NAT traversal servers
    • #2 – network blocking and the forgotten TURN/TCP configuration

We’re not done yet with this series. We still have one or more articles in the pipeline to close the basics of what webrtc-internals got up its sleeves.

If you are interested in keeping up with us, you might want to consider subscribing.

Huge thanks for Fippo in assisting with this series!

6

How do you find the current active connection in webrtc-internals?

Today, we want to help you find the current active connection in webrtc-internals.

To make this article as accurate as possible, I decided to go to my source of truth for the low level stuff related to WebRTC – Philipp Hancke, also known as fippo or hcornflower. This in a way, is a joint article we’ve put together.

Before we do, though, make sure that you know what the parameters in chrome://webrtc-internals mean.  If you don’t, then you should start with our previous article, which explains the meaning of the parameters in webrtc-internals (and getstats).

First things first:

Why is this of any importance?

While you can always go and check the statistics on your connection and channels, there are times when what you really want to do is understand HOW you got connected.

WebRTC allows connections to take one of 3 main routes:

  1. Direct P2P
  2. By making use of STUN (to find a public IP address)
  3. By making use of TURN (to relay all of your media through it)

WebRTC connection types

Which one did you use for this connection? Did this call end up going via TURN over TCP?

There are two times when this will be useful to us:

  1. When trying to understand what occurred in this call, how the ICE negotiation took place, which candidate ended up being used, did we need to renegotiate or switch between candidates – we need to know how to find the active connection
  2. When we want to know this programmatically – especially if we are collecting these WebRTC stats and trying to ascertain the test results from it or how our network behaves in production with live users

What is an active connection in WebRTC?

WebRTC uses the ICE protocol (described in RFC 5245 if you are really keen on the details) to find a way to connect two peers which may be behind NAT routers. To establish a connection, each peer gathers a number of candidates.

There are host candidates which represent local ip addresses and ports. These have a “typ host” when you expand the onicecandidate or addIceCandidate events in webrtc-internals.

onIceCandidate

Similarly, there are serverreflexive candidates where the peer asked a STUN server for its public ip address. These will have a “typ srflx” and an additional raddr field that shows the local ip address where this originated.

Last but not least there are relay candidates. Here, the peer asked a TURN server to open a port and relay traffic. These will show up in the onicecandidate and addIceCandidate with a “typ relay”. To make things more difficult, WebRTC clients can use either UDP, TCP or TLS to connect to the TURN server.

Let’s assume that you see a number of onicecandidate and addIceCandidate calls in webrtc-internals. There are conditions where this does not happen but typically WebRTC will use a technique known as “trickle ice”.

Once candidates have been exchanged, the WebRTC engine forms pairs of local and remote candidates and starts sending STUN packets to check if it gets a response. While the details of this are very much hidden from the application we can observe what is going on in webrtc-internals (and the getStats() API).

TURN in wireshark

The screenshot above was taken from Wireshark, a tool that can be used to sniff the local network. It shows a TURN message flowing in a session during the connection setup stage.

How do we find the current active connection in webrtc-internals?

Open one of the WebRTC samples that establishes a local peer-to-peer connection, mute your speaker and start a call. Also open chrome://webrtc-internals. Now let’s see if we can find out together the active connection.

Now that we’re running the “local peer-to-peer connection” sample off the WebRTC samples repository, there’s something to remember – it does not use STUN and TURN servers, so the number of candidates exchanged will be quite small. We’ve selected it on purpose, so we won’t have so much clutter to go over. What you will see is just four candidates in the onicecandidate event of the first connection (because of some rules about rtcp-mux and BUNDLE) and a single candidate in the addIceCandidate event.

An active connection in webrtc-internals

Now which of those candidate pairs is used? Fortunately webrtc-internals gives up a hint by showing the active candidate pair in bold. Typically the id of this will be Conn-audio-1-0 (with BUNDLE, the same candidate pair and connection is used also for the video). Let’s expand this and see why it is marked as bold:

WebRTC active connection statistics

The most important thing here is the googActiveConnection attribute which is true for the currently active connection and false for all others. We see quite a number of attributes here and while some of them have been discussed in the previous post let us repeat what we see here:

  1. bytesReceived, bytesSent and packetsSent specify the number of bytes and packets sent over this connection. packetsReceived is still MIA but since it is not part of the specification that probably will not change anymore
  2. googReadable specifies whether STUN packets were received on this pair, googWritable whether responses were sent. These google properties have been replaced by the requestsReceived, responsesSent, requestsReceived and responsesSent standard counters
  3. googLocalAddress and googRemoteAddress show the local IP and port. This is quite useful if you need to correlate the information in webrtc-internals with a Wireshark dump. googLocalCandidateType and googRemoteCandidateType give you more information about whether this was a local, serverreflexive or relayed candidate
  4. localCandidateId and remoteCandidateId give you the names of the local and remote candidate statistics. These allow you to get even more details

Lets search for the local and remote candidates specified in the localCandidateId and remoteCandidateId fields. Again this is pretty easy since there is only a single candidate pair.

WebRTC connection candidates

Despite having a different type (localcandidate and remotecandidate; the specification changed this slightly to local-candidate and remote-candidate; we imagine there was a lengthy discussion about the merits of the hyphen), the information in both candidates is the same:

  • the ipAddress is where packets are sent from or sent to
  • the networkType allows you to figure out the type of the network interface
  • the portNumber gives you the port that packets are sent to or from
  • the transport specifies the transport protocol of the candidate as shown in the candidate line. Typically this will be udp unless ICE-TCP is used
  • the candidateType shows whether this is a host, srflx, prflx or relay candidate. See the specification for further details
  • the priority is the priority of the candidate as defined in RFC 5245

The most interesting field here is the priority. When the implementation following the definition of the priority from RFC 5245, the highest 8 bit specify the type preference (= you’ll need to shift the priority attribute right >> 24 bits). This is of particular interest since for local candidates with a candidateType relay we can figure out the transport used between the client and the TURN server. Chrome uses a type preference of 2 for relay candidates allocated over TURN/UDP, 1 for TURN/TCP and 0 for TURN/TLS. Calculating priorities this way means that relayed connections will have a lower priority than any non-relayed candidates and that TURN servers will only be used as a fallback. As we have described previously, using TURN/TCP (or TURN/TLS) changes the behaviour of the WebRTC quite drastically. If you find this hard to understand do not worry, the specification is currently updated to address this!

What happens during an ICE restart?

So far we have been using a very simple example. Let’s make things more complicated with one of Philipp’s favorite topics, ICE restarts. Navigate away from the previous sample and go to this sample. Make a call. You will see an active candidate pair on chrome://webrtc-internals. Click on the “Restart ICE” button. Note how the data on the candidate pair Conn-audio-1-0 has changed, the ip addresses have changed and the STUN counters have been reset. And the data we have seen previously is now (usually) on a report with the identifier Conn-audio-1-1. The candidate pair is not used any longer but will stick around in getStats (and consequently webrtc-internals).

This is somewhat unexpected (and lead to filing an issue). It seems like the underlying webrtc.org library is renaming the reports in a way such that the active candidate always has the id Conn-audio-1-0. It has behaved like this for ages but this has some disadvantages. Among other things, it is not safe to assume that the number of bytesSent will always increase as it could reset, similar to a bug resetting the bytesSent counters on the reports with a type of ssrc that we have seen in the last post.

What’s next?

What have we learned so far?

  1. The meaning of parameters in webrtc-internals and getstats
  2. Active connections in WebRTC
    1. We saw why knowing and understanding the active connection is important (mainly for debugging purposes)
    2. We then went to look at how we find an active connection (it is in bold, and it’s googActiveConnection attribute is set to true
    3. We also reviewed the interesting attributes we can find on an active connection.
    4. We then linked back the active connection to its candidate pair using the candidates id’s, and found out the candidate type from it
    5. We even learned how to find over what transport protocol a candidate is allocated on a TURN server

We’ve only just started exploring webrtc-internals, and our plan is to continue working on them to create a series of articles that will act as the knowledge base around webrtc-internals.

If you are interested in keeping up with us, you might want to consider subscribing.

Huge thanks for Fippo in assisting with this series!

Check out the enhancements we’ve made to testRTC

It has been a while since we released a version, so it is with great pleasure that I am writing this announcement.

Yes. Our latest release is now out in the wild. We’ve upgraded our service on Sunday, so it is about time we take you for a quick roundup of the changes we’ve made.

#1 – Support for projects and users

This one is long overdue. Up until today, if you signed up for testRTC, you had to share your credentials with whoever was on your team to work with him on the tests. This was impossible to work with, assuming you wanted QA, R&D and DevOps to share the account and work cooperatively with the tests and monitors that got logged inside testRTC.

So we did what we should have – we now support two modes of operation:

  1. A user can be linked to multiple projects
    • So if your company is running multiple projects, you can now run them separately, having people focused on their own environment and tests
    • This is great for those who run segregated services for their own customers
    • It also means that now, a user can switch between projects with a single set of credentials in the system
  2. A project can belong to multiple users
    • Need someone to work on writing the scripts and executing them? You got it
    • Have a developer working on a bug that got reported with a link to testRTC? Sure thing
    • The IT guy who just received a downtime alarm from the WebRTC monitor we run? That’s another user
    • Each user has his own place in the project, and each is distinguished by his own credentials

testRTC project selection

If you require multiple projects, or want to add more users to your account just contact our support.

#2 – Longer, bigger tests

While theoretically, testRTC can run any test at any length and size, things aren’t always that easy.

There are usually two limitations to these requirements:

  1. The time they take to prepare, execute, run and collect results
  2. The time it takes to analyze the results

We worked hard in this release on both elements and got to a point where we’re quite happy with the results.

If you need long tests, we can handle those. One of the main concerns with long tests is what to do if you made a mistake while configuring them? Now you can cancel such tests in the middle if necessary.

Canceling a test run

If you need to scale tests to a large number of browsers – we can do that too.

We are making sure we bubble up the essentials from the browsers, so you don’t have to work hard and rummage through hundreds of browser logs to find out what went wrong. To that end, the tables that show browser results have been reworked and are now sorted in a way that will show failures first.

#3 – Advanced WebRTC analysis

We’ve noticed in the past few months that some of our customers are rather hard core. They are technology savvy and know their way in WebRTC. For them, the graphs we offer of bitrates, latencies, packet losses, … – are just not enough.

Chrome’s webrtc-internals and getstats() offer a wealth of additional information that we offered up until now only in a JSON file download. Well… now we also visualize it upon request right from the report itself:

Advanced WebRTC graphs

These graphs are reachable by clicking the webrtc_internals_dump.txt link under the Logs tab of a test result. Or by clicking the Advanced WebRTC Analytics button located just below the channels list:

Access advanced WebRTC graphs

I’d like to thank Fippo for the work he did (webrtc-dump-importer) – we adopted it for this feature.

#4 – Simulation of call drops and dynamic network changes

This is something we’ve been asked more than once. We have the capability of modeling the network of our probes, so that the browser runs with a specific configuration of a firewall or via a specific type of simulated network. We’re modifying and tweaking the profiles we have for these from time to time, but now we’ve added a script command so that you can change this configuring in runtime.

What can you do with it? Run two minutes of a test with 2 Mbps, then close virtually everything for 20-30 seconds, then open up  the network again – and see what happens. It is a way to test WebRTC in your application in dynamic network conditions – ones that may require ICE restarts.

Dynamically changing network profile in testRTC

In the test above, we dynamically changed the network profile in mid-call to starve WebRTC and see how it affects the test.

How do you use this new capability? Use our new command rtcSetNetworkProfile(). Read all about it in our knowledge base: rtcSetNetworkProfile()

#5 – Additional test expectations

We had the basics covered when it came to expectations. You could check the number and types of channels, validate that there’s some bits going on in there, validate packet loss. And that’s about it.

To this list of capabilities that existed in rtcSetTestExpectations() we’ve now added the ability to add expectations related to jitter, video resolutions, frame rate, and call setup time. We’ve also taken the time to handle expectations on empty channels a lot better.

There’s really nothing new here, besides an enhancement of what rtcSetTestExpectations() can do.

#6 – Additional information in Webhook responses

testRTC can notify your backend whenever a test or a monitor run ends on the status of that run – success or failure. This is done by configuring a webhook that is called at the end of the test run. We’ve had customers use it to collect the results to their own internal monitoring systems such as Splunk and Elastic Search.

What we had on offer in the actual payload that was passed with the webhook was rather thin, and while we’re still trying to keep it simple, we did add the leading error in that response in cases of failure:

testRTC webhook test failure response

#7 – API enabled to all customers

Yes. We had APIs in the past, but somehow, there was friction involved, with customers needing to ask for their API key in order to use the API for their continuous integration plans. It worked well, but the number of customers asking for API keys – both customers and prospects under evaluation – has risen to a point where it was ridiculous to continue doing this manually. Especially when our intent is for customers to use our APIs.

So we took this one step forward. From now on, every account has an API key by default. That API key is accessible from the account’s dashboard when you login, so there’s no need to ask for it any longer.

testRTC API key

For those of you who have been using it – note that we’ve also reset your key to a new value.

Your turn

This has been quite a big release for us, and I am sure to miss an enhancement or two (or more).

Now back to you. How would you want to test WebRTC in your product?