hate these ads?, log in or register to hide them
Results 1 to 13 of 13

Thread: [Dev Blog] Sleeping Beauty - How One of the Oldest Bugs in EVE Got Hunted Down

  1. #1
    Donor Verite Rendition's Avatar
    Join Date
    April 9, 2011
    Posts
    3,786

    [Dev Blog] Sleeping Beauty - How One of the Oldest Bugs in EVE Got Hunted Down

    Sleeping Beauty - How One of the Oldest Bugs in EVE Got Hunted Down and Resolved

    (And now the drone window, overview, and various other UI elements no longer stop updating)

    https://community.eveonline.com/news...eeping-beauty/

    Quote Originally Posted by CCP Snorlax


    Sometimes a story of a bug is so intriguing you have to write it down - so if you are technically minded or just curious about the realities of development on EVE Online. Join me for a deep-dive into the technical depths of the EVE Online code base, as I recount how one of the oldest defects in the game got hunted down and resolved!

    We're always working on improving EVE Online and our code base, and I hope you enjoy coming along on a journey of how this work can look.

    How it all began

    A little while ago CCP karkur asked me if I could help with an issue she had been looking into with the drone window, where it sometimes would stop updating when drones are being recalled. Her investigation did not uncover any bugs in the code related to drones or the UI. Over the course of a few hours, she managed to repro the drone bug 3 times, each time adding more logging where the problem seemed to lie. After the last repro, it became clear that it was failing at the least expected line of code - the drone window simply started updating, then went to sleep for a bit, but never woke up again - it seemed that tasklets were on occasion simply not returning from their sleep.

    The original bug was quite difficult to reproduce, but CCP karkur put together a script that generated 500,000 tasklets that simply sleep for some given time, with some tracking to identify tasklets that never finish. With this script, we've been able to reproduce this sleeping disorder somewhat reliably. This script identified the tasklet that didn't wake up, and upon inspection it was clear that the tasklet was neither scheduled nor blocked - it was in some sort of limbo.

    My first attempts at reproducing the bug with CCP karkur's script were not successful - I was running in Jessica (our authoring/debugging tool) without the client and didn't seem to have any luck. Later I tried in the client with better luck, so it seemed there was more to it than simply having lots of tasklets. I started reading through the relevant C++ code to try to come up with theories.

    Some background

    Tasklets (see Stackless Python for more information) are run in PyScheduler::Run, by calling PyStackless_RunWatchdogEx. This method takes in a timeout value and runs any scheduled tasklets until the given time has passed. When a tasklet is created, it is scheduled and so will run when the PyScheduler::Run is called. When a tasklet sleeps, it calls Synchro::SleepWallclock, which is implemented by creating a channel and calling receive on the channel. This blocks the channel until somebody sends something on that channel. Synchro maintains a heap of sleepers - a sleeper is a simple object that holds the due time when the tasklet should wake up, and the channel. Synchro::Tick pulls sleepers from the top of the heap and sends a value on the channel. This in turn unblocks the tasklet and schedules it.

    My first idea was that there might be a bug in the heap code. It's our own code, it's old and there are no unit tests for it. I considered writing some tests for it, or even finding a suitable replacement, but then I realized the symptoms didn't match with the tasklet never being pulled from that heap. If that were the case, the tasklet would simply remain blocked forever. It's still on my to-do list to add tests for this code, though.

    My next idea came from looking at the documentation of PyStackless_RunWatchdogEx. It can return a tasklet, that should either be killed or rescheduled. That turned out to be a dead end - this only applies when running in preemptive mode, where a tasklet can be interrupted.

    Show me the code

    I had a hunch that killing tasklets might be a factor in this, but I couldn't see anything wrong with the code around that. Nevertheless, I couldn't convince myself that the reference counting of the channel was correct, so I decided to clean up the code a bit. It felt awkward how it was done.
    Code:
    PyObject* Synchro::SleepWallclock(int ms, const int64_t &due)
    {
        Sleeper sl;
        sl.channel = PyChannel_New(NULL);
        if (!sl.channel)
            return 0;
        PyChannel_SetPreference(sl.channel, 0); //just make runnable on wakeup
        sl.due = due;
     
        mWallclockSleepers.Insert(sl);
         
        // Go to sleep and wake up! *(the sender releases the channel)
        PyObject *ret = PyChannel_Receive(sl.channel);
     
        if (!ret) {
            //we were killed, so lets try and find us in the queue, to release resources.
            RemoveSleeper( mWallclockSleepers, sl );
        }
        return ret;
    }
    The SleepWallclock function creates the channel, stores it in the Sleeper object, puts the Sleeper object in the heap and calls receive on the channel. The receive call returns after Synchro::Tick has called send on the channel. That unblocks it, causing it to be scheduled. The tasklet could also be killed, in which case receive returns a nullptr. Let's take a look at a snippet from Synchro::Tick:
    Code:
    //gather sleepers for wakeup
    std::vector sleepers;
    while (mWallclockSleepers.size()) {
        if (mWallclockSleepers.front().due > now)
            break;
        sleepers.push_back(mWallclockSleepers.Shift());
    }
    if (sleepers.size()) {
        Be::Time nnow = BeOS->GetActualTime();
        for(unsigned int i = 0; iPyError();
            }
            Py_DECREF(s.channel);
        }
    }
    This all looks correct - the channel could not pulled out from under the tasklet until after the Send call. What happens when a tasklet is killed?
    Code:
    void Synchro::RemoveSleeper( Heap &sleepers, Sleeper &sl )
    {
        SleeperIt it;
        for( it = sleepers.begin(); it != sleepers.end(); ++it )
        {
            if( it->channel == sl.channel )
            {
                break;
            }
        }
     
        if( it != sleepers.end() )
        {
            //Ok found us.  Let's delete us.
            sleepers.Remove( it );
            Py_DECREF( sl.channel );
        }
    }
    Again, it all looks alright - the tasklet is removed from the heap and the channel is released. Note that it is not an error if the tasklet can't be found on the heap - it could have woken up and been scheduled on the same tick as it is being killed, but in that case the channel will have been released in Tick.

    Let's examine this a bit further, though. Keep in mind that Py_DECREF frees the memory if the reference count goes to zero. The PyChannel_New call in SleepWallclock returns a PyChannelObject with a reference count of one. There are no references added, so the channel object is deleted in Tick when it calls Py_DECREF. This should be fine, as there are no references to the channel in SleepWallclock after the receive call. Except...

    In RemoveSleeper, which is called after a tasklet has been killed we're comparing channel objects. There should always be a one-to-one correspondence with tasklets and channels so we're effectively looking for a tasklet here, but since we needed to keep track of channels to wake them up, we just use the channels. I added a reference to the tasklet in the Sleeper object so that I could verify it here, and lo and behold, when running the scripts from CCP karkur I got a discrepancy - a tasklet was determined to be in the heap based on the channel, but the tasklet didn't match.

    The Eureka moment

    When a channel object is deleted, it's memory is now free and available for reuse. This implies that a tasklet that goes to sleep on the same tick as a tasklet is killed exactly when it wanted to wake up can get the same channel pointer as the killed tasklet. RemoveSleeper will in that case remove the wrong tasklet (the one that just called SleepWallclock) from the heap, kill it's channel, leaving it in a non-scheduled, non-blocked state.

    The fix is simple - move the Py_DECREF calls into SleepWallclock. This ensures that channel objects are not recycled prematurely, and in my opinion is actually cleaner, keeping the reference counting localized in one function.

    Never assume

    EVE Online is almost 14 years old now and has seen many battles during its lifetime, and one can say the code base is thoroughly battle tested. However, that doesn't mean sneaky bugs might not be hiding there, but finding a bug in such a fundamental function was surprising, to say the least. I guess we can never assume that any part of our code base is absolutely 100% correct. CCP karkur and others have spent a considerable amount of time tracking this issue down so we had high confidence in the higher level code. This allowed me to go into this investigation on the assumption that Sleep was broken, being stubborn until I found this rare edge case.

    We have reason to believe that a few other bugs are symptoms of the same underlying issue - the overview not updating, for example, or the shield, armor and structure bars not updating. These bugs have always been nearly impossible to reproduce but nevertheless are reported occasionally, especially after mass tests or heavy fights on Tranquility.
    Tranquility (EVE) Influence Map
    Life is absurd, but with a little effort we can make it completely ridiculous

  2. #2
    Movember '12 Best Facial Hair Movember 2012Donor Lallante's Avatar
    Join Date
    April 13, 2011
    Posts
    16,030
    I presume coders may find this interesting?

  3. #3
    Rakshasa The Cat's Avatar
    Join Date
    April 18, 2011
    Location
    Only one here to predict a win for God Emperor
    Posts
    11,771
    Quote Originally Posted by Lallante View Post
    I presume coders may find this interesting?
    Not really, it was a common ref counter bug. Interesting would be stuff like microsofts STL hash map iterator having non-strict ordering when the server-side had additional elements.

    (The desync issue)
    Are you an engineer? -- Quack

  4. #4
    Donor Verite Rendition's Avatar
    Join Date
    April 9, 2011
    Posts
    3,786
    Quote Originally Posted by Lallante View Post
    I presume coders may find this interesting?
    Coder here. It is indeed interesting.
    Tranquility (EVE) Influence Map
    Life is absurd, but with a little effort we can make it completely ridiculous

  5. #5

    Join Date
    May 31, 2011
    Posts
    2,971
    Yeah, interesting.

    Although - as Rak pointed out, it turns out to be a "simple" reference counter bug, the story on how it was identified as a bug is an interesting read. Making it somewhat reliably reproducable by CCP karkur's script, which implies she didn't gave up after the first couple of "can't be reproduced" - and the further hunt down the deep dark holes of decades old code makes it worth a read.

  6. #6
    Daneel Trevize's Avatar
    Join Date
    April 10, 2011
    Location
    T L A
    Posts
    11,870
    OP/CCP needs to update the Synchro::Tick snippet, one of the devs posted a valid for loop on reddit, the quoted one is mangled.
    Quote Originally Posted by QuackBot View Post
    Idk about that, and i'm fucking stupid.

  7. #7
    Rakshasa The Cat's Avatar
    Join Date
    April 18, 2011
    Location
    Only one here to predict a win for God Emperor
    Posts
    11,771
    Also passing a reference to 'const int64_t' is bad practice and slower than passing by value.
    Are you an engineer? -- Quack

  8. #8
    root's Avatar
    Join Date
    April 26, 2011
    Location
    The Camel Empire
    Posts
    2,793
    It's interesting because the speed-o-meter freezing was around for years.
    Let's hope they this fixed it.
    The Rapier is my love boat
    ~lowsec smallscale pvp 'n stuff~

  9. #9
    Lady Spank's Avatar
    Join Date
    April 12, 2011
    Location
    Get Out Nasty Face
    Posts
    4,930
    TL;DR boot.ini
      Spoiler:



    (ಠ_ృ) ゛Lady Spank is the best。゛ ~ Xenuria (ಠ_ృ)

  10. #10
    Movember 2012 I Legionnaire's Avatar
    Join Date
    April 9, 2011
    Posts
    1,412
    Can someone explain why they're co-mingling python and C++ code?

  11. #11
    Cortess's Avatar
    Join Date
    April 15, 2011
    Posts
    4,796
    Quote Originally Posted by I Legionnaire View Post
    Can someone explain why they're co-mingling python and C++ code?
    :fearless:

  12. #12
    Donor Verite Rendition's Avatar
    Join Date
    April 9, 2011
    Posts
    3,786
    Quote Originally Posted by I Legionnaire View Post
    Can someone explain why they're co-mingling python and C++ code?
    The core rendering code is in C++. The game logic is in Python.
    Tranquility (EVE) Influence Map
    Life is absurd, but with a little effort we can make it completely ridiculous

  13. #13
    Rakshasa The Cat's Avatar
    Join Date
    April 18, 2011
    Location
    Only one here to predict a win for God Emperor
    Posts
    11,771
    Quote Originally Posted by I Legionnaire View Post
    Can someone explain why they're co-mingling python and C++ code?
    You write the computationally intensive, often called and/or foundational parts in C++ that allows it to be highly optimized, while the higher-level stuff that involves more complicated logic and is often changed gets written in Python.

    This means that you only optimize (by writing in C++) the parts that are responsible for most of the run time while making it easy enough for the pleb coders to add or change gameplay related stuff.

    Edit: C++ doesn't just cover rendering or the game engine parts, you also code parts of the game logic in C++ when you know it is performance critical.
    Are you an engineer? -- Quack

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •