nimbius a day ago

Tesla seeks to work to standardize a new high-speed/low-latency fabric (be that TTPoE or otherwise) for AI/ML/Datacenters however theres nothing inherently abject about TCP as it exists today. RDMA over Converged Ethernet suffices perfectly well for whatever an "AI/ML/Datacenter" is and if we're being fair, the lackadaisical approach to the documentation suggests that they may not be taking it as seriously as they could anyway.

If Tesla were really seeking to shake things up they wouldnt have picked IPv4 to do it when the newest release has been around for nearly 30 years and has latency reduction baked in.

this smacks of a pandersome attempt from a company that sees the quite mandarin writing on the walls and has decided (in true Muskovite fashion) they too are just a misunderstood font of futurism.

  • pclmulqdq 21 hours ago

    RoCE sends huge packets down the wire.

    TCP has the wrong abstraction for truly high performance.

    I wouldn't necessarily standardize what Tesla does here, but most of the big companies have their own layer 3 transport protocol for things that need truly high speed and are operating within a datacenter.

    Cray/HPE has their own Ethernet-based protocol (Slingshot was an earlier version of it - not sure what its name is now) which seems to be better than whatever Tesla has, but is not necessarily published.

digitallis42 a day ago

I did a skim and didn't see any explanation of why one would want it over TCP. Did I miss, or is it non obvious?

  • lloeki a day ago

    From a cursory look:

    - looks dead simple

    - no IP layer (there's a ttpip folder in that repo though)

    - distributed congestion control (TCP has a "window" field + a bunch of tentative RFCs, this has a purposeful "congestion")

    - 100% implementable in hardware (TCP can, but it's complex)

    Not a general TCP replacement, but the README properly highlights a "many endpoints local link" use case:

    > the protocol executed entirely in hardware and deployed to a very large multi-ExaFlops (fp16) supercomputer with over 10s of thousands of concurrent endpoints. This protocol does not need a CPU or OS to be involved in any way to link and execute.

    • cyp0633 a day ago

      In Tesla's presentation slides, "Tesla Transport Protocol Over Ethernet (TTPoE): A New Lossy, Exa-Scale Fabric for the Dojo AI Supercomputer", they mentioned that the network layer is optional (but not removed)

    • dboreham 21 hours ago

      Resume-driven engineering.

  • Beretta_Vexee a day ago

    I think it's better to think of it as a fibre channel protocol rather than TCP. It's intended for use on managed internal data centre networks. It skips OSI layers to gain speed and probably do 100% hardware routing with FPGAs.

    It's of no interest on the internet or any small scale netwwork.

    • KaiserPro a day ago

      > fibre channel protocol

      Apart from FC is is explicitly lossless and ordered

      • bcrl a day ago

        FC is not entirely lossless. One ticket I had the joy of dealing with involved a customer using a Fibre Channel network for their storage using multipathd for failover. In theory it was a fully redundant configuration with dual FC ports on the server with each one going to a different FC switch all the way back to the SAN. However, the system was generating I/O errors on large writes while small writes would succeed. Needless to say that ext4 failed horribly, and there were worries that it was a kernel bug in the FC driver.

        After a good amount of back and forth with the customer, and several test programs run on the system in question, I eventually came up with a hypothesis that there was an error in the write path of the SAN as small writes succeeded while larger writes failed. The customer ultimately found there was a dirty fibre on one of the links in their FC fabric. It was dirty enough to corrupt large packets, but not so dirty that smaller writes and control packets were unable to get through. Since multipathd only checks to see if a given target can be read from, it would never fail over to the other path (which was fine). So much for trying to build a high availability system using an expensive SAN!

        Lesson of the story: what you think is a lossless network is not always lossless. Using the IP stack has a lot of beneficial diagnostic tools that you really start missing when something goes awry in a non-IP network.

        • KaiserPro 21 hours ago

          FC should be able to detect errors. I've had alerts shout at me when an FC switch detects a dropped packet.

          More over, the multi-path should have stopped that! it should have detected a bad link and failed over to the other one (but the config for that is hard, so I can see why that might not worked. )

          • bcrl 20 hours ago

            Last time I checked, multipathd does not and cannot detect faults on the write path as it only performs small reads to check the health of any given path. Checking writes would involve allocating space on the disk for multipathd to safely write to. Maybe someone has changed that in the past decade? I don't know as I'm not involved in anything SAN related anymore (and thank goodness for that!). SAN hardware is particularly awful as the underlying network is essentially hidden from the operating system most of the time. Storage subsystems built 30 years ago were built without any consideration that they might running on top of networks.

            These and many other performance issues left me with a particular hatred of SANs.

        • stonogo a day ago

          Broken hardware does not make the protocol lossy. I think you're misunderstanding what 'lossless' is intended to mean in this context; it does not mean that it is error-free. In a lossy protocol, missing data is not necessarily an error. In a lossless protocol, missing data is treated as an error, which is consistent with what you experienced.

          • bcrl 20 hours ago

            I do understand what lossless means. The point of my anecdote is a tale of warning that when going off and start designing new network protocols, especially one as bare bones as TTPoE, you need to consider what happens when someone has to deal with things going wrong. Diagnostics and maintenance matter in the real world for people running large systems with thousands or millions of moving parts. IPv4 and IPv6 bring along lots of tools that help in these scenarios, and IPv4/v6 headers don't actually have all that much overhead to parse and generate in hardware, plus they are protocols that have been around long enough to have many widely available hardware and software implementations in open source or to be purchased from vendors. I'm certain that there will be times when sysadmins will be cursing the fact that the folks who implemented TTPoE didn't have a ping-like tool available from the start.

            • stonogo 10 hours ago

              FC remains a lossless protocol; bugs in multipathd just mean we live in an imperfect world. Your initial sentence, "FC is not entirely lossless," conflates a specific networking term of art with a pedantic application of the denotative definition of the word. If your point was that immature network technologies do not have as many diagnostic tools as mature ones do, you should have made that point instead of misappropriating jargon.

              Anyway, to your specific point, IP at all is basically overkill in a cluster architecture. Very few IP stacks function properly without having to get things like ARP involved; the more of this stack you can get rid of, the better performance you get and there's less to maintain. TTPoE reminds me the most of ATA over Ethernet, a previous effort to shed the complexity of a protocol designed for global networking. It worked great until you hit scaling issues, which competing tech leveraged the aforementioned complexity to address.

              • bcrl an hour ago

                At a high level, FC lost the write without responding to the write in a timely fashion while other I/Os went through. The write request never gets through from the host to the target, and the host ends up timing out the write then throws an I/O error => that seems like a lost write to me at a higher level. Lossless only applies at the lowest layer of the stack; any holistic view of the system would view this scenario as lossy.

                I have implemented ARP and UDP on FPGAs for some toy projects, and it's really not that difficult. One of the use-cases I played around with was getting debug data out of an FPGA at multigigabit rates -- things like PCIe TLPs and raw SERDES data from an EPON implementation to debug a burst mode CDR. The fact that the protocol was IPv4/UDP was no impediment to having it push data through at line rate. Once you've implemented parallel CRC32 for ethernet packets from scratch on a 256-512 bit wide data bus where packets can start and end on arbitrary 32 bit boundaries, the complexity of IPv4 and UDP checksums is dead simple in comparison.

                I understand and agree with throwing out TCP in TTPoE. I do not agree with throwing out IPv4 / IPv6. Heck, you don't even need ARP for v6, you could get away with link local addresses using the ethernet MAC address you already need to have anyways.

    • delfinom a day ago

      Elon just doesn't want to pay Nvidia for Infiniband. Lol

      • andix a day ago

        If it works and it's cheaper, this is a very reasonable thing to do.

kardianos a day ago

There was a talk about this prior. This was used in place of TCP, but where TCP is designed to run over unreliable networks, this protocol achieves speed and latency figures comparable to others, while still being able to retain commodity IP switches in the cluster. By having a fixed buffer, no lingers, faster opens, they increase the speed and latency, without going to dedicated vendors or other stacks.

  • vardump a day ago

    > they increase the speed and latency

    I suppose you mean "increase the speed and decrease the latency"?

FuriouslyAdrift a day ago

Be interesting to see how this stacks up to the dominant protocol in supercomputers/ai clusters : Infiniband.

  • nine_k a day ago

    AFAICT this is very much about handling unreliable links and congestion control.

    Infiniband instead makes the sides bargain to avoid packet loss, while the medium is supposed to be reliable.

  • glzone1 a day ago

    I thought infiniband was more expensive and that even AI where bandwidth is super important was trying to get away from it towards cheaper options.

  • throw0101b a day ago

    > Be interesting to see how this stacks up to the dominant protocol in supercomputers/ai clusters : Infiniband.

    As mentioned in README, this was submitted to the larger Ultra Ethernet consortium for consideration:

    > Deliver an Ethernet based open, interoperable, high performance, full-communications stack architecture to meet the growing network demands of AI & HPC at scale

    * https://ultraethernet.org

iamleppert a day ago

How is this better than UDP? Or for that matter, just plain old Ethernet MAC addressing? You can achieve lower latency and speed (than this) if you don't care about reliability in your transport layer.

This reaks of NIH.

  • mannyv a day ago

    I worked with a company that wrote its own protocol for Ethernet and got almost wire speed. It was worth it for 10, but not worth it at 100mbps.

    You can always beat general purpose solutions like the TCP/IP/UDP stack if you try. For most it isn’t worth it.

  • leetharris a day ago

    Did you even try reading the README?

    - TTPoE is designed to be implemented at hardware level unlike UDP

    - UDP cannot guarantee transmission whereas this does

    - TTPoE is built for distributed resilience

bilekas a day ago

> Some variables may have changed slightly without documentation updates, but we're sure you can figure it out

I hope they're not hoping for mass adoption with an attitude like that. Not exactly inspiring confidence in the longevity and maintainability.

  • Sysreq2 a day ago

    I don’t think mass adoption is their goal. They had a problem. They solved that problem. They shared how they solved said problem.

    Every engineering company releases stuff like this. It’s not meant to change the world. It’s marketing to recruit other engineers who would find that problem interesting.

    • bilekas 21 hours ago

      > I don’t think mass adoption is their goal.

      I'm not so sure about that.. FRom the repo :

      > Tesla also announced joining the Ultra Ethernet Consortium (UEC) to share this protocol and work to standardize a new high-speed/low-latency fabric (be that TTPoE or otherwise) for AI/ML/Datacenters

      Also it's a protocol, personally I will only use a protocol that's fully spec'd. It's a pain sometimes to have consensus among all contributors but it's valuable.

      > edit : I will only use a protocol that's fully spec'd IN PROD

      • renewiltord 21 hours ago

        Nah, it's the base thing that a standard can be built out of. This is how things usually get done.

        • bilekas 8 hours ago

          Yup, for all the specs I've contributed towards, I should have just said : "but we're sure you can figure it out"

          That's how things usually get done right ?

  • RajT88 a day ago

    At least they're honest about it.

    This is currently the state of much modern documentation from huge tech companies.

  • serf a day ago

    it feels more like a way to sweep liability away rather than a real warning..

    ..which also does not inspire confidence.

    • glzone1 a day ago

      Why does this not inspire confidence in Tesla. Their internal software stack is available to their own developers who can review what is actually there.

      Why does it have to be perfectly documented in a public github? Are all other car companies "properly" publically documenting things in github?

      Does it inspire more confidence in VW's software stack if they don't share it? Is VW's confidential stack some big competitive advantage? I've used a VW ID electric vehicle. I did not come away that impressed.

      • Fidelix 21 hours ago

        Because Tesla and Musk bad... or something of the sort.

        This is the way it goes here in HN for anything related to Musk.

        • bilekas 8 hours ago

          > Because Tesla and Musk bad... or something of the sort.

          No, spec bad. Protocol unknown. Poof

          edit: > This is the way it goes here in HN for anything related to Musk.

          Nobody mentioned Musk ... Except you.

7e a day ago

[flagged]

  • iphoneisbetter a day ago

    Can confirm FSD is a total fraud. I was only able to use it for 158 miles of my 160 mile journey yesterday. Absolute vaporware, totally.

    • travisgriggs a day ago

      I can’t tell which way the mocking/satire goes in your comment.

      Are you bragging that it did work for nearly 99% of your trip, and therefore the haters are out of sorts?

      Or

      Are you saying it worked great until it didn’t, and your car ended in some sort of wreck?

      • Sohcahtoa82 20 hours ago

        > Are you bragging that it did work for nearly 99% of your trip, and therefore the haters are out of sorts?

        As someone who has trialed FSD, I'd say the haters are out of sorts.

        There are situations it handles surprisingly well. I drove through some road construction and FSD gracefully handled following the traffic cones that guided cars outside the painted lines. I was able to have FSD drive me from right outside my house (It won't back out of the driveway yet) to a friend's house across town completely automated with zero intervention. 15 miles of both surface streets (Including neighborhood streets with no painted lines and curbs lined with cars) and busy highway.

        But there are still some bone-headed things it does. When one lane turns into two, it still sometimes gets confused and tries to drive in the middle and then suddenly decides to take a specific lane and swerves into it. It is overly cautious at stop signs and will easily piss off anybody behind you.

        I truly believe that Tesla will achieve actual FSD, but it just won't be on the timescale that Elon keeps saying. I also think that FSD will eventually be Level 5 capable, but they won't call it Level 5 and still expect drivers to pay attention so they can dodge legal liability.

      • iphoneisbetter a day ago

        Yes. Kidding aside I am fairly satisfied with FSD. I don't expect it to perform miracles - but it does a pretty damn good job of rolling the car safely down the road when I don't want to do it.

    • moooo99 a day ago

      Anecdotal evidence go brrrr.

      FSD may not be total trash, but it objectively isn’t what was promised it to be in 2020 (?)

      • dailykoder a day ago

        Well, who would've thought that the real world is actually hard?

        • mikestew a day ago

          Apparently not Tesla, but I’m not sure I get your point. Or maybe that was your point. :-)

      • scarby2 a day ago

        At worst it's a partial fraud though.

        • moooo99 a day ago

          I‘m not a lawyer, but if courts would deem this to be fraud, Tesla would have to take accountability towards their customers as well as their shareholders.

    • qwerpy 21 hours ago

      It's sad how politics can dominate even "smart" people to the point that they refuse to appreciate how magical it is that an off the shelf car can drive itself on the vast majority of roads.

  • leesec a day ago

    Haven't had a critical intervention in weeks on that Fraud, personally

    • knallfrosch a day ago

      That's self-evident from you posting here. Just watch the lane dividers, emergency vehicles and truck trailers yourself and you should be fine.

  • MisterTea a day ago

    What irks me is they slapped their name on it like its supposed to invoke supreme technical prowess.

Alex4386 a day ago

[flagged]

  • nsteel a day ago

    I don't think QUIC is a good example. If you realistically want to do something on the internet you absolutely have to use either TCP or UDP. There's no choice in it. But within the confines of your datacenter you can do whatever you want. Including re-inventing the wheel to be square-shaped if that fits better.

    • londons_explore a day ago

      Even if your training computer is fully within your control, occasionally you'll want to run your protocols over the internet for example to test a node in a remote location, debug some problem, etc.

      If you have requirements the internet cannot meet (eg. "latency must be <500us for correctness"), then it limits what you can do.

      • fidotron a day ago

        Do you run PCI over the internet to test boards remotely?

        • londons_explore a day ago

          No, but thats a bit of a pain.

          If it could run over the internet, I'd be able to use standard tooling (eg. wireshark to see what messages are being sent to debug my driver). I'd be able to connect to a PCI card on another machine remotely and have it 'just work', albeit with poor performance.

          There's a lot you lose by inventing your own protocol, and you lose even more if your protocol can't be tunnelled over IP.

          • ethbr1 a day ago

            That's what engineering is though: losing things you care less about in exchange for gaining things you care more about.

            Your point about being able to reach the Internet probably isn't as important of a design goal as some latency ceiling within their cluster.

    • fidotron a day ago

      I legitimately couldn’t tell if the poster was sarcastic. Overpromotion of QUIC is verging on meme territory.

  • Cthulhu_ a day ago

    What makes you think Musk was behind this personally? There's very few things he is involved in in terms of engineering.

  • billsmithaustin a day ago

    Or perhaps Elon is not personally involved in what network protocols they use on their Dojo AI supercomputer.

  • londons_explore a day ago

    Sometimes reinventing the wheel is the right call...

    But considering Dojo is years late and as far as I can see hasn't yet done any meaningful work, and Tesla is still buying up a lot of H100's, I'd say the bet on reinventing everything didn't work out this time.

    It's so late that it probably wouldn't be on the forefront of FLOPs/$ anymore, making the whole project have a business value of $0 (and negative if you consider sunk costs).

  • KaiserPro a day ago

    You don't want to use QUIC for this.

    QUIC is designed to handle loss, and its not meant to be ultra low latency.

    QUIC is designed to be used over IP, not on raw ethernet.

    There is somewhat of a method to do it this way, especially as its designed to pipe data directly into silicon.

    also "the giants of UDP" seems a bit off. UDP is the under developed step child of IP, completely outshined by TCP.

  • Spooky23 a day ago

    Tesla is a weird company. They are hyperfocused on GM-like micro cost cutting in the cars.

    Yet they spend expensive engineering time on stuff like this. Maybe there’s some big cost savings in the backend.

    • adgjlsfhk1 a day ago

      I think the difference is that Tesla plans like a tech company in that they assume they will build an infinite number of cars so any fixed cost optimization is going to be worth it eventually

    • zaroth a day ago

      It’s for FSD, which when fully realized, will have a global market cap of over $10 trillion.

  • paxys a day ago

    You are right Elon personally created the protocol and wrote the spec. It's not like Tesla employs engineers who can make decisions like these independently.

    • mkoubaa a day ago

      At best you can say his organizations give engineers the latitude to reinvent the wheel when they feel it's necessary

  • olalonde a day ago

    > Elon being Elon and reinventing the wheel again!

    Well he has a pretty good track record at doing that.

elcritch a day ago

Twice now I’ve been excited that this was for realtime ethernet used in teslas vehicles. Alas, it is not.

  • sgu999 a day ago

    Any reason to believe they don't use one of the standard industrial protocols like the poorly named EtherNet/IP?

    • kvmet a day ago

      Licensing probably?

      CAN (or one of its more modern variants) are historically more common in automotive. However with 2-wire Ethernet connections becoming more commonplace I do think you're right that more and more cars will be moving to ethernet fieldbus.

      EtherNet/IP is not as robust for many applications as its competitors (PROFINET, EtherCAT) since it is not fully deterministic. EtherCAT is my personal favorite.

      • DannyBee a day ago

        +1 - ethercat and profinet are the way.

        Random guessing - Ethercat seems more likely to take over for CAN because CoE (canopen over ethercat) is so common.

        It's very easy to turn CAN devices into ethercat ones.

        Harder to turn them into profinet ones.

        Seems like a more incremental path for car makers.

        otherwise the main advantage of profinet is that you can treat it like regular ethernet (IE switches, etc), but not sure anyone cares in a car.

    • LeifCarrotson a day ago

      Of all the (current) industrial protocols they could have picked, Ethernet/IP would be the worst.

      Its only advantage is that it can coexist with other TCP traffic and run over standard switches, but that just results in unreliable fieldbus performance.

    • MisterTea a day ago

      Please no EIP, its utter crap and designed by an OOP huffing committee. The only serious protocol is EtherCAT with honorable mentions for Sercos 3 and Ethernet Powerlink (CANopen over Ethernet).

high_na_euv a day ago

Really interesting

  • thelastparadise a day ago

    Why?

    • high_na_euv a day ago

      Recreating foundational infra doesnt seem so common, especially for car company

      • Cthulhu_ a day ago

        In a sense this wasn't from Tesla the car company, but Tesla the IT department with a supercomputer. I don't know what they do on it though, might be lots of physics simulations (aerodynamics etc) or deep learning for assisted driving tech.

        • martindbp a day ago

          They train an end-to-end model to drive based on 8 camera streams and recorded input from human drivers, training on tens, (if not hundreds now) of millions of 30 second clips from their consumer fleet. That's why they're bought one of the largest GPU clusters and making their own chips and transport protocols.

          It's not widely known, but Tesla probably has one of the largest training cluster, because practically all the GPUs they buy go towards training, while most of GPUs for e.g. OpenAI go towards inference. Tesla does inference in the car.

        • literalAardvark a day ago

          In older interviews Musk said that the Dojo is intended for deep learning.

          So most likely that. I agree that this seems to have very little to do with cars.

      • aeonik a day ago

        CAN, MOST, Flexray, LIN, K-Line were all invented for automotive use.

        2 wire Ethernet is also a thing that they spearheaded.

KeepOnTruckin1 a day ago

[flagged]

  • dang 18 hours ago

    Can you please not post like this? Regardless of who you're talking about or how you feel about them, it's not what this site is for, and destroys what it is for.

    If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.

  • _joel a day ago

    [flagged]