Kai Zhang

Kai Zhang Email and Phone Number

Senior Staff Engineer at Alibaba Cloud @ Alibaba Cloud
hangzhou, zhejiang, china
Kai Zhang's Location
Beijing, China, China
Kai Zhang's Contact Details

Kai Zhang work email

Kai Zhang personal email

Kai Zhang phone numbers

About Kai Zhang

Specialties: Cloud Computing, Kubernetes, Docker, Cloud Native AI, MLOps, Distributed system management, job orchestration and scheduling, Deep learning, Tensorflow, Messaging System, QoS (High Availability, Scalability and Reliability) of Distributed Services, Services Oriented Architecture, Search Engine, Semantic WebLatest interests: Cloud native, Kubernetes, Deep Learning, Cloud native AI, MLOps

Kai Zhang's Current Company Details
Alibaba Cloud

Alibaba Cloud

View
Senior Staff Engineer at Alibaba Cloud
hangzhou, zhejiang, china
Website:
aliyun.com
Employees:
2157
Kai Zhang Work Experience Details
  • Alibaba Cloud
    Senior Staff Engineer
    Alibaba Cloud Jan 2017 - Present
    Beijing City, China
    Working on Alibaba Cloud Container Services for Kubernetes (a.k.a, ACK), leading heterogeneous computing and AI solution based on cloud native technology. We deliver large scale AI platform through Docker and Kubernetes for WW customers.- Lead ACK Kubernetes core scheduler product development. Support Batch scheduling policies (including Gang, Capacity, Fair share), Job queue, Elastic scheduling, Colocation scheduling of hybrid workloads, Resource/Task topology aware scheduling, Fine grained CPU/GPU resource Control/Sharing/Isolation, etc. ACK scheduling performance is 5-10 times over the open-source scheduler. - Established ACK Cloud-native AI Suite products, with a business growth rate in the triple digits for three consecutive years. Offering unified management, monitoring, and scheduling for GPU/NPU/RDMA, support for popular open-source AI frameworks (TF, Pytorch, Deepspeed, Megatron, Horovod, Spark, MPI, Triton, Kserve, etc) and models (CV/NLP/LLM/AIGC), AI task orchestration and scheduling, elastic training and fault tolerance, serverless model inference, dataset management and acceleration, MLOps lifecycle management, integration with Huggingface, Modelscope, AI container images repository, and other ecosystems, all the way to model development and platform operation tools and SDKs, providing full-stack support for AI system engineering efficiency. Manage over ten thousand GPU cards for clients, increase GPU utilization by 100%, accelerate AI tasks by 30%, and optimize AI engineering efficiency by 50%- Lead team to create and contribute to multiple cloud-native open-source projects, including Kube-scheduler (K8s scheduler), Koordinator (Colocation scheduler), Kube-queue/Kueue (K8s job queue), Fluid (Dataset orchestration and acceleration, CNCF sandbox), Kubeflow (Machine learning), Kserve (AI model inference). Promoted the cloud-native community to treat AI/big data workloads as first-class citizens and provided more native support.
  • Megvii (Face++)  - 旷视科技
    Technical Director
    Megvii (Face++) - 旷视科技 Feb 2016 - Nov 2016
    Beijing City, China
    Megvii as the leading AI technology startup in China, establishes very solid competency in deep learning and computer vision area. It provides the most popular intelligent services like Face++, FaceID. Besides, we build up the comprehensive cloud platform to produce AI models and services more efficiently. It’s a deep learning domain PaaS, designed to help different roles related to AI, like researchers, consultants and application developers.- As technical director and lead architect, I cooperate with CTO and other senior leaders to define the product and roadmap.- Leads core team (~15 top gun devs) to design/develop/operate the whole platform, to take the challenge of integrating cloud with AI. . The platform gives end to end support to deep learning, from data collect, labeling, preprocess dataset, data flow, build neural network, train/manage models, to encapsulate model into executables for various platforms, even publish them as APIs. . Monitoring, logging, training experiment management services are supported to smoothen deep learning iterations. . Solid cluster management (like Mesos, Kubernetes), job scheduling and orchestration are supported to ensure high efficiency, HA and scalability throughout training lifecycle. The cluster manages 100+ servers providing heterogeneous resources like CPU, GPU, ethernet, infiniband and distributed storage. . All jobs are isolated by Docker with volume/overlay network/containers lifecycle control automatically. . On the top, it provides major services to accelerate deep learning, like model train monitoring (like Google’s TensorBoard), model checkpoint/restore, service registry/discovery, admin console and CLI. . The most popular deep learning frameworks are supported like Google’s Tensorflow neutrally, as well as our private train engine. - As technical lead, I also take responsibility to help the young elite engineer team to quickly grow up with both professional engineering and technical insight.
  • Netdragon Websoft Inc.(网龙网络公司)
    Senior Software Architect
    Netdragon Websoft Inc.(网龙网络公司) May 2015 - Feb 2016
    Beijing City, China
    - As leader of technical team (~60 staffs), be responsible for technical strategy and decision, product development and operation of NetDragon’s PaaS cloud platform for online education business. The platform now supports dozens of applications (including K12 education , IM, social, ERP, etc) serving both internal users and external customers.- Lead architecture design and technology verification of core cloud services, like content store, database, user management, security, application monitoring, scaling, profiling, logging, etc. Focus on HA, scalability, reliability and QoS, as well as DevOps landing in product team.- Lead deployment the platform on 3rd party IaaS cloud like AWS, as well as technical challenge of hybrid cloud deployment.- Be responsible for several K12 key applications product design and development.
  • Ibm
    Senior Software Engineer
    Ibm Dec 2013 - May 2015
    Beijing City, China
    Working on Platform as a Service (PaaS) layer of Cloud computing, focusing on cloud application QoS and Resiliency. Care about application monitoring, logging, auto scaling, dynamic configuration, and HA. Hands on experiences of Pivotal's Cloud Foundry and IBM's BlueMix (PaaS) cloud platform. Leading Bluemix Fabric operation and development works in China lab.
  • Ibm
    Advisory Software Engineer
    Ibm Jul 2011 - Nov 2013
    China, Beijing
    I'm working on the public cloud project, and focusing on PaaS (platform as a service), SaaS (software as a service), and BPaaS (business process as a service). Meanwhile, I'm also taking some studying on mobile applications as well as IoT (internet of things).
  • Ibm
    Staff Software Engineer
    Ibm Jul 2009 - Jun 2011
    I'm a staff software engineer of WebSphere in IBM China at Beijing(CN). I have over 3 years experience in every aspect of application integration software, from development, test and assurance to planning, with plenty of skills of business integration, enterprise connectivity solution, SOA and related technologies. I am also interested in Web 2.0, Smentic Web and Information Retrieval.
  • Pivotal Labs
    Software Engineer
    Pivotal Labs Apr 2014 - May 2014
    San Francisco Bay Area
    I am participating Pivotal Dojo program onsite of San Francisco office and working on development of Cloud Foundry which is popular Platform-as-a-Service cloud offering. My responsibility is focusing on Cloud Foundry runtime development, Pull Request verification and merge, new feature design, etc.
  • Ibm China Development Lab
    Staff Software Engineer
    Ibm China Development Lab 2006 - 2010

Kai Zhang Skills

Cloud Computing Soa Saas Mobile Applications Java Enterprise Edition Xml Enterprise Software Design Patterns Integration High Availability Unix Distributed Systems Web Services Javascript Websphere

Kai Zhang Education Details

Frequently Asked Questions about Kai Zhang

What company does Kai Zhang work for?

Kai Zhang works for Alibaba Cloud

What is Kai Zhang's role at the current company?

Kai Zhang's current role is Senior Staff Engineer at Alibaba Cloud.

What is Kai Zhang's email address?

Kai Zhang's email address is ws****@****ail.com

What is Kai Zhang's direct phone number?

Kai Zhang's direct phone number is (800) 426*****

What schools did Kai Zhang attend?

Kai Zhang attended Beijing Institute Of Technology, Lan Hua San Zhong, Lan Hua San Zhong, Lan Hua San Zhong.

What skills is Kai Zhang known for?

Kai Zhang has skills like Cloud Computing, Soa, Saas, Mobile Applications, Java Enterprise Edition, Xml, Enterprise Software, Design Patterns, Integration, High Availability, Unix, Distributed Systems.

Who are Kai Zhang's colleagues?

Kai Zhang's colleagues are Wayne Shi, Kell Xiong, Lei Shi, 王志国, Leaf Ye, 高凌霄, 贾少天.

Not the Kai Zhang you were looking for?

Free Chrome Extension

Find emails, phones & company data instantly

Find verified emails from LinkedIn profiles
Get direct phone numbers & mobile contacts
Access company data & employee information
Works directly on LinkedIn - no copy/paste needed
Get Chrome Extension - Free

Download 750 million emails and 100 million phone numbers

Access emails and phone numbers of over 750 million business users. Instantly download verified profiles using 20+ filters, including location, job title, company, function, and industry.