update link to the figures in doc
Change-Id: Idddf4f6bafce02d484f76c16a907f35e0c9dc336
This commit is contained in:
parent
579f3c1403
commit
14495c021c
@ -96,7 +96,7 @@ Jianchao Guo (AsiaInfo), Jian Xu (China Mobile), Jie Nie (China Mobile), Jintao
|
||||
|
||||
**T**he operational service layer is the user-facing service layer. On one hand, it enables the provision of computing force network services to users based on underlying computing and network resources. On the other hand, it collaborates with other computing providers to build a unified transaction service platform, supporting new business models such as "computing force e-commerce."
|
||||
|
||||
![Graph of Computing force Network Technology(from China Mobile "Computing Force Network White Paper")](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/3.%E5%9B%BE-CFN-Technology-Topo.png)
|
||||
![Graph of Computing force Network Technology(from China Mobile "Computing Force Network White Paper")](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/3.%E5%9B%BE-CFN-Technology-Topo-en.png)
|
||||
|
||||
(If you would like to learn more about specific technical areas in the graph, you can refer to China Mobile's "Computing Force Network White Paper" and "Computing Force Network Technical White Paper".)
|
||||
|
||||
@ -118,7 +118,7 @@ Jianchao Guo (AsiaInfo), Jian Xu (China Mobile), Jie Nie (China Mobile), Jintao
|
||||
| Use Case Description | With the deepening of digital transformation of the whole society and industry, cloud office has become more and more common. Cloud based office has the characteristics of resource on-demand, convenience and high mobility, and is favored by large and medium-sized enterprises. Cloud desktop is a specific implementation method. By centrally managing the office computing resources required by enterprise employees and adopting large-scale and digital methods, IT Capital Expenditure and Operating Expense(CAPEX & OPX) can be reduced, and production efficiency can be improved. Due to the presence of branches throughout the country and even globally, enterprises have requirements for both computing resource and network connectivity. Therefore, this scenario can be considered a typical scenario for computing force network.|
|
||||
| Current Solutions and Gap Analysis | In traditional solutions, cloud desktop allocation is usually based on the geographic location of employees, without considering network status. In this way, it is uncertain whether employees who move to another location will still have the same usage experience as the resident. The overall IT resource utilization rate of the enterprise cannot achieve optimal results.|
|
||||
| Computing Force Network Requirements Derivation | **Virtual Desktop Infrastructure Requirements Based on Computing and Network Convergence:** <br> 1. When users use cloud desktops, they have different requirements for latency and bandwidth based on the purpose of the cloud desktop. For example, cloud desktops for office use generally require a latency of less than 30ms and a bandwidth of around 2M, while for simulation design cloud desktops, the latency is lower and the bandwidth is higher. Therefore, a good solution requires the ability to calculate a network path with appropriate latency and bandwidth based on the cloud desktop type required by the user when selecting a computing resource pool. In this scenario, the computing force network is required to have network scheduling capability. <br> 2. Additionally, the required computing and storage resources may vary depending on the type of cloud desktop. For example, office type cloud desktops generally only require CPU and some memory, while designing simulation type cloud desktops typically requires GPU, as well as more CPU and memory. Therefore, when selecting a computing resource pool, computing force network should be able to select computing entities with the required hardware form and physical capacity based on the type of cloud desktop. Computing force network need to have computing scheduling capabilities. <br> 3. Employees may travel outside their premises at any time. In order to ensure a consistent experience for users using cloud desktops in their new location, computing force network needs to be able to reassign cloud desktop resources based on the new physical location of employees after identifying their movement, and then perform cloud desktop migration.|
|
||||
| Reference Implementation | In this scenario, a VDI management system (hereinafter referred to as VDI Center) and a computing force network brain are required to collaborate. The system architecture diagram is as follows: <br> ![VDI示意图](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.1%E5%9B%BE-VDI.png) <br> Specific workflow: <br> 1. Enterprises deploy cloud desktop server resource pools in different geographical locations. This can be achieved through self built private cloud data centers or by using public cloud services. VDI Center is a centralized cloud desktop management system. The VDI Center reports all computing resource information to the computing force network brain. The computing force network brain simultaneously maintains the information of computing and network. <br> 2. Users bring network SLA requirements and computing resource requirements to the VDI Center to apply for cloud desktop. <br> 3.The VDI Center requests an appropriate resource pool from the computing force network brain with the user's constrained needs. The computing force network brain selects an appropriate cloud resource pool based on the global optimal strategy and calculates the network path from the user access point to the cloud resource pool. Then, the computing force network brain returns the selected resource pool information and pre-establishes the path. <br> 4. After obtaining the optimal resource pool information, the VDI Center applies for virtual desktop through the cloud management platform of that resource pool and returns the information of the cloud desktop to the user. <br> 5. Users access the cloud desktop through the previously established path. <br> 6. User travels to other regions and initiates a cloud desktop migration request. <br> 7. The VDI Center requests a new and suitable resource pool from the computing force network brain again. The computing force network brain recalculates the optimal resource pool, establishes a new path, and returns the results to the VDI Center. <br> 8. VDI Center discovered a better resource pool and initiated virtual machine migration process. <br> 9. Users access a new cloud desktop through a new path.|
|
||||
| Reference Implementation | In this scenario, a VDI management system (hereinafter referred to as VDI Center) and a computing force network brain are required to collaborate. The system architecture diagram is as follows: <br> ![Figure for VDI](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.1%E5%9B%BE-VDI-en.png) <br> Specific workflow: <br> 1. Enterprises deploy cloud desktop server resource pools in different geographical locations. This can be achieved through self built private cloud data centers or by using public cloud services. VDI Center is a centralized cloud desktop management system. The VDI Center reports all computing resource information to the computing force network brain. The computing force network brain simultaneously maintains the information of computing and network. <br> 2. Users bring network SLA requirements and computing resource requirements to the VDI Center to apply for cloud desktop. <br> 3.The VDI Center requests an appropriate resource pool from the computing force network brain with the user's constrained needs. The computing force network brain selects an appropriate cloud resource pool based on the global optimal strategy and calculates the network path from the user access point to the cloud resource pool. Then, the computing force network brain returns the selected resource pool information and pre-establishes the path. <br> 4. After obtaining the optimal resource pool information, the VDI Center applies for virtual desktop through the cloud management platform of that resource pool and returns the information of the cloud desktop to the user. <br> 5. Users access the cloud desktop through the previously established path. <br> 6. User travels to other regions and initiates a cloud desktop migration request. <br> 7. The VDI Center requests a new and suitable resource pool from the computing force network brain again. The computing force network brain recalculates the optimal resource pool, establishes a new path, and returns the results to the VDI Center. <br> 8. VDI Center discovered a better resource pool and initiated virtual machine migration process. <br> 9. Users access a new cloud desktop through a new path.|
|
||||
| Proposal of Technology development and open-source work| It is recommended to conduct further research on the computing force network brain as follows: <br> 1. The computing force network brain has a set of multi-objective optimization scheduling algorithms. Multiple objectives include balancing computing resources, minimizing network latency, and optimizing paths, and so on. <br> 2. The computing force network brain can manage various forms of computing resources (such as CPU, GPU, ASIC, etc.) and establish unified metrics.|
|
||||
|
||||
|
||||
@ -128,7 +128,7 @@ Jianchao Guo (AsiaInfo), Jian Xu (China Mobile), Jie Nie (China Mobile), Jintao
|
||||
|:----|:----|
|
||||
| Contributor | China Mobile Research Institute: Weisen Pan |
|
||||
| Application Name | AI-based Computer Force Network Traffic Control and Computer Force Matching |
|
||||
| Use Case Description | 1. Computer Force Network integrates distributed and ubiquitous computing capabilities in different geographic locations, and its sources include various computing devices such as cloud computing nodes, edge computing nodes, end devices, network devices, etc. The computing tasks in the CFN environment are large in volume and diverse in type, including data analysis, AI reasoning, graphics rendering, and other computing tasks. In this case, the traditional traffic control strategy may not be able to effectively handle the diversity and magnitude of tasks, which may lead to the waste of computing resources, delay of computing tasks, and degradation of service quality. To solve these problems, AI-based traffic control and computing force matching can be used to train AI models using deep learning algorithms by collecting a large amount of network traffic data, device state data, and task demand data. The model can not only learn the pattern of network traffic and computing tasks but also predict future traffic changes and task demands, as well as the computing capacity of devices, and adjust the traffic control strategy and arithmetic matching strategy in real-time based on this information. <br> 2.With the help of AI, operators can manage traffic and computing force more effectively, reduce network congestion, improve the utilization of computing resources, reduce the latency of computing tasks, and improve the quality of service. For example, when a large number of data analysis tasks are predicted to be coming, AI systems can adjust network configurations in advance to prioritize allocating computing resources to these tasks to meet demand. When the capacity of computing devices is predicted to be insufficient to handle the upcoming tasks, the AI system can adjust the traffic control policy in advance to redirect some tasks to other devices to prevent congestion. <br> 3. AI-based Computer Force Network traffic control and computer force matching bring significant performance improvements to large-scale CFN, enabling operators to manage computing resources better to meet the demands of various computing tasks. <br> ![AI流量协同示意图](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.2%E5%9B%BE-AI-workload.png)|
|
||||
| Use Case Description | 1. Computer Force Network integrates distributed and ubiquitous computing capabilities in different geographic locations, and its sources include various computing devices such as cloud computing nodes, edge computing nodes, end devices, network devices, etc. The computing tasks in the CFN environment are large in volume and diverse in type, including data analysis, AI reasoning, graphics rendering, and other computing tasks. In this case, the traditional traffic control strategy may not be able to effectively handle the diversity and magnitude of tasks, which may lead to the waste of computing resources, delay of computing tasks, and degradation of service quality. To solve these problems, AI-based traffic control and computing force matching can be used to train AI models using deep learning algorithms by collecting a large amount of network traffic data, device state data, and task demand data. The model can not only learn the pattern of network traffic and computing tasks but also predict future traffic changes and task demands, as well as the computing capacity of devices, and adjust the traffic control strategy and arithmetic matching strategy in real-time based on this information. <br> 2.With the help of AI, operators can manage traffic and computing force more effectively, reduce network congestion, improve the utilization of computing resources, reduce the latency of computing tasks, and improve the quality of service. For example, when a large number of data analysis tasks are predicted to be coming, AI systems can adjust network configurations in advance to prioritize allocating computing resources to these tasks to meet demand. When the capacity of computing devices is predicted to be insufficient to handle the upcoming tasks, the AI system can adjust the traffic control policy in advance to redirect some tasks to other devices to prevent congestion. <br> 3. AI-based Computer Force Network traffic control and computer force matching bring significant performance improvements to large-scale CFN, enabling operators to manage computing resources better to meet the demands of various computing tasks. <br> ![Figure of AI Workload](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.2%E5%9B%BE-AI-workload-en.png)|
|
||||
| Current Solutions and Gap Analysis | AI-based Computer Force Network Traffic Control and Computer Force Matching Through artificial intelligence technology, it can monitor the status of CFN in real time, dynamically predict network traffic demand, and automatically optimize CFN resource allocation and load balancing. It can also continuously learn and improve its own traffic control strategy through deep learning algorithms to make it more adaptable to complex and variable network environments. <br> Gap Analysis: <br> 1.Dynamic and adaptive: Traditional traffic control methods tend to be more static and difficult to adapt to the rapid changes in the future CFN environment. AI-based traffic control and computer force matching are highly dynamic and adaptive, and can dynamically adjust traffic control policies and computer force allocation policies based on real-time network status and predicted traffic demand. <br> 2. Learning and improvement: Traditional traffic control methods cannot often learn and improve themselves. On the other hand, AI-based traffic control and computer force matching can continuously learn and improve their own traffic control and computer force matching strategies through deep learning algorithms, making them more adaptable to complex and changing network environments. <br> 3. Adaptability to future technologies: With the rapid development of CFN and related applications, the future CFN environment and traffic demand may be more complex and variable. Therefore, AI-based traffic control and computer force matching are better adaptable and forward-looking for future CFN and related applications.|
|
||||
| Computing Force Network Requirements Derivation | In CFN, traffic control and reasonable matching of computer force are crucial to ensure efficient operation and resource optimization. This requires a system that can adjust the flow control policy and arithmetic matching in real-time and dynamically. Artificial intelligence-based flow control and computer force matching are expected to meet this need. The following is the specific requirement derivation process: <br> 1. Efficient resource utilization: In a large-scale, distributed CFN, the efficiency of resource utilization directly affects the operational efficiency and cost of the entire network. AI technology enables more accurate traffic prediction and scheduling, resulting in more rational and efficient utilization of resources. <br> 2. Dynamic adjustment and optimization: Network traffic and task demand may change with time, applications, and user behavior, which requires traffic control policies to respond to these changes in real-time. AI technology can achieve dynamic adjustment and optimization of traffic control policies through real-time learning and prediction and reasonably match the optimal computing force. <br> 3.Load balancing: In the face of sudden changes in traffic or changes in task demand, it is critical to maintain network load balancing. AI technology can dynamically adjust traffic and task distribution to support load balancing by monitoring and predicting network status in real-time. <br> 4. Quality of Service Assurance: In ensuring the quality of service, AI technology can improve the quality of service by prioritizing essential tasks and services based on the predicted network state and task demands. <br> 5. Automation management: By automatically learning and updating rules, AI technology can reduce the workload of CFN management and achieve a higher degree of automation. <br> Therefore, the introduction of AI-based traffic control and computer force matching can improve the operational efficiency and service quality of CFN and achieve a higher degree of automated management, which is in line with the development needs of CFN.|
|
||||
| Reference Implementation | 1. Data collection: Collect historical data in CFN, such as the computing force utilization of each node, task execution time, network latency, etc., as the data basis for training AI models. <br> 2. Data pre-processing: Pre-process the collected data, including data cleaning, format conversion, feature extraction, etc. <br> 3. Model selection training: According to the characteristics and needs of CFN, suitable AI models (such as deep learning models, reinforcement learning models, etc.) are selected for training. The training goal is for AI models to learn how to perform optimal traffic control and arithmetic power allocation under various conditions. <br> 4. Model testing and optimization: The trained AI models are tested in a simulated or natural environment, and the model is adjusted and optimized according to the test results. <br> 5. Model deployment: The optimized AI model is deployed to CFN for traffic control and arithmetic guidance according to real-time network status and task requirements. <br> 6. Real-time adjustment: The model needs to be dynamically adjusted and optimized according to the real-time network status and task demand data collected after deployment. <br> 7. Model update: The model is regularly updated and optimized according to the network operation and model performance. <br> 8. Continuous monitoring and adjustment: After the model is deployed, the network state and task execution need to be continuously monitored, the AI model needs to be adjusted as required, and the model needs to be periodically retrained to cope with changes in the network environment. |
|
||||
@ -142,9 +142,9 @@ Jianchao Guo (AsiaInfo), Jian Xu (China Mobile), Jie Nie (China Mobile), Jintao
|
||||
| Contributor | Inspur - Geng Xiaoqiao |
|
||||
| Application Name |Integrated computing and network scheduling for video applications in rail transit scenario|
|
||||
| Use Case Description | Based on the intelligent video scene of rail transit and focusing on customer business, integrated computing and network scheduling capability is built. Perceiving and analyzing users' requirements on delay and cost, coordinately scheduling and allocating computing and storage resources, as well as realizing dynamic optimization of scheduling policies, so as to provide customers with on-demand, flexible and intelligent computing force network services. At the same time, it is widely adapted to various industry application scenarios, enabling intelligent transformation and innovation of video services. |
|
||||
| Current Solutions and Gap Analysis | 1.Computing and network resources are heterogeneous and ubiquitous, services are complex and performance requirements are high. However, the traditional solutions have synergy barriers between the network, cloud, and service, which cannot meet the scheduling requirements of various scenarios. <br> 2.During peak hours, the data transmission volume is large and the network load is high, leading to high latency and affecting the quality of video service, further putting forward higher requirements and challenges for the optimization of video scheduling solutions. <br> ![终端摄像头与网络协同示意图](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.3%E5%9B%BE-%E8%A7%86%E9%A2%91%E7%BD%91%E7%BB%9C%E5%9B%BE.png) <br> 3. Video business intelligence is insufficient, and a closed loop of automatic supervision and efficient early warning disposal has not been formed.|
|
||||
| Current Solutions and Gap Analysis | 1.Computing and network resources are heterogeneous and ubiquitous, services are complex and performance requirements are high. However, the traditional solutions have synergy barriers between the network, cloud, and service, which cannot meet the scheduling requirements of various scenarios. <br> 2.During peak hours, the data transmission volume is large and the network load is high, leading to high latency and affecting the quality of video service, further putting forward higher requirements and challenges for the optimization of video scheduling solutions. <br> ![Figure for the network of Camera](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.3%E5%9B%BE-%E8%A7%86%E9%A2%91%E7%BD%91%E7%BB%9C%E5%9B%BE-en.png) <br> 3. Video business intelligence is insufficient, and a closed loop of automatic supervision and efficient early warning disposal has not been formed.|
|
||||
| Computing Force Network Requirements Derivation | 1.Based on the measurement of resource status and business requirements, combining with optimization algorithms, the optimization scheduling and allocation of computing force resources in different time periods and different sites are carried out, to establish a collaborative technology system of computing force network scheduling for industrial applications. <br> 2.For all kinds of video users and business, to provide task-based services (such as optimal path, nearest distance, lowest cost, and etc.), as well as AI intelligent video services. |
|
||||
| Reference Implementation | ![算力网络一体化调度示意图](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.3%E5%9B%BE-%E7%AE%97%E5%8A%9B%E7%BD%91%E7%BB%9C%E4%B8%80%E4%BD%93%E5%8C%96%E8%B0%83%E5%BA%A6.png) <br> 1. Perceiving, collecting, and analyzing underlying computing resources, network resources, and storage resources, as well as perceiving user service types, demands on latency, transmitted data volume, upload traffic, and etc. <br> 2.Based on the user's business requirements of delay and cost, and combining the overall system modeling, business rule analysis, optimization strategy solving and system module docking, to provide intelligent scheduling capabilities including time-sharing tide scheduling, cross-region scheduling, and etc. <br> 3.Evaluating whether the current scheduling policy can meet the service requirements of users in real time, and feeding back relevant indicators to the intelligent scheduling module, then the scheduling policy can be dynamically optimized and adjusted. <br> 4.Relevant processing tasks such as video processing, AI reasoning, data processing, and etc. are flexibly delivered to related computing resources, and providing efficient intelligent video services such as automatic video data backup, AI training, and AI video real-time reasoning, and etc. |
|
||||
| Reference Implementation | ![Covergent Scheduling of CFN](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.3%E5%9B%BE-%E7%AE%97%E5%8A%9B%E7%BD%91%E7%BB%9C%E4%B8%80%E4%BD%93%E5%8C%96%E8%B0%83%E5%BA%A6-en.png) <br> 1. Perceiving, collecting, and analyzing underlying computing resources, network resources, and storage resources, as well as perceiving user service types, demands on latency, transmitted data volume, upload traffic, and etc. <br> 2.Based on the user's business requirements of delay and cost, and combining the overall system modeling, business rule analysis, optimization strategy solving and system module docking, to provide intelligent scheduling capabilities including time-sharing tide scheduling, cross-region scheduling, and etc. <br> 3.Evaluating whether the current scheduling policy can meet the service requirements of users in real time, and feeding back relevant indicators to the intelligent scheduling module, then the scheduling policy can be dynamically optimized and adjusted. <br> 4.Relevant processing tasks such as video processing, AI reasoning, data processing, and etc. are flexibly delivered to related computing resources, and providing efficient intelligent video services such as automatic video data backup, AI training, and AI video real-time reasoning, and etc. |
|
||||
| Proposal of Technology development and open-source work | 1. Combining AI intelligent video capabilities with computing force networks to meet the diversified requirements of industry scenarios. <br> 2. It is suggested to carry out related research on computing and network resources measurement, in order to provide a unified resource template for computing force network.|
|
||||
|
||||
## 4.4 Scheduling of Private Computing Service Based on Computing Force Network
|
||||
@ -155,7 +155,7 @@ Jianchao Guo (AsiaInfo), Jian Xu (China Mobile), Jie Nie (China Mobile), Jintao
|
||||
| Use Case Description |When individuals/enterprises apply for loans from banks, banks need to assess lending risks and identify the risks of users borrowing excessively or excessively. By building a privacy computing platform, it is possible to utilizes privacy queries and multi-party joint statistics, and collaborate with multiple banks to jointly calculate the total loan amount of each bank before lending. After receiving the joint statistical results, the bank decides whether to issue loans to users.|
|
||||
| Current Solutions and Gap Analysis | |
|
||||
| Computing Force Network Requirements Derivation | Without solving computing force and communication issues, the large-scale application of privacy computing will be impossible to achieve. Due to massive data needing processing, privacy computing requires super-high bandwidth and large amount of computing force. <br>Besides, privacy computing also put forward higher requirements for computing force network convergent scheduling. Hidden query services and multi-party joint statistical services involve collaborative computing among multiple computing force nodes, data ciphertexts between nodes, and encryption algorithms. It is necessary for computing force networks to have the collaborative scheduling ability of computing force and networks, which can meet network transmission needs, and comprehensively consider transmission delay and task execution delay of computing nodes to eliminate the "barrel effect" caused by computing force shortfalls. At the same time, it is necessary for the computing force network to have dynamic resource scheduling capabilities, which can meet the needs of business scheduling in real-time.|
|
||||
| Reference Implementation | ![隐私计算与算力网络协同流程图](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.4%E5%9B%BE-%E9%9A%90%E7%A7%81%E8%AE%A1%E7%AE%97-3.png) <br> 1. The bank initiates a joint model training service request; <br> 2. The operation center of the CFN submits a business request to the CFN brain; <br> 3. CFN brain analysis business requests (which may include location, resource requirements, etc.), query available nodes and paths (dedicated/private networks), and reply to the CFN operation center for optional solutions; <br> 4. The CFN operations center responds about resource solution and price information to bank; <br> 5. Bank select and confirm preferred solution and price; <br> 6. The CFN operation center will send the selected solution to the CFN brain; <br> 7. CFN brain conduct self verification to check whether the selected solution can be satisfied (based on digital twin technology or other simulation technologies); <br> 8. CFN brain reply to the self verification status of the solution to CFN operations center (confirmation of information such as computing force, network resources, etc.) ; <br> 9.CFN operations center double confirm the plan; <br> 10. CFN brain initiates resource opening and network establishment requests to the CFN infrastructure layer, and send service instantiation request to CFN infrastructure layer; <br> 11. CFN infrastructure layer reply that computing and network resources as well as services are prepared; <br> 12. CFN brain reply to CFN operation center about service ready information; <br> 13. The CFN operation center responds to bank about service ready info and the bank can start to conducts model training and deployed model inference.|
|
||||
| Reference Implementation | ![Private Computing Service Scheduling in CFN](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.4%E5%9B%BE-%E9%9A%90%E7%A7%81%E8%AE%A1%E7%AE%97-workflow-en.png) <br> 1. The bank initiates a joint model training service request; <br> 2. The operation center of the CFN submits a business request to the CFN brain; <br> 3. CFN brain analysis business requests (which may include location, resource requirements, etc.), query available nodes and paths (dedicated/private networks), and reply to the CFN operation center for optional solutions; <br> 4. The CFN operations center responds about resource solution and price information to bank; <br> 5. Bank select and confirm preferred solution and price; <br> 6. The CFN operation center will send the selected solution to the CFN brain; <br> 7. CFN brain conduct self verification to check whether the selected solution can be satisfied (based on digital twin technology or other simulation technologies); <br> 8. CFN brain reply to the self verification status of the solution to CFN operations center (confirmation of information such as computing force, network resources, etc.) ; <br> 9.CFN operations center double confirm the plan; <br> 10. CFN brain initiates resource opening and network establishment requests to the CFN infrastructure layer, and send service instantiation request to CFN infrastructure layer; <br> 11. CFN infrastructure layer reply that computing and network resources as well as services are prepared; <br> 12. CFN brain reply to CFN operation center about service ready information; <br> 13. The CFN operation center responds to bank about service ready info and the bank can start to conducts model training and deployed model inference.|
|
||||
| Proposal of Technology development and open-source work | |
|
||||
|
||||
## 4.5 Cross-Architecture deployment and migration of AI Applications in CFN
|
||||
@ -165,9 +165,9 @@ Jianchao Guo (AsiaInfo), Jian Xu (China Mobile), Jie Nie (China Mobile), Jintao
|
||||
| Contributor | China Mobile Research Institute – Qihui Zhao |
|
||||
| Application Name | AI applications |
|
||||
| Use Case Description | When users apply for AI services from computing force network, taking facial recognition as an example, they will provide facial recognition task type, data to be processed (such as image transmission interface, image quantity, image size), preferred detection location, processing speed, cost constraints, etc. The computing force network will choose a suitable facial recognition software and a computing force cluster to deploy facial recognition software, complete configuration, and provide services based on user needs. <br> Since the computing force network forms a unified network with computing force of different sources and types, the underlying computing force that used to carries the facial recognition service can be any one or more of Nvidia GPU, Intel CPU, Huawei NPU, Cambrian MLU, Haiguang DCU and many other intelligent chips. Therefore, the deployment, operation, and migration of AI applications on heterogeneous computing chips from multiple vendors is one of the typical use cases of computing force networks. This use case is similar to the AI application’s cross architecture migration scenario in cloud computing. <br> In addition to users applying for AI services from computing force network, the above use case is also suitable for users to deploy self-developed AI applications into computing force network. |
|
||||
| Current Solutions and Gap Analysis | AI applications (i.e., AI services) generally requires the support of "AI framework + Toolchain + Hardware", where: AI framework refers to PaddlePaddle, Pytorch, TensorFlow, etc.; Hardware refers to the AI chips of various device manufacturers; The Toolchain is a series of software built by various AI-chip manufacturers around their AI chips, including but not limited to IDE, compiler, runtime, device driver, etc. At present, users need to choose programming languages and framework models, specify hardware backend, and perform integrated compilation and linking at application development and design stages. If wanting to run AI applications on any AI chips in computing force network, it is necessary to develop multiple versions of codes, complete the compilation for different AI chips, using the development toolchain matched by each chip, and then deploy to the runtime of specific AI chips. The overall development difficulty is high, and the software maintenance cost is high. <br> ![智算生态壁垒示意图](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.5%E5%9B%BE-%E5%8A%A0%E9%80%9F%E7%A1%AC%E4%BB%B6%E7%94%9F%E6%80%81%E5%A3%81%E5%9E%92.png) |
|
||||
| Computing Force Network Requirements Derivation | 1. To simplify the difficulty of user development and deployment of AI applications, CFN needs to form a software stack that supports cross architecture development and deployment of AI applications in collaboration with AI frameworks, chips, and related toolchain, shield the underlying hardware differences for users. <br> ![算力原生逻辑方案](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.5%E5%9B%BE-%E7%AE%97%E5%8A%9B%E5%8E%9F%E7%94%9F%E9%80%BB%E8%BE%91%E6%96%B9%E6%A1%88.png) <br> 2. Computing force networks need to understand user’s descriptive service quality requirements, support transforming the requirements into resource, network, storage and other requirements that can be understood by cloud infrastructure, and support to form multiple resource combination as solutions to meet user SLA requirements. <br> 3. The computing force network needs to measure the underlying heterogeneous chips of different vendors according to unified standards, so that computing force network can calculate different resource combination schemes based on user descriptive SLA requirements. <br> 4. The computing force network needs to monitor the entire state of the system to support resource scheduling, application deployment, etc. |
|
||||
| Reference Implementation | This section is only a solution designed for cross architecture deployment and migration of AI applications, and other implementations and processes may also exist. <br> ![应用跨架构部署流程图](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.5-%E8%B7%A8%E6%9E%B6%E6%9E%84%E5%B7%A5%E4%BD%9C%E6%B5%81%E7%A8%8B.png) <br> Workflows: <br> Pre1. The orchestration and management layer has already managed the underlying computing resource pool and monitored the resource situation within the pool in real-time. <br> Pre2. The computing service provider prepares a cross architecture basic image based on the resource types in the computing resource pool, and registers the image in the central image repository of the orchestration and management layer. A cross architecture runtime is at least included in the cross architecture basic image. This cross architecture runtime can provide a unified abstraction of computing resources for upper level businesses, thereby shielding underlying hardware differences; and can translate unified AI calls into API or instruction call of chips’ toolchain to execute the computing task. <br> 1. Computing force network provides AI application developers (i.e. users) with a flexible and loadable local cross architecture development environment, including cross architecture IDEs (compilers, SDKs, etc.) and cross architecture environment simulators (flexibly adapted according to the user's local underlying hardware type, simulating unified and abstract computing resources for application debugging). <br> 2. After completing AI application development, users can generate cross architecture executable files through the cross architecture development environment, and upload them to the cross architecture business executable file repository of CFN orchestration and management layer. <br> 3. Users propose descriptive SLA requirements for AI application deployment to CFN, and the orchestration and management layer receives SLA requests. <br> 4. The orchestration and management layer analyzes SLA requirements based on monitoring data of multiple types of resources in CFN and converts them into resource requirements that can be understood by the infrastructure layer. This resource requirement can be met by multiple resource combinations, and after the user selects a specific combination scheme, it triggers application deployment. <br> 5. The application deployment process first completes the generation of AI application images and deployment files. For the image, the corresponding cross architecture basic image will be pulled from the central image repository based on the underlying resource type, and automatically packaged with the cross architecture executable file of the AI application to be deployed to generate a complete cross architecture AI application image. The complete AI application image will be distributed to the image repository in the underlying resource pool. For deployment files, automatic deployment files will be generated based on the underlying environment type (bare metal environment, container environment, virtual machine environment), resolved resource requirements, image location, business configuration, and other information/files (which can be scripts, Helm Chart, Heat templates, etc.) <br> 6. The orchestration and management component in the infrastructure layer resource pool deploys AI applications based on deployment files, and complete cross architecture AI application image. <br> 7. If AI application needs to be migrated to resource pools with other chips (CPU, FPGA, ASIC, etc.), repeat steps 4, 5, and 6. |
|
||||
| Current Solutions and Gap Analysis | AI applications (i.e., AI services) generally requires the support of "AI framework + Toolchain + Hardware", where: AI framework refers to PaddlePaddle, Pytorch, TensorFlow, etc.; Hardware refers to the AI chips of various device manufacturers; The Toolchain is a series of software built by various AI-chip manufacturers around their AI chips, including but not limited to IDE, compiler, runtime, device driver, etc. At present, users need to choose programming languages and framework models, specify hardware backend, and perform integrated compilation and linking at application development and design stages. If wanting to run AI applications on any AI chips in computing force network, it is necessary to develop multiple versions of codes, complete the compilation for different AI chips, using the development toolchain matched by each chip, and then deploy to the runtime of specific AI chips. The overall development difficulty is high, and the software maintenance cost is high. <br> ![Figure of Vertial Ecology of Intelligent Computing Industry](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.5%E5%9B%BE-%E7%AE%97%E5%8A%9B%E5%8E%9F%E7%94%9F%E9%80%BB%E8%BE%91%E6%96%B9%E6%A1%88-en.png) |
|
||||
| Computing Force Network Requirements Derivation | 1. To simplify the difficulty of user development and deployment of AI applications, CFN needs to form a software stack that supports cross architecture development and deployment of AI applications in collaboration with AI frameworks, chips, and related toolchain, shield the underlying hardware differences for users. <br> ![Figure of Computing Native Solution ](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.5%E5%9B%BE-%E7%AE%97%E5%8A%9B%E5%8E%9F%E7%94%9F%E9%80%BB%E8%BE%91%E6%96%B9%E6%A1%88-en.png) <br> 2. Computing force networks need to understand user’s descriptive service quality requirements, support transforming the requirements into resource, network, storage and other requirements that can be understood by cloud infrastructure, and support to form multiple resource combination as solutions to meet user SLA requirements. <br> 3. The computing force network needs to measure the underlying heterogeneous chips of different vendors according to unified standards, so that computing force network can calculate different resource combination schemes based on user descriptive SLA requirements. <br> 4. The computing force network needs to monitor the entire state of the system to support resource scheduling, application deployment, etc. |
|
||||
| Reference Implementation | This section is only a solution designed for cross architecture deployment and migration of AI applications, and other implementations and processes may also exist. <br> ![Workflow of Cross-Architecture Deployment of AI Application](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.5-%E8%B7%A8%E6%9E%B6%E6%9E%84%E5%B7%A5%E4%BD%9C%E6%B5%81%E7%A8%8B-en.png) <br> Workflows: <br> Pre1. The orchestration and management layer has already managed the underlying computing resource pool and monitored the resource situation within the pool in real-time. <br> Pre2. The computing service provider prepares a cross architecture basic image based on the resource types in the computing resource pool, and registers the image in the central image repository of the orchestration and management layer. A cross architecture runtime is at least included in the cross architecture basic image. This cross architecture runtime can provide a unified abstraction of computing resources for upper level businesses, thereby shielding underlying hardware differences; and can translate unified AI calls into API or instruction call of chips’ toolchain to execute the computing task. <br> 1. Computing force network provides AI application developers (i.e. users) with a flexible and loadable local cross architecture development environment, including cross architecture IDEs (compilers, SDKs, etc.) and cross architecture environment simulators (flexibly adapted according to the user's local underlying hardware type, simulating unified and abstract computing resources for application debugging). <br> 2. After completing AI application development, users can generate cross architecture executable files through the cross architecture development environment, and upload them to the cross architecture business executable file repository of CFN orchestration and management layer. <br> 3. Users propose descriptive SLA requirements for AI application deployment to CFN, and the orchestration and management layer receives SLA requests. <br> 4. The orchestration and management layer analyzes SLA requirements based on monitoring data of multiple types of resources in CFN and converts them into resource requirements that can be understood by the infrastructure layer. This resource requirement can be met by multiple resource combinations, and after the user selects a specific combination scheme, it triggers application deployment. <br> 5. The application deployment process first completes the generation of AI application images and deployment files. For the image, the corresponding cross architecture basic image will be pulled from the central image repository based on the underlying resource type, and automatically packaged with the cross architecture executable file of the AI application to be deployed to generate a complete cross architecture AI application image. The complete AI application image will be distributed to the image repository in the underlying resource pool. For deployment files, automatic deployment files will be generated based on the underlying environment type (bare metal environment, container environment, virtual machine environment), resolved resource requirements, image location, business configuration, and other information/files (which can be scripts, Helm Chart, Heat templates, etc.) <br> 6. The orchestration and management component in the infrastructure layer resource pool deploys AI applications based on deployment files, and complete cross architecture AI application image. <br> 7. If AI application needs to be migrated to resource pools with other chips (CPU, FPGA, ASIC, etc.), repeat steps 4, 5, and 6. |
|
||||
| Proposal of Technology development and open-source work |1. It is suggested to enhance the research on the migration scheme of cross architecture deployment of AI applications, which can rely on the CFN WG Computing Native sub-working group to explore referenceimplementation. The existing open-source solutions such as Alibaba HALO+ODLA, Intel's DPC++ and LevelZero in the industry can serve as the basis for exploration. <br> 2. It is recommended to conduct research on computing force measurement in order to provide a unified unit for resource computation capability. <br> 3. It is recommended to explore user descriptive SLA requirements and study the ways in which these requirements can be transformed into specific resource requirements. For details, please refer to service models such as PaaS, SaaS, and FaaS. |
|
||||
|
||||
## 4.6 CFN Elastic bare metal
|
||||
@ -176,9 +176,9 @@ Jianchao Guo (AsiaInfo), Jian Xu (China Mobile), Jie Nie (China Mobile), Jintao
|
||||
| Contributor | China Mobile-Jintao Wang、China Mobile-Qihui Zhao |
|
||||
| Application Name | / |
|
||||
| Use Case Description | According to Section 2.3, the CFN provides a resource-based service model, allowing users to directly apply for resources such as bare metal, virtual machines, and containers to the CFN. Some users are also more inclined to use bare metal services due to considerations such as performance, security, and the difficulty of business virtualization/containerization transformation. Traditional bare metal service resources are difficult to flexibly configure. Customers need to adapt to various drivers. The distribution and management process is lengthy. Whether it is customer experience or management and operation methods, there is a big gap compared with virtualization services. Therefore, how to realize the flexible management and operation and maintenance of various infrastructures by CFN providers, and simplify the use of bare metal by users is an important requirement of the CFN data center. This scenario is also applicable to the fields of cloud computing and cloud-network integration. |
|
||||
| Current Solutions and Gap Analysis | For CFN providers, in the traditional bare metal distribution process, manual configuration of the network is required in the inspection stage, and multiple network switching, node installation, and restart operations are required in the provisioning and tenant stages. The overall process is complicated and lengthy. For users, the number of network cards and storage specifications are fixed, which cannot meet the differentiated network card and storage requirements of different users. In addition, since bare metal NIC drivers and storage clients are visible to users, users need to make adaptations, which further increases the difficulty for users to use bare metal. At the same time, the exposure of the block storage network to the guest operating system also poses security risks. <br> ![传统裸机管理方案](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.6%E5%9B%BE-%E4%BC%A0%E7%BB%9F%E8%A3%B8%E6%9C%BA%E7%AE%A1%E7%90%86%E6%96%B9%E6%A1%88.png)|
|
||||
| Current Solutions and Gap Analysis | For CFN providers, in the traditional bare metal distribution process, manual configuration of the network is required in the inspection stage, and multiple network switching, node installation, and restart operations are required in the provisioning and tenant stages. The overall process is complicated and lengthy. For users, the number of network cards and storage specifications are fixed, which cannot meet the differentiated network card and storage requirements of different users. In addition, since bare metal NIC drivers and storage clients are visible to users, users need to make adaptations, which further increases the difficulty for users to use bare metal. At the same time, the exposure of the block storage network to the guest operating system also poses security risks. <br> ![Traditional Bare Metal Management Solution](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.6%E5%9B%BE-DPU%E8%A3%B8%E6%9C%BA%E7%AE%A1%E7%90%86%E6%96%B9%E6%A1%88-en.png)|
|
||||
| Computing Force Network Requirements Derivation | The goal is to achieve bare metal provisioning and management as elastic as virtual machines: <br> 1. Bare metal servers are fully automatically provisioned. Through the console self-service application, functions such as automatic image installation, network configuration, and cloud disk mounting can be completed without manual intervention. <br> 2. It is fully compatible with the cloud disk system of the virtualization platform. Bare metal can be started from the cloud disk without operating system installation, which meets the requirements of elastic storage. <br> 3. It is compatible with the virtual machine VPC network and realizes the intercommunication between the bare metal server and the virtual machine network. |
|
||||
| Reference Implementation | ![DPU裸机管理方案](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.6%E5%9B%BE-DPU%E8%A3%B8%E6%9C%BA%E7%AE%A1%E7%90%86%E6%96%B9%E6%A1%88.png) <br> A DPU card needs to be installed on a bare metal server, and the network ports and disk devices of a bare metal instance are all provided by the DPU. The management, network, and storage-related components of the cloud platform run on the DPU. <br> The management module is responsible for the life cycle management of bare metal instances, the network module is responsible for the realization of bare metal virtual network ports and the forwarding of network flows, and the storage module is responsible for the realization of bare metal cloud disks and the termination of the storage protocol stack. <br> In this scenario, the distribution and management process of bare metal is as follows: <br> 1. Select the bare metal flavor through the cloud platform API or UI interface, and create a bare metal instance; <br> 2. The conductor component of the cloud platform initializes the bare metal instantiation process, calls the management module running on the DPU, and completes the configuration of the bare metal instance; <br> 3. The cloud platform creates a virtual network port and initializes the backend of the virtual network port on the DPU. Bare Metal obtains the vNIC through the standard virtio driver after startup. At the same time, the cloud platform synchronizes the network port information to the SDN controller, and the SDN controller sends the flow table to the vSwitch on the DPU of the bare metal node.In this way, the network interworking between bare metal and other instances can be realized; <br> 4. The cloud platform creates a cloud disk, and connects to the remote block storage system on the DPU through a storage protocol stack such as iSCSI. Finally, it is provided to the bare metal instance as a disk through a backend such as NVMe or virtio-blk.|
|
||||
| Reference Implementation | ![Bera Metal Management Solution with DPU](https://opendev.org/cfn/use-case-and-architecture/src/branch/master/figures/4.6%E5%9B%BE-DPU%E8%A3%B8%E6%9C%BA%E7%AE%A1%E7%90%86%E6%96%B9%E6%A1%88-en.png) <br> A DPU card needs to be installed on a bare metal server, and the network ports and disk devices of a bare metal instance are all provided by the DPU. The management, network, and storage-related components of the cloud platform run on the DPU. <br> The management module is responsible for the life cycle management of bare metal instances, the network module is responsible for the realization of bare metal virtual network ports and the forwarding of network flows, and the storage module is responsible for the realization of bare metal cloud disks and the termination of the storage protocol stack. <br> In this scenario, the distribution and management process of bare metal is as follows: <br> 1. Select the bare metal flavor through the cloud platform API or UI interface, and create a bare metal instance; <br> 2. The conductor component of the cloud platform initializes the bare metal instantiation process, calls the management module running on the DPU, and completes the configuration of the bare metal instance; <br> 3. The cloud platform creates a virtual network port and initializes the backend of the virtual network port on the DPU. Bare Metal obtains the vNIC through the standard virtio driver after startup. At the same time, the cloud platform synchronizes the network port information to the SDN controller, and the SDN controller sends the flow table to the vSwitch on the DPU of the bare metal node.In this way, the network interworking between bare metal and other instances can be realized; <br> 4. The cloud platform creates a cloud disk, and connects to the remote block storage system on the DPU through a storage protocol stack such as iSCSI. Finally, it is provided to the bare metal instance as a disk through a backend such as NVMe or virtio-blk.|
|
||||
| Proposal of Technology development and open-source work |1. Research the decoupling scheme of cloud platform components and DPU hardware drivers, so that different cloud vendors can more efficiently deploy cloud platform software components to DPUs of different vendors; <br> 2. Explore the research of NVMe-oF storage network protocol stack to provide high-performance cloud disk services.|
|
||||
|
||||
# 5. Next Step
|
||||
|
Loading…
Reference in New Issue
Block a user