Abstract: Data compliance technologies, such as privacy computing technology, SaaS platforms, and data compliance monitoring systems, transform data processing requirements into computational problems that can be handled by programs and codes in accordance with regulatory rules. Data compliance technology integrates multi-node data and computing power resources, expands data source channels, and enhances the generalization capabilities of models. Consequently, it emerges as a fundamental underlying technology for developing and applying large-scale artificial intelligence models with broad applicability in various domains. Data compliance technology has the potential to significantly revolutionize the algorithm application process, and there are multiple relationships in data processing involving multiple data processors and varying degrees of systemic risks in its technological implementation. The risk regulation of data compliance technology applications should embed scattered and fragmented rules and tools in a dynamic program of multidimensional co-governance and carry out personalized and differentiated dynamic regulation according to different mechanisms of various technical risks.
Keywords: data compliance technology; artificial intelligence; general large model; risk governance
CLC: D 923; D 920 " " " " " " " " " DC: A " " " " Article: 2096?9783(2024)02?0117?10
The three main elements of artificial intelligence (AI) technology include data, algorithms, and computing power. With the support of computing power, the data is trained by the preset algorithm, and the organic combination of the three is formed to build the model together. The AI model goes through the stages of the machine learning model, deep learning model, and large-scale pre-training model. Traditional machine learning models rely on handcrafted features and statistical methods. With the wide application of deep learning models such as convolutional neural networks (CNN)[1], recurrent neural networks (RNN)[2], and graph neural networks (GNN)[3], a large amount of labeled data is required for training to achieve good performance. With limited data, deep learning models are prone to overfitting issues as the number of parameters increases, thereby making it challenging to generalize across various task domains.
To improve the domain migration capability of deep learning models, models based on large datasets have emerged. The traditional \"task-specific model\" construction model has been replaced by \"a large-scale pre-trained model for multiple downstream tasks\". This large-scale pre-trained model (PM) is referred to as the large model. The PM provides a two-stage solution based on pre-training and fine-tuning: In the pre-training phase, the model learns domain knowledge from extensive unsupervised data; in the fine-tuning phase, only a small amount of annotated data is needed to apply the domain knowledge learned in the pre-training phase to a specific task without having to train from scratch. In this mode, researchers train large models for users with different needs through the design of advanced algorithms and the support of massive computing power, improving the generalization ability of the model and expanding the application of artificial intelligence.
The new generation of artificial intelligence represented by the large model is driven by data and knowledge, and the more data, the smarter. Therefore, the development and application of large models inevitably face data compliance problems. To solve data compliance, technology developers and users, on the one hand, submit the data application information disclosure practices to data regulatory authorities, and on the other hand, take the initiative to prevent data regulatory requirements and reduce security risks by applying technical means. This kind of technical means to meet the needs of data compliance regulation is called data compliance technology, including but not limited to multi-party secure computing, federated learning, trusted execution environments, and other privacy computing technologies, as well as low-code and zero-code as the core of the software as a service (SaaS) platform.
The emergence of data compliance technology originated from the concept of \"privacy by design\", which involves integrating data security requirements into software and large-scale model development and design from the outset. This approach ensures that privacy compliance becomes the default operating rule of the technology, rather than a regulatory requirement imposed after problems occur[4]. The concepts and principles of AI research and development, such as \"ethics first, science and technology for good, and agile governance\", advocate for the values of full-cycle digital justice. This means actively preventing problems in the early stages of design and development, rather than simply providing remediation after issues arise. Data compliance technology has become a universal underlying technology for the development and application of general large models of artificial intelligence, which has unique advantages in opening up application channels, breaking down industry barriers, and resolving information islands. Consequently, it has become an important part of the data sharing infrastructure under the trend of strong data supervision. The concept of \"reflexive modernization\", coined by German sociologist Ulrich Beck, suggests that in the context of the \"risk society\", the technological tools used to eliminate risks are sometimes the very incentives to create new risks[5-6]. Data compliance technology has the capacity to revolutionize the application process of algorithms and can create varying degrees of systemic risk at the front, middle, and back ends of its technological application.
1 Data Compliance Technology Applications and Risks
1.1 Application and Risk of Privacy Computing Technology
Privacy computing technology allows data analysis and processing while protecting data privacy. Through various algorithms and technical means, such as data desensitization, anonymization, and camouflage technology, data can be processed and used on the premise of not disclosing personal information to ensure the privacy and security of data during use. Privacy computing techniques mainly include secure multi-party computation (SMPC), homomorphic encryption, differential privacy, zero-knowledge proof, and federated learning. SMPC is a technique that allows multiple parties to work together to compute a specific function without disclosing their respective data. Through secure algorithms and protocols, participants encrypt or transform data before providing it to other parties to ensure privacy and security in the calculation process. Homomorphic encryption allows computational operations to be performed on ciphertext data, and the result of the computation is the same as that of direct operations on plaintext after decryption. This means that it can be processed and analyzed without exposing the raw data. Differential privacy is a technology that protects the privacy of users when publishing aggregated data. By adding noise to the data, it makes it impossible for an attacker to determine information about a particular individual, even if they have information about all other individuals except one. Zero-knowledge proofs allow one party to prove to another that a statement is true without providing the other party with any specific information that makes the statement verifiable. Federated learning allows multiple devices or servers to train the model collaboratively; each participant computes updates only locally and shares only updated data rather than raw data, thus protecting the privacy of the data source.
The risks of privacy computing technology mainly include the inconsistency of technical standards and specifications, security and credibility problems, performance problems, and so on. As a relatively young field, privacy computing lacks unified technical standards and specifications. This not only brings uncertainty to the engineered application of privacy computing but also limits the interoperability between different schemes. With the rapid iteration of technology, existing standards are often not updated in a timely manner, resulting in many uncertainties and risks in practice. In terms of security and credibility, privacy computing involves the processing of sensitive personal data, and once there is a security breach or data leak, it will pose a huge threat to user privacy. Therefore, ensuring the security of privacy computing is an important issue facing engineering. Performance issues are also a challenge for private computing technologies. Although the current performance has reached a basic usable level, compared with plaintext computing, there is still a lot of room for improvement. The development and application of private computing technology require continuous progress in technical standardization, security assurance, and performance optimization to reduce the existing risks and improve its practicality in various fields.
1.2 Application and Risk of SaaS Platform
SaaS platforms provide a flexible tool for data management and analysis, and these platforms often include a range of features such as data encryption, access control, and audit trails to help users manage their data assets and ensure compliance. The core of the SaaS model is \"software as a service\", which means that the development, maintenance, and upgrade of software are the responsibility of the service provider. Users do not need to care about the underlying technical details; they can just access and use the software through the Internet. This model not only lowers the user's threshold of use but also makes software development more efficient and focused. Software development based on SaaS platforms often adopts a microservices architecture, which is a method of breaking down complex applications into a set of small, independent services. Each service is built around a business function and can be developed, deployed, and extended independently. This architecture makes software development more flexible and scalable and also facilitates the practice of team collaboration and continuous integration/continuous delivery (CI/CD). SaaS platform software development focuses on API design and openness. By providing a unified and standardized API interface, you can easily connect various external systems and services to achieve data sharing and business collaboration. This not only helps to improve the functionality and usability of the software but also provides rich development opportunities and platforms for third-party developers.
SaaS platform operation also face data security and compliance risks, as well as the risk of service disruption. Data security risk is one of the major risks faced by SaaS platforms. On the SaaS platform, if the cloud service provider's security measures are not in place or are attacked by hackers, it may lead to the illegal acquisition and disclosure of users' sensitive data. This will not only compromise the privacy of users but may also lead to the disclosure of trade secrets. While most cloud service providers have backup and disaster recovery plans in place, there is still a risk of data loss. For example, a user's data may not be recoverable due to operational error, system failure, or malicious attacks. Data tampering is also a potential risk. On a multi-tenant SaaS platform, it is a challenge to ensure data isolation between different tenants. If one tenant's data is tampered with by another tenant or a malicious attacker, serious consequences can result. Data protection and privacy regulations are some of the important regulations that SaaS platforms need to comply with. Different countries and regions may have different regulatory requirements for data security and privacy protection, and meeting country-specific regulations is a factor that SaaS platforms need to consider.
1.3 Application and Risk of Data Governance Tools and Compliance Monitoring Systems
Data governance tools are used to monitor and manage users' data flows, helping users establish and maintain data classification, data lifecycle management, and data access policies to ensure safe and compliant use of data. The whole life cycle management of data includes every processing behavior in the whole process of data collection, storage, use, processing, transmission, provision, disclosure, destruction, and deletion. Throughout the data processing process, security risk assessments are carried out for specific data needs. According to the relevant management measures, processors of important data and core data are required to complete a data security risk assessment at least once a year. Any processing activity that may affect the confidentiality, integrity, and availability of data requires a security risk assessment. When handling personal information, it is also necessary to conduct security risk assessments to protect personal privacy and comply with relevant laws and regulations. The compliance monitoring system automatically monitors users' data processing activities to ensure compliance with relevant laws and policies. When a potential compliance issue is identified, the system alerts and provides the necessary reporting capabilities so that users can take timely action to correct the issue.
In general, the specific implementation of data security risk assessment should first carry out basic research, understand the background and environment of data processing, and then develop a detailed assessment plan. Then, through field research, data combing and classification, threat and vulnerability identification, and data security measure identification. The collected information is sorted out, the qualitative or quantitative risk calculation is carried out, and the corresponding risk disposal recommendations are formulated. The data compliance technology represented by data governance tools and compliance monitoring systems relies on advanced information technology, such as big data analysis and artificial intelligence. The pace of development of these technologies is very fast, but there is also instability. If the relevant technology malfunctions or makes errors, it may lead to inaccuracies in data processing, which will affect the judgment and implementation of data compliance. In the process of self-operation of these systems, if there are no strict security measures, there may be data leakage, abuse of data, invasion of personal privacy, and other ethical issues. Over-reliance on data governance tools and compliance monitoring systems also creates the risk of path dependence, potentially reducing users' ability to make autonomous decisions. If the system fails or the data is inaccurate, it may cause the user to make the wrong decision.
2 Data Compliance Risk in Large Model Training and Application
2.1 Mechanisms of Data Action in Large Model Development
Pre-trained models were first used in the field of computer vision (CV). The emergence of large-scale image data sets provides a data basis for image pre-training models. With the success of the pre-trained model in the CV field, it gradually entered the field of natural language processing (NLP). The availability of pre-trained models (PTM) for transformer architecture improves model initialization and avoids overfitting on small data sets. Transformer is a deep neural network model based on an attention mechanism. By introducing a self-attention mechanism, Transformer assigns different attention weights to different elements in the neural network model and then weights and combines information according to the correlation of elements[7]. With the increase in computing power, the emergence of deeper models, and the enhancement of training techniques, the depth and number of parameters of PTM architectures continue to increase.
Driven by large-scale text corpus databases and self-supervised pre-training technologies, the Generative Pretrained Transformer (GPT) uses a large amount of data to read and generate text, and the large language model (LLM) has been rapidly developed. Based on the development of visual models and language models, multimodal large models such as \"visual language\" have also developed rapidly. Multimodal inputs and outputs allow AI systems to process and generate multiple types of data, such as text, images, audio, and video. This approach makes AI systems more flexible and able to adapt to the needs of different scenarios. The pre-trained model learns and trains a large amount of data to generate text, images, audio, and other related content according to the input instructions, forming artificial intelligence-generated content (AIGC). In the case of multimodal input, it is necessary to preprocess different types of data and fuse them into a unified representation.
2.2 Risks Associated with Data in the Pre-Training Phase of Large Models
The content information source generated by the large model relies on the massive corpus. The training phase of the large model requires the collection of a large amount of training data. The training data determines what kind of content is generated by the large model product, and to a large extent determines whether the content generated by the large model is compliant. The technical principles of large models dictate that content security compliance begins with training data.
Since large models require massive data for training, it is difficult for artificially constructed data to meet such a large data requirement, so most of the training data of large models come from open data on the Internet. For the training of large models, data with a very low error rate is critical because the accuracy and performance of a machine learning model depend on the quality of the training data. This publicly available data carries risks in terms of accuracy and authenticity.
In order to solve the two shortcomings of real-time and accuracy of large models, many developers use search technology to obtain real-time data, help large models update and correct knowledge, and solve problems such as data lag and illusion. In data retrieval, some developers have adopted \"crawling\" technology to illegally access and cache data from other servers. In legal practice, whether the behavior of crawling data is compliant must be judged according to specific scenarios or uses. In order to achieve compliance, first of all, the technical means of the website operator should not be broken, and the climb of the breakthrough technical means is identified as an infringement of the data property rights and interests of the operator. At the same time, do not violate the robots agreement statement and explicitly blacklist \"crawlers\"; Do not disrupt the normal operation of other websites and servers; Prevent crawling of data that triggers strict personal information protection rules. Training large models using open-source datasets requires a review of open-source license requirements.
With large model training data, there are data ownership and intellectual property issues. Many pieces of data are shared by multiple organizations, such as medical records or traffic flow data. In the absence of clear ownership and sharing of data, illegal use and infringement of data will occur. The use of text through training data, especially the mining of Chinese text in open data, is often limited by the fair use of intellectual property rights, which constitutes an infringement of works protected by the Copyright Law. In terms of data storage, access control, and dissemination, it is important to ensure that data is stored and accessed in accordance with laws, regulations, and industry standards. If data needs to be transferred overseas, it is necessary to focus on the areas and functions of the vertical enterprises it serves, determine whether it is critical data, and ensure that the amount of personal data processed is in line with data exit regulations. AI-related technologies are subject to export controls, and there are technology export control risks in the presence of domestic and foreign affiliates.
If the data in the training phase is derived from the collection of personal data of citizens, it is necessary to obtain the data subject's consent for data protection. Data protection covers sensitive information about individuals, companies, organizations, and other subjects, such as names, addresses, ID numbers, and bank accounts. It is difficult to obtain informed consent for every piece of data that contains personal information. There are compliance risks between the informed consent dilemma and large-scale model development training. Personal information collected during the training phase may also pose a risk of breaching the \"minimum scope\" requirement. If this personal information comes from open public data, there are still compliance risks as to whether the use of personal information for training has a \"reasonable scope\" and whether it will have a \"significant impact on the rights and interests of individuals\".
2.3 Risks Associated with Large Model Product Use Phase Data
In the process of public testing or the opening of general large models to users, some large models will be used for continuous iterative training through the interactive information generated by users during use, and some large models will be used for business operations after fine-tuning through enterprise data. There is a possibility that personal information, enterprise information, and even trade secrets will be disclosed to the large models in these processes. There are great hidden dangers to the security of this information, and there is a risk of data leakage. In the process of testing or using large models, it is very difficult to exercise the user's right to delete personal information. Although privacy policies have been formulated in many large model tests, due to the complexity of data deletion generated by large models in interaction with users, it is uncertain whether developers can delete traces of personal information and whether they can meet the current compliance requirements.
Since the pre-training data content of many large models comes from the information in the public network, if the pre-training data cannot guarantee to cover all possible languages or propositions or the data contains 1, misleading, or even incorrect information, then there is the possibility of wrong output results, which may lead to misleading users. Compared with other Internet information services, large model products directly interactively output conclusive content. Due to the scientific mystery given by the algorithm black box, it is very easy to have an impact on the ideology and value orientation of users, especially minors. In the process of use, large model products may generate 1 information, terrorism, extremism, pornography, violence, and other bad information by themselves or under the guidance of users, causing adverse effects on users and causing damage to the legitimate rights and interests of users.
Users may also use 1 information to release rumors, disturb social order, and bring public opinion risks to large-model products. Some lawbreakers will use 1 information generated by deep synthesis technology to disrupt the order of network communication and social order, bringing governance risks caused by the abuse of artificial intelligence-generated content. In the testing and application of large models, there is also the legal risk of cross-border data flow. For example, when a large model developed by a foreign company is used by a domestic user in China, information about it is transmitted to an overseas data-processing center. In the interaction process, if the user transfers sensitive personal information or a certain scale of personal information to an overseas data processing center for data analysis and other purposes, it may constitute a de facto data exit behavior, and if it is not approved, it will cause compliance risks.
3 Large Model Data Compliance Technology Exogenous Risk
3.1 User Privacy Protection
The main purpose of the development and application of data compliance technology is to transform data processing requirements and technical constraints in accordance with regulatory rules into computational problems that can be solved by programs and codes. For example, in order to obtain user data safely and legally under the condition of protecting user privacy, the data processing process is constructed using secure multi-party computing technology. In the development and application scenarios of data compliance technology, the mathematical process of solving specific problems is still realized through algorithms.
In the weak artificial intelligence stage, the algorithm's logic directly reflects the natural human logic of the developer, and the developer's values and implicit biases lead to risks. The application of data compliance technology, in the stage of strong artificial intelligence, imposes many restrictions on algorithmic automated decision-making to ensure that data processing meets regulatory requirements, but it also leads to more concerns in the code writing and technology development process. In particular, the use of secure multi-party computing is necessary in order to ensure the enthusiasm of all parties involved. Security is usually used to reduce the cost of data cleaning as a starting point, and the technical side will sometimes abandon the high cost of cleaning unstructured data. However, in federated learning, the participating initial model needs to have a considerable degree of adaptability, which also makes the initial model carry multiple functions and lose the specificity and uniqueness that it should have. All these have caused problems in connection and collaboration, and in the case of ensuring data interoperability, the accuracy of the data translation process is difficult to ensure, resulting in the risk of bias and dissipation. When data compliance technology is adopted in an AI general large model, all participants in the data compliance technology can directly obtain complete technical parameters, and malicious attackers can also use this feature to disguise themselves as honest participants to steal calculation results, distort model training, crack trusted environments, or generate malicious low code. How to get participants to accept honesty and good faith is an issue that needs to be considered in the regulation of data compliance technology.
3.2 Increased Generalization Ability Comes with Source Data Risk
Improving the generalization ability of large models is a key problem in machine learning, and it is also a key problem that general large models can be applied to. In the scheme design of a large model, different technical methods are adopted to solve the overfitting phenomenon of the machine learning model. The overfitting phenomenon refers to a situation where a large model performs well on the training set but poorly on the test set or new data. Overfitting occurs when the model learns noise and specific patterns in the training data that may not apply to previously unseen data. This causes the model to lose its ability to generalize, i.e., become less applicable to new data. In the development of artificial intelligence multimodal models, in order to improve generality, multiple models will be simplified to a certain extent, such as using regularization to limit the complexity of the model and adding a regularization term to the loss function to punish the complexity of the model, so as to reduce the risk of overfitting. By using the cross-validation method, the training data is divided into multiple subsets, and the training and validation on different subsets can evaluate the performance of the model more accurately. Through ensemble learning, combining the predictions of multiple models can also reduce the risk of overfitting. For deep learning models, the use of attention mechanisms and pooling strategies can help the model focus on important information.
To improve the generalization ability of large models, it is necessary to use a variety of methods and technologies comprehensively and to adjust and optimize them according to specific application scenarios and data characteristics. In large model training, data compliance techniques such as multi-party security computing have greatly broadened the data source channels of the model. By increasing the amount of training data, the generalization ability of the model can be improved. More data can help the model learn more general features, rather than overrelying on specific patterns or noise in the training set. But sometimes the data channel itself is a source of risk. The application of data compliance technology, while greatly increasing the amount of data available for deep mining by machine learning models, will bring systemic risks such as the identification of the source legitimacy and accuracy of external data channels, the limitation of data use boundaries under the sharing mode, and the infection of multi-party data problems.
3.3 Algorithmic Interpretability and Algorithmic Discrimination Risk
The application of data compliance technology in the general model of artificial intelligence is inseparable from the algorithm, and the implementation of the algorithm will bring an \"algorithm black box\". Due to the technical characteristics of machine learning itself, the rules created by algorithms through self-learning are difficult to observe and understand by natural people at the technical level. From the outside, many algorithm developers of large models often hide the rules of algorithm decision-making, lack transparency about the subject of the decision, and find it difficult for the external subject to know the process and logic of the decision. Although the regulatory authorities have put forward the transparency requirements of algorithms, the provisions of the transparency requirements of algorithms are generally abstract and lack specific guidelines, and algorithm providers often do not know how to implement the obligations of algorithmic transparency. In the multi-party cooperation of data compliance technology, the interpretability of algorithms is still low due to the lack of unified interpretation standards and interpretation subjects. Because the algorithms being interpreted may overlap with privacy, security or trade secrets, the range of subjects that can be used to present them is limited. In addition to the interpretability risk of algorithms, there is also the risk of algorithm discrimination due to data bias, algorithm design bias, and application scenario bias. Data bias is one of the main causes of algorithmic discrimination. Due to limitations or biases in the training data, the algorithm may learn some unfair or discriminatory behavior. For example, if the sample size of certain groups in the training data is insufficient or there is label noise, then the algorithm may produce inaccurate predictions or classifications for these groups. Secondly, algorithm design bias is also an important cause of algorithm discrimination. If not enough consideration is given to fairness and diversity during the design process of algorithms, then the algorithms may produce unfair results for certain groups. For example, some algorithms may amplify pre-existing biases or discrimination, such as gender, race, or age. In addition, application scenario deviation may also lead to algorithm discrimination. In some application scenarios, algorithms may be used for inappropriate purposes or have an unfair impact on certain groups. For example, in a recruitment scenario, if the algorithm is used to screen candidates without taking into account other important characteristics or abilities of the candidate, it may produce unfair results for certain groups.
4 Large Model Data Compliance Technology Application Risk Dynamic Regulation
4.1 Strengthen Legal Regulation
With the development of artificial intelligence research and application, in response to the challenges and opportunities brought by the development of AI technology, the legal norms for artificial intelligence have shown a trend of gradual refinement and improvement. Since 2017, the United States[8], the European Union[9] and other countries and regions have issued a series of laws and regulations, which have been gradually refined from macro-guidelines and strategies to specific areas such as autonomous driving governance and data protection. Overall, there are five main regulatory principles: \"Transparency, traceability, and interpretability\"; \"data protection, privacy, and data security\"; \"challenging or correcting AI decisions\"; \"prohibiting bias or discrimination\"; and \"preventing misuse of technology and illegal activities\".
China's governance norms for AI mainly focus on ensuring the safety of AI, transparency of use, interpretability of algorithms, and ethical compliance. In the field of artificial intelligence, a multi-level governance normative structure has been initially established, and a comprehensive governance system has been formed that combines the formulation law (hard law) of compulsory binding force with the self-normative criteria (soft law) of industry self-regulation. Artificial intelligence legal norms mainly include \"New Generation of Artificial Intelligence Ethics\", \"Internet Information Service in-depth Synthesis Management Regulations\", \"Interim Measures for the Management of Generative Artificial Intelligence Services\", and \"Opinions on Regulating and Strengthening the Judicial Application of Artificial Intelligence\". The Interim Measures for the Management of Generative Artificial Intelligence Services put forward specific requirements for the supervision of generative artificial intelligence in terms of compliance obligations and responsibilities, security assessment and algorithm filing, legitimacy of data sources, and protection of user information.
The multi-modal general large model further improves the learning ability of artificial intelligence, enabling it to better understand and process multiple types of data, such as text, images, audio, and video, etc., so it can be applied to multiple fields, such as natural language processing, computer vision, speech recognition, etc., to achieve cross-domain knowledge transfer and sharing. The multimodal universal large model can better understand human language and behavior and improve the naturalness and intelligence of human-computer interaction. With the development of multimodal, general large models, AI may surpass human intelligence in some areas, and how to ensure that AI decisions are in line with human values and moral norms and how to prevent AI from being used for malicious purposes will bring more ethical and security challenges. The current artificial intelligence governance has problems such as incomplete norms, overly general provisions, or limited effectiveness, and it is necessary to constantly improve the legal and regulatory system and strengthen the ability to evaluate and control artificial intelligence security.
4.2 Establish an Upstream and Downstream Linkage Mechanism for Data Compliance Technology Application
Large models adopt deep neural network architecture, but its interpretability is poor, it is difficult to effectively track and explain the training process and reasoning results of the model, and it faces safety problems in practical applications, and there are huge risks in areas with high-reliability requirements (such as autonomous driving and AI medical treatment). The mechanism of the emergence ability of large models is still unclear. With the increasing size of large model parameters, the performance improvement brought by the size of the model appears marginal diminishing effect. Larger models result in higher training costs, including computing power, data, and more complex training processes. Therefore, it is particularly important to develop a more systematic and economical pre-training framework to optimize the training process of large models. In the process of large model training, factors such as model validity, efficiency optimization, and training stability need to be considered, so it is more necessary to optimize the resource scheduling mechanism in order to better organize and utilize the resources in the computing cluster. This will require more data sets and will also require the application of data compliance technologies.
Data compliance technology connects multi-node data and computing power resources and completes the data processing ecological transformation from individual intelligence to group intelligence by improving the generalization ability of machine learning models. Different from the traditional algorithm decision, which has a fixed business model, the data processing across the whole cycle of data compliance technology is no longer a simple contract of \"collection-processing-decision\", but a series of contracts that run through the downstream data, midstream technology to upstream applications, and subtle changes in any link will produce a chain reaction of contracts. The application of data compliance technology is highly dynamic, and the data processor must adapt to the situation of node state, contract strategy, encryption mode, environmental closure, etc. This dynamic business characteristic determines that the traditional data supervision must also be in a dynamic change. For example, for the most important \"informed consent\" framework in the protection of personal information, even if the de-identified and anonymized personal data participates in multi-party secure computing, federated learning, etc., the data processing subject should also request user authorization \"successively, separately, and proactively\".
The application of data compliance technology may have multiple contractual relationships involving multiple data processors, all of whom must perform data security assessment obligations. In the development and application of the general large model, there are master processors and slave processors of data processing, and the master processors shall ensure that the slave processors establish appropriate data security capabilities and implement necessary management and technical measures in accordance with the requirements of relevant national standards. The master processor should have reasonable means and measures to curb the opportunistic behavior of the slave processor to ensure that the curiosity of the slave processor is under control and not malicious at any time. In the application of data compliance technology, upstream and downstream linkage mechanisms should be established. When the data subject changes the behavior of data processing, the main processor should ensure that the downstream processor's behavior is linked with it. Unified data standards and clear technical specifications are the basis for the linkage of upstream and downstream rules. There are many schools of development and application of data compliance technology, and the methods of data collection and processing are very different. In order to achieve the unification of standards, it depends on the industry authorities to take the lead in formulating applicable data exchange rules to jointly comply with the data coding, standard, caliber, and format.
4.3 Establish a Third-Party Risk Assessment Mechanism
For data compliance, there are basically six key steps that the average data processor takes. The collection, storage, processing, and transmission of data must be continuously monitored to ensure that all activities are carried out within a framework of compliance. The second is to strengthen real-time analytics by analyzing data streams in real-time to quickly identify and address potential compliance issues. The third is to conduct regular risk assessments to identify potential risks and weaknesses in data compliance and take appropriate actions to mitigate these risks. The fourth is to update policies and processes, as regulations and standards continue to change, to update data compliance policies and processes in a timely manner to ensure that they are always consistent with current regulatory requirements. The fifth is to strengthen transparency and reporting, to be transparent about data compliance, and to provide regular reports to relevant stakeholders. Develop and implement an emergency response plan that allows quick action to mitigate damage in the event of a data breach.
There are a series of technical black boxes in data compliance technology, such as homomorphic encryption, differential privacy and other means in privacy computing, low-code plate splicing, etc., which break the linear correlation between data input and result output. Data compliance technologies need to be both structurally stable and economically workable, and encryption and defenses are dizzying and foggy. From the perspective of supervision, it is difficult to achieve the interpretability requirements for each process of the application of data compliance technology, but the participants in data compliance technology must disclose information on the operation logic and construction mechanism of the overall scheme, as well as possible system deviations, privacy risks, operational failures, and remediation plans. Due to the rapid technology iteration and the multi-link technology superposition of data compliance technology, the results of multiple trainings may deviate from the initial preset procedure. In information disclosure, although technology developers have good intentions, they also give \"correct explanations\", and there is the possibility of wrong or even 1 information disclosure. Therefore, it is necessary to introduce external review mechanisms and manual audit mechanisms to deeply study the hidden risks of the application of data compliance technology. External audits of data compliance technology development and application programs need to be conducted by an independent third party. The personnel conducting the audit shall be composed of experts with relevant technical backgrounds and no interest in the participants. External audit to ensure that the technical side in the data cleaning process introduces a default data screening mechanism, to avoid data compliance by technical participants because of the hunger for data demand, and to curb algorithm-based discrimination from the source of data.
5 Conclusion
With the continuous iteration and upgrading of artificial intelligence technology, data compliance technology in the development and application of large models is booming, but it also brings many risks, such as data security and algorithm misuse. Data compliance technology is constantly undergoing technological change and new technology applications, and artificial intelligence in general and large models also face different application scenarios. Due to the different endowments of the parties involved in the multi-party participation mechanism of data compliance technology, the risk prevention scheme for data compliance technology does not have an immutable \"optimal solution\". It is necessary to carry out personalized and differentiated dynamic regulation based on the different mechanisms of various technical risks. The regulation mode of risk governance should be transformed from a single-subject to a multi-subject co-governance, taking full consideration of the relationship between data, algorithms, applications, and enterprises, and forming a full-process regulatory model of \"pre-supervision-mid-intervention-post-verification\".
The risks of data compliance technology in the development and application of large-scale models are ostensibly in terms of data use risks, algorithmic misuse risks, and privacy risks. In essence, it is about the conflict between the rational management of technology development and the choice of regulatory strategy. Risk regulation for the development and application of data compliance technology must establish a systematic and coherent system, adhere to the equal importance of promoting the development and use of data and ensuring data security, and form an \"agile governance\" risk governance model that meets technological needs and adapts to economic development. Scattered and fragmented rules and tools should be integrated into the dynamic program of multidimensional co-governance, and the rapid perception ability, flexible responseability, and continuous coordination ability of risk governance should be continuously improved so as to fully reconcile the contradiction between technological innovation and data risk and realize the improvement of overall social welfare brought by technological progress.
References:
[1] KALCHBRENNER N, GREFENSTETTE E, BLUNSOM P. A convolutional neural network for modelling sentences[C]. Annual Meeting of the Association for Computational Linguistics, 2014.
[2] SUTSKEVER I, VINYALS O, LE Q. Sequence to sequence learning with neural networks[C]. Advances in Neural Information Processing Systems, 2014.
[3] SOCHER R, PERELYGIN A, WU J, et al. Recursive deep models for semantic compositionality over a sentiment treebank[C]. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013.
[4] RUBINSTEIN I S, GOOD N. Privacy by design: a counterfactual analysis of Google and Facebook privacy incidents[J]. Berkeley Technology Law Journal, 2013, 28(2): 1333-1413.
[5] BECK U, GIDDENS A, LASH S. Reflexive modernization: politics, tradition and aesthetics in the modern social order[M]. Redwood City, CA: Stanford University Press, 1994: 146.
[6] BECK U. Risk Society: towards a new modernity[M]. London: Sage, 1992: 155-158.
[7] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. Advances in neural information processing systems, 2017.
[8] The White House. Accelerating America's Leadership in Artificial Intelligence[EB/OL]. (2019-02-11) [2023-08-26]. https://trumpwhitehouse.archives.gov/articles/accelerating-americas-leadership-in-artificial-intelligence/.
[9] European Commission. White Paper on Artificial Intelligence: a European approach to excellence and trust [EB/OL]. (2020-02-19) [2023-12-24]. https://commission.europa.eu/publications/white-paper-artificial-intelligence-european-approach-excellence-and-trust_en.
人工智能多模態通用大模型數據合規技術應用風險動態規制
吳 " "蔚
(中共四川省委黨校/四川行政學院,成都 610071)
摘 " "要:隱私計算技術、SaaS平臺、數據合規監測系統等數據合規技術將數據處理需求按照監管規則約束轉化為程序和代碼可以處理的計算問題。數據合規技術聯結了多方節點數據和算力資源,拓寬了數據來源渠道,提高了模型的泛化能力,成為人工智能通用大模型研發應用的泛在普適性底層技術。數據合規技術具備顛覆性重構算法應用流程的能力,在數據處理上存在多重關系,涉及多方數據處理者,技術應用中存在不同程度的系統性風險。數據合規技術應用的風險規制要將分散零碎的規則和工具嵌入多維共治的動態方案中,針對各項技術風險產生的不同機理,進行個性化和差異化的動態規制。
關鍵詞:數據合規技術;人工智能;通用大模型;風險治理