Experiments
We carried out experiments on 4 datasets, the place three datasets correspond with downstream generative duties and one dataset with a classification activity. Generative duties are usually tougher than classification duties. It is because the generative duties are evaluated by the next-token prediction accuracy, which requires the artificial information to protect fine-grained textual data from the non-public information. In distinction, the classification duties solely require sustaining the co-occurrence patterns between labels and phrases within the non-public information.
The three generative duties are chosen to cowl a various set of sensible eventualities: PubMed (medical paper abstracts), Chatbot Area (human-to-machine interactions), and Multi-Session Chat (human-to-human each day dialogues). To guage the standard of the generated artificial information, we adopted the setup of Aug-PE to coach a small downstream language mannequin on the artificial information after which compute the next-token prediction accuracy on the actual check information.
The classification activity is carried out on the OpenReview (educational paper evaluations) dataset. To guage the standard of the generated artificial information, we practice a downstream classifier on the artificial information, and compute the classification accuracy on the actual check information.
To mitigate issues relating to information contamination, we rigorously analyzed our chosen datasets. Our evaluation confirmed no overlap between our pre-training information and the downstream datasets.